Perception
Geometric and statistical recovery of the planar ball state $X = (x, y, \dot x, \dot y) \in \mathbb{R}^4$ from an overhead RGB stream: ArUco-based homography from the image plane onto the plate, colour-segmentation of the ball, and a constant-velocity Kalman filter that fuses the noisy position measurements with the dynamic model to produce a smooth estimate of position and velocity at the controller rate.
Three frames are involved: the image plane in pixels, the camera frame in metres centred at the optical centre, and the plate frame on the plate surface. Every stage that follows is a transformation between two of these.
Pixel coordinates $(u, v) \in \mathbb{N}^2$ with origin at the top-left corner of the image. Everything the camera sees lives here first.
Centred at the optical centre of the pinhole model. $\hat z_C$ runs along the optical axis (positive in front of the camera); $(\hat x_C, \hat y_C)$ align with the image axes.
Defined on the setup page: origin at the plate centre, $\hat x_P$ to the right in the overhead view, $\hat y_P$ up, $\hat z_P$ along the plate normal.
The output of the pipeline is the ball state $(x, y, \dot x, \dot y) \in \{P\}$ in metres and metres per second.
The image $\rightarrow$ ball-state mapping is a pipeline of four stages. The first three turn a pixel into a metric position; the last fuses noisy positions into a smooth state.
A checkerboard photographed in many poses fixes the focal lengths, principal point, and lens distortion. Cached and applied to every runtime frame to rectify it back to an ideal pinhole image.
Four ArUco fiducials at the plate corners give eight point correspondences. The Direct Linear Transform fits a $3 \times 3$ matrix $H$ that maps any pixel on the plate back to plate-frame metres.
The orange ball is selected by an HSV colour range. After a morphological clean-up, the largest circular blob's centroid in pixels is taken as the ball; the homography $H$ converts it to plate-frame metres.
A linear-Gaussian Kalman filter fuses the noisy per-frame positions into a smooth state, using the commanded plate tilt as a known control input so the velocity estimate anticipates the gravity component.
The camera is modelled as an ideal pinhole with intrinsic parameter matrix
where $(f_x, f_y)$ are the focal lengths in pixels along the image axes and $(c_x, c_y)$ is the principal point. For a 3-D point $X_C = (X, Y, Z)^\top$ expressed in $\{C\}$, the ideal projection onto the image plane is
Real lenses introduce radial and tangential distortion. After normalising $(x_n, y_n) = (X/Z, Y/Z)$, the distorted coordinates used to look up the pixel colour are
where $r^2 = x_n^2 + y_n^2$. The five-element distortion vector is $d = (k_1, k_2, p_1, p_2, k_3)$.
Calibration is performed offline by photographing an $8 \times 5$ inner-corner checkerboard with a known $30$ mm square size in roughly twenty distinct poses. For each detected pose with $N$ corner correspondences $(X_j^{\text{world}}, u_j)$, the calibration solves
where $\pi(\cdot)$ is the full distorted-pinhole projection and $(R_i, t_i)$ is the pose of the checkerboard in image $i$. The minimisation is initialised from a closed-form Direct Linear Transform (DLT) and refined by Levenberg-Marquardt; the reported reprojection error is typically below $0.5$ pixels. The output of calibration is the pair $(K, d)$, cached once and applied to every runtime frame: each pixel is rectified by the inverse distortion mapping so that downstream stages see a clean pinhole image.
The plate's surface is a known plane. After undistortion, every undistorted pixel on that plane is related to its plate-frame metric coordinates by a planar homography $H \in \mathbb{R}^{3 \times 3}$: in homogeneous coordinates,
Each pixel-to-plate correspondence $(u_i, v_i) \leftrightarrow (x_i, y_i)$ contributes two linear constraints in the eight free entries of $H$ (the matrix is defined up to a scale). Four point correspondences are enough to determine $H$ uniquely up to that scale; this is the Direct Linear Transform (DLT).
The four ArUco fiducial markers are placed at the plate corners, with centre coordinates known a priori in $\{P\}$. A standard ArUco detector returns four image corners per visible marker. During the first second of operation, while the plate is stationary at home and all four markers are visible, each marker's image corners are averaged over twenty frames and projected through a centroid-fit $H$ into $\{P\}$. Once the four corners of all four markers are known in $\{P\}$, every visible marker contributes four point correspondences, so a single visible marker is sufficient to refit $H$ and the system tolerates partial occlusion.
A full Perspective-n-Point (PnP) recovery of the plate's six-DOF pose in $\{C\}$ would also work, but the ball is constrained to move on the plate plane, so its world position is a function only of $(u, v)$. The 2-D homography is faster to fit and avoids estimating quantities that are not subsequently used.
The undistorted BGR (Blue-Green-Red) frame is converted to the Hue-Saturation-Value (HSV) representation. The orange ping-pong ball is selected by an HSV range threshold tuned offline; a morphological opening removes speckle and a closing fills small holes. Connected components are extracted; the component with the largest area passing a circularity test is taken as the ball, and its centroid in pixel coordinates is computed via the minimum enclosing circle. Application of $H$ from § 4 yields the ball position $(x, y) \in \{P\}$ in metres.
Per-frame finite differences of $(x, y)$ are too noisy to feed into the controller: the perception error at the camera scale is on the order of a few millimetres, which becomes hundreds of millimetres per second after differentiation. A linear-Gaussian Kalman filter (KF) is used instead.
The state is the four-vector $X_t = (x_t, y_t, \dot x_t, \dot y_t)^\top$. The process model is constant-velocity in the plate frame, with the in-plane component of gravity entering as a known control input $u_t = (a_x, a_y)^\top$ derived from the commanded plate tilt:
with the discrete-time matrices
The control input is the in-plane acceleration produced by gravity on the tilted plate; from the rolling-without-slipping equation derived on the physics page,
with $\alpha = 5/7$ for a uniform solid sphere. The plate tilt angles $(\theta_x, \theta_y)$ are the commanded tilt from the controller — already known when the predict step runs. Without this input the filter degrades gracefully to a pure constant-velocity estimator; with it, the filter anticipates the gravity-induced acceleration and reduces the lag on the velocity estimate.
Only the position is measured directly:
with $R = \sigma_{\text{meas}}^2 I_2$.
At every camera frame the filter performs the standard two-step update. Let $\hat X_{t \mid t-1}$ denote the state estimate before the new measurement and $P_{t \mid t-1}$ its covariance. The predict step propagates the estimate and its uncertainty forward in time:
Once a new measurement $Y_t$ arrives, the update step computes the innovation, the innovation covariance, the Kalman gain $K_t$, and the posterior estimate:
The process-noise covariance is the discrete-time continuous-white-noise-acceleration form parametrised by a single standard deviation $\sigma_a$ on the unmodelled acceleration:
The four components $(x, y, \dot x, \dot y)$ from the Kalman estimate are the perception output, produced once per camera frame and consumed by the controller. Accompanying tracker-health flags (was the ball detected this frame, is the tracking estimate still valid, how many ArUco markers were seen) indicate when the estimate should be trusted.