Perception

Vision

Geometric and statistical recovery of the planar ball state $X = (x, y, \dot x, \dot y) \in \mathbb{R}^4$ from an overhead RGB stream: ArUco-based homography from the image plane onto the plate, colour-segmentation of the ball, and a constant-velocity Kalman filter that fuses the noisy position measurements with the dynamic model to produce a smooth estimate of position and velocity at the controller rate.

§ 1Coordinate frames

Three frames are involved: the image plane in pixels, the camera frame in metres centred at the optical centre, and the plate frame on the plate surface. Every stage that follows is a transformation between two of these.

image · pixels

Image plane

Pixel coordinates $(u, v) \in \mathbb{N}^2$ with origin at the top-left corner of the image. Everything the camera sees lives here first.

$\{C\}$ · metres

Camera frame

Centred at the optical centre of the pinhole model. $\hat z_C$ runs along the optical axis (positive in front of the camera); $(\hat x_C, \hat y_C)$ align with the image axes.

$\{P\}$ · metres

Plate frame

Defined on the setup page: origin at the plate centre, $\hat x_P$ to the right in the overhead view, $\hat y_P$ up, $\hat z_P$ along the plate normal.

P ŷP $\{P\}$ C C $\{C\}$ (u, v)
Figure 1. The overhead camera looks down at the plate; the projection ray maps a point on the plate surface to a pixel $(u, v)$ on the image plane. The estimation pipeline inverts that mapping, recovering plate-frame metric coordinates from $(u, v)$.

The output of the pipeline is the ball state $(x, y, \dot x, \dot y) \in \{P\}$ in metres and metres per second.

§ 2Estimation pipeline

The image $\rightarrow$ ball-state mapping is a pipeline of four stages. The first three turn a pixel into a metric position; the last fuses noisy positions into a smooth state.

offline · $\sim 20$ poses

Intrinsic calibration

checkerboard $\rightarrow$ $(K,\, d)$

A checkerboard photographed in many poses fixes the focal lengths, principal point, and lens distortion. Cached and applied to every runtime frame to rectify it back to an ideal pinhole image.

1 2 3 4 image pixels (u, v) H homography 1 2 3 4 plate metres (x, y)

Homography from ArUco

$(u, v) \rightarrow (x, y)$ on the plate

Four ArUco fiducials at the plate corners give eight point correspondences. The Direct Linear Transform fits a $3 \times 3$ matrix $H$ that maps any pixel on the plate back to plate-frame metres.

orange range centroid $(u, v)$

HSV ball detection

image $\rightarrow$ pixel centroid $(u, v)$

The orange ball is selected by an HSV colour range. After a morphological clean-up, the largest circular blob's centroid in pixels is taken as the ball; the homography $H$ converts it to plate-frame metres.

noisy samples $\rightarrow$ smooth state

Kalman filter

$(x, y) \rightarrow (x, y, \dot x, \dot y)$

A linear-Gaussian Kalman filter fuses the noisy per-frame positions into a smooth state, using the commanded plate tilt as a known control input so the velocity estimate anticipates the gravity component.

§ 3Pinhole camera model and calibration

The camera is modelled as an ideal pinhole with intrinsic parameter matrix

$$K = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} \in \mathbb{R}^{3 \times 3},$$

where $(f_x, f_y)$ are the focal lengths in pixels along the image axes and $(c_x, c_y)$ is the principal point. For a 3-D point $X_C = (X, Y, Z)^\top$ expressed in $\{C\}$, the ideal projection onto the image plane is

$$\lambda \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} \;=\; K \begin{pmatrix} X \\ Y \\ Z \end{pmatrix}, \qquad \lambda = Z.$$

Real lenses introduce radial and tangential distortion. After normalising $(x_n, y_n) = (X/Z, Y/Z)$, the distorted coordinates used to look up the pixel colour are

$$\begin{aligned} x_d &= x_n (1 + k_1 r^2 + k_2 r^4 + k_3 r^6) + 2 p_1 x_n y_n + p_2(r^2 + 2 x_n^2), \\ y_d &= y_n (1 + k_1 r^2 + k_2 r^4 + k_3 r^6) + p_1 (r^2 + 2 y_n^2) + 2 p_2 x_n y_n, \end{aligned}$$

where $r^2 = x_n^2 + y_n^2$. The five-element distortion vector is $d = (k_1, k_2, p_1, p_2, k_3)$.

Calibration is performed offline by photographing an $8 \times 5$ inner-corner checkerboard with a known $30$ mm square size in roughly twenty distinct poses. For each detected pose with $N$ corner correspondences $(X_j^{\text{world}}, u_j)$, the calibration solves

$$\min_{K, d, \{R_i, t_i\}} \;\; \sum_i \sum_{j=1}^{N} \;\big\|\, u_j - \pi(K, d; R_i, t_i, X_j^{\text{world}}) \big\|^2,$$

where $\pi(\cdot)$ is the full distorted-pinhole projection and $(R_i, t_i)$ is the pose of the checkerboard in image $i$. The minimisation is initialised from a closed-form Direct Linear Transform (DLT) and refined by Levenberg-Marquardt; the reported reprojection error is typically below $0.5$ pixels. The output of calibration is the pair $(K, d)$, cached once and applied to every runtime frame: each pixel is rectified by the inverse distortion mapping so that downstream stages see a clean pinhole image.

3-D world points 2-D image points non-linear least-squares $K$ (intrinsics) $d$ (distortion) runtime undistort
Figure 1. Calibration pipeline. Checkerboard pose pairs are passed to non-linear least-squares with reprojection-error cost; the recovered $(K, d)$ are then used to rectify every runtime frame.

§ 4Plane recovery via ArUco markers and homography

The plate's surface is a known plane. After undistortion, every undistorted pixel on that plane is related to its plate-frame metric coordinates by a planar homography $H \in \mathbb{R}^{3 \times 3}$: in homogeneous coordinates,

$$\lambda \begin{pmatrix} x \\ y \\ 1 \end{pmatrix} \;=\; H \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}, \qquad H = \begin{pmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{pmatrix}.$$

Each pixel-to-plate correspondence $(u_i, v_i) \leftrightarrow (x_i, y_i)$ contributes two linear constraints in the eight free entries of $H$ (the matrix is defined up to a scale). Four point correspondences are enough to determine $H$ uniquely up to that scale; this is the Direct Linear Transform (DLT).

The four ArUco fiducial markers are placed at the plate corners, with centre coordinates known a priori in $\{P\}$. A standard ArUco detector returns four image corners per visible marker. During the first second of operation, while the plate is stationary at home and all four markers are visible, each marker's image corners are averaged over twenty frames and projected through a centroid-fit $H$ into $\{P\}$. Once the four corners of all four markers are known in $\{P\}$, every visible marker contributes four point correspondences, so a single visible marker is sufficient to refit $H$ and the system tolerates partial occlusion.

A full Perspective-n-Point (PnP) recovery of the plate's six-DOF pose in $\{C\}$ would also work, but the ball is constrained to move on the plate plane, so its world position is a function only of $(u, v)$. The 2-D homography is faster to fit and avoids estimating quantities that are not subsequently used.

§ 5Ball detection in HSV space

The undistorted BGR (Blue-Green-Red) frame is converted to the Hue-Saturation-Value (HSV) representation. The orange ping-pong ball is selected by an HSV range threshold tuned offline; a morphological opening removes speckle and a closing fills small holes. Connected components are extracted; the component with the largest area passing a circularity test is taken as the ball, and its centroid in pixel coordinates is computed via the minimum enclosing circle. Application of $H$ from § 4 yields the ball position $(x, y) \in \{P\}$ in metres.

§ 6Velocity estimation: Kalman filter

Per-frame finite differences of $(x, y)$ are too noisy to feed into the controller: the perception error at the camera scale is on the order of a few millimetres, which becomes hundreds of millimetres per second after differentiation. A linear-Gaussian Kalman filter (KF) is used instead.

5.1   State and dynamics

The state is the four-vector $X_t = (x_t, y_t, \dot x_t, \dot y_t)^\top$. The process model is constant-velocity in the plate frame, with the in-plane component of gravity entering as a known control input $u_t = (a_x, a_y)^\top$ derived from the commanded plate tilt:

$$X_{t+1} = F\, X_t + B\, u_t + w_t, \qquad w_t \sim \mathcal{N}(0, Q),$$

with the discrete-time matrices

$$F = \begin{pmatrix} 1 & 0 & \Delta t & 0 \\ 0 & 1 & 0 & \Delta t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}, \qquad B = \begin{pmatrix} \tfrac{1}{2}\Delta t^2 & 0 \\ 0 & \tfrac{1}{2}\Delta t^2 \\ \Delta t & 0 \\ 0 & \Delta t \end{pmatrix}.$$

The control input is the in-plane acceleration produced by gravity on the tilted plate; from the rolling-without-slipping equation derived on the physics page,

$$a_x \;=\; +\alpha\, g\, \sin\theta_y, \qquad a_y \;=\; -\alpha\, g\, \sin\theta_x,$$

with $\alpha = 5/7$ for a uniform solid sphere. The plate tilt angles $(\theta_x, \theta_y)$ are the commanded tilt from the controller — already known when the predict step runs. Without this input the filter degrades gracefully to a pure constant-velocity estimator; with it, the filter anticipates the gravity-induced acceleration and reduces the lag on the velocity estimate.

5.2   Measurement model

Only the position is measured directly:

$$Y_t = C\, X_t + v_t, \qquad C = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{pmatrix}, \qquad v_t \sim \mathcal{N}(0, R),$$

with $R = \sigma_{\text{meas}}^2 I_2$.

5.3   Predict-update recursion

At every camera frame the filter performs the standard two-step update. Let $\hat X_{t \mid t-1}$ denote the state estimate before the new measurement and $P_{t \mid t-1}$ its covariance. The predict step propagates the estimate and its uncertainty forward in time:

$$\hat X_{t+1 \mid t} = F\, \hat X_{t \mid t} + B\, u_t, \qquad P_{t+1 \mid t} = F\, P_{t \mid t}\, F^\top + Q.$$

Once a new measurement $Y_t$ arrives, the update step computes the innovation, the innovation covariance, the Kalman gain $K_t$, and the posterior estimate:

$$\nu_t = Y_t - C\, \hat X_{t \mid t-1}, \quad S_t = C\, P_{t \mid t-1}\, C^\top + R, \quad K_t = P_{t \mid t-1}\, C^\top S_t^{-1},$$
$$\hat X_{t \mid t} = \hat X_{t \mid t-1} + K_t\, \nu_t, \qquad P_{t \mid t} = (I - K_t\, C)\, P_{t \mid t-1}.$$

The process-noise covariance is the discrete-time continuous-white-noise-acceleration form parametrised by a single standard deviation $\sigma_a$ on the unmodelled acceleration:

$$Q = \sigma_a^2 \begin{pmatrix} \tfrac{1}{4}\Delta t^4 & 0 & \tfrac{1}{2}\Delta t^3 & 0 \\ 0 & \tfrac{1}{4}\Delta t^4 & 0 & \tfrac{1}{2}\Delta t^3 \\ \tfrac{1}{2}\Delta t^3 & 0 & \Delta t^2 & 0 \\ 0 & \tfrac{1}{2}\Delta t^3 & 0 & \Delta t^2 \end{pmatrix}.$$
noisy measurements $Y_t = (x, y) + v_t$ Kalman filter predict + update $u_t$ — in-plane gravity from the commanded tilt smoothed state $\hat X_t = (x, y, \dot x, \dot y)$
Figure 2. Velocity estimation. The per-frame noisy positions (left) are fused by the linear-Gaussian Kalman filter (centre) into a smooth state with consistent velocity estimates (right, tangent arrows). The control input $u_t$ is the in-plane gravity computed from the commanded plate tilt; the filter uses it during the predict step so the velocity estimate anticipates the gravity-induced acceleration.

§ 7Output

The four components $(x, y, \dot x, \dot y)$ from the Kalman estimate are the perception output, produced once per camera frame and consumed by the controller. Accompanying tracker-health flags (was the ball detected this frame, is the tracking estimate still valid, how many ArUco markers were seen) indicate when the estimate should be trusted.

§ 8Notation

$(u, v)$
pixel coordinates in the rectified image plane.
$\{C\}, \{P\}$
camera frame and plate frame.
$K \in \mathbb{R}^{3\times 3}$
camera intrinsic matrix.
$d = (k_1, k_2, p_1, p_2, k_3)$
radial-tangential distortion coefficients.
$H \in \mathbb{R}^{3\times 3}$
plate-plane homography (pixels to plate metres).
$X_t = (x, y, \dot x, \dot y)^\top$
Kalman state.
$F, B, C$
state-space matrices.
$Q, R$
process- and measurement-noise covariances.
$u_t = (a_x, a_y)^\top$
in-plane gravity input computed from the commanded plate tilt.
$\alpha = 5/7$
rolling factor for a uniform solid sphere.

§ 9References

  1. Garrido-Jurado, S., Muñoz-Salinas, R., Madrid-Cuevas, F. J., Marín-Jiménez, M. J. (2014). Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6), 2280–2292. doi.org/10.1016/j.patcog.2014.01.005
  2. Romero-Ramirez, F. J., Muñoz-Salinas, R., Medina-Carnicer, R. (2018). Speeded up detection of squared fiducial markers. Image and Vision Computing, 76, 38–47. doi.org/10.1016/j.imavis.2018.05.004
  3. Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), 1330–1334. doi.org/10.1109/34.888718
  4. Hartley, R., Zisserman, A. (2004). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press. robots.ox.ac.uk/~vgg/hzbook
  5. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME — Journal of Basic Engineering, 82(D), 35–45. doi.org/10.1115/1.3662552
  6. Bradski, G. (2000). The OpenCV Library. Dr. Dobb's Journal of Software Tools, 25(11), 120–125. github.com/opencv/opencv/wiki/CiteOpenCV