Reinforcement Learning · Hold UR Pla7e

§ 1Motivation

The hand-tuned PID of § 2 is reliable inside the operating envelope it was tuned for: nominal ball mass, nominal plate friction, low sensing noise, low actuation latency. Outside that envelope its fixed gains $(K_p, K_i, K_d)$ no longer match the closed-loop dynamics, and the tracking error grows or the ball drops. Re-tuning by hand for every parameter setting is impractical, and a single conservative gain schedule trades nominal performance for robustness.

Residual reinforcement learning takes a different route. The PID is left in place, and a small additional correction is added on top of its command at every control step. Four design choices make this work:

Keep the PID baseline

The hand-tuned PID continues to do most of the work: pulling the ball toward the reference, damping the response, and cancelling steady-state offsets. The learned component only adds to its command; it never replaces the proportional, integral, or derivative terms.

Add a neural correction

A feedforward policy reads the current observation of the ball-plate system and emits a two-dimensional output. This output is the residual: a small per-axis tilt adjustment that is summed with the PID command before the wiring stage.

Bound the correction

The residual is multiplied by a small fixed factor $\alpha = 0.3$ before being added, so $\lVert \alpha\, \pi_\theta \rVert_\infty \le \alpha$: the network can refine the baseline but cannot override it. A zero policy reproduces the deployed PID exactly, so a freshly initialised network starts no worse than the baseline.

Train across many plants

Producing a correction that generalises across many operating conditions requires training samples from many operating conditions. The simulator is therefore not run on the nominal plant: at every episode it is reset with seven physical and sensing parameters drawn from a wide envelope (domain randomisation).

The seven randomised parameters are the ball mass, the ball-plate friction, the plate-air friction, the initial ball offset from the plate centre, a small multiplicative perturbation on gravity, the standard deviation of the position-measurement noise, and the end-to-end actuation latency. The exact ranges are listed in § 5. Because every episode is a slightly different plant, the policy is forced to remain competent across the whole envelope of conditions rather than overfit to a single nominal setting.

Concretely, the closed-loop tilt command at each control step is the sum of the PID output and the scaled policy output:

(u_x, u_y) \;=\; \underbrace{\mathrm{PID}(e_p, e_v)}_{\text{baseline}} \;+\; \underbrace{\alpha\, \pi_\theta(o_t)}_{\text{residual}}, \qquad \alpha = 0.3, \;\; \pi_\theta(o_t) \in [-1, 1]^2.

Figure 1. Closed-loop action. The PID baseline reads the tracking errors $(e_p, e_v)$ and emits the bulk of the tilt command through its three gains $K_p, K_i, K_d$. In parallel, the policy network $\pi_\theta$ reads the 12-dimensional observation $o_t$ and emits a bounded residual $a_t \in [-1, 1]^2$, scaled by $\alpha$ before being summed. The combined command $(u_x, u_y)$ is forwarded to the wiring stage of the PID page (§ 3), then to the IK and the arm; the plant feeds the vision pipeline, which closes the loop.

Training has to learn only the corrections that the fixed PID gains cannot express, which is a substantially easier problem than learning a full controller from scratch. The region of plant conditions over which the residual is competent is shown schematically below.

Figure 2. Operating envelope (schematic). A two-dimensional slice of the seven-dimensional perturbation space, showing the regions in which the PID alone and the PID + residual scheme keep the ball on the plate. The outer envelope matches the domain-randomisation distribution of § 5; outside it neither scheme is guaranteed to track.

§ 2Markov decision process

The training problem is cast as an episodic Markov decision process (MDP) with state $s_t$, action $a_t$, transition kernel $T(s_{t+1} \mid s_t, a_t)$, and reward $r(s_t, a_t)$. An episode lasts up to $T = 600$ control steps at a $30$ Hz policy rate, for a $20$ s horizon. The episode terminates early if the ball leaves the plate; in that case a drop-terminal reward is applied. The objective is the expected discounted return,

J(\pi_\theta) \;=\; \mathbb{E}_{\pi_\theta} \!\left[ \sum_{t=0}^{T-1} \gamma^t \, r(s_t, a_t) \right], \qquad \gamma = 0.99.

The state $s_t$ is the full MuJoCo state of the arm and the ball; the policy observes a 12-dimensional projection $o_t$ defined in § 3.

Figure 3. Residual-RL training loop. Sixteen MuJoCo environments run in parallel, each carrying its own domain-randomisation sample and its own PID baseline. The transitions $(o_t, a_t, r_t, o_{t+1})$ are written into the on-policy rollout buffer; after $T$ steps the buffer is consumed by a PPO-clipped gradient step on the policy $\pi_\theta$ and the value head $V_\phi$, and the updated policy is shipped back to all envs.

§ 3Observation

The policy receives a 12-dimensional observation $o_t \in \mathbb{R}^{12}$ concatenated from the following blocks, all expressed in the plate frame $\{P\}$:

$(x, y)$: ball position in $\{P\}$ (m).
$(\dot x, \dot y)$: ball velocity in $\{P\}$ (m/s).
$(x_r, y_r, \dot x_r, \dot y_r)$: reference position and velocity at the current time.
$(x_r^{+\Delta}, y_r^{+\Delta})$: reference position at a fixed lookahead $\Delta = 0.2$ s.
roll, pitch: plate tilt angles in $\{W\}$.

The lookahead lets the policy anticipate target motion. The PID alone trails a moving reference by a lag proportional to the bandwidth ratio between the reference and the closed-loop dynamics, which is visible on circular and figure-of-eight trajectories; the lookahead term allows the residual to compensate this lag without modifying the PID gains.

§ 4Reward shaping

The per-step reward combines three positive shaping terms and two negative regularisers:

\begin{aligned} r_t \;=\;\; & \textcolor{#288c5a}{w_t\, e^{-\|e_p\| / \sigma_p}} &&\quad \text{tracking} \;(+) \\ + & \textcolor{#288c5a}{w_v\, e^{-\|e_v\| / \sigma_v}} &&\quad \text{velocity match} \;(+) \\ + & \textcolor{#003262}{w_g \big( \|e_p^{(t-1)}\| - \|e_p^{(t)}\| \big)} &&\quad \text{progress shaping} \;(+) \\ - & \textcolor{#b41e28}{w_a\, \|a_t\|^2} &&\quad \text{action magnitude} \;(-) \\ - & \textcolor{#b41e28}{w_s\, \|a_t - a_{t-1}\|^2} &&\quad \text{action smoothness} \;(-) \end{aligned}

The tracking term is an exponential bonus on the position error norm with scale $\sigma_p = 0.05$ m.
The velocity-match term rewards ball velocity tracking the reference velocity, with scale $\sigma_v = 0.20$ m/s.
The progress term is a potential-based shaping bonus proportional to the reduction in $\|e_p\|$ between consecutive steps; it densifies the reward signal when the ball is far from the reference.
The action-magnitude penalty regularises the residual norm; with $\pi_\theta \equiv 0$ the term is identically zero.
The smoothness penalty regularises the temporal difference of the residual to suppress high-frequency oscillation.

A drop event terminates the episode and applies a large terminal penalty $-w_d \cdot (\text{steps remaining}) / T$, so dropping the ball early in an episode is punished more heavily than dropping it late.

The deployed weights are $w_t = 1.0$, $w_v = 0.3$, $w_g = 0.5$, $w_a = 0.01$, $w_s = 0.05$, $w_d = 100$.

§ 5Domain randomization

Each training episode samples seven physical and sensing parameters uniformly within ranges that bracket the values seen on the real hardware. The policy is forced to perform across the full range, which produces robustness to whichever values the real system actually exhibits.

ball mass: $[2.0, 4.0]$ g (ITTF nominal $2.7$ g).
ball-plate friction: $[0.30, 0.70]$.
plate-air friction: $[0.30, 0.80]$.
initial ball offset: $[0, 2]$ cm radius from plate centre.
gravity scale: $\pm 2 \%$ multiplicative perturbation on $g$.
observation noise: $0$–$3$ mm Gaussian standard deviation on $(x, y)$.
action delay: $[30, 150]$ ms (camera + ROS + trajectory command latency).

Action delay is implemented as a circular buffer of length $\lceil \tau / \Delta t \rceil$ where $\tau$ is the sampled latency and $\Delta t$ is the policy step; the action applied at step $t$ is the action requested $\lceil \tau / \Delta t \rceil$ steps earlier.

§ 6Training algorithm

The policy is trained with Proximal Policy Optimization (PPO), an on-policy policy-gradient method that stabilises updates by clipping the importance-sampling ratio between the new and old policies. With $r_t(\theta) = \pi_\theta(a_t \mid o_t) / \pi_{\theta_{\text{old}}}(a_t \mid o_t)$ and the generalised advantage estimate $\hat A_t$, the surrogate objective is

\mathcal{L}^{\mathrm{CLIP}}(\theta) \;=\; \mathbb{E}_t \!\left[ \min\!\big( r_t(\theta)\, \hat A_t, \;\; \mathrm{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\, \hat A_t \big) \right],

with clip range $\epsilon = 0.1$. The advantages are computed by Generalised Advantage Estimation (GAE) with parameter $\lambda = 0.95$. Both policy and value networks are Multi-Layer Perceptrons (MLP) with two hidden layers of $256$ units and $\tanh$ activations. Sixteen MuJoCo environments run in parallel via the Stable-Baselines 3 vectorised wrapper. The learning rate decays linearly from $10^{-4}$ to $10^{-5}$ and the entropy coefficient from $5 \cdot 10^{-3}$ to $5 \cdot 10^{-4}$ over the full training schedule. The training curve and the benchmark are reported on the results page.

Figure 4. Policy network $\pi_\theta$. The 12-dimensional observation is mapped through two fully connected layers of $256$ units with $\tanh$ activations to a 2-dimensional output, also passed through $\tanh$ so that $a_t \in [-1, 1]^2$. Only ten representative units per hidden layer are drawn; the value head $V_\phi$ shares the same topology.

§ 7Notation

$o_t \in \mathbb{R}^{12}$: policy observation.
$\pi_\theta : \mathbb{R}^{12} \to [-1, 1]^2$: policy network.
$a_t \in [-1, 1]^2$: residual action (before scaling).
$r_t$: per-step reward.
$\gamma = 0.99$: discount factor.
$\lambda = 0.95$: GAE parameter.
$\epsilon = 0.1$: PPO clip range.
MDP, MLP, GAE: Markov decision process, multi-layer perceptron, generalised advantage estimation.

Residual PPO controller