Control · Residual PPO
A bounded neural correction trained with Proximal Policy Optimization, summed on top of the deployed PID command so that it fine-tunes the baseline without replacing it. Training is performed in MuJoCo with seven physical and sensing parameters randomised per episode.
The hand-tuned PID of § 2 is reliable inside the operating envelope it was tuned for: nominal ball mass, nominal plate friction, low sensing noise, low actuation latency. Outside that envelope its fixed gains $(K_p, K_i, K_d)$ no longer match the closed-loop dynamics, and the tracking error grows or the ball drops. Re-tuning by hand for every parameter setting is impractical, and a single conservative gain schedule trades nominal performance for robustness.
Residual reinforcement learning takes a different route. The PID is left in place, and a small additional correction is added on top of its command at every control step. Four design choices make this work:
The hand-tuned PID continues to do most of the work: pulling the ball toward the reference, damping the response, and cancelling steady-state offsets. The learned component only adds to its command; it never replaces the proportional, integral, or derivative terms.
A feedforward policy reads the current observation of the ball-plate system and emits a two-dimensional output. This output is the residual: a small per-axis tilt adjustment that is summed with the PID command before the wiring stage.
The residual is multiplied by a small fixed factor $\alpha = 0.3$ before being added, so $\lVert \alpha\, \pi_\theta \rVert_\infty \le \alpha$: the network can refine the baseline but cannot override it. A zero policy reproduces the deployed PID exactly, so a freshly initialised network starts no worse than the baseline.
Producing a correction that generalises across many operating conditions requires training samples from many operating conditions. The simulator is therefore not run on the nominal plant: at every episode it is reset with seven physical and sensing parameters drawn from a wide envelope (domain randomisation).
The seven randomised parameters are the ball mass, the ball-plate friction, the plate-air friction, the initial ball offset from the plate centre, a small multiplicative perturbation on gravity, the standard deviation of the position-measurement noise, and the end-to-end actuation latency. The exact ranges are listed in § 5. Because every episode is a slightly different plant, the policy is forced to remain competent across the whole envelope of conditions rather than overfit to a single nominal setting.
Concretely, the closed-loop tilt command at each control step is the sum of the PID output and the scaled policy output:
Training has to learn only the corrections that the fixed PID gains cannot express, which is a substantially easier problem than learning a full controller from scratch. The region of plant conditions over which the residual is competent is shown schematically below.
The training problem is cast as an episodic Markov decision process (MDP) with state $s_t$, action $a_t$, transition kernel $T(s_{t+1} \mid s_t, a_t)$, and reward $r(s_t, a_t)$. An episode lasts up to $T = 600$ control steps at a $30$ Hz policy rate, for a $20$ s horizon. The episode terminates early if the ball leaves the plate; in that case a drop-terminal reward is applied. The objective is the expected discounted return,
The state $s_t$ is the full MuJoCo state of the arm and the ball; the policy observes a 12-dimensional projection $o_t$ defined in § 3.
The policy receives a 12-dimensional observation $o_t \in \mathbb{R}^{12}$ concatenated from the following blocks, all expressed in the plate frame $\{P\}$:
The lookahead lets the policy anticipate target motion. The PID alone trails a moving reference by a lag proportional to the bandwidth ratio between the reference and the closed-loop dynamics, which is visible on circular and figure-of-eight trajectories; the lookahead term allows the residual to compensate this lag without modifying the PID gains.
The per-step reward combines three positive shaping terms and two negative regularisers:
A drop event terminates the episode and applies a large terminal penalty $-w_d \cdot (\text{steps remaining}) / T$, so dropping the ball early in an episode is punished more heavily than dropping it late.
The deployed weights are $w_t = 1.0$, $w_v = 0.3$, $w_g = 0.5$, $w_a = 0.01$, $w_s = 0.05$, $w_d = 100$.
Each training episode samples seven physical and sensing parameters uniformly within ranges that bracket the values seen on the real hardware. The policy is forced to perform across the full range, which produces robustness to whichever values the real system actually exhibits.
Action delay is implemented as a circular buffer of length $\lceil \tau / \Delta t \rceil$ where $\tau$ is the sampled latency and $\Delta t$ is the policy step; the action applied at step $t$ is the action requested $\lceil \tau / \Delta t \rceil$ steps earlier.
The policy is trained with Proximal Policy Optimization (PPO), an on-policy policy-gradient method that stabilises updates by clipping the importance-sampling ratio between the new and old policies. With $r_t(\theta) = \pi_\theta(a_t \mid o_t) / \pi_{\theta_{\text{old}}}(a_t \mid o_t)$ and the generalised advantage estimate $\hat A_t$, the surrogate objective is
with clip range $\epsilon = 0.1$. The advantages are computed by Generalised Advantage Estimation (GAE) with parameter $\lambda = 0.95$. Both policy and value networks are Multi-Layer Perceptrons (MLP) with two hidden layers of $256$ units and $\tanh$ activations. Sixteen MuJoCo environments run in parallel via the Stable-Baselines 3 vectorised wrapper. The learning rate decays linearly from $10^{-4}$ to $10^{-5}$ and the entropy coefficient from $5 \cdot 10^{-3}$ to $5 \cdot 10^{-4}$ over the full training schedule. The training curve and the benchmark are reported on the results page.