Benchmark across the DR envelope
The residual policy of § 1 is benchmarked against the deployed PID across nine simulation conditions: a nominal baseline that matches the calibrated lab setup, plus eight perturbations that each push a single physical parameter outside the tuning window (weaker or stronger gravity, a more slippery or a more sticky plate, a heavier ball, noisier vision, longer actuation delay, and a combined condition that varies all of them at once). Each (condition, controller) pair is run for $30$ episodes with seeds shared across controllers. The residual reduces the drop rate under every perturbation. The largest reductions are under stronger gravity ($0.70 \to 0.03$) and on a high-friction plate ($0.33 \to 0.00$).
Mean episode reward of the Residual PPO policy across $60$ million PPO timesteps, collected from $16$ parallel MuJoCo environments. The reward climbs from near zero to roughly $200$ and plateaus once the linear learning-rate schedule of the training algorithm (§ 6 of the Residual PPO page) has decayed.
The PID of the PID page and the trained Residual PPO policy of the Residual PPO page are evaluated on eight perturbed simulation domains plus the nominal one. Each (domain, policy) cell is run for $30$ episodes with fixed per-episode seeds shared across policies, so the two controllers see the same random domain parameters within a cell. Each episode lasts $600$ control steps. Two scalar metrics are reported per cell: the drop rate, the fraction of episodes in which the ball leaves the plate before the horizon expires; and the tracking RMSE, the root-mean-square position error against the reference trajectory averaged over the surviving steps and over the $30$ episodes.
The numerical results behind Figure 2 are listed in Table 1. The residual reduces the drop rate in every perturbed condition. The largest gain is under stronger-than-nominal gravity, where $70\%$ of the PID episodes ended with a dropped ball against $3\%$ for the residual; the smallest gain is under weaker gravity ($40\% \to 27\%$), where the ball moves slowly enough that the dynamics, not the controller, dominate the error. On the baseline both controllers complete every episode.
| Domain | Drop rate (PID) | Drop rate (Residual PPO) | RMSE PID [m] | RMSE Residual PPO [m] |
|---|---|---|---|---|
| standard | 0.00 | 0.00 | 0.034 | 0.036 |
| slippery | 0.20 | 0.03 | 0.053 | 0.038 |
| sticky | 0.33 | 0.00 | 0.074 | 0.035 |
| heavy_ball | 0.30 | 0.03 | 0.068 | 0.039 |
| low_gravity | 0.40 | 0.27 | 0.091 | 0.070 |
| high_gravity | 0.70 | 0.03 | 0.117 | 0.037 |
| noisy_vision | 0.30 | 0.03 | 0.075 | 0.040 |
| high_latency | 0.30 | 0.20 | 0.075 | 0.054 |
| full_dr | 0.33 | 0.07 | 0.075 | 0.041 |
A single episode on the nominal domain with a square reference trajectory. Both controllers complete the trajectory without dropping the ball. The square pattern exposes the corner-tracking behaviour: at each turn the reference velocity changes direction discontinuously, and each controller deviates slightly from the reference for a few control ticks before catching up.
The PID was tuned at the nominal operating point and tracks the reference well there. Its gains are fixed, so when the simulator moves away from those conditions the closed-loop dynamics no longer match what the gains were chosen for and the ball is lost. The residual policy was trained across the whole envelope under domain randomisation; its drop rate stays low across every perturbed condition, and it recovers most of the tracking accuracy that the fixed-gain PID loses off nominal.
The smallest improvement is under weaker-than-nominal gravity. The ball moves slowly, the dynamics evolve on a longer effective time horizon, and the constant-velocity prior of the Kalman filter is least accurate in this regime. The remaining error is dominated by estimation rather than control, so both controllers degrade together.