Benchmark across the DR envelope

Domain randomisation

The residual policy of § 1 is benchmarked against the deployed PID across nine simulation conditions: a nominal baseline that matches the calibrated lab setup, plus eight perturbations that each push a single physical parameter outside the tuning window (weaker or stronger gravity, a more slippery or a more sticky plate, a heavier ball, noisier vision, longer actuation delay, and a combined condition that varies all of them at once). Each (condition, controller) pair is run for $30$ episodes with seeds shared across controllers. The residual reduces the drop rate under every perturbation. The largest reductions are under stronger gravity ($0.70 \to 0.03$) and on a high-friction plate ($0.33 \to 0.00$).

§ 1Training curve

Mean episode reward of the Residual PPO policy across $60$ million PPO timesteps, collected from $16$ parallel MuJoCo environments. The reward climbs from near zero to roughly $200$ and plateaus once the linear learning-rate schedule of the training algorithm (§ 6 of the Residual PPO page) has decayed.

PPO training curve
Figure 1. Mean episode reward over PPO training, $60$ M timesteps.

§ 2Benchmark protocol

The PID of the PID page and the trained Residual PPO policy of the Residual PPO page are evaluated on eight perturbed simulation domains plus the nominal one. Each (domain, policy) cell is run for $30$ episodes with fixed per-episode seeds shared across policies, so the two controllers see the same random domain parameters within a cell. Each episode lasts $600$ control steps. Two scalar metrics are reported per cell: the drop rate, the fraction of episodes in which the ball leaves the plate before the horizon expires; and the tracking RMSE, the root-mean-square position error against the reference trajectory averaged over the surviving steps and over the $30$ episodes.

§ 3Per-domain comparison

PID vs Residual PPO benchmark
Figure 2. Drop rate (left) and tracking RMSE (right) across eight perturbed domains plus the nominal one. PID baseline against Residual PPO, $30$ episodes per cell.

The numerical results behind Figure 2 are listed in Table 1. The residual reduces the drop rate in every perturbed condition. The largest gain is under stronger-than-nominal gravity, where $70\%$ of the PID episodes ended with a dropped ball against $3\%$ for the residual; the smallest gain is under weaker gravity ($40\% \to 27\%$), where the ball moves slowly enough that the dynamics, not the controller, dominate the error. On the baseline both controllers complete every episode.

Table 1. Drop rate and tracking RMSE per domain. Lower is better.
Domain Drop rate (PID) Drop rate (Residual PPO) RMSE PID [m] RMSE Residual PPO [m]
standard0.000.000.0340.036
slippery0.200.030.0530.038
sticky0.330.000.0740.035
heavy_ball0.300.030.0680.039
low_gravity0.400.270.0910.070
high_gravity0.700.030.1170.037
noisy_vision0.300.030.0750.040
high_latency0.300.200.0750.054
full_dr0.330.070.0750.041

§ 4Example trajectory

A single episode on the nominal domain with a square reference trajectory. Both controllers complete the trajectory without dropping the ball. The square pattern exposes the corner-tracking behaviour: at each turn the reference velocity changes direction discontinuously, and each controller deviates slightly from the reference for a few control ticks before catching up.

Square trajectory overlay
Figure 3. A square reference trajectory, with the ball positions tracked by the PID and by Residual PPO overlaid. Both controllers complete the episode on the nominal domain.

§ 5Discussion

The PID was tuned at the nominal operating point and tracks the reference well there. Its gains are fixed, so when the simulator moves away from those conditions the closed-loop dynamics no longer match what the gains were chosen for and the ball is lost. The residual policy was trained across the whole envelope under domain randomisation; its drop rate stays low across every perturbed condition, and it recovers most of the tracking accuracy that the fixed-gain PID loses off nominal.

The smallest improvement is under weaker-than-nominal gravity. The ball moves slowly, the dynamics evolve on a longer effective time horizon, and the constant-velocity prior of the Kalman filter is least accurate in this regime. The remaining error is dominated by estimation rather than control, so both controllers degrade together.

§ 6Notation

RMSE
root-mean-square error.
Drop rate
fraction of episodes in which the ball leaves the plate before the $600$-step horizon expires.
Nominal domain
simulation with the mean values of the randomisation ranges and no perturbation.
DR
domain randomisation (see § 5 of the Residual PPO page).

§ 7References

  1. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. IEEE/RSJ IROS, pp. 23–30. arxiv.org/abs/1703.06907
  2. Peng, X. B., Andrychowicz, M., Zaremba, W., Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. IEEE ICRA, pp. 3803–3810. arxiv.org/abs/1710.06537
  3. Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M., Issac, J., Ratliff, N., Fox, D. (2019). Closing the sim-to-real loop: adapting simulation randomization with real world experience. IEEE ICRA, pp. 8973–8979. arxiv.org/abs/1810.05687
  4. OpenAI: Andrychowicz, M., Baker, B., Chociej, M., et al. (2020). Learning dexterous in-hand manipulation. International Journal of Robotics Research, 39(1), 3–20. arxiv.org/abs/1808.00177
  5. OpenAI: Akkaya, I., Andrychowicz, M., Chociej, M., et al. (2019). Solving Rubik's Cube with a robot hand. arXiv:1910.07113. arxiv.org/abs/1910.07113
  6. Mehta, B., Diaz, M., Golemo, F., Pal, C., Paull, L. (2020). Active domain randomization. Conf. on Robot Learning (CoRL 2019), PMLR 100:1162–1176. proceedings.mlr.press/v100/mehta20a