Domain randomisation · Hold UR Pla7e

§ 1Training curve

Mean episode reward of the Residual PPO policy across $60$ million PPO timesteps, collected from $16$ parallel MuJoCo environments. The reward climbs from near zero to roughly $200$ and plateaus once the linear learning-rate schedule of the training algorithm (§ 6 of the Residual PPO page) has decayed.

PPO training curve — Figure 1. Mean episode reward over PPO training, $60$ M timesteps.

§ 2Benchmark protocol

The PID of the PID page and the trained Residual PPO policy of the Residual PPO page are evaluated on eight perturbed simulation domains plus the nominal one. Each (domain, policy) cell is run for $30$ episodes with fixed per-episode seeds shared across policies, so the two controllers see the same random domain parameters within a cell. Each episode lasts $600$ control steps. Two scalar metrics are reported per cell: the drop rate, the fraction of episodes in which the ball leaves the plate before the horizon expires; and the tracking RMSE, the root-mean-square position error against the reference trajectory averaged over the surviving steps and over the $30$ episodes.

§ 3Per-domain comparison

PID vs Residual PPO benchmark — Figure 2. Drop rate (left) and tracking RMSE (right) across eight perturbed domains plus the nominal one. PID baseline against Residual PPO, $30$ episodes per cell.

The numerical results behind Figure 2 are listed in Table 1. The residual reduces the drop rate in every perturbed condition. The largest gain is under stronger-than-nominal gravity, where $70\%$ of the PID episodes ended with a dropped ball against $3\%$ for the residual; the smallest gain is under weaker gravity ($40\% \to 27\%$), where the ball moves slowly enough that the dynamics, not the controller, dominate the error. On the baseline both controllers complete every episode.

Table 1. Drop rate and tracking RMSE per domain. Lower is better.
Domain	Drop rate (PID)	Drop rate (Residual PPO)	RMSE PID [m]	RMSE Residual PPO [m]
standard	0.00	0.00	0.034	0.036
slippery	0.20	0.03	0.053	0.038
sticky	0.33	0.00	0.074	0.035
heavy_ball	0.30	0.03	0.068	0.039
low_gravity	0.40	0.27	0.091	0.070
high_gravity	0.70	0.03	0.117	0.037
noisy_vision	0.30	0.03	0.075	0.040
high_latency	0.30	0.20	0.075	0.054
full_dr	0.33	0.07	0.075	0.041

§ 4Example trajectory

A single episode on the nominal domain with a square reference trajectory. Both controllers complete the trajectory without dropping the ball. The square pattern exposes the corner-tracking behaviour: at each turn the reference velocity changes direction discontinuously, and each controller deviates slightly from the reference for a few control ticks before catching up.

Square trajectory overlay — Figure 3. A square reference trajectory, with the ball positions tracked by the PID and by Residual PPO overlaid. Both controllers complete the episode on the nominal domain.

§ 5Discussion

The PID was tuned at the nominal operating point and tracks the reference well there. Its gains are fixed, so when the simulator moves away from those conditions the closed-loop dynamics no longer match what the gains were chosen for and the ball is lost. The residual policy was trained across the whole envelope under domain randomisation; its drop rate stays low across every perturbed condition, and it recovers most of the tracking accuracy that the fixed-gain PID loses off nominal.

The smallest improvement is under weaker-than-nominal gravity. The ball moves slowly, the dynamics evolve on a longer effective time horizon, and the constant-velocity prior of the Kalman filter is least accurate in this regime. The remaining error is dominated by estimation rather than control, so both controllers degrade together.

§ 6Notation

RMSE: root-mean-square error.
Drop rate: fraction of episodes in which the ball leaves the plate before the $600$-step horizon expires.
Nominal domain: simulation with the mean values of the randomisation ranges and no perturbation.
DR: domain randomisation (see § 5 of the Residual PPO page).

§ 7References

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. IEEE/RSJ IROS, pp. 23–30. arxiv.org/abs/1703.06907
Peng, X. B., Andrychowicz, M., Zaremba, W., Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. IEEE ICRA, pp. 3803–3810. arxiv.org/abs/1710.06537
Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M., Issac, J., Ratliff, N., Fox, D. (2019). Closing the sim-to-real loop: adapting simulation randomization with real world experience. IEEE ICRA, pp. 8973–8979. arxiv.org/abs/1810.05687
OpenAI: Andrychowicz, M., Baker, B., Chociej, M., et al. (2020). Learning dexterous in-hand manipulation. International Journal of Robotics Research, 39(1), 3–20. arxiv.org/abs/1808.00177
OpenAI: Akkaya, I., Andrychowicz, M., Chociej, M., et al. (2019). Solving Rubik's Cube with a robot hand. arXiv:1910.07113. arxiv.org/abs/1910.07113
Mehta, B., Diaz, M., Golemo, F., Pal, C., Paull, L. (2020). Active domain randomization. Conf. on Robot Learning (CoRL 2019), PMLR 100:1162–1176. proceedings.mlr.press/v100/mehta20a