Note

Off-Policy Drift

Classical synchronous RL training used to wait for data generation before doing any gradient update, wasting GPUs as they lie idle when data generation is happening.
Asynchronous RL does parallel data generation and policy updates, making it more efficient. But now subsequent data generation is not happening under the current policy; it is using a slightly older version, since the gradient update is also happening simultaneously. This makes classical on-policy RL algorithms off-policy in the real world for optimizing RL infra.

Created June 7, 2026 · 4:32 PM · updated June 7, 2026 · 4:42 PM