Aryaman's research archive
← Library

Note

Off-Policy Drift

  • Classical synchronous RL training used to wait for data generation before doing any gradient update, wasting GPUs as they lie idle when data generation is happening.
  • Asynchronous RL does parallel data generation and policy updates, making it more efficient. But now subsequent data generation is not happening under the current policy; it is using a slightly older version, since the gradient update is also happening simultaneously. This makes classical on-policy RL algorithms off-policy in the real world for optimizing RL infra.

Created June 7, 2026 · 4:32 PM · updated June 7, 2026 · 4:42 PM