Aryaman's research archive
← Library

Note

GRPO vs DAPO

  1. Asymmetric clipping: Clip with different epsilons, with a higher upper epsilon to maintain exploration and prevent entropy collapse.
  2. Dynamic sampling: Maintain at least one correct or incorrect sample in the response group to prevent advantage value collapse. Dr. GRPO just removed dividing by std dev for mathematical stability.
  3. Token-level loss: Replace sample-level averaging with token-level averaging; otherwise, longer responses will have smaller gradient contribution. (Also presented in Dr. GRPO paper)

Created June 7, 2026 · 4:23 PM · updated June 7, 2026 · 10:13 PM