Note
GRPO vs DAPO
- Asymmetric clipping: Clip with different epsilons, with a higher upper epsilon to maintain exploration and prevent entropy collapse.
- Dynamic sampling: Maintain at least one correct or incorrect sample in the response group to prevent advantage value collapse. Dr. GRPO just removed dividing by std dev for mathematical stability.
- Token-level loss: Replace sample-level averaging with token-level averaging; otherwise, longer responses will have smaller gradient contribution. (Also presented in Dr. GRPO paper)
Created June 7, 2026 · 4:23 PM · updated June 7, 2026 · 10:13 PM