Note

GRPO vs DAPO

Asymmetric clipping: Clip with different epsilons, with a higher upper epsilon to maintain exploration and prevent entropy collapse.
Dynamic sampling: Maintain at least one correct or incorrect sample in the response group to prevent advantage value collapse. Dr. GRPO just removed dividing by std dev for mathematical stability.
Token-level loss: Replace sample-level averaging with token-level averaging; otherwise, longer responses will have smaller gradient contribution. (Also presented in Dr. GRPO paper)

Created June 7, 2026 · 4:23 PM · updated June 7, 2026 · 10:13 PM