Aryaman's research archive

GRPO vs DAPO

Jun 7, 2026

Note

1. Asymmetric clipping: Clip with different epsilons, with a higher upper epsilon to maintain exploration and prevent entropy collapse.

Off-Policy Drift

Jun 7, 2026

Note

- Classical synchronous RL training used to wait for data generation before doing any gradient update, wasting GPUs as they lie idle when data generation is happening.

Reasoning Models Don't Always Say What They Think

May 30, 2026

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, … — 2025 · arXiv

CoT reasoning in SOTA models (Claude 3.7, DeepSeek R1) is unfaithful <20% of the time; RL doesn't fix it; reward hacking rarely verbalized.

chain-of-thought safety reward-hacking runtime-monitoring alignment-faking faithfulness

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

May 29, 2026

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, … — 2025 · arXiv

Identifies two necessary conditions for CoT to work in VLA, then builds DeepThinkVLA (hybrid-attention + SFT→RL) achieving SOTA on LIBERO, LIBERO-Plus, RoboTwin 2.0.

chain-of-thought vla out-of-distribution action-chunking robot-foundation-model hierarchical-policy

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

May 28, 2026

Moo Jin Kim, Chelsea Finn, Percy Liang — 2025 · arXiv

Systematic VLA fine-tuning study yields OpenVLA-OFT: parallel decoding + action chunking + continuous L1 regression achieves 97.1% on LIBERO and outperforms π0/RDT-1B on ALOHA.

vla dexterous-manipulation robot-foundation-model action-chunking film-conditioning

Deep Reinforcement Learning for Sim-to-Real Policy Transfer of VTOL-UAVs Offshore Docking Operations

May 28, 2026

Ali M. Ali, Aryaman Gupta, Hashim A. Hashim — 2024 · arXiv

Hierarchical DRL (model-based approach + PPO landing) for VTOL-UAV autonomous docking on wave-disturbed offshore platforms with sim-to-real transfer.

sim-to-real hierarchical-policy domain-randomization UAV fallback-controller safety

Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers

May 28, 2026

Aryaman Gupta, Kaustav Chakraborty, Somil Bansal — 2023 · arXiv

HJ reachability stress-tests vision-based controllers offline to mine system-level failures; trains runtime anomaly classifier + fallback controller.

reachability-analysis anomaly-detection runtime-monitoring fallback-controller out-of-distribution safety

Unsupervised Discovery of Failure Taxonomies from Deployment Logs

May 28, 2026

Aryaman Gupta, Yusuf Umut Ciftci, Somil Bansal — 2025 · arXiv

Unsupervised framework: VLM-inferred failure explanations + LLM clustering → interpretable failure taxonomies from raw deployment logs.

failure-analysis unsupervised-clustering deployment-logs runtime-monitoring chain-of-thought data-collection

$π_0$: A Vision-Language-Action Flow Model for General Robot Control

May 28, 2026

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, … — 2024 · arXiv

π0: 3.3B VLA built on PaliGemma + flow-matching action expert, pre-trained on 10k hrs cross-embodiment data, enabling dexterous long-horizon manipulation.

vla flow-matching robot-foundation-model dexterous-manipulation cross-embodiment hierarchical-policy