DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin — 2025 · arXiv
chain-of-thought
vla
out-of-distribution
action-chunking
robot-foundation-model
hierarchical-policy
TL;DR
<p class="text-sm leading-relaxed">Identifies two necessary conditions for CoT to work in VLA, then builds DeepThinkVLA (hybrid-attention + SFT→RL) achieving SOTA on LIBERO, LIBERO-Plus, RoboTwin 2.0.Summary
<p class="text-sm leading-relaxed">Diagnoses why CoT fails in VLA via two conditions: Decoding Alignment (modality-appropriate generation) and Causal Alignment (outcome-linked reasoning). DeepThinkVLA satisfies both via hybrid causal+bidirectional attention decoder and SFT-then-GRPO-RL pipeline, achieving 97.0% on LIBERO and +21.7pp over π0-FAST on RoboTwin 2.0.Key contributions
- — Empirical identification of two necessary conditions (Decoding Alignment, Causal Alignment) whose violation makes CoT harmful or decorative in VLA.
- — Hybrid-attention decoder (causal for CoT, bidirectional for parallel action decoding) + SFT-then-GRPO-RL pipeline achieving SOTA on three benchmarks.
Novelty
<p class="text-sm leading-relaxed">Unlike ECoT/CoT-VLA which apply SFT CoT to AR decoders, DeepThinkVLA diagnoses both architectural and training failure modes and jointly fixes them.Methods
- — Hybrid-attention decoder: causal attention for CoT tokens, bidirectional for parallel action-chunk decoding.
- — Two-stage CoT dataset construction: cloud VLM on keyframes (gripper-state detected), local fine-tuned VLM for intermediate frames.
- — GRPO-style RL with sparse task-success reward and group-normalized credit assignment over full reasoning-action trajectories.
Strengths
- — OOD ablation (Joint-Limit dynamics) rigorously distinguishes SFT-CoT (32pp drop ≈ reasoning-free 31.6pp) from RL-CoT (24.4pp), proving functional causal role.
- — Backbone generality confirmed: Qwen3-VL (no robotic pretraining) reaches 94.9% LIBERO / 77.0% LIBERO-Plus, showing conditions generalize beyond π0-FAST weights.
Weaknesses
- — Real-robot evaluation limited to 20 trials on 3 tasks with no baseline comparison — insufficient to validate real-world claims.
- — RL stage evaluated only on LIBERO-Long (+2pp) and RoboTwin 2.0 (+6.8pp); gains modest relative to compute cost on 8×A800s.
Future work
- — Extend causal-alignment RL to more diverse, contact-rich real-robot tasks with dense reward signals.
- — Investigate whether Decoding/Causal Alignment conditions hold for diffusion-based action heads.
Key insights
- — SFT-learned CoT is 'fake reasoning': its OOD drop matches reasoning-free baseline until RL causally links CoT to task outcomes.
- — Forcing language CoT and high-dim actions through a single AR decoder is actively harmful (−4.2pp), not merely suboptimal.
My thoughts
My Thoughts
Initial Reaction
Connections
Implementation Notes
Open Questions
Connections
-
$π_0$: A Vision-Language-Action Flow Model for General Robot Control
π0 is a competing VLA baseline used on the same benchmarks; hierarchical policy and action-chunking approaches are directly compared/referenced in DeepThinkVLA.
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Both implement parallel/bidirectional action decoding with action chunking in VLA models; OpenVLA-OFT's parallel decoding directly relates to DeepThinkVLA's hybrid-attention design and LIBERO benchmarking.
Related papers
- Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success d=0.72
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control d=0.83
- Reasoning Models Don't Always Say What They Think d=0.88
- Unsupervised Discovery of Failure Taxonomies from Deployment Logs d=0.97
- Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers d=1.05
Extracted by claude-sonnet-4-6.