Aryaman's research archive
← Library

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin — 2025 · arXiv

arXiv ↗ PDF ↗ Added May 29, 2026
chain-of-thought vla out-of-distribution action-chunking robot-foundation-model hierarchical-policy

TL;DR

<p class="text-sm leading-relaxed">Identifies two necessary conditions for CoT to work in VLA, then builds DeepThinkVLA (hybrid-attention + SFT→RL) achieving SOTA on LIBERO, LIBERO-Plus, RoboTwin 2.0.

Summary

<p class="text-sm leading-relaxed">Diagnoses why CoT fails in VLA via two conditions: Decoding Alignment (modality-appropriate generation) and Causal Alignment (outcome-linked reasoning). DeepThinkVLA satisfies both via hybrid causal+bidirectional attention decoder and SFT-then-GRPO-RL pipeline, achieving 97.0% on LIBERO and +21.7pp over π0-FAST on RoboTwin 2.0.

Key contributions

  • — Empirical identification of two necessary conditions (Decoding Alignment, Causal Alignment) whose violation makes CoT harmful or decorative in VLA.
  • — Hybrid-attention decoder (causal for CoT, bidirectional for parallel action decoding) + SFT-then-GRPO-RL pipeline achieving SOTA on three benchmarks.

Novelty

<p class="text-sm leading-relaxed">Unlike ECoT/CoT-VLA which apply SFT CoT to AR decoders, DeepThinkVLA diagnoses both architectural and training failure modes and jointly fixes them.

Methods

  • — Hybrid-attention decoder: causal attention for CoT tokens, bidirectional for parallel action-chunk decoding.
  • — Two-stage CoT dataset construction: cloud VLM on keyframes (gripper-state detected), local fine-tuned VLM for intermediate frames.
  • — GRPO-style RL with sparse task-success reward and group-normalized credit assignment over full reasoning-action trajectories.

Strengths

  • — OOD ablation (Joint-Limit dynamics) rigorously distinguishes SFT-CoT (32pp drop ≈ reasoning-free 31.6pp) from RL-CoT (24.4pp), proving functional causal role.
  • — Backbone generality confirmed: Qwen3-VL (no robotic pretraining) reaches 94.9% LIBERO / 77.0% LIBERO-Plus, showing conditions generalize beyond π0-FAST weights.

Weaknesses

  • — Real-robot evaluation limited to 20 trials on 3 tasks with no baseline comparison — insufficient to validate real-world claims.
  • — RL stage evaluated only on LIBERO-Long (+2pp) and RoboTwin 2.0 (+6.8pp); gains modest relative to compute cost on 8×A800s.

Future work

  • — Extend causal-alignment RL to more diverse, contact-rich real-robot tasks with dense reward signals.
  • — Investigate whether Decoding/Causal Alignment conditions hold for diffusion-based action heads.

Key insights

  • — SFT-learned CoT is 'fake reasoning': its OOD drop matches reasoning-free baseline until RL causally links CoT to task outcomes.
  • — Forcing language CoT and high-dim actions through a single AR decoder is actively harmful (−4.2pp), not merely suboptimal.

My thoughts

My Thoughts

Initial Reaction

Connections

Implementation Notes

Open Questions

Connections

Related papers

Extracted by claude-sonnet-4-6.