DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin — 2025 · arXiv

arXiv ↗ PDF ↗ Added May 29, 2026

chain-of-thought vla out-of-distribution action-chunking robot-foundation-model hierarchical-policy

TL;DR

<p class="text-sm leading-relaxed">Identifies two necessary conditions for CoT to work in VLA, then builds DeepThinkVLA (hybrid-attention + SFT→RL) achieving SOTA on LIBERO, LIBERO-Plus, RoboTwin 2.0.

Summary

<p class="text-sm leading-relaxed">Diagnoses why CoT fails in VLA via two conditions: Decoding Alignment (modality-appropriate generation) and Causal Alignment (outcome-linked reasoning). DeepThinkVLA satisfies both via hybrid causal+bidirectional attention decoder and SFT-then-GRPO-RL pipeline, achieving 97.0% on LIBERO and +21.7pp over π0-FAST on RoboTwin 2.0.

Key contributions

— Empirical identification of two necessary conditions (Decoding Alignment, Causal Alignment) whose violation makes CoT harmful or decorative in VLA.
— Hybrid-attention decoder (causal for CoT, bidirectional for parallel action decoding) + SFT-then-GRPO-RL pipeline achieving SOTA on three benchmarks.

Novelty

<p class="text-sm leading-relaxed">Unlike ECoT/CoT-VLA which apply SFT CoT to AR decoders, DeepThinkVLA diagnoses both architectural and training failure modes and jointly fixes them.

Methods

— Hybrid-attention decoder: causal attention for CoT tokens, bidirectional for parallel action-chunk decoding.
— Two-stage CoT dataset construction: cloud VLM on keyframes (gripper-state detected), local fine-tuned VLM for intermediate frames.
— GRPO-style RL with sparse task-success reward and group-normalized credit assignment over full reasoning-action trajectories.

Strengths

— OOD ablation (Joint-Limit dynamics) rigorously distinguishes SFT-CoT (32pp drop ≈ reasoning-free 31.6pp) from RL-CoT (24.4pp), proving functional causal role.
— Backbone generality confirmed: Qwen3-VL (no robotic pretraining) reaches 94.9% LIBERO / 77.0% LIBERO-Plus, showing conditions generalize beyond π0-FAST weights.

Weaknesses

— Real-robot evaluation limited to 20 trials on 3 tasks with no baseline comparison — insufficient to validate real-world claims.
— RL stage evaluated only on LIBERO-Long (+2pp) and RoboTwin 2.0 (+6.8pp); gains modest relative to compute cost on 8×A800s.

Future work

— Extend causal-alignment RL to more diverse, contact-rich real-robot tasks with dense reward signals.
— Investigate whether Decoding/Causal Alignment conditions hold for diffusion-based action heads.

Key insights

— SFT-learned CoT is 'fake reasoning': its OOD drop matches reasoning-free baseline until RL causally links CoT to task outcomes.
— Forcing language CoT and high-dim actions through a single AR decoder is actively harmful (−4.2pp), not merely suboptimal.

My thoughts

My Thoughts

Initial Reaction

Connections

Implementation Notes

Open Questions

Connections

$π_0$: A Vision-Language-Action Flow Model for General Robot Control
π0 is a competing VLA baseline used on the same benchmarks; hierarchical policy and action-chunking approaches are directly compared/referenced in DeepThinkVLA.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Both implement parallel/bidirectional action decoding with action chunking in VLA models; OpenVLA-OFT's parallel decoding directly relates to DeepThinkVLA's hybrid-attention design and LIBERO benchmarking.

Related papers

Extracted by claude-sonnet-4-6.