Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, Percy Liang — 2025 · arXiv
vla
dexterous-manipulation
robot-foundation-model
action-chunking
film-conditioning
TL;DR
<p class="text-sm leading-relaxed">Systematic VLA fine-tuning study yields OpenVLA-OFT: parallel decoding + action chunking + continuous L1 regression achieves 97.1% on LIBERO and outperforms π0/RDT-1B on ALOHA.Summary
<p class="text-sm leading-relaxed">Studies key fine-tuning design choices for OpenVLA (action decoding, representation, objective) and proposes OFT recipe combining parallel decoding, action chunking, continuous actions, and L1 regression. Achieves SOTA 97.1% on LIBERO (up from 76.5%), 26× throughput gain, and beats π0/RDT-1B on real bimanual ALOHA tasks by up to 15%.Key contributions
- — OFT recipe: parallel decoding + action chunking + continuous L1 regression boosts OpenVLA from 76.5%→97.1% LIBERO with 26× speedup.
- — FiLM language modulation applied to vision transformer blocks enables reliable language grounding in multi-camera bimanual ALOHA setup.
Novelty
<p class="text-sm leading-relaxed">Unlike prior work (FAST, TinyVLA, π0) that modifies tokenization or uses diffusion, shows simple L1 regression with parallel decoding matches diffusion quality at far lower compute, even beating π0 fine-tuned with its default recipe.Methods
- — Parallel decoding with bidirectional attention replacing causal mask, generating full action chunks in one forward pass.
- — 4-layer MLP action head with L1 regression over continuous normalized actions replacing discrete token prediction.
- — Per-block FiLM conditioning (spatially-agnostic hidden-dimension modulation) in both SigLIP and DINOv2 vision transformers.
Strengths
- — Controlled ablations isolate contributions of each design choice (decoding, representation, objective) with quantitative throughput + success rate metrics.
- — Real-world ALOHA evaluation against π0, RDT-1B, ACT, Diffusion Policy with fine-grained rubric scoring on dexterous bimanual tasks.
Weaknesses
- — All ALOHA tasks use very small demo sets (20–300); unclear if findings hold at scale or generalize across robot morphologies beyond bimanual/single-arm.
- — L1 regression limitation with truly multimodal action distributions acknowledged but not empirically characterized; no benchmark with explicit multimodality tested.
Future work
- — Extend OFT to pretraining regime; assess if L1 vs. diffusion gap emerges at scale.
- — Characterize failure modes of L1 regression on multimodal demonstration datasets.
Key insights
- — Fine-tuning recipe design can matter more than pretraining data coverage—OpenVLA (single-arm pretrain only) outperforms π0/RDT-1B pretrained on bimanual data.
- — Parallel decoding with action chunking improves not just speed but task success (+14% absolute), suggesting temporal dependency modeling benefits beyond efficiency.
My thoughts
My Thoughts
- Feature-wise linear modulation (Film) seems to be a nice idea to regulate how much the model attends to each feature. They used to make it attend more to language; we can extend it to other modalities, such as reasoning, as well.
- Action chunking and parallel decoding help in faster inference and overall performance.
- Action chunking captures temporal dependencies.
- Continuous actions achieve better performance.
- L1 regression and diffusion achieve performance similar to that of CE loss.
Final Takeaway
- use parallel decoding and action chunking over causal autoregressive
- use continuous action representations over discrete
- use L1 regression for predicting continuous actions over diffusion
Interesting Points
- parallel decoding requires empty input embeddings differing only in their positional encodings
- L1 regression works well for continuous actions, comparable to diffusion
Connections
Implementation Notes
Open Questions
Connections
-
$π_0$: A Vision-Language-Action Flow Model for General Robot Control
Source paper directly benchmarks against π0 on ALOHA dexterous manipulation using same VLA fine-tuning framework.
-
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
Both implement parallel/bidirectional action decoding with action chunking in VLA models; OpenVLA-OFT's parallel decoding directly relates to DeepThinkVLA's hybrid-attention design and LIBERO benchmarking.
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control d=0.71
- DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models d=0.72
- Unsupervised Discovery of Failure Taxonomies from Deployment Logs d=0.96
- Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers d=1.02
- Deep Reinforcement Learning for Sim-to-Real Policy Transfer of VTOL-UAVs Offshore Docking Operations d=1.03
Extracted by claude-sonnet-4-6.