Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, Percy Liang — 2025 · arXiv

arXiv ↗ PDF ↗ Added May 28, 2026

vla dexterous-manipulation robot-foundation-model action-chunking film-conditioning

TL;DR

<p class="text-sm leading-relaxed">Systematic VLA fine-tuning study yields OpenVLA-OFT: parallel decoding + action chunking + continuous L1 regression achieves 97.1% on LIBERO and outperforms π0/RDT-1B on ALOHA.

Summary

<p class="text-sm leading-relaxed">Studies key fine-tuning design choices for OpenVLA (action decoding, representation, objective) and proposes OFT recipe combining parallel decoding, action chunking, continuous actions, and L1 regression. Achieves SOTA 97.1% on LIBERO (up from 76.5%), 26× throughput gain, and beats π0/RDT-1B on real bimanual ALOHA tasks by up to 15%.

Key contributions

— OFT recipe: parallel decoding + action chunking + continuous L1 regression boosts OpenVLA from 76.5%→97.1% LIBERO with 26× speedup.
— FiLM language modulation applied to vision transformer blocks enables reliable language grounding in multi-camera bimanual ALOHA setup.

Novelty

<p class="text-sm leading-relaxed">Unlike prior work (FAST, TinyVLA, π0) that modifies tokenization or uses diffusion, shows simple L1 regression with parallel decoding matches diffusion quality at far lower compute, even beating π0 fine-tuned with its default recipe.

Methods

— Parallel decoding with bidirectional attention replacing causal mask, generating full action chunks in one forward pass.
— 4-layer MLP action head with L1 regression over continuous normalized actions replacing discrete token prediction.
— Per-block FiLM conditioning (spatially-agnostic hidden-dimension modulation) in both SigLIP and DINOv2 vision transformers.

Strengths

— Controlled ablations isolate contributions of each design choice (decoding, representation, objective) with quantitative throughput + success rate metrics.
— Real-world ALOHA evaluation against π0, RDT-1B, ACT, Diffusion Policy with fine-grained rubric scoring on dexterous bimanual tasks.

Weaknesses

— All ALOHA tasks use very small demo sets (20–300); unclear if findings hold at scale or generalize across robot morphologies beyond bimanual/single-arm.
— L1 regression limitation with truly multimodal action distributions acknowledged but not empirically characterized; no benchmark with explicit multimodality tested.

Future work

— Extend OFT to pretraining regime; assess if L1 vs. diffusion gap emerges at scale.
— Characterize failure modes of L1 regression on multimodal demonstration datasets.

Key insights

— Fine-tuning recipe design can matter more than pretraining data coverage—OpenVLA (single-arm pretrain only) outperforms π0/RDT-1B pretrained on bimanual data.
— Parallel decoding with action chunking improves not just speed but task success (+14% absolute), suggesting temporal dependency modeling benefits beyond efficiency.

My thoughts

My Thoughts

Feature-wise linear modulation (Film) seems to be a nice idea to regulate how much the model attends to each feature. They used to make it attend more to language; we can extend it to other modalities, such as reasoning, as well.
Action chunking and parallel decoding help in faster inference and overall performance.
Action chunking captures temporal dependencies.
Continuous actions achieve better performance.
L1 regression and diffusion achieve performance similar to that of CE loss.

Final Takeaway

use parallel decoding and action chunking over causal autoregressive
use continuous action representations over discrete
use L1 regression for predicting continuous actions over diffusion

Interesting Points

parallel decoding requires empty input embeddings differing only in their positional encodings
L1 regression works well for continuous actions, comparable to diffusion

Connections

Implementation Notes

Open Questions

Connections

$π_0$: A Vision-Language-Action Flow Model for General Robot Control
Source paper directly benchmarks against π0 on ALOHA dexterous manipulation using same VLA fine-tuning framework.
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
Both implement parallel/bidirectional action decoding with action chunking in VLA models; OpenVLA-OFT's parallel decoding directly relates to DeepThinkVLA's hybrid-attention design and LIBERO benchmarking.

Related papers

Extracted by claude-sonnet-4-6.