Unsupervised Discovery of Failure Taxonomies from Deployment Logs
Aryaman Gupta, Yusuf Umut Ciftci, Somil Bansal — 2025 · arXiv
failure-analysis
unsupervised-clustering
deployment-logs
runtime-monitoring
chain-of-thought
data-collection
TL;DR
<p class="text-sm leading-relaxed">Unsupervised framework: VLM-inferred failure explanations + LLM clustering → interpretable failure taxonomies from raw deployment logs.Summary
<p class="text-sm leading-relaxed">Introduces unsupervised failure taxonomy discovery from multimodal robot deployment logs. VLM chain-of-thought extracts structured failure explanations; LLM ensemble-and-refine clustering organizes them into actionable taxonomies. Achieves 95.8% SAS vs. expert labels; failure-guided data collection cuts indoor navigation failure rate from 46% to 18%.Key contributions
- — Full pipeline: CLIP-based semantic downsampling → VLM CoT failure reasoning → LLM ensemble-refine taxonomy clustering, all unsupervised.
- — Taxonomy-guided runtime monitor outperforms supervised classifiers on OOD crash detection; failure-guided data collection 2× more efficient than uniform.
Novelty
<p class="text-sm leading-relaxed">Unlike prior episode-level failure explanation (REFLECT, AHA, RoboFAC), discovers corpus-level failure structure across deployment logs without predefined labels or human annotation.Methods
- — CLIP-embedding bidirectional change-point downsampling centered on failure event
- — Gemini 2.5 Pro chain-of-thought failure reasoning from downsampled multimodal trajectories
- — LLM ensemble-and-refine taxonomy aggregation (prompt rephrasing + reconciliation via o4-mini)
Strengths
- — Quantitative validation against expert RoboFail taxonomy (SAS=0.958) plus ablations over VLMs, downsampling strategies, and single-run vs. aggregation.
- — Cross-domain evaluation (manipulation, autonomous driving, indoor nav) with concrete downstream metrics (F1 monitoring, failure rate reduction).
Weaknesses
- — Largest dataset ~1500 videos; no evaluation at scale (10k+ trajectories) where LLM cost and coherence may degrade.
- — Runtime monitor F1 gains modest on In-D driving (71.4 vs 65.3 VideoMAE); lead-time advantage not statistically validated.
Future work
- — Scale to temporally extended logs; integrate causal/simulation-based validation of discovered failure modes.
- — Incorporate formal safety analysis (STPA/FRAM) to ground taxonomy clusters beyond semantic plausibility.
Key insights
- — Clustering in semantic reasoning space (LLM explanations) beats clustering in perceptual embedding space for failure taxonomy quality.
- — Failure taxonomy context enables OOD generalization in runtime monitors where supervised classifiers collapse.
My thoughts
My Thoughts
Initial Reaction
Connections
Implementation Notes
Open Questions
Connections
-
Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers
Both address runtime failure/anomaly detection from deployment logs; source's taxonomy-guided monitor directly parallels this paper's runtime anomaly classifier for system-level failures.
Related papers
- Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers d=0.86
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control d=0.96
- Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success d=0.96
- DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models d=0.97
- Reasoning Models Don't Always Say What They Think d=0.99
Extracted by claude-sonnet-4-6.