Unsupervised Discovery of Failure Taxonomies from Deployment Logs

Aryaman Gupta, Yusuf Umut Ciftci, Somil Bansal — 2025 · arXiv

arXiv ↗ PDF ↗ Added May 28, 2026

failure-analysis unsupervised-clustering deployment-logs runtime-monitoring chain-of-thought data-collection

TL;DR

<p class="text-sm leading-relaxed">Unsupervised framework: VLM-inferred failure explanations + LLM clustering → interpretable failure taxonomies from raw deployment logs.

Summary

<p class="text-sm leading-relaxed">Introduces unsupervised failure taxonomy discovery from multimodal robot deployment logs. VLM chain-of-thought extracts structured failure explanations; LLM ensemble-and-refine clustering organizes them into actionable taxonomies. Achieves 95.8% SAS vs. expert labels; failure-guided data collection cuts indoor navigation failure rate from 46% to 18%.

Key contributions

— Full pipeline: CLIP-based semantic downsampling → VLM CoT failure reasoning → LLM ensemble-refine taxonomy clustering, all unsupervised.
— Taxonomy-guided runtime monitor outperforms supervised classifiers on OOD crash detection; failure-guided data collection 2× more efficient than uniform.

Novelty

<p class="text-sm leading-relaxed">Unlike prior episode-level failure explanation (REFLECT, AHA, RoboFAC), discovers corpus-level failure structure across deployment logs without predefined labels or human annotation.

Methods

— CLIP-embedding bidirectional change-point downsampling centered on failure event
— Gemini 2.5 Pro chain-of-thought failure reasoning from downsampled multimodal trajectories
— LLM ensemble-and-refine taxonomy aggregation (prompt rephrasing + reconciliation via o4-mini)

Strengths

— Quantitative validation against expert RoboFail taxonomy (SAS=0.958) plus ablations over VLMs, downsampling strategies, and single-run vs. aggregation.
— Cross-domain evaluation (manipulation, autonomous driving, indoor nav) with concrete downstream metrics (F1 monitoring, failure rate reduction).

Weaknesses

— Largest dataset ~1500 videos; no evaluation at scale (10k+ trajectories) where LLM cost and coherence may degrade.
— Runtime monitor F1 gains modest on In-D driving (71.4 vs 65.3 VideoMAE); lead-time advantage not statistically validated.

Future work

— Scale to temporally extended logs; integrate causal/simulation-based validation of discovered failure modes.
— Incorporate formal safety analysis (STPA/FRAM) to ground taxonomy clusters beyond semantic plausibility.

Key insights

— Clustering in semantic reasoning space (LLM explanations) beats clustering in perceptual embedding space for failure taxonomy quality.
— Failure taxonomy context enables OOD generalization in runtime monitors where supervised classifiers collapse.

My thoughts

My Thoughts

Initial Reaction

Connections

Implementation Notes

Open Questions

Connections

Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers
Both address runtime failure/anomaly detection from deployment logs; source's taxonomy-guided monitor directly parallels this paper's runtime anomaly classifier for system-level failures.

Related papers

Extracted by claude-sonnet-4-6.