Aryaman's research archive
← Library

Unsupervised Discovery of Failure Taxonomies from Deployment Logs

Aryaman Gupta, Yusuf Umut Ciftci, Somil Bansal — 2025 · arXiv

arXiv ↗ PDF ↗ Added May 28, 2026
failure-analysis unsupervised-clustering deployment-logs runtime-monitoring chain-of-thought data-collection

TL;DR

<p class="text-sm leading-relaxed">Unsupervised framework: VLM-inferred failure explanations + LLM clustering → interpretable failure taxonomies from raw deployment logs.

Summary

<p class="text-sm leading-relaxed">Introduces unsupervised failure taxonomy discovery from multimodal robot deployment logs. VLM chain-of-thought extracts structured failure explanations; LLM ensemble-and-refine clustering organizes them into actionable taxonomies. Achieves 95.8% SAS vs. expert labels; failure-guided data collection cuts indoor navigation failure rate from 46% to 18%.

Key contributions

  • — Full pipeline: CLIP-based semantic downsampling → VLM CoT failure reasoning → LLM ensemble-refine taxonomy clustering, all unsupervised.
  • — Taxonomy-guided runtime monitor outperforms supervised classifiers on OOD crash detection; failure-guided data collection 2× more efficient than uniform.

Novelty

<p class="text-sm leading-relaxed">Unlike prior episode-level failure explanation (REFLECT, AHA, RoboFAC), discovers corpus-level failure structure across deployment logs without predefined labels or human annotation.

Methods

  • — CLIP-embedding bidirectional change-point downsampling centered on failure event
  • — Gemini 2.5 Pro chain-of-thought failure reasoning from downsampled multimodal trajectories
  • — LLM ensemble-and-refine taxonomy aggregation (prompt rephrasing + reconciliation via o4-mini)

Strengths

  • — Quantitative validation against expert RoboFail taxonomy (SAS=0.958) plus ablations over VLMs, downsampling strategies, and single-run vs. aggregation.
  • — Cross-domain evaluation (manipulation, autonomous driving, indoor nav) with concrete downstream metrics (F1 monitoring, failure rate reduction).

Weaknesses

  • — Largest dataset ~1500 videos; no evaluation at scale (10k+ trajectories) where LLM cost and coherence may degrade.
  • — Runtime monitor F1 gains modest on In-D driving (71.4 vs 65.3 VideoMAE); lead-time advantage not statistically validated.

Future work

  • — Scale to temporally extended logs; integrate causal/simulation-based validation of discovered failure modes.
  • — Incorporate formal safety analysis (STPA/FRAM) to ground taxonomy clusters beyond semantic plausibility.

Key insights

  • — Clustering in semantic reasoning space (LLM explanations) beats clustering in perceptual embedding space for failure taxonomy quality.
  • — Failure taxonomy context enables OOD generalization in runtime monitors where supervised classifiers collapse.

My thoughts

My Thoughts

Initial Reaction

Connections

Implementation Notes

Open Questions

Connections

Related papers

Extracted by claude-sonnet-4-6.