Unsupervised Discovery of Failure Taxonomies from Deployment Logs

Abstract

As robotic systems become increasingly integrated into real-world environments, ranging from autonomous vehicles to household assistants, they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving system robustness. However, manually analyzing large-scale failure datasets is impractical and does not scale. In this work, we introduce the problem of unsupervised discovery of failure taxonomies from large volumes of raw failure logs, aiming to obtain semantically coherent and actionable failure modes directly from perceptual trajectories. Our approach first infers structured failure explanations from multimodal inputs using vision-language reasoning, and then performs clustering in the resulting semantic reasoning space, enabling the discovery of recurring failure modes rather than isolated episode-level descriptions. We evaluate our method across robotic manipulation, indoor navigation, and autonomous driving domains, and demonstrate that the discovered taxonomies are consistent, interpretable, and practically useful. In particular, we show that structured failure taxonomies guide targeted data collection for offline policy refinement and enhance runtime failure monitoring systems.

Discovering Failure Taxonomies from Robot Datasets

We propose a method to automatically cluster robotic failure data into semantically meaningful failure modes using Multimodal Large Language Models (MLLMs). It provides a structured and autonomous way to understand the failure patterns of a robotic system and use that knowledge to enhance its safety and performance as the generated taxonomy provides a rich information about system failures. We propose a two-step approach to achieve this:

Use a MLLM to infer a textual description and the failure reason of each trajectory in the failure data from sequence of visual inputs.
Use a reasoning LLM to group that data based on the inferred failure reasons and generate clusters representing underlying failure modes.

Robot Manipulation:

Real-World Car Crash:

Vision-Based Indoor Navigation:

Runtime Failure Monitoring Leveraging Failure Clusters

MLLMs can be used to perform Runtime Failure Monitoring but they fail when not provided any contextual information about the system behavior. We propose to use the generated failure clusters as that information as they lists all the most probable failures of the particular system and can help the MLLM detect them much better at runtime while looking at the past observations. Our prompt is structured in Chain-of-Thought (CoT) manner, where the LLM first reasons about the robot possible future trajectory and then thinks if its predicted path and surrounding environment can lead to any of the situations listed in the clusters. We compare our method with SOTA LLM-based anomaly detection (LLM-AD), CNN-based failure classification methods (VideoMAE-BC, ENet-BC), and also ablate without providing the cluster information in the prompt (NoContext).

Table : Runtime Monitoring Performance comparison.
The above tables show we get higher F-1 Scores and Earlier Detection Times than the baseline methods. We also get better performance than NoContext. This reinforces the importance of providing context about system behavior in the prompt and the utility of our failure clusters for that.
(Refer the paper for more details on the baseline implementations and prompts)

We also show an example of runtime monitoring in action with an expert fallback policy, which overrides the underlying controller when a failure is detected.

Fig. Runtime Failure Monitoring in action while using a fallback controller.

Targeted Failure Data Collection and Policy Refinement

We use the discovered clusters to guide expert data collection in targeted regions of the environment. The robot policy is fine-tuned on an augmented dataset containing an additional 40K samples collected in identified failure zones along with the original training data. The failure rate in sampled trajectories drops from 46% to 18%, demonstrating enhanced safety in previously failure-prone situations, whereas fine-tuning with randomly collected additional data only improves the failure rate to 34%. This forms a closed-loop pipeline of failure discovery, targeted intervention, and policy refinement for continuously enhancing system safety.