From Perception Logs to Failure Modes: Language-Driven Semantic Clustering of Failures for Robot Safety

In Submission

Abstract

As robotic systems become increasingly integrated into real-world environments—ranging from autonomous vehicles to household assistants—they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving future performance. However, manually analyzing large-scale failure datasets is impractical. In this work, we present a method for automatically organizing large-scale robotic failure data into semantically meaningful failure clusters, enabling scalable learning from failure without human supervision. Our approach leverages the reasoning capabilities of Multimodal Large Language Models (MLLMs), trained on internet-scale data, to infer high-level failure causes from raw perceptual trajectories and discover interpretable structure within uncurated failure logs. These semantic clusters reveal patterns and hypothesized causes of failure, enabling scalable learning from experience. We demonstrate that the discovered failure modes can guide targeted data collection for policy refinement, accelerating iterative improvement in agent policies and overall safety. Additionally, we show that these semantic clusters can benefit online failure monitoring systems, offering a lightweight yet powerful safeguard for real-time operation. We demonstrate that this framework enhances robot learning and robustness by transforming real-world failures into actionable and interpretable signals for adaptation.

Clustering Failure Modes from Robot Datasets

We propose a method to automatically cluster robotic failure data into semantically meaningful clusters using a Multimodal Large Language Model (MLLM). It proivde a structured and autonomous way to understanf the failure modes of a system and use that knowledge to enhance its safety and performance as the clusters provide a rich information about system failures. We propose a two-step approach to achieve this:

  1. Use a MLLM to generate a textual description of each trajectory in the failure data from sequence of visual inputs.
  2. Use a reasoning LLM to group the data based on the generated textual descriptions and provide clusters of underlying failure modes.

We use Gemini 2.5 Pro for getting the descriptions and OpenaAI o4-mini to cluster them. Here is the list of generated failure clusters for two different datasets:

  • Robot Manipulation:


      • Real-World Car Crash:


          • Vision-Based Indoor Navigation:
Evaluation of Generated Failure Reasons and Clusters
We evaluate different open and closed source models for failure reasoning, and found Gemini 2.5 Pro to perform the best. We also compare the generated clusters with expert-defined failure taxonomy using similarity scores and our method generates failure modes that align well with human understanding of failures describing core issue behind failures and covering both prominent and long-tail issues..
Table. Failure reasoning performance of different MLLMs.
Fig. A failure inference example where robot dropped a pot with water on the floor.
Fig. Heatmap comparing similarity scores between expert-defined failure taxonomy and generated clusters.
Runtime Failure Monitoring Leveraging Failure Clusters
MLLMs can be used to perform Runtie Failure Monitoring but they fail when not provided any contextual information about the system behavior. We propose to use the generated failure clusters as that information as they lists all the most probable failures of the particular system and can help the MLLM detect them much better at runtime while looking at the past observations. Our prompt is structured in Chain-of-Thought (CoT) manner, where the LLM first reasons about the robot possible future trajectory and then thinks if its predicted path and surrounding environment can lead to any of th situation listed in the clusters. We compare our method with SOTA LLM-based anomaly detection (LLM-AD), CNN-based failure classification methods (VideoMAE-BC, ENet-BC), and also ablate without providing the cluster information in the prompt (NoContext).

Table : Runtime Monitoring Performance comparison.

The above tables show we get higher F-1 Scores and Earlier Detection Times than the baselnes methods. We also get better performance than NoContext. This reinforces the importance of providing context about system behavior in the prompt and the utility of our failure clusters for that.
(Refer the paper for more details on the baseline implementations and prompts)

We also show an example of runtime monitoring in action with an expert fallback policy, which overrides the underlying controller when a failure is detected.

Fig. Runtime Failure Monitoring in action while using a fallback controller.

Targeted Failure Data Collection and Policy Refinement

We use the discovered clusters to guide expert data collection in targeted regions of the environment. The robot policy is fine-tuned on an augmented dataset containing an additional 40K samples collected in identified failure zones along with the original training data. The failure rate in sampled trajectories drops from 46% to 18%, demonstrating enhanced safety in previously failure-prone situations, whereas fine-tuning with randomly collected additional data only improves the failure rate to 34%. This forms a \textit{closed-loop} pipeline of failure discovery, targeted intervention, and policy refinement for continuously enhancing system safety.