Enhancing Robot Safety via MLLM-Based Semantic Interpretation of Failure Data

In Submission

Audio Overview (generated using NotebookLM)

Abstract

As robotic systems become increasingly integrated into real-world environments—ranging from autonomous vehicles to household assistants—they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving future performance. However, manually analyzing large-scale failure datasets is impractical. In this work, we present a method for automatically organizing large-scale robotic failure data into semantically meaningful clusters, enabling scalable learning from failure without human supervision. In this work, we present a method for automatically organizing large-scale robotic failure data into semantically meaningful clusters, enabling scalable learning from failure without human supervision. Our approach leverages the reasoning capabilities of Multimodal Large Language Models (MLLMs), trained on internet-scale data, to infer high-level failure causes from raw perceptual trajectories and discover interpretable structure within uncurated failure logs. These semantic clusters reveal latent patterns and hypothesized causes of failure, enabling scalable learning from experience. We demonstrate that the discovered failure modes can guide targeted data collection for policy refinement, accelerating iterative improvement in agent policies and overall safety. Additionally, we show that these semantic clusters can be employed for online failure detection, offering a lightweight yet powerful safeguard for real-time adaptation. We demonstrate that this framework enhances robot learning and robustness by transforming real-world failures into actionable and interpretable signals for adaptation.

Clustering Failure Modes from Robot Datasets

We propose a method to automatically cluster robotic failure data into semantically meaningful clusters using a Multimodal Large Language Model (MLLM). It proivde a structured and autonomous way to understanf the failure modes of a system and use that knowledge to enhance its safety and performance as the clusters provide a rich information about system failures. We propose a two-step approach to achieve this:

  1. Use a MLLM to generate a textual description of each trajectory in the failure data from sequence of visual inputs.
  2. Use a reasoning LLM to group the data based on the generated textual descriptions and provide clusters of underlying failure modes.

We use Gemini 2.5 Pro for getting the descriptions and OpenaAI o4-mini to cluster them. Here is the list of generated failure clusters for two different datasets:

Fig. Overview of the clustering process.
  • Driving Dataset:
    • Fig. Generated Clusters for Driving Dataset.
      • Indoor Navigation Robot:
        • Fig. Generated Clusters for Vision-based Indoor Navigation.

Runtime Monitoring Leveraging Failure Clusters

MLLMs can be used to perform Runtie Failure Monitoring but they fail when not provided any contextual information about the system behavior. We propose to use the generated failure clusters as that information as they lists all the most probable failures of the particular system and can help the MLLM detect them much better at runtime while looking at the past observations. Our prompt is structured in Chain-of-Thought (CoT) manner, where the LLM first reasons about the robot possible future trajectory and then thinks if its predicted path and surrounding environment can lead to any of th situation listed in the clusters. We compare our method with SOTA VLM-based anomaly detection (VLM-AD), CNN-based failure classification methods (Leaderboard, ENet-BC), and also ablate without providing the cluster information in the prompt (NoContext).

Table : Runtime Monitoring Performance comparison for Driving dataset.

Table : Runtime Monitoring Performance comparison for Vision-based Indoor Navigation.

The above tables show we get higher F-1 Scores and Earlier Detection Times than the baselnes methods. We also get better performance than NoContext. This reinforces the importance of providing context about system behavior in the prompt and the utility of our failure clusters for that.
(Refer the paper for more details on the baseline implementations and prompts)

We also show an example of runtime monitoring in action with an expert fallback policy, which overrides the underlying controller when a failure is detected.

Fig. Runtime Failure Monitoring in action while using a fallback controller.

Targeted Failure Data Collection and Policy Refinement

The generated failure clusters can be used to collect more data for the failure modes that are not well represented in the training data. We collect expert demonstrations around those clusters for the indoor robot navigation and augment the new dataset with original training dataset. After fine-tuning the policy with this augmneted dataset, the failure rate of the system drops from 46% to 18%.