Recognition: no theorem link
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
Pith reviewed 2026-05-10 18:43 UTC · model grok-4.3
The pith
KITE turns robot videos into compact keyframe and bird's-eye-view tokens so off-the-shelf VLMs can detect, identify, and explain failures without any training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view representation that encodes relative object layout, axes, timestamps, and detection confidence; these visual cues are serialized with robot-profile and scene-context tokens into a unified prompt that supports failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM.
What carries the argument
KITE front-end: keyframe selection from motion salience, paired with schematic BEV layouts and serialized robot-profile tokens that produce compact tokenized evidence for VLMs.
If this is right
- On RoboFAC, KITE with Qwen2.5-VL substantially outperforms vanilla Qwen2.5-VL in training-free failure detection, identification, and localization.
- The same KITE prompt supports the full pipeline of detection through correction with one off-the-shelf model.
- A small QLoRA fine-tune on top of KITE further improves explanation and correction quality.
- Qualitative results on real dual-arm robots indicate that the structured evidence transfers beyond simulation.
- KITE remains competitive with a fully tuned RoboFAC baseline while requiring no task-specific training for the base VLM.
Where Pith is reading between the lines
- The same keyframe-plus-BEV serialization could be reused for other long-horizon video tasks such as anomaly detection in assembly or navigation logs.
- Because the BEV is schematic and human-readable, the evidence stream may also serve as an interpretable log for human oversight or regulatory review.
- If keyframe selection misses subtle contact events, adding a lightweight optical-flow or contact-sensor filter before tokenization would be a direct extension.
- The method's reliance on open-vocabulary detections suggests easy swapping of the underlying detector without retraining the VLM prompt logic.
Load-bearing premise
The chosen keyframes and BEV schematics preserve every piece of information an off-the-shelf VLM needs to correctly analyze failures without critical omissions or misleading cues.
What would settle it
A robot failure whose root cause is visible only in frames or spatial details omitted by the keyframe and BEV selection process, causing the VLM to produce an incorrect detection or explanation.
Figures
read the original abstract
We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KITE, a training-free front-end that distills robot execution trajectories into a compact set of motion-salient keyframes augmented with open-vocabulary detections and schematic bird's-eye-view (BEV) representations encoding layout, axes, timestamps, and confidence. These are serialized with robot-profile and scene-context tokens to form prompts for off-the-shelf VLMs, enabling failure detection, identification, localization, explanation, and correction. On the RoboFAC benchmark, KITE paired with Qwen2.5-VL yields substantial gains over the vanilla VLM in the training-free regime (especially on simulation tasks) while remaining competitive with a RoboFAC-tuned baseline; a small QLoRA fine-tune further boosts explanation and correction quality. Qualitative results on real dual-arm robots are also presented, and code/models are released.
Significance. If the benchmark results hold under scrutiny, KITE provides a practical, interpretable, and training-free mechanism for injecting structured visual evidence into VLMs for robot failure analysis, which could improve reliability and debuggability in deployed robotic systems. The explicit release of code and models is a clear strength that supports reproducibility and extension by the community.
major comments (2)
- [§3] §3 (Keyframe extraction and BEV serialization): The motion-salient keyframe heuristic is described at a high level but without the precise selection rule (e.g., optical-flow magnitude threshold, change-detection metric, or minimum inter-keyframe interval). This is load-bearing for the central claim because low-velocity failure modes (slow drift, grasp instability, force-threshold violations) may be systematically excluded; if the extractor drops such cues, the subsequent BEV prompt receives no signal and the reported gains on RoboFAC may be benchmark-specific rather than a general property of the representation.
- [§4] §4 (Experimental evaluation): The manuscript reports large improvements on RoboFAC failure detection/identification/localization but does not specify the exact evaluation metrics, confidence intervals, statistical tests, or controls for keyframe-selection bias. Without these details it is impossible to determine whether the gains are robust or sensitive to the particular failure distribution in the benchmark.
minor comments (2)
- The abstract and §4 could state the typical number of keyframes retained per trajectory and the resulting prompt token count; this would help readers assess the claimed compactness.
- Figure captions for the qualitative real-robot examples should explicitly note which failure types are illustrated and whether any low-motion cases were included.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas where additional precision will strengthen the manuscript. We address each major comment below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: [§3] §3 (Keyframe extraction and BEV serialization): The motion-salient keyframe heuristic is described at a high level but without the precise selection rule (e.g., optical-flow magnitude threshold, change-detection metric, or minimum inter-keyframe interval). This is load-bearing for the central claim because low-velocity failure modes (slow drift, grasp instability, force-threshold violations) may be systematically excluded; if the extractor drops such cues, the subsequent BEV prompt receives no signal and the reported gains on RoboFAC may be benchmark-specific rather than a general property of the representation.
Authors: We agree that the keyframe extraction procedure requires a more explicit description to support reproducibility and to allow evaluation of its coverage for low-velocity failures. In the revised manuscript we will expand §3 with the precise selection rule as implemented in the released code, including the motion-saliency metric, any thresholds, and the minimum inter-keyframe interval. We will also add a short discussion of how low-velocity or static failure cues are captured through the complementary open-vocabulary detections and BEV layout encodings, while acknowledging that the current heuristic is primarily motion-driven and may benefit from future extensions for purely static anomalies. revision: yes
-
Referee: [§4] §4 (Experimental evaluation): The manuscript reports large improvements on RoboFAC failure detection/identification/localization but does not specify the exact evaluation metrics, confidence intervals, statistical tests, or controls for keyframe-selection bias. Without these details it is impossible to determine whether the gains are robust or sensitive to the particular failure distribution in the benchmark.
Authors: We concur that the experimental reporting should be more complete. In the revised §4 we will explicitly define the metrics used for each sub-task (detection, identification, localization, explanation, and correction), report confidence intervals, include the results of appropriate statistical tests, and add controls or ablations that address potential keyframe-selection bias (for example, comparisons against uniform or random keyframe baselines). These additions will make the robustness of the observed gains clearer. revision: yes
Circularity Check
No circularity in KITE empirical pipeline
full rationale
The paper describes an empirical front-end pipeline that selects motion-salient keyframes from robot videos, generates schematic BEV representations, and serializes them with tokens for an off-the-shelf VLM. Performance is measured via direct comparison on the external RoboFAC benchmark against vanilla VLM and a tuned baseline. No equations, fitted parameters, or predictions are presented that reduce to the method's own inputs by construction. No uniqueness theorems, self-cited ansatzes, or self-definitional steps appear in the provided text. The keyframe heuristic and BEV encoding are presented as standard components without load-bearing self-references.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Do as i can, not as i say: Grounding language in robotic affordances,
A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian,et al., “Do as i can, not as i say: Grounding language in robotic affordances,” inConference on robot learning. PMLR, 2023, pp. 287–318
2023
-
[2]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu,et al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al., “A generalist agent,”arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review arXiv 2022
-
[4]
Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y . Xie, T. Zhang, Z. Zhao,et al., “Toward general-purpose robots via foundation models: A survey and meta-analysis,”arXiv preprint arXiv:2312.08782, 2023
-
[5]
R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman,et al., “Foundation models in robotics: Applications, challenges, and the future,”arXiv preprint arXiv:2312.07843, 2023
-
[6]
arXiv preprint arXiv:2403.03174
F. Liu, K. Fang, P. Abbeel, and S. Levine, “Moka: Open-vocabulary robotic manipulation through mark-based visual prompting,”arXiv preprint arXiv:2403.03174, 2024
-
[7]
Copa: General robotic manipulation through spatial constraints of parts with foundation models,
H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “Copa: General robotic manipulation through spatial constraints of parts with foundation models,”arXiv preprint arXiv:2403.08248, 2024
-
[8]
W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”arXiv preprint arXiv:2409.01652, 2024
-
[9]
Reflect: Summarizing robot experi- ences for failure explanation and correction,
Z. Liu, A. Bahety, and S. Song, “Reflect: Summarizing robot experi- ences for failure explanation and correction,” inCoRL, 2023
2023
-
[10]
Aha: A vision- language-model for detecting and reasoning over failures in robotic manipulation,
J. Duan, W. Pumacay, N. Kumar, Y . R. Wang, S. Tian, W. Yuan, R. Krishna, D. Fox, A. Mandlekar, and Y . Guo, “Aha: A vision- language-model for detecting and reasoning over failures in robotic manipulation,” inICLR, 2025
2025
-
[11]
Robofac: A comprehensive framework for robotic failure analysis and correction,
W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao, “Robofac: A comprehensive framework for robotic failure analysis and correction,”
-
[12]
arXiv preprint arXiv:2505.12224 , year=
[Online]. Available: https://arxiv.org/abs/2505.12224
-
[13]
Compound robot - realman robotics,
RealMan Robotics, “Compound robot - realman robotics,” https://www. realman-robotics.com/compound-robot, 2024, accessed: 2024-09-07
2024
-
[14]
Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,
A. . Team, J. Aldaco, T. Armstrong, R. Baruch, J. Bingham, S. Chan, K. Draper, D. Dwibedi, C. Finn, P. Florence, S. Goodrich, W. Gramlich, T. Hage, A. Herzog, J. Hoech, T. Nguyen, I. Storz, B. Taban- pour, L. Takayama, J. Tompson, A. Wahid, T. Wahrburg, S. Xu, S. Yaroshenko, K. Zakka, and T. Z. Zhao, “Aloha 2: An enhanced low-cost hardware for bimanual te...
2024
-
[15]
Large language models for robotics: A survey
F. Zeng, W. Gan, Y . Wang, N. Liu, and P. S. Yu, “Large language models for robotics: A survey,”arXiv preprint arXiv:2311.07226, 2023
-
[16]
Large language models for human–robot interaction: A review,
C. Zhang, J. Chen, J. Li, Y . Peng, and Z. Mao, “Large language models for human–robot interaction: A review,”Biomimetic Intelligence and Robotics, vol. 3, no. 4, p. 100131, 2023
2023
-
[17]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale,et al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
O. team, “Gpt-4o system card,” 2024. [Online]. Available: https://arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023
2023
-
[21]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next
2024
-
[22]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican,et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
A survey of embodied ai: From simulators to research tasks,
J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022
2022
-
[25]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022
2022
-
[26]
Automated agent decom- position for classical planning,
M. Crosby, M. Rovatsos, and R. Petrick, “Automated agent decom- position for classical planning,” inProceedings of the International Conference on Automated Planning and Scheduling, vol. 23, 2013, pp. 46–54
2013
-
[27]
Re- woo: Decoupling reasoning from observations for ef- ficient augmented language models
B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y . Liu, and D. Xu, “Rewoo: Decoupling reasoning from observations for efficient augmented language models,”arXiv preprint arXiv:2305.18323, 2023
-
[28]
Large lan- guage models are zero-shot reasoners,
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022
2022
-
[29]
Explainable ai for robot failures: Generating explanations that improve user assistance in fault recovery,
D. Das, S. Banerjee, and S. Chernova, “Explainable ai for robot failures: Generating explanations that improve user assistance in fault recovery,” inProceedings of the 2021 ACM/IEEE international conference on human-robot interaction, 2021, pp. 351–360
2021
-
[30]
Verbalization: Narration of autonomous robot experience
S. Rosenthal, S. P. Selvaraj, and M. M. Veloso, “Verbalization: Narration of autonomous robot experience.” inIJCAI, vol. 16, 2016, pp. 862–868
2016
-
[31]
Human trust after robot mistakes: Study of the effects of different forms of robot communication,
S. Ye, G. Neville, M. Schrum, M. Gombolay, S. Chernova, and A. Howard, “Human trust after robot mistakes: Study of the effects of different forms of robot communication,” in2019 28th IEEE Interna- tional Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2019, pp. 1–7
2019
-
[32]
P. Khanna, E. Yadollahi, M. Björkman, I. Leite, and C. Smith, “User study exploring the role of explanation of failures by robots in human robot collaboration tasks,”arXiv preprint arXiv:2303.16010, 2023
-
[33]
Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions,
J. Arkin, D. Park, S. Roy, M. R. Walter, N. Roy, T. M. Howard, and R. Paul, “Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions,”The International Journal of Robotics Research, vol. 39, no. 10-11, pp. 1279–1304, 2020
2020
-
[34]
Latte: Language trajectory transformer,
A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, S. Vemprala, and R. Bonatti, “Latte: Language trajectory transformer,”arXiv preprint arXiv:2208.02918, 2022
-
[35]
Cape: Corrective actions from precondition errors using large language models,
S. S. Raman, V . Cohen, I. Idrees, E. Rosen, R. Mooney, S. Tellex, and D. Paulius, “Cape: Corrective actions from precondition errors using large language models,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 070–14 077
2024
-
[36]
I can tell what i am doing: Toward real-world natural language grounding of robot experiences,
Z. Wang, B. Liang, V . Dhat, Z. Brumbaugh, N. Walker, R. Krishna, and M. Cakmak, “I can tell what i am doing: Toward real-world natural language grounding of robot experiences,”arXiv preprint arXiv:2411.12960, 2024
-
[37]
Learning to summarize and answer questions about a virtual robot’s past actions,
C. DeChant, I. Akinola, and D. Bauer, “Learning to summarize and answer questions about a virtual robot’s past actions,”Autonomous robots, vol. 47, no. 8, pp. 1103–1118, 2023
2023
-
[38]
Vision-language models as success detectors.arXiv preprint arXiv:2303.07280, 2023
Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi, “Vision-language models as success detectors,”arXiv preprint arXiv:2303.07280, 2023
-
[39]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre-training,”arXiv preprint arXiv:2210.00030, 2022
work page internal anchor Pith review arXiv 2022
-
[40]
Scaling up and distilling down: Language-guided robot skill acquisition,
H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” inConference on Robot Learning. PMLR, 2023, pp. 3766–3777
2023
-
[41]
Gensim: Generating robotic simulation tasks via large language models,
L. Wang, Y . Ling, Z. Yuan, M. Shridhar, C. Bao, Y . Qin, B. Wang, H. Xu, and X. Wang, “Gensim: Generating robotic simulation tasks via large language models,”arXiv preprint arXiv:2310.01361, 2023
-
[42]
K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suender- hauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,”arXiv preprint arXiv:2307.06135, 2023
-
[43]
Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving,
T. Choudhary, V . Dewangan, S. Chandhok, S. Priyadarshan, A. Jain, A. K. Singh, S. Srivastava, K. M. Jatavallabhula, and K. M. Krishna, “Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 16 345–16 352
2024
-
[44]
Sketch, ground, and refine: Top-down dense video captioning,
C. Deng, S. Chen, D. Chen, Y . He, and Q. Wu, “Sketch, ground, and refine: Top-down dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 234–243
2021
-
[45]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu,et al., “Grounding dino: Marrying dino with grounded pre- training for open-set object detection,”arXiv preprint arXiv:2303.05499, 2023
work page Pith review arXiv 2023
-
[46]
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv:2406.09414, 2024
work page internal anchor Pith review arXiv 2024
-
[47]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Two-frame motion estimation based on polynomial expansion,
G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” inScandinavian conference on Image analysis. Springer, 2003, pp. 363–370
2003
-
[49]
QLoRA: Efficient Finetuning of Quantized LLMs
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Ef- ficient finetuning of quantized llms,”arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.