Identifying the Unknown: Prompt-Free Open Vocabulary Anomaly Recognition for Robot-Object Interaction

Jan-Gerrit Habekost; Philipp Allgeuer; Stefan Wermter

arxiv: 2606.26829 · v1 · pith:FYT7SMQKnew · submitted 2026-06-25 · 💻 cs.CV

Identifying the Unknown: Prompt-Free Open Vocabulary Anomaly Recognition for Robot-Object Interaction

Philipp Allgeuer , Jan-Gerrit Habekost , Stefan Wermter This is my paper

Pith reviewed 2026-06-26 05:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords anomaly detectionopen vocabularyprompt-freerobot-object interactionmasked autoencoderobject recognitionopen-world autonomyNOVIC

0 comments

The pith

AnomNOVIC lets robots recognize unseen objects without prompts by pairing anomaly detection for boxes with a dedicated classifier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AnomNOVIC as a two-stage framework for robots to identify previously unseen objects in open-world settings. It first applies a masked autoencoder to locate anomalous regions through generic bounding boxes, then uses the NOVIC classifier to label those regions without any text prompts or candidate class lists. This design targets continuous deployment needs in robotic environments where new objects appear unpredictably. Tests in a humanoid robot tabletop workspace yield 47.1 percent AP in fully prompt-free mode and higher scores when candidates are optionally supplied, exceeding several open vocabulary baselines. The approach aims to support more autonomous robot behavior by reducing reliance on predefined vocabularies.

Core claim

AnomNOVIC is a known-workspace two-stage system that combines a masked autoencoder trained for anomaly detection to generate object-agnostic bounding boxes with NOVIC, a real-time prompt-free open vocabulary image classifier. The MAE stage enables classification of salient regions without requiring a predefined candidate class list or prompts. On a tabletop robot-object environment with the NICOL humanoid, it reaches 47.1 percent AP and 57.5 percent AP50 for prompt-free recognition, rising to 59.0 percent AP and 72.5 percent AP50 with class candidates provided. Across additional datasets including an in-the-wild set of 48 unique objects, it attains up to 82.6 percent prompt-free detection an

What carries the argument

AnomNOVIC two-stage framework, where the masked autoencoder produces generic object-agnostic bounding boxes that enable the NOVIC classifier to operate without prompts or class lists.

If this is right

Robots can detect and classify novel objects in real time during continuous operation without predefined class lists.
Performance reaches 47.1 percent AP and 57.5 percent AP50 in prompt-free mode on the evaluated robot workspace.
Accuracy increases to 59.0 percent AP and 72.5 percent AP50 when optional class candidates are supplied.
The method attains up to 82.6 percent prompt-free accuracy on in-the-wild datasets with 48 objects.
It exceeds the AP results of YOLO-World-v2, OWLv2, and YOLOE across the reported test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular separation of anomaly-based localization from classification could allow independent upgrades to either component for new environments.
Deployment on additional robot platforms beyond the tested humanoid could reveal how workspace assumptions affect generalization.
Integration with downstream robotic actions such as grasping might become feasible once objects are identified without prior vocabulary.

Load-bearing premise

The masked autoencoder must generate bounding boxes that are accurate enough and sufficiently generic for the NOVIC classifier to perform effective recognition without prompts or any candidate class list.

What would settle it

An evaluation in which the anomaly detection stage produces imprecise or incomplete bounding boxes on novel objects, causing overall AP to fall below that of the tested baselines such as YOLO-World-v2.

Figures

Figures reproduced from arXiv: 2606.26829 by Jan-Gerrit Habekost, Philipp Allgeuer, Stefan Wermter.

**Figure 1.** Figure 1: Left: The LLM-powered NICOL is used in a human-robot-object interaction scenario requiring open vocabulary object detection. Right: A two-stage anomaly recognition model, AnomNOVIC, unites an anomaly detection masked autoencoder (MAE) with a strong prompt-free open vocabulary image classification model (NOVIC). by aligning regional features with text embeddings and/or unifying detection and phrase groundin… view at source ↗

**Figure 2.** Figure 2: Examples from our training (left, center) and test (right) datasets, with (bottom) and without (top) distractor objects. The training dataset only contains clean images of the workspace and corresponding table masks (making it quick and easy to collect), while the test dataset contains anomalous images with many (often overlapping) objects. NOVIC checkpoint for all main results. However, we also train our … view at source ↗

**Figure 3.** Figure 3: Left: Easy, medium, and hard examples (left to right) of the jello, minibus, bowl, and green pepper test classes (top to bottom). Right: Example masked autoencoder input (top left), target reconstruction (top right), target anomaly mask (bottom left), and target table mask (bottom right), for the training dataset with distractor objects. including color jitter, random perspective distortions, rotations of … view at source ↗

**Figure 4.** Figure 4: Anomaly localization example, showing the input image, binary anomaly mask, mask-based anomalies, reconstructed image Iˆ, reconstruction error score map Sr, reconstruction-based anomalies, and the final merged anomalies with corresponding prompt-free NOVIC classifications and NOVIC / merged MAE anomaly scores. Although natively prompt-free, unlike almost all existing classifiers, it can also emulate prompt… view at source ↗

**Figure 5.** Figure 5: Top: Prompt-free predictions of YOLOE-11L (left) and AnomNOVIC-S N-FT0 (right) on the Wild dataset, showing the starkly better performance of AnomNOVIC. Bottom: Prompt-free predictions with a moving camera in a fuselage environment. dataset is particularly beneficial in more exotic open-set scenarios. The Wild dataset performance drop from N-FT0 to N-FT2 is however much smaller than from FT0 to FT2, highli… view at source ↗

read the original abstract

Robots operating in real-world environments must in general be able to recognize previously unseen objects. As robotic systems move toward open-world autonomy, there is a growing, yet largely unmet, need for open vocabulary object detectors that are prompt-free and efficient enough for continuous deployment. We present AnomNOVIC, a two-stage known-workspace framework that combines a masked autoencoder (MAE) trained for anomaly detection, with NOVIC, a powerful real-time prompt-free open vocabulary image classifier. The MAE produces generic object-agnostic bounding boxes, allowing NOVIC to classify salient image regions without requiring a predefined candidate class list. We evaluate AnomNOVIC against strong open vocabulary baselines in a tabletop robot-object environment featuring the NICOL humanoid robot, reaching 47.1% AP / 57.5% AP50 for prompt-free recognition, and 59.0% AP / 72.5% AP50 if class candidates are provided. Across additional datasets, including an in-the-wild test set with 48 unique objects, AnomNOVIC reaches up to 82.6% prompt-free detection and classification accuracy. These results significantly surpass all tested open vocabulary baselines, including YOLO-World-v2, OWLv2, and YOLOE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AnomNOVIC pairs MAE anomaly boxes with prompt-free NOVIC classification and beats listed baselines on robot tabletop data, but the absence of any MAE-stage metrics leaves the source of the gains unclear.

read the letter

The main point is that this paper describes a two-stage pipeline called AnomNOVIC for prompt-free open-vocabulary recognition in a robot-object setting. A masked autoencoder generates generic bounding boxes for anomalies, then NOVIC classifies those regions without prompts or candidate lists. On their NICOL tabletop setup it reaches 47.1% AP prompt-free and 59.0% AP with candidates, and up to 82.6% accuracy on an in-the-wild set with 48 objects, outperforming the cited baselines like YOLO-World-v2, OWLv2, and YOLOE.

The practical focus is useful. Robotics needs methods that run continuously without per-query prompts, and separating anomaly proposal from classification is a reasonable way to handle unknowns. The additional dataset results give some sense that the approach is not limited to one controlled scene.

The weakest link is exactly what the stress-test note flags: the abstract supplies no numbers on the MAE proposals themselves—no recall, no IoU against ground-truth unknowns, no precision of the anomaly boxes. Without those, the headline AP numbers cannot be cleanly attributed to the prompt-free classifier. The gains could come from unusually clean proposals rather than from NOVIC’s ability to work on generic regions. The paper would be stronger with even basic detector-stage diagnostics.

The work is incremental engineering rather than a new theoretical result, but the empirical comparisons are direct and the problem statement matches a real deployment constraint. It is aimed at robotics and applied CV groups that need open-world perception on physical platforms. A reader building similar systems would find the architecture and the reported deltas worth examining.

It deserves peer review. The core idea is concrete, the baselines are relevant, and the robotics motivation is clear; the missing detector metrics are fixable in revision.

Referee Report

2 major / 1 minor

Summary. The paper presents AnomNOVIC, a two-stage framework for prompt-free open vocabulary anomaly recognition in robot-object interactions. A masked autoencoder (MAE) generates generic object-agnostic bounding boxes for anomaly detection, which are then classified by NOVIC, a real-time prompt-free open vocabulary image classifier, without requiring predefined class lists or prompts. Evaluations on a tabletop environment with the NICOL humanoid robot report 47.1% AP / 57.5% AP50 for prompt-free recognition (59.0% AP / 72.5% AP50 with class candidates), outperforming baselines including YOLO-World-v2, OWLv2, and YOLOE. Additional results on in-the-wild datasets with 48 unique objects reach up to 82.6% prompt-free accuracy.

Significance. If the performance claims hold after addressing the proposal quality issue, this could meaningfully advance open-world robotic perception by demonstrating a practical, efficient prompt-free pipeline for detecting and classifying unknown objects in continuous deployment. The empirical scope across a robot-specific tabletop setup and multiple additional datasets, including in-the-wild tests, provides concrete evidence of applicability beyond standard benchmarks.

major comments (2)

[Methods / two-stage framework] Description of the two-stage framework (MAE component): the quality of the bounding boxes produced by the MAE is not quantified with any metrics such as recall, IoU against ground-truth unknowns, or proposal precision. This is load-bearing for the central claim, as the reported 47.1% prompt-free AP cannot be unambiguously attributed to NOVIC's prompt-free classification unless the MAE proposals are shown to be sufficiently accurate and complete.
[Results / evaluation] Results section reporting AP numbers: no training details, evaluation protocols, variance estimates, or statistical tests accompany the comparisons to YOLO-World-v2, OWLv2, and YOLOE. Without these, the headline gains cannot be verified as robust support for the prompt-free design.

minor comments (1)

[Abstract and Results] The distinction between the prompt-free and class-candidate-provided settings could be made more explicit when presenting the 47.1% vs. 59.0% AP figures to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our work. We have carefully considered each point and provide our responses below, along with plans for revisions to the manuscript.

read point-by-point responses

Referee: [Methods / two-stage framework] Description of the two-stage framework (MAE component): the quality of the bounding boxes produced by the MAE is not quantified with any metrics such as recall, IoU against ground-truth unknowns, or proposal precision. This is load-bearing for the central claim, as the reported 47.1% prompt-free AP cannot be unambiguously attributed to NOVIC's prompt-free classification unless the MAE proposals are shown to be sufficiently accurate and complete.

Authors: We agree that providing quantitative evaluation of the MAE-generated bounding boxes is essential to support the attribution of performance to the NOVIC classifier. In the revised manuscript, we will add metrics including recall, average IoU with ground-truth unknown objects, and proposal precision for the MAE component on the relevant datasets. This will demonstrate the quality and completeness of the proposals. revision: yes
Referee: [Results / evaluation] Results section reporting AP numbers: no training details, evaluation protocols, variance estimates, or statistical tests accompany the comparisons to YOLO-World-v2, OWLv2, and YOLOE. Without these, the headline gains cannot be verified as robust support for the prompt-free design.

Authors: We acknowledge the need for more rigorous reporting in the results. The updated manuscript will include comprehensive training details for the MAE and NOVIC models, a full description of the evaluation protocols used for computing AP, variance estimates from multiple experimental runs, and appropriate statistical tests to validate the performance improvements over the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical two-stage framework (MAE for generic bounding boxes followed by NOVIC classification) and reports performance metrics such as 47.1% AP as outcomes of evaluation on tabletop and in-the-wild datasets. No equations, fitted parameters, self-referential derivations, or load-bearing self-citations appear in the provided text. Claims are presented as experimental results rather than quantities defined in terms of the method itself, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical models, free parameters, axioms, or invented entities; it describes an applied machine learning system evaluated on empirical tasks.

pith-pipeline@v0.9.1-grok · 5762 in / 1333 out tokens · 25979 ms · 2026-06-26T05:43:26.659662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references

[1]

Allgeuer, P.: GPT Batch API, https://github.com/pallgeuer/gpt_batch_api, ac- cessed 15 May 2026

2026
[2]

In: Proceedings of the Winter Conference on Applications of Computer Vision (WACV) (2025)

Allgeuer, P., Ahrens, K., Wermter, S.: Unconstrained open vocabulary image classi- fication: Zero-shot transfer from text to image via CLIP inversion. In: Proceedings of the Winter Conference on Applications of Computer Vision (WACV) (2025)

2025
[3]

International Journal of Computer Vision129(4) (2021)

Bergmann, P., Batzner, K., Fauser, M., Sattlegger, D., Steger, C.: The MVTec anomaly detection dataset: A comprehensive real-world dataset for unsupervised anomaly detection. International Journal of Computer Vision129(4) (2021)

2021
[4]

Neural Networks147, 53–62 (2022)

Chen, L., You, Z., Zhang, N., Xi, J., Le, X.: UTRAD: Anomaly detection and localization with U-Transformer. Neural Networks147, 53–62 (2022)

2022
[5]

In: Computer Vision and Pattern Recognition (CVPR) (2024)

Cheng, T., Song, L., Ge, Y., et al.: YOLO-World: Real-time open-vocabulary object detection. In: Computer Vision and Pattern Recognition (CVPR) (2024)

2024
[6]

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. In: Int. Conf. on Learning Representations (ICLR) (2024)

2024
[7]

Gajdošech, L., Ali, H., Habekost, J.G., Madaras, M., Kerzel, M., et al.: Shaken, not stirred: A novel dataset for visual understanding of glasses in human-robot bartending tasks. In: Int. Conf. on Intelligent Robots and Systems (IROS) (2025)

2025
[8]

Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: Int. Conf. on Learning Repr. (ICLR) (2022)

2022
[9]

In: Computer Vision and Pattern Recog

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recog. (CVPR) (2022)

2022
[10]

IEEE Access10(2022)

Lee, Y., Kang, P.: AnoViT: Unsupervised anomaly detection and localization with vision transformer-based encoder-decoder. IEEE Access10(2022)

2022
[11]

In: Computer Vision (ECCV) (2024)

Liu, S., Zeng, Z., Ren, T., et al.: Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In: Computer Vision (ECCV) (2024)

2024
[12]

In: European Conference on Computer Vision (ECCV) (2022)

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., et al.: Simple open-vocabulary object detection. In: European Conference on Computer Vision (ECCV) (2022)

2022
[13]

In: Conference on Neural Information Processing Systems (NeurIPS) (2023)

Minderer, M., Gritsenko, A.A., Houlsby, N.: Scaling open-vocabulary object detec- tion. In: Conference on Neural Information Processing Systems (NeurIPS) (2023)

2023
[14]

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: Int. Conf. on Machine Learning (2021)

2021
[15]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., et al.: SAM 2: Segment anything in images and videos. In: Int. Conf. on Learning Representations (ICLR) (2025)

2025
[16]

Schwartz, E., Arbelle, A., Karlinsky, L., et al.: MAEDAY: MAE for few- and zero-shot AnomalY-Detection. Comp. Vision and Image Understanding241(2024)

2024
[17]

Tao, X., Adak, C., et al.: ViTALnet: Anomaly on industrial textured surfaces with hybrid transformer. Trans. on Instrumentation and Measurement72(2023)

2023
[18]

In: International Conference on Computer Vision (ICCV) (2025)

Wang, A., Liu, L., Chen, H., Lin, Z., Han, J., Ding, G.: YOLOE: Real-time seeing anything. In: International Conference on Computer Vision (ICCV) (2025)

2025
[19]

In: Comp

Yao, L., Pi, R., et al.: DetCLIPv3: Towards versatile generative open-vocabulary object detection. In: Comp. Vis. and Pattern Recognition (CVPR) (2024)

2024
[20]

In: Computer Vision and Pattern Recognition (CVPR) (2021)

Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Computer Vision and Pattern Recognition (CVPR) (2021)

2021
[21]

Zavrtanik, V., et al.: DRAEM: A discriminatively trained reconstruction embedding for surface anomaly detection. In: Int. Conf. on Comp. Vis. (ICCV) (2021)

2021
[22]

In: Computer Vision and Pattern Recognition (CVPR) Workshops (2024)

Zhang, Y., Huang, X., Ma, J., et al.: Recognize Anything: A strong image tagging model. In: Computer Vision and Pattern Recognition (CVPR) Workshops (2024)

2024
[23]

In: Computer Vision and Pattern Recognition (CVPR) (2022)

Zhong, Y., Yang, J., Zhang, P., et al.: RegionCLIP: Region-based language-image pretraining. In: Computer Vision and Pattern Recognition (CVPR) (2022)

2022

[1] [1]

Allgeuer, P.: GPT Batch API, https://github.com/pallgeuer/gpt_batch_api, ac- cessed 15 May 2026

2026

[2] [2]

In: Proceedings of the Winter Conference on Applications of Computer Vision (WACV) (2025)

Allgeuer, P., Ahrens, K., Wermter, S.: Unconstrained open vocabulary image classi- fication: Zero-shot transfer from text to image via CLIP inversion. In: Proceedings of the Winter Conference on Applications of Computer Vision (WACV) (2025)

2025

[3] [3]

International Journal of Computer Vision129(4) (2021)

Bergmann, P., Batzner, K., Fauser, M., Sattlegger, D., Steger, C.: The MVTec anomaly detection dataset: A comprehensive real-world dataset for unsupervised anomaly detection. International Journal of Computer Vision129(4) (2021)

2021

[4] [4]

Neural Networks147, 53–62 (2022)

Chen, L., You, Z., Zhang, N., Xi, J., Le, X.: UTRAD: Anomaly detection and localization with U-Transformer. Neural Networks147, 53–62 (2022)

2022

[5] [5]

In: Computer Vision and Pattern Recognition (CVPR) (2024)

Cheng, T., Song, L., Ge, Y., et al.: YOLO-World: Real-time open-vocabulary object detection. In: Computer Vision and Pattern Recognition (CVPR) (2024)

2024

[6] [6]

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. In: Int. Conf. on Learning Representations (ICLR) (2024)

2024

[7] [7]

Gajdošech, L., Ali, H., Habekost, J.G., Madaras, M., Kerzel, M., et al.: Shaken, not stirred: A novel dataset for visual understanding of glasses in human-robot bartending tasks. In: Int. Conf. on Intelligent Robots and Systems (IROS) (2025)

2025

[8] [8]

Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: Int. Conf. on Learning Repr. (ICLR) (2022)

2022

[9] [9]

In: Computer Vision and Pattern Recog

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Computer Vision and Pattern Recog. (CVPR) (2022)

2022

[10] [10]

IEEE Access10(2022)

Lee, Y., Kang, P.: AnoViT: Unsupervised anomaly detection and localization with vision transformer-based encoder-decoder. IEEE Access10(2022)

2022

[11] [11]

In: Computer Vision (ECCV) (2024)

Liu, S., Zeng, Z., Ren, T., et al.: Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In: Computer Vision (ECCV) (2024)

2024

[12] [12]

In: European Conference on Computer Vision (ECCV) (2022)

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., et al.: Simple open-vocabulary object detection. In: European Conference on Computer Vision (ECCV) (2022)

2022

[13] [13]

In: Conference on Neural Information Processing Systems (NeurIPS) (2023)

Minderer, M., Gritsenko, A.A., Houlsby, N.: Scaling open-vocabulary object detec- tion. In: Conference on Neural Information Processing Systems (NeurIPS) (2023)

2023

[14] [14]

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: Int. Conf. on Machine Learning (2021)

2021

[15] [15]

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., et al.: SAM 2: Segment anything in images and videos. In: Int. Conf. on Learning Representations (ICLR) (2025)

2025

[16] [16]

Schwartz, E., Arbelle, A., Karlinsky, L., et al.: MAEDAY: MAE for few- and zero-shot AnomalY-Detection. Comp. Vision and Image Understanding241(2024)

2024

[17] [17]

Tao, X., Adak, C., et al.: ViTALnet: Anomaly on industrial textured surfaces with hybrid transformer. Trans. on Instrumentation and Measurement72(2023)

2023

[18] [18]

In: International Conference on Computer Vision (ICCV) (2025)

Wang, A., Liu, L., Chen, H., Lin, Z., Han, J., Ding, G.: YOLOE: Real-time seeing anything. In: International Conference on Computer Vision (ICCV) (2025)

2025

[19] [19]

In: Comp

Yao, L., Pi, R., et al.: DetCLIPv3: Towards versatile generative open-vocabulary object detection. In: Comp. Vis. and Pattern Recognition (CVPR) (2024)

2024

[20] [20]

In: Computer Vision and Pattern Recognition (CVPR) (2021)

Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Computer Vision and Pattern Recognition (CVPR) (2021)

2021

[21] [21]

Zavrtanik, V., et al.: DRAEM: A discriminatively trained reconstruction embedding for surface anomaly detection. In: Int. Conf. on Comp. Vis. (ICCV) (2021)

2021

[22] [22]

In: Computer Vision and Pattern Recognition (CVPR) Workshops (2024)

Zhang, Y., Huang, X., Ma, J., et al.: Recognize Anything: A strong image tagging model. In: Computer Vision and Pattern Recognition (CVPR) Workshops (2024)

2024

[23] [23]

In: Computer Vision and Pattern Recognition (CVPR) (2022)

Zhong, Y., Yang, J., Zhang, P., et al.: RegionCLIP: Region-based language-image pretraining. In: Computer Vision and Pattern Recognition (CVPR) (2022)

2022