pith. machine review for the scientific record. sign in

arxiv: 2602.22683 · v2 · submitted 2026-02-26 · 💻 cs.CV · cs.AI

Recognition: no theorem link

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords smart glassesvision question answeringVQA benchmarkvision language modelsmultimodal agentretrieval augmented generationegocentric vision
0
0 comments X

The pith

A specialized agent for smart glasses vision questions outperforms GPT-4o by 2.19 percent through object detection and targeted web search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SUPERGLASSES, a benchmark of 2,422 real egocentric image-question pairs gathered directly from smart glasses devices across 14 domains and 8 categories, complete with search trajectories and reasoning steps. Existing vision-language models show clear shortfalls on these queries because they often fail to first locate the relevant object before seeking outside knowledge. The authors introduce SUPERLENS, which adds automatic object detection, breaks queries into parts, and performs multimodal web searches to generate answers. This pipeline reaches higher accuracy than GPT-4o and other leading models. The results indicate that general-purpose VLMs are insufficient for wearable smart-glasses tasks and that purpose-built components deliver measurable gains.

Core claim

SUPERGLASSES is the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices, comprising 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. SUPERLENS integrates automatic object detection, query decoupling, and multimodal web search for retrieval-augmented answer generation and achieves state-of-the-art performance, outperforming GPT-4o by 2.19 percent on the benchmark.

What carries the argument

SUPERLENS, the multimodal smart glasses agent that performs automatic object detection, followed by query decoupling and multimodal web search, to support retrieval-augmented answer generation.

If this is right

  • Existing VLMs leave significant performance gaps on realistic smart glasses queries that require precise object identification before external retrieval.
  • Automatic object detection must precede knowledge lookup to handle the core challenge of smart glasses VQA.
  • Query decoupling combined with multimodal web search improves answer quality over direct generation.
  • Traditional multimodal datasets fail to reflect the sequential identification-then-retrieval demands of wearable devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers building smart glasses applications may need custom agent pipelines instead of relying solely on general VLMs.
  • Extending the benchmark to continuous video or live interaction streams could expose further limitations in current models.
  • Comparable detection-plus-retrieval designs might improve performance in other egocentric or wearable vision settings.

Load-bearing premise

The 2,422 collected pairs and the eight query categories sufficiently capture real smart glasses usage challenges, and the observed performance gain comes from the added detection and search components rather than evaluation choices or model scale.

What would settle it

A general-purpose VLM that reaches or exceeds SUPERLENS accuracy on the SUPERGLASSES benchmark without using the object-detection and query-decoupling pipeline would show the specialized steps are not required.

Figures

Figures reproduced from arXiv: 2602.22683 by Haohao Qu, Kanglong Liu, Qing Li, Shanru Lin, Wenqi Fan, Xu Yuan, Zhuohang Jiang.

Figure 1
Figure 1. Figure 1: SUPERGLASSES contains 14 image domains and 8 query categories, with 2,422 question–answer pairs. Each example includes an egocentric image captured by smart glasses, a manually annotated question–answer pair, and associated multimodal search logs. sponses to user queries. For instance, benchmarks such as Dyn-VQA [32], LIVEVQA [17], CRAG-MM [46], and WearVQA [6] present dynamic questions that are especially… view at source ↗
Figure 2
Figure 2. Figure 2: The A 4 Data Collection Pipeline, consisting of four stages: Acquirement, Annotatation, Assessment, and Analysis. require multi-hop reasoning, information-seeking capabili￾ties, and adaptation to rapidly changing scenarios. Standing out among these models, the large-scale models Gemini 2.5 Pro and GPT-4o, which estimated parameter sizes exceed￾ing 400B, consistently achieve the highest overall perfor￾mance… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed SUPERLENS, composed of a Demand-Adaptive Answerer and a Dual-Lens Knowledge Retriever. All modules marked in blue are powered by VLMs, while modules in green are external tools [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance across categories. DeepSeek-3B DeepSeek-27B LLaMA-11B LLaMA-90B Qwen2.5-3B Qwen2.5-7B Qwen2.5-72B Gemini DeepSeek-3B DeepSeek-27B LLaMA-11B LLaMA-90B Qwen2.5-3B Qwen2.5-7B Qwen2.5-72B Gemini 100.00 75.73 59.62 63.81 68.20 74.48 75.31 67.57 50.91 100.00 53.59 60.06 54.57 65.96 68.35 64.42 50.00 66.84 100.00 76.49 59.30 72.63 78.42 69.47 39.87 55.82 56.99 100.00 51.63 65.75 72.42 68.89 52.67 62.6… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of error types. LENS combines demand-adaptive retrieval control with a dual-lens knowledge retriever to obtain high-quality evi￾dence context, outperforming heuristic RAG by 24.65%. Textual vs. Visual Retrieval. In [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Question/answer length distribution and question prefixes of S [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The common topics in questions and answers of S [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prompt used for domain recognition. Direct Answer Generation Prompt System Prompt: You are a visual assistant tasked with addressing the user’s query for the image based on your inherent knowledge. General Reasoning Guidelines: 1. Generate step-by-step reasoning to address the query using evidence from the image and your knowledge... Stop reasoning once you have enough information to answer, or you fi… view at source ↗
Figure 12
Figure 12. Figure 12: The prompt used for direct answer generation. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The prompt used for answer evaluation. 6Detailed scores on query categories of SUPERGL [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Success case of using image search. Query: How many founders founded Sushiro? Reasoning: 1. The exact name of the object in the image that the query "How many founders founded Sushiro?" is about is Sushiro. 2. Sushiro is a Japanese restaurant chain. 3. The founders of Sushiro are not visible in the image. 4. I cannot determine the number of founders of Sushiro based on the image. Therefore, the answer is:… view at source ↗
Figure 15
Figure 15. Figure 15: Success case of using text search. intent, retrieves high-quality textual evidence, and produces the correct answer (“two founders”). Together, these two examples show that when the system correctly selects the appropriate search tool, forms a well-structured query, and extracts modality-specific evidence, it is capable of produc￾ing robust and accurate answers across heterogeneous ques￾tion types. F.2.2.… view at source ↗
Figure 16
Figure 16. Figure 16: Failure case caused by incorrect search tool selection. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failure case caused by incorrect search query generation. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Case Study in SUPERGLASSES: “Campbell’s Soup Can” [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Hop 1 and Hop 2 of “Campbell’s Soup Can” [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Hop 3 and Hop 4 of “Campbell’s Soup Can” [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
read the original abstract

The rapid advancement of AI-powered smart glasses-one of the hottest wearable devices-has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPER- GLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose the SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. SUPERLENS achieves state-of-the-art performance, outperforming GPT-4o by 2.19%, underscoring the need for task-specific solutions in smart glasses VQA. Our dataset is publicly available at https://huggingface.co/datasets/xandery/SuperGlasses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SUPERGLASSES, a new VQA benchmark of 2,422 real-world egocentric image-question pairs collected from smart glasses devices across 14 domains and 8 query categories, complete with search trajectories and annotations. It evaluates 26 VLMs on the benchmark, identifies performance gaps, and proposes SUPERLENS, a multimodal agent that integrates automatic object detection, query decoupling, and multimodal web search for retrieval-augmented generation, claiming state-of-the-art results with a 2.19% improvement over GPT-4o.

Significance. If the 2.19% gain is robust and specifically attributable to the three proposed components rather than prompt or protocol differences, the work would supply a useful public benchmark for smart-glasses VQA and illustrate the value of task-specific agent designs. The public dataset release is a clear strength for reproducibility and follow-on research.

major comments (2)
  1. [SUPERLENS agent description and results] The central claim attributes the 2.19% margin over GPT-4o specifically to automatic object detection + query decoupling + multimodal web search, yet the manuscript contains no ablation tables that disable one component at a time while holding the others fixed. Without these isolations it is impossible to rule out that the delta arises from prompt engineering, base-model choice, or evaluation-protocol details rather than the advertised integrations.
  2. [Abstract and experimental results] The abstract and results report a 2.19% absolute improvement without statistical significance testing, confidence intervals, or full evaluation-protocol details (exact prompts, temperature, decoding strategy). This information is required to establish that the margin is reliable rather than an artifact of the chosen protocol.
minor comments (2)
  1. [Abstract] Abstract contains inconsistent hyphenation and spacing in 'SUPER- GLASSES'.
  2. [Dataset construction] Clarify the rationale for selecting the 8 query categories and whether they adequately sample the full range of real smart-glasses usage scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in the current manuscript regarding component isolation and statistical rigor. We address each point below and will revise the manuscript to incorporate the requested analyses and details.

read point-by-point responses
  1. Referee: [SUPERLENS agent description and results] The central claim attributes the 2.19% margin over GPT-4o specifically to automatic object detection + query decoupling + multimodal web search, yet the manuscript contains no ablation tables that disable one component at a time while holding the others fixed. Without these isolations it is impossible to rule out that the delta arises from prompt engineering, base-model choice, or evaluation-protocol details rather than the advertised integrations.

    Authors: We agree that the manuscript lacks explicit ablation studies isolating each component of SUPERLENS. The current version describes the three integrations but does not quantify their individual contributions through controlled ablations. In the revised manuscript we will add ablation tables that systematically disable one component at a time (automatic object detection, query decoupling, and multimodal web search) while holding the base model, prompts, and evaluation protocol fixed, thereby demonstrating the incremental benefit attributable to each module. revision: yes

  2. Referee: [Abstract and experimental results] The abstract and results report a 2.19% absolute improvement without statistical significance testing, confidence intervals, or full evaluation-protocol details (exact prompts, temperature, decoding strategy). This information is required to establish that the margin is reliable rather than an artifact of the chosen protocol.

    Authors: We acknowledge the absence of statistical testing and protocol transparency in the current version. The revised manuscript will include bootstrap confidence intervals and paired significance tests for the 2.19% margin, along with a new appendix that reports the exact prompts, temperature settings, decoding strategy, and all other hyperparameters used for SUPERLENS and the 26 evaluated VLMs, including GPT-4o. These additions will allow readers to reproduce and assess the reliability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on new dataset

full rationale

The paper's core claims are direct accuracy measurements (e.g., SUPERLENS outperforming GPT-4o by 2.19% on the 2,422-pair SUPERGLASSES benchmark). No equations, fitted parameters, or derivations are presented that reduce the reported gains to inputs by construction. The benchmark collection, query categories, and SUPERLENS components (object detection, query decoupling, web search) are described as independent engineering choices; performance deltas are evaluated externally against 26 VLMs without self-referential definitions or load-bearing self-citations. This is a standard empirical benchmarking paper with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard VQA evaluation practices and existing VLM and search APIs; no new physical entities or ad-hoc fitted constants are introduced beyond typical ML hyperparameters.

axioms (1)
  • domain assumption Existing VQA metrics and web-search APIs remain valid when applied to egocentric smart-glasses imagery
    The evaluation and agent design presuppose that standard multimodal retrieval and detection tools transfer directly to the new data distribution.

pith-pipeline@v0.9.0 · 5571 in / 1338 out tokens · 36364 ms · 2026-05-15T19:16:07.073025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 10 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, and Amit Bahree. Phi-3 technical report: A highly capable language model lo- cally on your phone.arXiv preprint arXiv:2404.14219, 2024. 6, 7

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    jina-reranker-m0: Multilingual multimodal docu- ment reranker, 2025

    Jina AI. jina-reranker-m0: Multilingual multimodal docu- ment reranker, 2025. 6, 15

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  5. [5]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1

  6. [6]

    WearVQA: A visual question answering benchmark for wearables in egocentric authentic real-world scenarios

    Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Patrick hall, Elissa Li, Nicolas SCHEF- FER, Ahmed Kirmani, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Seungwhan Moon, and Xin Luna Dong. WearVQA: A visual question answerin...

  7. [7]

    Webqa: Multihop and multimodal qa

    Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16495– 16504, 2022. 4, 5, 12, 13

  8. [8]

    A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. 1

  9. [9]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 14948–14968, 2023. 4, 5, 13

  10. [10]

    Seeclick: Har- nessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- nessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024. 12

  11. [11]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 6

  12. [12]

    Exploring large language model based intelligent agents: Definitions, methods, and prospects.arXiv preprint arXiv:2401.03428, 2024

    Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects.arXiv preprint arXiv:2401.03428, 2024. 1

  13. [13]

    Augmenting multi- modal llms with self-reflective tokens for knowledge-based visual question answering

    Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting multi- modal llms with self-reflective tokens for knowledge-based visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9199– 9209, 2025. 5

  14. [14]

    Augmented reality smart glasses in industrial assembly: Cur- rent status and future challenges.Journal of Industrial Infor- mation Integration, 20:100175, 2020

    Oscar Danielsson, Magnus Holm, and Anna Syberfeldt. Augmented reality smart glasses in industrial assembly: Cur- rent status and future challenges.Journal of Industrial Infor- mation Integration, 20:100175, 2020. 1

  15. [15]

    Palm- e: an embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023. 12

  16. [16]

    A sur- vey on rag meeting llms: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A sur- vey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024. 1

  17. [17]

    Yu, and Ranjay Krishna

    Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, and Ranjay Krishna. Seeking and updating with live visual knowledge. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track,

  18. [18]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 1, 6, 7

  19. [19]

    Mrag-bench: Vision- centric evaluation for retrieval-augmented multimodal mod- els

    Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, and Nanyun Peng. Mrag-bench: Vision- centric evaluation for retrieval-augmented multimodal mod- els. InThe Thirteenth International Conference on Learning Representations, 2025. 5, 13

  20. [20]

    MM- Search: Unveiling the potential of large models as multi- modal search engines

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, ji- ayi lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. MM- Search: Unveiling the potential of large models as multi- modal search engines. InThe Thirteenth International Con- ference on Learning Representations, 2025. 4, 13

  21. [21]

    Hibench: Benchmarking llms capability on hi- erarchical structure reasoning

    Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter HF Ng, and Qing Li. Hibench: Benchmarking llms capability on hi- erarchical structure reasoning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5505–5515, 2025. 12

  22. [22]

    QA-dragon: Query-aware dynamic RAG system for knowledge-intensive visual question answering

    Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, and Li Qing. QA-dragon: Query-aware dynamic RAG system for knowledge-intensive visual question answering. In2025 KDD Cup Workshop for Multimodal Retrieval Augmented Generation, 2025. 5, 12

  23. [23]

    Hydra: A hy- per agent for dynamic compositional visual reasoning

    Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, and Hamid Rezatofighi. Hydra: A hy- per agent for dynamic compositional visual reasoning. In European Conference on Computer Vision, pages 132–149. Springer, 2024. 1

  24. [24]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 1

  25. [25]

    LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

  26. [26]

    Benchmarking multimodal re- trieval augmented generation with dynamic vqa dataset and self-adaptive planning agent

    Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. Benchmarking multimodal re- trieval augmented generation with dynamic vqa dataset and self-adaptive planning agent. InThe Thirteenth International Conference on Learning Representations, 2025. 4, 5, 8, 13

  27. [27]

    Retrieval augmented visual ques- tion answering with outside knowledge

    Weizhe Lin and Bill Byrne. Retrieval augmented visual ques- tion answering with outside knowledge. InProceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 11238–11254, 2022. 1, 6

  28. [28]

    Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

    Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. Advances in Neural Information Processing Systems, 36: 22820–22840, 2023. 12

  29. [29]

    Wearvox: An egocentric mul- tichannel voice assistant benchmark for wearables.arXiv preprint arXiv:2601.02391, 2025

    Zhaojiang Lin, Yong Xu, Kai Sun, Jing Zheng, Yin Huang, Surya Teja Appini, Krish Narang, Renjie Tao, Ishan Kapil Jain, Siddhant Arora, et al. Wearvox: An egocentric mul- tichannel voice assistant benchmark for wearables.arXiv preprint arXiv:2601.02391, 2025. 1

  30. [30]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 5, 6, 7

  31. [31]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  32. [32]

    Benchmarking retrieval-augmented generation in multi-modal contexts

    Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, and Maosong Sun. Benchmarking retrieval-augmented generation in multi-modal contexts. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 4817–4826,

  33. [33]

    Visual agentic ai for spatial reasoning with a dynamic api

    Damiano Marsili, Rohun Agrawal, Yisong Yue, and Geor- gia Gkioxari. Visual agentic ai for spatial reasoning with a dynamic api. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19446–19455, 2025. 1

  34. [34]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

    Meta-AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024. 6, 7

  35. [35]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1

  36. [36]

    Retrieval-based knowledge augmented vision language pre-training

    Jiahua Rao, Zifei Shan, Longpo Liu, Yao Zhou, and Yue- dong Yang. Retrieval-based knowledge augmented vision language pre-training. InProceedings of the 31st ACM In- ternational Conference on Multimedia, pages 5399–5409,

  37. [37]

    A-okvqa: A benchmark for visual question answering using world knowl- edge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 1, 12

  38. [38]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180,

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180,

  39. [39]

    Distance learning and assistance using smart glasses.education sci- ences, 8(1):21, 2018

    Michael Spitzer, Ibrahim Nanic, and Martin Ebner. Distance learning and assistance using smart glasses.education sci- ences, 8(1):21, 2018. 1

  40. [40]

    Vipergpt: Visual inference via python execution for reasoning

    D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 11888–11898, 2023. 1, 12

  41. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

  42. [42]

    A systematic lit- erature review on integrating ai-powered smart glasses into digital health management for proactive healthcare solutions

    Boyuan Wang, Ying Zheng, Xihao Han, Liang Kong, Gexin Xiao, Zunxiong Xiao, and Shanji Chen. A systematic lit- erature review on integrating ai-powered smart glasses into digital health management for proactive healthcare solutions. npj Digital Medicine, 8(1):410, 2025. 1

  43. [43]

    Mllm-tool: A multimodal large language model for tool agent learning

    Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A multimodal large language model for tool agent learning. In 2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 6678–6687. IEEE, 2025. 12

  44. [44]

    Readerlm-v2: Small language model for html to mark- down and json.arXiv preprint arXiv:2503.01151, 2025

    Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, and Han Xiao. Readerlm-v2: Small language model for html to mark- down and json.arXiv preprint arXiv:2503.01151, 2025. 6, 15

  45. [45]

    A practical stereo depth system for smart glasses

    Jialiang Wang, Daniel Scharstein, Akash Bapat, Kevin Blackburn-Matzen, Matthew Yu, Jonathan Lehman, Suhib Alsisan, Yanghan Wang, Sam Tsai, Jan-Michael Frahm, et al. A practical stereo depth system for smart glasses. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21498–21507, 2023. 1

  46. [46]

    Crag-mm: Multi- modal multi-turn comprehensive rag benchmark.arXiv preprint arXiv:2510.26160, 2025

    Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, et al. Crag-mm: Multi- modal multi-turn comprehensive rag benchmark.arXiv preprint arXiv:2510.26160, 2025. 1, 2, 4, 5, 12, 13

  47. [47]

    A survey on large language model based au- tonomous agents.Frontiers of Computer Science, 18(6): 186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based au- tonomous agents.Frontiers of Computer Science, 18(6): 186345, 2024. 1

  48. [48]

    Filling the image information gap for vqa: Prompting large language models to proactively ask questions

    Ziyue Wang, Chi Chen, Peng Li, and Yang Liu. Filling the image information gap for vqa: Prompting large language models to proactively ask questions. InThe 2023 Confer- ence on Empirical Methods in Natural Language Processing,

  49. [49]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023. 12

  50. [50]

    Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025. 5, 7, 12

  51. [51]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 5, 6, 7

  52. [52]

    Mimo-vl technical report, 2025

    LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 6, 7

  53. [53]

    Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

    Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024. 1

  54. [54]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. 7

  55. [55]

    A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

  56. [56]

    mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

    Xu Yuan, Liangbo Ning, Wenqi Fan, and Qing Li. mkg- rag: Multimodal knowledge graph-enhanced rag for visual question answering.arXiv preprint arXiv:2508.05318, 2025. 6

  57. [57]

    Appa- gent: Multimodal agents as smartphone users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appa- gent: Multimodal agents as smartphone users. InProceed- ings of the 2025 CHI Conference on Human Factors in Com- puting Systems, pages 1–20, 2025. 12

  58. [58]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 5, 6, 7

  59. [59]

    what”, reflecting object-centric and identity-focused information needs common in smart- glasses usage. Other frequent prefixes include“how

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 We have included supplementary material to facilitate a more comprehensive understa...

  60. [60]

    Stop reasoning once you have enough information to answer, or you find that necessary information is lacking

    Generate step-by-step reasoning to address the query using evidence from the image and your knowledge... Stop reasoning once you have enough information to answer, or you find that necessary information is lacking

  61. [61]

    In your reasoning, identify the exact object that the query is about by its exact name

  62. [62]

    If the query involves multiple objects or relationships, dedicate one reasoning step to each object or relationship, and then summarize the result in a final step

  63. [63]

    I have no knowledge about <lacking_knowledge>

    If you find that necessary information is lacking, explicitly state: “I have no knowledge about <lacking_knowledge>” Domain Reasoning Guidelines: ... User Prompt: Given the<image>, please conduct step-by-step reasoning to address the query:{query} Image metadata: The location of the image is{location}. Output Format:

  64. [64]

    The exact name of the object in the image that the query is about is<specific_object_name>

  65. [65]

    reasoning

    Therefore, the answer is ... Output Summary in JSON format:{“reasoning”:<summary_reasoning_string>, “answer”:<answer>} Figure 12. The prompt used for direct answer generation. All retrieved webpages are merged into a unified set:H= H vis ∪H txt. In our setting, both image and text retrieval are conducted using a SerpApi-powered search engine6, re- stricte...

  66. [66]

    The prediction is correct if it captures all the key information from the ground truth

  67. [67]

    The prediction is correct even if phrased differently as long as the meaning is the same

  68. [68]

    Output a JSON object with a single field ‘accuracy’ whose value is true or false

    The prediction is incorrect if it contains incorrect information or is missing essential details. “Output a JSON object with a single field ‘accuracy’ whose value is true or false.” Question:{query},Ground Truth:{answer},Prediction:{prediction} Figure 13. The prompt used for answer evaluation.Table 6. Detailed scores on query categories of SUPERGLASSES. M...

  69. [69]

    What is the model of this vehicle?

    The exact name of the object in the image that the query "What is the model of this vehicle?" is about is a car

  70. [70]

    The car in the image appears to be a Toyota model, as indicated by the logo on the front

  71. [71]

    reasoning

    The specific model cannot be determined from the image alone without more details such as the body style or additional features. Therefore, the answer is: {"reasoning": "The model of the vehicle in the image is a Toyota, but the specific model cannot be determined from the image alone."} Decision: Need RAG Tool Decision: Need Image Search, No need Text Se...

  72. [72]

    How many founders founded Sushiro?

    The exact name of the object in the image that the query "How many founders founded Sushiro?" is about is Sushiro

  73. [73]

    Sushiro is a Japanese restaurant chain

  74. [74]

    The founders of Sushiro are not visible in the image

  75. [75]

    reasoning

    I cannot determine the number of founders of Sushiro based on the image. Therefore, the answer is: {"reasoning": "I cannot determine the number of founders of Sushiro based on the image."} Decision: Need RAG Tool Decision: No need Image Search, Need Text Search Search Query: Search Result: Evidence 1 with relevance score 0.94: From Wikipedia, the free enc...

  76. [76]

    What does this red sign mean?

    The exact name of the object in the image that the query "What does this red sign mean?" is about is a red line on a glass window

  77. [77]

    The red line on the glass window is likely a safety feature, such as a grab bar or a design element

  78. [78]

    In many high-rise buildings, red lines on glass windows are often used as safety features to help people hold onto the glass in case of an emergency

  79. [79]

    reasoning

    The red line could also be a design element, possibly part of the building's branding or aesthetic. Therefore, the answer is: {"reasoning": "The red line on the glass window is likely a safety feature, such as a grab bar or a design element."} Decision: Need RAG Tool Decision: No need Image Search, Need Text Search Search Query: Search Result: Evidence 1 ...

  80. [80]

    American Payroll Association Basic Guide to Payroll

    The exact name of the object in the image that the query is about is "American Payroll Association Basic Guide to Payroll"

Showing first 80 references.