arxiv: 2602.22683 · v2 · submitted 2026-02-26 · 💻 cs.CV · cs.AI

Recognition: no theorem link

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Zhuohang Jiang , Xu Yuan , Haohao Qu , Shanru Lin , Kanglong Liu , Wenqi Fan , Qing Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords smart glassesvision question answeringVQA benchmarkvision language modelsmultimodal agentretrieval augmented generationegocentric vision

0 comments

The pith

A specialized agent for smart glasses vision questions outperforms GPT-4o by 2.19 percent through object detection and targeted web search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SUPERGLASSES, a benchmark of 2,422 real egocentric image-question pairs gathered directly from smart glasses devices across 14 domains and 8 categories, complete with search trajectories and reasoning steps. Existing vision-language models show clear shortfalls on these queries because they often fail to first locate the relevant object before seeking outside knowledge. The authors introduce SUPERLENS, which adds automatic object detection, breaks queries into parts, and performs multimodal web searches to generate answers. This pipeline reaches higher accuracy than GPT-4o and other leading models. The results indicate that general-purpose VLMs are insufficient for wearable smart-glasses tasks and that purpose-built components deliver measurable gains.

Core claim

SUPERGLASSES is the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices, comprising 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. SUPERLENS integrates automatic object detection, query decoupling, and multimodal web search for retrieval-augmented answer generation and achieves state-of-the-art performance, outperforming GPT-4o by 2.19 percent on the benchmark.

What carries the argument

SUPERLENS, the multimodal smart glasses agent that performs automatic object detection, followed by query decoupling and multimodal web search, to support retrieval-augmented answer generation.

If this is right

Existing VLMs leave significant performance gaps on realistic smart glasses queries that require precise object identification before external retrieval.
Automatic object detection must precede knowledge lookup to handle the core challenge of smart glasses VQA.
Query decoupling combined with multimodal web search improves answer quality over direct generation.
Traditional multimodal datasets fail to reflect the sequential identification-then-retrieval demands of wearable devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers building smart glasses applications may need custom agent pipelines instead of relying solely on general VLMs.
Extending the benchmark to continuous video or live interaction streams could expose further limitations in current models.
Comparable detection-plus-retrieval designs might improve performance in other egocentric or wearable vision settings.

Load-bearing premise

The 2,422 collected pairs and the eight query categories sufficiently capture real smart glasses usage challenges, and the observed performance gain comes from the added detection and search components rather than evaluation choices or model scale.

What would settle it

A general-purpose VLM that reaches or exceeds SUPERLENS accuracy on the SUPERGLASSES benchmark without using the object-detection and query-decoupling pipeline would show the specialized steps are not required.

Figures

Figures reproduced from arXiv: 2602.22683 by Haohao Qu, Kanglong Liu, Qing Li, Shanru Lin, Wenqi Fan, Xu Yuan, Zhuohang Jiang.

**Figure 1.** Figure 1: SUPERGLASSES contains 14 image domains and 8 query categories, with 2,422 question–answer pairs. Each example includes an egocentric image captured by smart glasses, a manually annotated question–answer pair, and associated multimodal search logs. sponses to user queries. For instance, benchmarks such as Dyn-VQA [32], LIVEVQA [17], CRAG-MM [46], and WearVQA [6] present dynamic questions that are especially… view at source ↗

**Figure 2.** Figure 2: The A 4 Data Collection Pipeline, consisting of four stages: Acquirement, Annotatation, Assessment, and Analysis. require multi-hop reasoning, information-seeking capabilities, and adaptation to rapidly changing scenarios. Standing out among these models, the large-scale models Gemini 2.5 Pro and GPT-4o, which estimated parameter sizes exceeding 400B, consistently achieve the highest overall performance… view at source ↗

**Figure 4.** Figure 4: Overview of the proposed SUPERLENS, composed of a Demand-Adaptive Answerer and a Dual-Lens Knowledge Retriever. All modules marked in blue are powered by VLMs, while modules in green are external tools [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Performance across categories. DeepSeek-3B DeepSeek-27B LLaMA-11B LLaMA-90B Qwen2.5-3B Qwen2.5-7B Qwen2.5-72B Gemini DeepSeek-3B DeepSeek-27B LLaMA-11B LLaMA-90B Qwen2.5-3B Qwen2.5-7B Qwen2.5-72B Gemini 100.00 75.73 59.62 63.81 68.20 74.48 75.31 67.57 50.91 100.00 53.59 60.06 54.57 65.96 68.35 64.42 50.00 66.84 100.00 76.49 59.30 72.63 78.42 69.47 39.87 55.82 56.99 100.00 51.63 65.75 72.42 68.89 52.67 62.6… view at source ↗

**Figure 8.** Figure 8: Distribution of error types. LENS combines demand-adaptive retrieval control with a dual-lens knowledge retriever to obtain high-quality evidence context, outperforming heuristic RAG by 24.65%. Textual vs. Visual Retrieval. In [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Question/answer length distribution and question prefixes of S [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: The common topics in questions and answers of S [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: The prompt used for domain recognition. Direct Answer Generation Prompt System Prompt: You are a visual assistant tasked with addressing the user’s query for the image based on your inherent knowledge. General Reasoning Guidelines: 1. Generate step-by-step reasoning to address the query using evidence from the image and your knowledge... Stop reasoning once you have enough information to answer, or you fi… view at source ↗

**Figure 12.** Figure 12: The prompt used for direct answer generation. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: The prompt used for answer evaluation. 6Detailed scores on query categories of SUPERGL [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Success case of using image search. Query: How many founders founded Sushiro? Reasoning: 1. The exact name of the object in the image that the query "How many founders founded Sushiro?" is about is Sushiro. 2. Sushiro is a Japanese restaurant chain. 3. The founders of Sushiro are not visible in the image. 4. I cannot determine the number of founders of Sushiro based on the image. Therefore, the answer is:… view at source ↗

**Figure 15.** Figure 15: Success case of using text search. intent, retrieves high-quality textual evidence, and produces the correct answer (“two founders”). Together, these two examples show that when the system correctly selects the appropriate search tool, forms a well-structured query, and extracts modality-specific evidence, it is capable of producing robust and accurate answers across heterogeneous question types. F.2.2.… view at source ↗

**Figure 16.** Figure 16: Failure case caused by incorrect search tool selection. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Failure case caused by incorrect search query generation. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Case Study in SUPERGLASSES: “Campbell’s Soup Can” [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Hop 1 and Hop 2 of “Campbell’s Soup Can” [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: Hop 3 and Hop 4 of “Campbell’s Soup Can” [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

read the original abstract

The rapid advancement of AI-powered smart glasses-one of the hottest wearable devices-has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPER- GLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose the SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. SUPERLENS achieves state-of-the-art performance, outperforming GPT-4o by 2.19%, underscoring the need for task-specific solutions in smart glasses VQA. Our dataset is publicly available at https://huggingface.co/datasets/xandery/SuperGlasses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new SUPERGLASSES benchmark from real smart-glasses footage is the useful part; the 2.19% agent gain over GPT-4o lacks ablations to back it up.

read the letter

The paper's main contribution is the SUPERGLASSES dataset: 2,422 egocentric image-question pairs collected directly from smart glasses, covering 14 domains and 8 query types, with full search trajectories attached. That setup matches the actual workflow better than standard VQA collections, where spotting the right object comes first and external knowledge retrieval follows. Releasing it publicly on Hugging Face is a clear plus for anyone building wearable multimodal systems.

Referee Report

2 major / 2 minor

Summary. The paper introduces SUPERGLASSES, a new VQA benchmark of 2,422 real-world egocentric image-question pairs collected from smart glasses devices across 14 domains and 8 query categories, complete with search trajectories and annotations. It evaluates 26 VLMs on the benchmark, identifies performance gaps, and proposes SUPERLENS, a multimodal agent that integrates automatic object detection, query decoupling, and multimodal web search for retrieval-augmented generation, claiming state-of-the-art results with a 2.19% improvement over GPT-4o.

Significance. If the 2.19% gain is robust and specifically attributable to the three proposed components rather than prompt or protocol differences, the work would supply a useful public benchmark for smart-glasses VQA and illustrate the value of task-specific agent designs. The public dataset release is a clear strength for reproducibility and follow-on research.

major comments (2)

[SUPERLENS agent description and results] The central claim attributes the 2.19% margin over GPT-4o specifically to automatic object detection + query decoupling + multimodal web search, yet the manuscript contains no ablation tables that disable one component at a time while holding the others fixed. Without these isolations it is impossible to rule out that the delta arises from prompt engineering, base-model choice, or evaluation-protocol details rather than the advertised integrations.
[Abstract and experimental results] The abstract and results report a 2.19% absolute improvement without statistical significance testing, confidence intervals, or full evaluation-protocol details (exact prompts, temperature, decoding strategy). This information is required to establish that the margin is reliable rather than an artifact of the chosen protocol.

minor comments (2)

[Abstract] Abstract contains inconsistent hyphenation and spacing in 'SUPER- GLASSES'.
[Dataset construction] Clarify the rationale for selecting the 8 query categories and whether they adequately sample the full range of real smart-glasses usage scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in the current manuscript regarding component isolation and statistical rigor. We address each point below and will revise the manuscript to incorporate the requested analyses and details.

read point-by-point responses

Referee: [SUPERLENS agent description and results] The central claim attributes the 2.19% margin over GPT-4o specifically to automatic object detection + query decoupling + multimodal web search, yet the manuscript contains no ablation tables that disable one component at a time while holding the others fixed. Without these isolations it is impossible to rule out that the delta arises from prompt engineering, base-model choice, or evaluation-protocol details rather than the advertised integrations.

Authors: We agree that the manuscript lacks explicit ablation studies isolating each component of SUPERLENS. The current version describes the three integrations but does not quantify their individual contributions through controlled ablations. In the revised manuscript we will add ablation tables that systematically disable one component at a time (automatic object detection, query decoupling, and multimodal web search) while holding the base model, prompts, and evaluation protocol fixed, thereby demonstrating the incremental benefit attributable to each module. revision: yes
Referee: [Abstract and experimental results] The abstract and results report a 2.19% absolute improvement without statistical significance testing, confidence intervals, or full evaluation-protocol details (exact prompts, temperature, decoding strategy). This information is required to establish that the margin is reliable rather than an artifact of the chosen protocol.

Authors: We acknowledge the absence of statistical testing and protocol transparency in the current version. The revised manuscript will include bootstrap confidence intervals and paired significance tests for the 2.19% margin, along with a new appendix that reports the exact prompts, temperature settings, decoding strategy, and all other hyperparameters used for SUPERLENS and the 26 evaluated VLMs, including GPT-4o. These additions will allow readers to reproduce and assess the reliability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on new dataset

full rationale

The paper's core claims are direct accuracy measurements (e.g., SUPERLENS outperforming GPT-4o by 2.19% on the 2,422-pair SUPERGLASSES benchmark). No equations, fitted parameters, or derivations are presented that reduce the reported gains to inputs by construction. The benchmark collection, query categories, and SUPERLENS components (object detection, query decoupling, web search) are described as independent engineering choices; performance deltas are evaluated externally against 26 VLMs without self-referential definitions or load-bearing self-citations. This is a standard empirical benchmarking paper with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard VQA evaluation practices and existing VLM and search APIs; no new physical entities or ad-hoc fitted constants are introduced beyond typical ML hyperparameters.

axioms (1)

domain assumption Existing VQA metrics and web-search APIs remain valid when applied to egocentric smart-glasses imagery
The evaluation and agent design presuppose that standard multimodal retrieval and detection tools transfer directly to the new data distribution.

pith-pipeline@v0.9.0 · 5571 in / 1338 out tokens · 36364 ms · 2026-05-15T19:16:07.073025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 10 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, and Amit Bahree. Phi-3 technical report: A highly capable language model lo- cally on your phone.arXiv preprint arXiv:2404.14219, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

jina-reranker-m0: Multilingual multimodal docu- ment reranker, 2025

Jina AI. jina-reranker-m0: Multilingual multimodal docu- ment reranker, 2025. 6, 15

work page 2025
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1

work page 1901
[6]

WearVQA: A visual question answering benchmark for wearables in egocentric authentic real-world scenarios

Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Patrick hall, Elissa Li, Nicolas SCHEF- FER, Ahmed Kirmani, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Seungwhan Moon, and Xin Luna Dong. WearVQA: A visual question answerin...

work page 2025
[7]

Webqa: Multihop and multimodal qa

Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16495– 16504, 2022. 4, 5, 12, 13

work page 2022
[8]

A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. 1

work page 2024
[9]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 14948–14968, 2023. 4, 5, 13

work page 2023
[10]

Seeclick: Har- nessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- nessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024. 12

work page 2024
[11]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 6

work page 2024
[12]

Exploring large language model based intelligent agents: Definitions, methods, and prospects.arXiv preprint arXiv:2401.03428, 2024

Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects.arXiv preprint arXiv:2401.03428, 2024. 1

work page arXiv 2024
[13]

Augmenting multi- modal llms with self-reflective tokens for knowledge-based visual question answering

Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting multi- modal llms with self-reflective tokens for knowledge-based visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9199– 9209, 2025. 5

work page 2025
[14]

Augmented reality smart glasses in industrial assembly: Cur- rent status and future challenges.Journal of Industrial Infor- mation Integration, 20:100175, 2020

Oscar Danielsson, Magnus Holm, and Anna Syberfeldt. Augmented reality smart glasses in industrial assembly: Cur- rent status and future challenges.Journal of Industrial Infor- mation Integration, 20:100175, 2020. 1

work page 2020
[15]

Palm- e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023. 12

work page 2023
[16]

A sur- vey on rag meeting llms: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A sur- vey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024. 1

work page 2024
[17]

Yu, and Ranjay Krishna

Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, and Ranjay Krishna. Seeking and updating with live visual knowledge. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track,

work page
[18]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 1, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Mrag-bench: Vision- centric evaluation for retrieval-augmented multimodal mod- els

Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, and Nanyun Peng. Mrag-bench: Vision- centric evaluation for retrieval-augmented multimodal mod- els. InThe Thirteenth International Conference on Learning Representations, 2025. 5, 13

work page 2025
[20]

MM- Search: Unveiling the potential of large models as multi- modal search engines

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, ji- ayi lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. MM- Search: Unveiling the potential of large models as multi- modal search engines. InThe Thirteenth International Con- ference on Learning Representations, 2025. 4, 13

work page 2025
[21]

Hibench: Benchmarking llms capability on hi- erarchical structure reasoning

Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter HF Ng, and Qing Li. Hibench: Benchmarking llms capability on hi- erarchical structure reasoning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5505–5515, 2025. 12

work page 2025
[22]

QA-dragon: Query-aware dynamic RAG system for knowledge-intensive visual question answering

Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, and Li Qing. QA-dragon: Query-aware dynamic RAG system for knowledge-intensive visual question answering. In2025 KDD Cup Workshop for Multimodal Retrieval Augmented Generation, 2025. 5, 12

work page 2025
[23]

Hydra: A hy- per agent for dynamic compositional visual reasoning

Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, and Hamid Rezatofighi. Hydra: A hy- per agent for dynamic compositional visual reasoning. In European Conference on Computer Vision, pages 132–149. Springer, 2024. 1

work page 2024
[24]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 1

work page 2020
[25]

LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,

work page
[26]

Benchmarking multimodal re- trieval augmented generation with dynamic vqa dataset and self-adaptive planning agent

Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. Benchmarking multimodal re- trieval augmented generation with dynamic vqa dataset and self-adaptive planning agent. InThe Thirteenth International Conference on Learning Representations, 2025. 4, 5, 8, 13

work page 2025
[27]

Retrieval augmented visual ques- tion answering with outside knowledge

Weizhe Lin and Bill Byrne. Retrieval augmented visual ques- tion answering with outside knowledge. InProceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 11238–11254, 2022. 1, 6

work page 2022
[28]

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. Advances in Neural Information Processing Systems, 36: 22820–22840, 2023. 12

work page 2023
[29]

Wearvox: An egocentric mul- tichannel voice assistant benchmark for wearables.arXiv preprint arXiv:2601.02391, 2025

Zhaojiang Lin, Yong Xu, Kai Sun, Jing Zheng, Yin Huang, Surya Teja Appini, Krish Narang, Renjie Tao, Ishan Kapil Jain, Siddhant Arora, et al. Wearvox: An egocentric mul- tichannel voice assistant benchmark for wearables.arXiv preprint arXiv:2601.02391, 2025. 1

work page arXiv 2025
[30]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 5, 6, 7

work page 2024
[31]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[32]

Benchmarking retrieval-augmented generation in multi-modal contexts

Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, and Maosong Sun. Benchmarking retrieval-augmented generation in multi-modal contexts. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 4817–4826,

work page
[33]

Visual agentic ai for spatial reasoning with a dynamic api

Damiano Marsili, Rohun Agrawal, Yisong Yue, and Geor- gia Gkioxari. Visual agentic ai for spatial reasoning with a dynamic api. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19446–19455, 2025. 1

work page 2025
[34]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

Meta-AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024. 6, 7

work page 2024
[35]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1

work page 2022
[36]

Retrieval-based knowledge augmented vision language pre-training

Jiahua Rao, Zifei Shan, Longpo Liu, Yao Zhou, and Yue- dong Yang. Retrieval-based knowledge augmented vision language pre-training. InProceedings of the 31st ACM In- ternational Conference on Multimedia, pages 5399–5409,

work page
[37]

A-okvqa: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 1, 12

work page 2022
[38]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180,

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180,

work page
[39]

Distance learning and assistance using smart glasses.education sci- ences, 8(1):21, 2018

Michael Spitzer, Ibrahim Nanic, and Martin Ebner. Distance learning and assistance using smart glasses.education sci- ences, 8(1):21, 2018. 1

work page 2018
[40]

Vipergpt: Visual inference via python execution for reasoning

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 11888–11898, 2023. 1, 12

work page 2023
[41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

A systematic lit- erature review on integrating ai-powered smart glasses into digital health management for proactive healthcare solutions

Boyuan Wang, Ying Zheng, Xihao Han, Liang Kong, Gexin Xiao, Zunxiong Xiao, and Shanji Chen. A systematic lit- erature review on integrating ai-powered smart glasses into digital health management for proactive healthcare solutions. npj Digital Medicine, 8(1):410, 2025. 1

work page 2025
[43]

Mllm-tool: A multimodal large language model for tool agent learning

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A multimodal large language model for tool agent learning. In 2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 6678–6687. IEEE, 2025. 12

work page 2025
[44]

Readerlm-v2: Small language model for html to mark- down and json.arXiv preprint arXiv:2503.01151, 2025

Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, and Han Xiao. Readerlm-v2: Small language model for html to mark- down and json.arXiv preprint arXiv:2503.01151, 2025. 6, 15

work page arXiv 2025
[45]

A practical stereo depth system for smart glasses

Jialiang Wang, Daniel Scharstein, Akash Bapat, Kevin Blackburn-Matzen, Matthew Yu, Jonathan Lehman, Suhib Alsisan, Yanghan Wang, Sam Tsai, Jan-Michael Frahm, et al. A practical stereo depth system for smart glasses. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21498–21507, 2023. 1

work page 2023
[46]

Crag-mm: Multi- modal multi-turn comprehensive rag benchmark.arXiv preprint arXiv:2510.26160, 2025

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, et al. Crag-mm: Multi- modal multi-turn comprehensive rag benchmark.arXiv preprint arXiv:2510.26160, 2025. 1, 2, 4, 5, 12, 13

work page arXiv 2025
[47]

A survey on large language model based au- tonomous agents.Frontiers of Computer Science, 18(6): 186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based au- tonomous agents.Frontiers of Computer Science, 18(6): 186345, 2024. 1

work page 2024
[48]

Filling the image information gap for vqa: Prompting large language models to proactively ask questions

Ziyue Wang, Chi Chen, Peng Li, and Yang Liu. Filling the image information gap for vqa: Prompting large language models to proactively ask questions. InThe 2023 Confer- ence on Empirical Methods in Natural Language Processing,

work page 2023
[49]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025. 5, 7, 12

work page arXiv 2025
[51]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Mimo-vl technical report, 2025

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 6, 7

work page 2025
[53]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024. 1

work page arXiv 2024
[54]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

work page 2024
[56]

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

Xu Yuan, Liangbo Ning, Wenqi Fan, and Qing Li. mkg- rag: Multimodal knowledge graph-enhanced rag for visual question answering.arXiv preprint arXiv:2508.05318, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Appa- gent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appa- gent: Multimodal agents as smartphone users. InProceed- ings of the 2025 CHI Conference on Human Factors in Com- puting Systems, pages 1–20, 2025. 12

work page 2025
[58]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

what”, reflecting object-centric and identity-focused information needs common in smart- glasses usage. Other frequent prefixes include“how

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 We have included supplementary material to facilitate a more comprehensive understa...

work page 2023
[60]

Stop reasoning once you have enough information to answer, or you find that necessary information is lacking

Generate step-by-step reasoning to address the query using evidence from the image and your knowledge... Stop reasoning once you have enough information to answer, or you find that necessary information is lacking

work page
[61]

In your reasoning, identify the exact object that the query is about by its exact name

work page
[62]

If the query involves multiple objects or relationships, dedicate one reasoning step to each object or relationship, and then summarize the result in a final step

work page
[63]

I have no knowledge about <lacking_knowledge>

If you find that necessary information is lacking, explicitly state: “I have no knowledge about <lacking_knowledge>” Domain Reasoning Guidelines: ... User Prompt: Given the<image>, please conduct step-by-step reasoning to address the query:{query} Image metadata: The location of the image is{location}. Output Format:

work page
[64]

The exact name of the object in the image that the query is about is<specific_object_name>

work page
[65]

reasoning

Therefore, the answer is ... Output Summary in JSON format:{“reasoning”:<summary_reasoning_string>, “answer”:<answer>} Figure 12. The prompt used for direct answer generation. All retrieved webpages are merged into a unified set:H= H vis ∪H txt. In our setting, both image and text retrieval are conducted using a SerpApi-powered search engine6, re- stricte...

work page arXiv
[66]

The prediction is correct if it captures all the key information from the ground truth

work page
[67]

The prediction is correct even if phrased differently as long as the meaning is the same

work page
[68]

Output a JSON object with a single field ‘accuracy’ whose value is true or false

The prediction is incorrect if it contains incorrect information or is missing essential details. “Output a JSON object with a single field ‘accuracy’ whose value is true or false.” Question:{query},Ground Truth:{answer},Prediction:{prediction} Figure 13. The prompt used for answer evaluation.Table 6. Detailed scores on query categories of SUPERGLASSES. M...

work page arXiv
[69]

What is the model of this vehicle?

The exact name of the object in the image that the query "What is the model of this vehicle?" is about is a car

work page
[70]

The car in the image appears to be a Toyota model, as indicated by the logo on the front

work page
[71]

reasoning

The specific model cannot be determined from the image alone without more details such as the body style or additional features. Therefore, the answer is: {"reasoning": "The model of the vehicle in the image is a Toyota, but the specific model cannot be determined from the image alone."} Decision: Need RAG Tool Decision: Need Image Search, No need Text Se...

work page arXiv 2020
[72]

How many founders founded Sushiro?

The exact name of the object in the image that the query "How many founders founded Sushiro?" is about is Sushiro

work page
[73]

Sushiro is a Japanese restaurant chain

work page
[74]

The founders of Sushiro are not visible in the image

work page
[75]

reasoning

I cannot determine the number of founders of Sushiro based on the image. Therefore, the answer is: {"reasoning": "I cannot determine the number of founders of Sushiro based on the image."} Decision: Need RAG Tool Decision: No need Image Search, Need Text Search Search Query: Search Result: Evidence 1 with relevance score 0.94: From Wikipedia, the free enc...

work page 1984
[76]

What does this red sign mean?

The exact name of the object in the image that the query "What does this red sign mean?" is about is a red line on a glass window

work page
[77]

The red line on the glass window is likely a safety feature, such as a grab bar or a design element

work page
[78]

In many high-rise buildings, red lines on glass windows are often used as safety features to help people hold onto the glass in case of an emergency

work page
[79]

reasoning

The red line could also be a design element, possibly part of the building's branding or aesthetic. Therefore, the answer is: {"reasoning": "The red line on the glass window is likely a safety feature, such as a grab bar or a design element."} Decision: Need RAG Tool Decision: No need Image Search, Need Text Search Search Query: Search Result: Evidence 1 ...

work page 1957
[80]

American Payroll Association Basic Guide to Payroll

The exact name of the object in the image that the query is about is "American Payroll Association Basic Guide to Payroll"

work page

Showing first 80 references.