arxiv: 2604.23145 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.AI

Recognition: unknown

UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

Jason Nguyen , Ameet Rao , Alexander Chang , Ishaan Kumar , Erin Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video question answeringexplicit reasoningmodular frameworklarge reasoning modelslarge multimodal modelsinterpretabilityobject identificationscene context

0 comments

The pith

A modular framework uses upstream reasoning models to generate explicit object and scene traces before video question answering, raising accuracy and transparency in some cases but lowering them when baselines are already strong.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UpstreamQA to separate video question answering into explicit upstream stages and downstream answering. Multimodal large reasoning models first identify objects and produce scene context from video input, creating detailed traces. These traces are then supplied to large multimodal models that generate the final answer to the question. Tests on the OpenEQA and NExTQA datasets show that the added explicit steps improve both correctness and the visibility of the reasoning process when the baseline model performs modestly, yet the same steps can reduce accuracy if the baseline is already high. This separation matters because it turns an otherwise hidden process into one that can be inspected and adjusted component by component.

Core claim

UpstreamQA is a modular framework that disentangles video reasoning by employing large reasoning models to perform object identification and scene context generation, then passing the resulting enriched reasoning traces to large multimodal models for the VideoQA task. Evaluations across two LRMs and two LMMs on OpenEQA and NExTQA demonstrate that explicit reasoning boosts performance and interpretability in several scenarios while causing degradation when baseline performance is already sufficiently high.

What carries the argument

The UpstreamQA modular pipeline, which routes explicit object-identification and scene-context traces generated by upstream multimodal LRMs into downstream LMMs for final VideoQA.

If this is right

VideoQA accuracy rises when explicit upstream traces are supplied to models whose implicit reasoning is currently weak.
The decision process becomes more interpretable because intermediate object and scene steps are available for inspection.
Performance drops occur precisely on questions where the baseline LMM already achieves high accuracy without added traces.
The modular split allows separate testing of object identification, scene description, and final answering stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems could decide at runtime whether to invoke the upstream modules based on a quick estimate of baseline uncertainty for each question.
Adding temporal-event detection as another upstream module might extend the gains to questions that hinge on action sequences.
Error patterns in the generated traces could be used to fine-tune the upstream models and reduce cases of performance degradation.

Load-bearing premise

The reasoning traces produced by the upstream models are accurate and add useful information rather than introducing errors that confuse the downstream model.

What would settle it

Measuring VideoQA accuracy after deliberately replacing the upstream traces with noisy or random descriptions on the same questions and videos would show whether the content of those traces drives the reported gains or losses.

Figures

Figures reproduced from arXiv: 2604.23145 by Alexander Chang, Ameet Rao, Erin Tan, Ishaan Kumar, Jason Nguyen.

**Figure 1.** Figure 1: Overview of our OpenEQA and NExTQA framework. An explicit reasoning model is used to perform a specific upstream task, and the output is passed to the LMM along with the original video and question/answers. 3.1 Datasets Our experiments are conducted across two datasets: NExTQA and OpenEQA [22, 23]. Both datasets are preprocessed by uniformly sampling 50 frames from each video in the dataset, paired alongsi… view at source ↗

**Figure 2.** Figure 2: reports the results stratified by question type for Gemini 2.5 Flash on OpenEQA. The full set of results for both models are presented in view at source ↗

**Figure 3.** Figure 3: Prompt for standalone LMMs to run baselines. view at source ↗

**Figure 4.** Figure 4: Upstream task prompt for LRMs to identify objects in the inputted frames. 10 view at source ↗

**Figure 5.** Figure 5: Upstream task prompt for LRMs to generate view at source ↗

**Figure 6.** Figure 6: Upstream task prompt for LRMs to generate view at source ↗

**Figure 7.** Figure 7: Example output of generated scene context. 12 view at source ↗

**Figure 8.** Figure 8: Example output of generated object identification. 13 view at source ↗

read the original abstract

Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4-mini, Gemini 2.5 Pro) and two LMMs (GPT-4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UpstreamQA splits VideoQA into explicit LRM traces for objects and context before LMM answering, but without measuring trace accuracy the gains and drops are hard to attribute.

read the letter

The punchline is that UpstreamQA tries to make VideoQA more interpretable by routing explicit reasoning traces from LRMs into LMMs, but without measuring how good those traces are, it's tough to know why performance goes up or down. The new part is the specific modular setup: using LRMs like o4-mini and Gemini 2.5 Pro for object identification and scene context on sampled frames, then feeding that to LMMs like GPT-4o for answering on datasets such as OpenEQA and NExTQA. This is a practical way to combine the strengths of reasoning models with multimodal ones, and the paper does well in showing that adding explicit steps can improve results in some cases while hurting in others when the baseline is already strong. That's a useful caution against assuming more reasoning always helps. The main soft spot is the lack of any evaluation on the upstream components. The abstract talks about performance changes but gives no numbers on how accurate the object traces or context generations are. There's no ablation that swaps in ground-truth traces or adds noise to see the effect. This means the degradation on high-performing baselines could just be error from bad upstream outputs rather than something about explicit reasoning itself. The stress-test note gets this right based on the description. The experiments use standard public datasets and common models, which keeps things reproducible in principle, though the free parameters like choice of specific LRMs and LMMs are noted. No circularity issues since it's straightforward empirical work. This paper is for people working on VideoQA and trying to add transparency to multimodal systems. A reader in that area might find the framework useful to build on, especially if they care about diagnostic tools. It deserves a serious referee because the core idea is straightforward and the mixed results raise a real question worth investigating with better controls. I would recommend sending it to peer review, with the expectation that revisions focus on validating the upstream reasoning quality.

Referee Report

3 major / 2 minor

Summary. The paper introduces UpstreamQA, a modular framework that disentangles VideoQA into upstream explicit reasoning using LRMs (o4-mini, Gemini 2.5 Pro) for object identification and scene context generation on static frames, followed by passing these traces to downstream LMMs (GPT-4o, Gemini 2.5 Flash) for final question answering. It evaluates the approach on OpenEQA and NExTQA, claiming that explicit reasoning boosts performance and interpretability in some cases but causes degradation when baseline LMM performance is already high.

Significance. If the empirical claims hold after proper validation, the framework would provide a useful diagnostic tool for isolating the contributions of explicit multi-step reasoning versus implicit multimodal understanding in VideoQA, potentially improving both accuracy and transparency in scenarios where baselines are weak.

major comments (3)

[Abstract] Abstract: The central claim that 'introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high' is presented without any quantitative accuracy numbers, deltas, error bars, ablation results, or statistical tests, leaving the magnitude and reliability of the reported effects unverifiable.
[Evaluation] Evaluation/Experiments: No metrics are reported for the accuracy, precision, or error rates of the upstream LRM-generated traces (object identification and scene context), so it is impossible to determine whether observed end-to-end gains or degradations arise from the quality of explicit reasoning or from noise/errors injected by inaccurate traces.
[Experiments] Experiments: The manuscript describes only end-to-end VideoQA accuracy on the two datasets and does not include ablations that substitute LRM traces with ground-truth annotations or controlled noisy variants; without these controls the assumption that enriched traces are additive for weaker LMMs yet non-harmful for stronger ones cannot be tested.

minor comments (2)

[Abstract] The abstract refers to 'several scenarios' in which the framework advances performance and transparency, but does not enumerate or characterize those scenarios.
Additional detail is needed on the exact mechanism for integrating the upstream reasoning traces into the downstream LMM prompts and on the frame-sampling strategy used by the LRMs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key opportunities to make our empirical claims more verifiable and to strengthen the experimental analysis. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high' is presented without any quantitative accuracy numbers, deltas, error bars, ablation results, or statistical tests, leaving the magnitude and reliability of the reported effects unverifiable.

Authors: We agree that the abstract should provide concrete quantitative support for the central claim. The body of the manuscript contains detailed results tables showing end-to-end accuracies on OpenEQA and NExTQA for the different LRM-LMM combinations, including specific deltas (both positive and negative) relative to the LMM-only baselines. In the revised version we will condense the key numerical findings—such as the largest observed gains and the conditions under which degradation occurs—directly into the abstract, along with a brief reference to the experimental conditions. This will make the magnitude and direction of the effects immediately verifiable. revision: yes
Referee: [Evaluation] Evaluation/Experiments: No metrics are reported for the accuracy, precision, or error rates of the upstream LRM-generated traces (object identification and scene context), so it is impossible to determine whether observed end-to-end gains or degradations arise from the quality of explicit reasoning or from noise/errors injected by inaccurate traces.

Authors: This concern is well-founded. Because the OpenEQA and NExTQA datasets do not provide ground-truth annotations for intermediate object lists or scene descriptions, we did not compute direct accuracy metrics for the upstream traces. Our evaluation instead relies on the downstream VideoQA accuracy as an implicit indicator of trace utility. In the revision we will add a dedicated limitations paragraph that explicitly states this gap, include qualitative examples of generated traces with commentary on common error patterns, and note that future work could incorporate human or automated verification of trace quality. We will not claim that the traces are error-free. revision: partial
Referee: [Experiments] Experiments: The manuscript describes only end-to-end VideoQA accuracy on the two datasets and does not include ablations that substitute LRM traces with ground-truth annotations or controlled noisy variants; without these controls the assumption that enriched traces are additive for weaker LMMs yet non-harmful for stronger ones cannot be tested.

Authors: We acknowledge that controlled ablations using ground-truth or synthetically noisy traces would provide stronger causal evidence. Creating such annotations at scale was outside the scope of the present study. In the revised manuscript we will expand the experiments section with additional per-pair breakdowns and qualitative analysis of when and why performance changes, and we will explicitly list the absence of ground-truth and noise-injection ablations as a limitation together with a concrete suggestion for how they could be performed in follow-up work. This will clarify the current evidential basis without overstating it. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential reductions

full rationale

The paper describes an empirical modular framework (UpstreamQA) that routes static-frame object identification and scene context traces from LRMs (o4-mini, Gemini 2.5 Pro) into downstream LMMs (GPT-4o, Gemini 2.5 Flash) for VideoQA evaluation on OpenEQA and NExTQA. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described experiments. Performance deltas are reported as direct experimental outcomes rather than any quantity that reduces to its own definition or prior fitted values by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the assumption that explicit modular reasoning adds value over implicit LMM processing for video tasks, with no new physical entities or mathematical axioms beyond standard ML model usage.

free parameters (1)

Selection of specific LRMs and LMMs
o4-mini, Gemini 2.5 Pro, GPT-4o, and Gemini 2.5 Flash are chosen for the experiments; their selection is not derived from first principles.

axioms (1)

domain assumption Explicit intermediate reasoning traces from LRMs improve or at least do not harm downstream LMM VideoQA performance
Invoked in the framework design and result interpretation.

invented entities (1)

UpstreamQA modular framework no independent evidence
purpose: Disentangle core video reasoning components via explicit upstream modules
New proposed system architecture with no independent evidence outside the paper's own evaluations.

pith-pipeline@v0.9.0 · 5545 in / 1252 out tokens · 71646 ms · 2026-05-08T08:37:16.357250+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 22 canonical work pages · 6 internal anchors

[1]

Video question answering: Datasets, algorithms and challenges

Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video question answering: Datasets, algorithms and challenges. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6439–6455, Abu Dhabi, United Arab Emirates, December 2022. As...

work page doi:10.18653/v1/2022.emnlp-main.432 2022
[2]

Videoqa in the era of llms: An empirical study, 2025

Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, and Angela Yao. Videoqa in the era of llms: An empirical study, 2025. URLhttps://arxiv.org/abs/2408.04223

work page arXiv 2025
[3]

Winterbottom, S

T. Winterbottom, S. Xiao, A. McLean, and N. Al Moubayed. On modality bias in the tvqa dataset. InProceedings of the British Machine Vision Conference (BMVC), 2020

2020
[4]

Morevqa: Exploring modular reasoning models for video question answering, 2025

Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular reasoning models for video question answering, 2025. URL https:// arxiv.org/abs/2404.06511

work page arXiv 2025
[5]

From system 1 to system 2: A survey of reasoning large language models,

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhiwei Li, Bao-Long Bi, Ling-Rui Mei, Junfeng Fang, Xiao Liang, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models,
[6]

URLhttps://arxiv.org/abs/2502.17419

work page internal anchor Pith review arXiv
[7]

Enhanced multimodal rag-llm for accurate visual question answering, 2024

Junxiao Xue, Quan Deng, Fei Yu, Yanhao Wang, Jun Wang, and Yuehua Li. Enhanced multimodal rag-llm for accurate visual question answering, 2024. URL https://arxiv.org/ abs/2412.20927

work page arXiv 2024
[8]

Videorag: Retrieval- augmented generation over video corpus, 2025

Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. Videorag: Retrieval- augmented generation over video corpus, 2025. URL https://arxiv.org/abs/2501. 05874

2025
[9]

Videorag: Retrieval-augmented generation with extreme long-context videos, 2025

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented generation with extreme long-context videos, 2025. URL https:// arxiv.org/abs/2502.01549

work page arXiv 2025
[10]

Traveler: A modular multi-lmm agent framework for video question-answering,

Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. Traveler: A modular multi-lmm agent framework for video question-answering, 2024. URL https: //arxiv.org/abs/2404.01476

work page arXiv 2024
[11]

A simple llm framework for long- range video question-answering,

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering, 2024. URL https://arxiv.org/abs/2312.17235

work page arXiv 2024
[12]

Leadqa: Llm-driven context-aware temporal grounding for video question answering,

Xinxin Dong, Baoyun Peng, Haokai Ma, Yufei Wang, Zixuan Dong, Fei Hu, and Xiaodong Wang. Leadqa: Llm-driven context-aware temporal grounding for video question answering,
[13]

URLhttps://arxiv.org/abs/2507.14784. 7

work page arXiv
[14]

ENTER: Event Based Interpretable Reasoning for VideoQA

Hammad Ayyubi, Junzhang Liu, Ali Asgarov, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Zhecan Wang, Chia-Wei Tang, Hani Alomari, Md. Atabuzzaman, Xudong Lin, Naveen Reddy Dyava, Shih-Fu Chang, and Chris Thomas. Enter: Event based interpretable reasoning for videoqa, 2025. URLhttps://arxiv.org/abs/2501.14194

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Gpt-5 system card

OpenAI. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf , Au- gust 2025

2025
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/ 2507.06261

work page internal anchor Pith review arXiv 2025
[17]

System card addendum: Claude opus 4.1

Anthropic. System card addendum: Claude opus 4.1. Technical report, Anthropic, Au- gust 2025. URL https://assets.anthropic.com/m/4c024b86c698d3d4/original/ Claude-4-1-System-Card.pdf

2025
[18]

Towards large reasoning models: A survey of reinforced reasoning with large language models

Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models, 2025. URL ht...

work page arXiv 2025
[19]

Reasvqa: Advancing videoqa with imperfect reasoning process, 2025

Jianxin Liang, Xiaojun Meng, Huishuai Zhang, Yueqian Wang, Jiansheng Wei, and Dongyan Zhao. Reasvqa: Advancing videoqa with imperfect reasoning process, 2025. URL https: //arxiv.org/abs/2501.13536

work page arXiv 2025
[20]

Assran, Q

Chuanqi Zang, Hanqing Wang, Mingtao Pei, and Wei Liang. Discovering the real association: Multimodal causal reasoning in video question answering. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19027–19036, 2023. doi: 10.1109/ CVPR52729.2023.01824

work page arXiv 2023
[21]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations,
[22]

URLhttps://arxiv.org/abs/1602.07332

work page Pith review arXiv
[23]

Learning deep features for scene recognition using places database

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Sys- tems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips....

2014
[24]

Translating videos to natural language using deep recurrent neural networks

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. In Rada Mihalcea, Joyce Chai, and Anoop Sarkar, editors,Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan...

work page doi:10.3115/v1/n15-1173 2015
[25]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, June 2021

2021
[26]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, and Aravind Raje...

2024
[27]

Openai cookbook: Examples & guides for building with the openai api

OpenAI. Openai cookbook: Examples & guides for building with the openai api. https: //cookbook.openai.com/, 2025. 8

2025
[28]

Annotating objects and relations in user-generated videos

Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user-generated videos. InProceedings of the 2019 on International Conference on Multimedia Retrieval, pages 279–287. ACM, 2019

2019
[29]

Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research.Communications of the ACM, 59(2):64–73, 2016

2016
[30]

Embodied question answering, 2017

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering, 2017. URLhttps://arxiv.org/abs/1711.11543

work page arXiv 2017
[31]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai, 2021. URLhttps://arxiv.org/abs/2109.08238

work page internal anchor Pith review arXiv 2021
[32]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes, 2017. URL https: //arxiv.org/abs/1702.04405

work page arXiv 2017
[33]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review arXiv 2024
[34]

GPT-4o System Card

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

work page internal anchor Pith review arXiv 2024
[35]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/, April 2025

2025
[36]

Yes" or

Trevor Lin, Ryan T Lin, Rahul Mhaskar, and Curtis E Margo. Evaluating the accuracy of advanced language learning models in ophthalmology: A comparative study of chatgpt-4o and meta ai’s llama 3.1.Advances in Ophthalmology Practice and Research, 5(2):95–99, 2025. doi: 10.1016/j.aopr.2025.01.002. A Extended Results LMM LRM Overall Object Recognition World K...

work page doi:10.1016/j.aopr.2025.01.002 2025