Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Pith reviewed 2026-05-19 15:57 UTC · model grok-4.3
The pith
Providing models with hints on where and when to look improves performance on complex egocentric video reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Minerva-Ego extends egocentric video sources with challenging multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces, plus object mask annotations for each trace. Benchmarking shows state-of-the-art models maintain a large performance gap relative to humans. Prompting the same models with hints that indicate the relevant locations and times in the video yields substantial improvements in accuracy.
What carries the argument
Spatiotemporal hints drawn from human-annotated reasoning traces and object masks that direct models to the key locations and moments needed for each question.
If this is right
- Frontier models gain measurable accuracy on egocentric video questions when supplied with explicit spatial and temporal guidance.
- Dense object mask annotations allow direct evaluation of which visual elements matter for each reasoning step.
- Benchmarks that score intermediate traces reveal gaps invisible to answer-only evaluation.
- Embodied agents can benefit from inference-time guidance that mimics human focus patterns in first-person video.
Where Pith is reading between the lines
- The same hinting method could be tested on non-egocentric video domains to check if spatial-temporal guidance generalizes.
- Models trained to generate their own reasoning traces and masks might reduce reliance on human annotations over time.
- Real-time embodied systems could incorporate online prediction of where and when to attend for ongoing tasks.
Load-bearing premise
The human-annotated reasoning traces and object masks accurately identify the intermediate steps and visual elements required to solve the questions.
What would settle it
Running the same models on the benchmark questions while supplying random or mismatched spatiotemporal hints and observing no performance change would indicate the hints are not the driver of gains.
Figures
read the original abstract
Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Minerva-Ego, a benchmark for complex egocentric visual reasoning. It augments existing egocentric video sources with multi-step multimodal questions, spatiotemporally-dense human-annotated reasoning traces, and object mask annotations. Benchmarking shows state-of-the-art models lag human performance, and experiments indicate that providing frontier models with 'where' and 'when' hints yields substantial gains.
Significance. If the results hold after addressing annotation validation, Minerva-Ego would be a useful resource for egocentric video understanding research, enabling analysis of intermediate reasoning steps rather than final answers alone. The empirical demonstration that spatiotemporal hints improve performance could guide model architectures for embodied agents.
major comments (2)
- [Abstract] Abstract: the central claim that 'prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance' depends on the human-annotated reasoning traces and masks accurately isolating the minimal visual elements and steps required. Without reported controls (e.g., machine-generated masks or ablations removing semantic content from traces), it remains possible that gains arise from answer leakage rather than pure spatiotemporal guidance.
- [Benchmarking experiments] Benchmarking experiments: the manuscript reports performance gaps and hint-based gains but provides no inter-annotator agreement statistics or validation that models using only the provided masks reach human-level accuracy on the questions. This is load-bearing for interpreting the hints as faithful spatiotemporal guidance.
minor comments (2)
- [Abstract] The GitHub link is provided, but the paper should explicitly state the data splits, question counts per split, and statistical significance tests used for the reported gains to support reproducibility.
- [Dataset description] Clarify how the 'spatiotemporally-dense' object masks are aligned with the reasoning traces (e.g., frame-level vs. clip-level) to avoid ambiguity in how hints are constructed from them.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential utility of Minerva-Ego for analyzing intermediate reasoning steps in egocentric video understanding. We address each major comment below and will incorporate revisions to strengthen the claims regarding spatiotemporal hints.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance' depends on the human-annotated reasoning traces and masks accurately isolating the minimal visual elements and steps required. Without reported controls (e.g., machine-generated masks or ablations removing semantic content from traces), it remains possible that gains arise from answer leakage rather than pure spatiotemporal guidance.
Authors: We agree that explicit controls are needed to isolate the contribution of spatiotemporal information and rule out leakage. In the revised manuscript we will add two new ablation experiments: (1) replacing human masks with machine-generated masks from an off-the-shelf detector, and (2) stripping semantic labels from the traces while preserving only temporal and spatial coordinates. These controls will quantify how much of the observed gain is attributable to faithful 'where/when' guidance versus direct answer content. revision: yes
-
Referee: [Benchmarking experiments] Benchmarking experiments: the manuscript reports performance gaps and hint-based gains but provides no inter-annotator agreement statistics or validation that models using only the provided masks reach human-level accuracy. This is load-bearing for interpreting the hints as faithful spatiotemporal guidance.
Authors: We will report inter-annotator agreement statistics (IoU for masks and Cohen’s kappa for trace steps) in the revised version. For validation, we will include additional results showing model accuracy when given only the annotated masks and traces; we will also discuss the remaining gap to human performance, which we attribute to limitations in current models’ ability to integrate the provided spatiotemporal cues rather than to annotation infidelity. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or fitted predictions
full rationale
The paper introduces Minerva-Ego as a new benchmark consisting of egocentric videos, multi-step questions, human-annotated reasoning traces, and spatiotemporal object masks. Its central empirical claim—that prompting frontier models with 'where' and 'when' hints yields performance gains—is obtained directly from evaluations on this newly created dataset. There are no equations, first-principles derivations, fitted parameters, or predictions that reduce to prior quantities by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner for any derivation, as none exist. The work is a self-contained empirical contribution whose findings rest on external model evaluations rather than internal redefinitions or self-referential fits.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
https://openai.com/index/learning- to-reason-with-llms , 2025
Open AI. https://openai.com/index/learning- to-reason-with-llms , 2025. [Accessed 19-02-2025]. 1
work page 2025
-
[3]
System Card: Claude Opus 4 & Claude Sonnet 4
Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. https://www- cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285. pdf. [Accessed 30-04-2025]. 4
work page 2025
-
[4]
Anthropic. Claude 3.5 sonnet v2. Anthropic API, 2023. A language model from Anthropic, featuring improved capabili- ties over the original Claude 3.5 Sonnet, including enhanced computer action generation. 1
work page 2023
-
[5]
Kirolos Ataallah, Chenhui Gou, Eslam Abdelrahman, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. In- finiBench: A comprehensive benchmark for large multimodal models in very long video understanding.arXiv preprint arXiv:2406.19875, 2024. 2
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Glober- son, and Alexei Efros. Visual prompting via image inpaint- ing.Advances in Neural Information Processing Systems, 35: 25005–25017, 2022. 3
work page 2022
-
[8]
Look, Remember and Reason: Grounded Reasoning in Videos with Language Models
Apratim Bhattacharyya, Sunny Panchal, Reza Pourreza, Mingu Lee, Pulkit Madan, and Roland Memisevic. Look, Remember and Reason: Grounded Reasoning in Videos with Language Models. InThe Twelfth International Conference on Learning Representations, 2024. 3
work page 2024
-
[9]
TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, and Jianwei Yang. TemporalBench: Benchmarking fine- grained temporal understanding for multimodal video models. arXiv preprint arXiv:2410.10818, 2024. 2
-
[10]
HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024
Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Du- rante, Manling Li, Jiajun Wu, and Li Fei-Fei. HourVideo: 1-hour video-language understanding.arXiv preprint arXiv:2411.04998, 2024. 2
-
[11]
Egothink: Evaluating first- person perspective thinking capability of vision-language models
Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. Egothink: Evaluating first- person perspective thinking capability of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14291–14302, 2024. 3
work page 2024
-
[12]
TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,
Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G M Snoek, and Yuki M Asano. TVBench: Redesigning video-language evaluation.arXiv preprint arXiv:2410.07752,
-
[13]
Egovqa: An egocentric video question an- swering benchmark dataset
Yifei Fan, Bo Zhang, Yun Zhou, Qingshan He, Zhenyi Liu, and Liang Wang. Egovqa: An egocentric video question an- swering benchmark dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (IC- CVW), pages 2501–2504, 2019. 2
work page 2019
-
[14]
Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025
Tongtong Feng, Xin Wang, Yu-Gang Jiang, and Wenwu Zhu. Embodied ai: From llms to world models.arXiv preprint arXiv:2509.20021, 2025. 1
-
[15]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceed- ings of the IEEE international conference on computer vision, pages 584...
work page 2017
-
[17]
Ego4d: Around the world in 3,000 hours of egocentric video, 2022
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Na- garajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, et al. Ego4d: Around the world in 3,000 hours of egocentric video, 2022. 2
work page 2022
-
[18]
Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection.arXiv preprint arXiv:2411.14794, 2024. 3
-
[19]
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,
-
[20]
Egotaskqa: Understanding human tasks in egocentric videos
Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos. InThe 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks,
work page 2022
-
[21]
Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021
Andy L Jones. Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021. 1
-
[22]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Adversarial filters of dataset biases
Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, and Yejin Choi. Adversarial filters of dataset biases. InInternational conference on machine learning, pages 1078–1088. Pmlr,
-
[24]
Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Videovista: A versatile bench- mark for video understanding and reasoning.arXiv preprint arXiv:2406.11303, 2024. 2
-
[25]
Xiatoian Liu, Hector Palacios, and Christian Muise. Egocen- tric planning for scalable embodied task achievement.Ad- vances in Neural Information Processing Systems, 36:54586– 54613, 2023. 1
work page 2023
- [26]
-
[27]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InThirty-seventh Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 1, 2
work page 2023
-
[28]
Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scal- ing open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007, 2023. 2, 7
work page 2023
-
[29]
Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hor- nung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, et al. Neptune: The long orbit to benchmarking long video under- standing.arXiv preprint arXiv:2412.09582, 2024. 1, 2
-
[30]
Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,
Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl V ondrick, Mikhail Sirotenko, et al. Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,
-
[31]
Introducing GPT-4.1 in the API
OpenAI. Introducing GPT-4.1 in the API. https:// openai.com/index/gpt-4-1/ , 2025. [Accessed 30- 04-2025]. 4
work page 2025
-
[32]
OpenAI. Introducing GPT-5. https://openai.com/ index/introducing-gpt-5/, 2025. [Accessed 16-09- 2025]. 1, 4
work page 2025
-
[33]
Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36, 2024. 2
work page 2024
-
[34]
Hd-epic: A highly-detailed egocentric video dataset
Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Sid- dhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rho- dri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the...
work page 2025
-
[35]
Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos
Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Ace Kul- shrestha, and Federico Tombari. Omnia de egotempo: Bench- marking temporal understanding of multi-modal llms in ego- centric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 7, 8
work page 2025
-
[36]
Sam 2: Segment anything in images and videos.ICLR, 2025
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.ICLR, 2025. 5
work page 2025
-
[37]
Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. CinePile: A Long Video Question Answering Dataset and Benchmark.arXiv preprint arXiv:2405.08813, 2024. 2
-
[38]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23(3):289–301, 2022. 3
work page 2022
-
[40]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2024. 3
work page 2024
-
[41]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Gemini Team. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next gen- eration agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, et al. Lvbench: An extreme long video understanding benchmark.arXiv preprint arXiv:2406.08035, 2024. 1, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3
-
[45]
Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024
Junda Wu, Zhehao Zhang, Yu Xia, Xintong Li, Zhaoyang Xia, Aaron Chang, Tong Yu, Sungchul Kim, Ryan A Rossi, Ruiyi Zhang, et al. Visual prompting in multimodal large language models: A survey.arXiv preprint arXiv:2409.15310, 2024. 3
-
[46]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, and Yinfei Yang. MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA. In International Conference on Learning Representations, pages 71705–71723, 2025. 3
work page 2025
-
[49]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019. 1 Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding Supplementary Material 0 10 2...
work page 2019
-
[50]
Minerva-Ego 6.1. Additional Statistics The shortest videos is 10 seconds while the longest video is 75 minutes. The mean video length is 20 minutes. The full distribution can be found in Fig. 7. Figure 8 shows statistics of reasoning lengths and object occurrences by question type. We note that the distribution is flat, other thanAudio Reasoning, which ha...
-
[51]
All raters are native English speakers with graduate degrees
Rater Guidelines All textual data in Minerva-Ego was manually annotated by human annotators (raters). All raters are native English speakers with graduate degrees. 7.1. Annotation Guidelines The raters were given the following guidelines before being asked to propose question, answers, decoys and reasoning traces: You will be given a video. For each video...
-
[52]
Egocentric and Spatial Perception
-
[53]
Numerical Reasoning (all math operations other than counting)
-
[54]
Counterfactual (“what if”) Important things to keep in mind:
-
[55]
Questions should be multi-step
-
[56]
They should be difficult to solve
-
[57]
Ideally they should involve looking at multiple different time segments of the video
-
[58]
Each question should be cover multiple skills and have multiple reasoning steps
-
[59]
The questions should be phrased with the word “I”, as if the cameraperson is asking the question
-
[60]
I” to refer to you (the rater, as you are doing the reasoning, and use the word “user
The reasoning traces should use the word “I” to refer to you (the rater, as you are doing the reasoning, and use the word “user” to refer to the cameraperson). Note in the question, I refers to the camera person
- [61]
-
[62]
Every time that object is referenced (even if it is in a different time step of the video), please use the same annotation
-
[63]
Please number the “steps” in each reasoning trace, as shown in the examples
-
[64]
All reasoning traces must have at least one time stamp, and at least one object of interest reference. 7.2. Human Study We used a disjoint pool of 12 raters for the human study, by asking the rater leads to ensure that the same rater who proposed the question (or even saw the video before) is not the same as the one who performs the human study. Hence no ...
-
[65]
Model Implementation Details Hyperparameters for all our models are provided in Table 7
-
[66]
Results 9.1. MiRA Reasoning Evaluation MiRA is a model-based metric that uses Gemini 2.5 Pro for scoring. The exact prompt used to compute the MiRA score appears in Figure 9. Since using the same LLM for hinting and as a judge couldpotentiallybias the results, we repeat the MiRA anal- ysis from Fig. 3 (left) using GPT-5 as a judge, and report the scores i...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.