pith. sign in

arxiv: 2512.01707 · v3 · pith:LHIZBS5Cnew · submitted 2025-12-01 · 💻 cs.CV · cs.AI· cs.CL

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Pith reviewed 2026-05-17 02:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords streaming video understandinggaze-guided reasoningmultimodal large language modelsegocentric videosproactive predictiontemporal reasoningbenchmarkquestion answering
0
0 comments X

The pith

The StreamGaze benchmark shows state-of-the-art MLLMs lag humans substantially when using gaze signals for temporal and proactive reasoning in streaming videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StreamGaze as the first benchmark to test whether multimodal large language models can interpret human gaze trajectories in streaming egocentric videos. It creates three task types that require models to reason about the past, track the present, and anticipate future intentions using only incoming frames and real-time gaze data. A custom pipeline generates the test questions by extracting fixations, applying region-specific prompts, and building scanpaths from raw gaze records. Evaluations across these tasks reveal clear performance shortfalls compared with human baselines, which the authors tie to weaknesses in gaze-based attention following and intention prediction.

Core claim

StreamGaze establishes gaze-guided past, present, and proactive tasks that measure how effectively MLLMs can leverage real-time gaze to follow shifting attention and infer user intentions from partially observed streaming video frames. The supporting QA generation pipeline aligns egocentric videos with gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction, yielding spatio-temporally grounded evaluation data that reflects human perceptual dynamics. Benchmark results document substantial gaps between current MLLMs and human performance on all tasks, along with analyses of prompting strategies and task-specific failure modes.

What carries the argument

The gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction to create spatio-temporally grounded question-answer pairs.

If this is right

  • Gaze prompting strategies can be refined to better integrate real-time signals into streaming video models.
  • Task-specific failure modes identify concrete areas for improving intention modeling and temporal reasoning.
  • Public release of the data and code supports further work on gaze-guided streaming understanding.
  • Applications such as AR glasses will require models that close the observed gaps in proactive prediction from partial observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Closing these gaps could enable AR systems that proactively assist users by anticipating what they will look at next.
  • The same pipeline approach for turning sensor streams into grounded QA pairs might be adapted to other real-time signals such as hand tracking or audio.
  • Architectures that process continuous video and auxiliary gaze data together may be needed before models match human performance on these tasks.

Load-bearing premise

The automatically generated QA pairs accurately reflect human perceptual dynamics and that gaze signals supply useful, non-redundant information for the past, present, and proactive tasks.

What would settle it

A follow-up experiment in which state-of-the-art MLLMs reach human-level accuracy on the proactive tasks when supplied with the same gaze data, or show no drop in performance when the gaze signal is removed, would indicate the claimed limitations do not hold.

Figures

Figures reproduced from arXiv: 2512.01707 by Branislav Kveton, Daeun Lee, Franck Dernoncourt, Mohit Bansal, Ryan A. Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Trung Bui, Viet Dac Lai.

Figure 1
Figure 1. Figure 1: STREAMGAZE’s task taxonomy. We introduce STREAMGAZE, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. We introduce gaze-guided past , present , and proactive streaming video understanding tasks. Abstract Streaming video understanding requires models not only to process temporally incoming frames, but also to antici￾pate user i… view at source ↗
Figure 2
Figure 2. Figure 2: MLLMs performance across STREAMGAZE tasks. 2. Related Work Streaming QA benchmarks. Streaming video understand￾ing requires models to interpret and respond to temporally incoming frames without access to future context. Stream￾ingBench [13] and OVO-Bench [15] benchmark online video understanding across past, present, and future tasks, provid￾ing a comprehensive evaluation of temporal reasoning ca￾pabilitie… view at source ↗
Figure 3
Figure 3. Figure 3: Gaze-guided streaming data construction pipeline for STREAMGAZE. Given egocentric video sources and raw gaze projections (Sec. 3.1), we first extract fixation moments across the entire video (Sec. 3.2). Next, we divide each frame into FOV and out-of-FOV regions and extract objects within the gaze area (Sec. 3.3). Finally, we construct scanpaths and generate streaming QA pairs (Sec. 4). saccadic) [7, 8, 23]… view at source ↗
Figure 4
Figure 4. Figure 4: Task-wise effect of gaze input strategies on STREAMGAZE. Different tasks respond unevenly to textual, visual, and salience-map prompts, revealing strong task-specific behavior in streaming gaze understanding. Type 1 Error (False Pos) Type 2 Error (False Neg) Type 1 Error (False Pos) Type 2 Error (False Neg) Qwen2.5-VL-7B InternVL3.5-8B GPT4o 16.7 56.9 4.1 15.4 83.6 11.8 30.9 9.6 59.1 19.0 67.1 23.7 0 20 40… view at source ↗
Figure 5
Figure 5. Figure 5: Offline MLLMs’ behavior on STREAMGAZE proac￾tive tasks. We evaluate type1 (false positive) and type2 (false negative) error from Qwen2.5-VL, InternVL-3.5, and GPT4o. strategies. These patterns indicate that each STREAMGAZE task require distinct visual, temporal, and attentional require￾ments, and therefore a single, uniform prompting or reason￾ing strategy is insufficient. A more adaptive, task-aware, or d… view at source ↗
Figure 6
Figure 6. Figure 6: Data statistics of STREAMGAZE. We report domain proportion, video length, world cloud for FOV extracted objects. Task proportion (%) OAA OAR OI (Easy) OI (Hard) FAP NFI GSM GTA OTP SR 20.6% 16.7% 16.7% 15.2% 10.5% 7.0% 5.5% 3.2% 2.9% 1.5% [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Task proportion of STREAMGAZE. We summarize the sample distribution for each task in STREAMGAZE [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of negative sampling strategy for proactive task fine-tuning. We apply negative sampling to fine-tune ViSpeak. Proactive task fine-tuning. For proactive reminder tasks, we fine-tune the model using a negative–positive sampling strategy together with ViSpeak’s informative head. As illus￾trated in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Gaze prompting strategies for STREAMGAZE We adopt three strategies for inputting gaze information to MLLMs [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison with text reasoning and gaze-based reasoning. We visualize examples from the scene reconstruction task comparing (a) text and (b) gaze-based reasoning. [Task: Scene Reconstruction] When the user was gazing at the {paper}, which background object was NOT visible? A. towel B. bowl C. spoon D. pot Thinking: When the gaze is centered on the paper, the towel on the right side of the fram… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison with text reasoning and visual-based reasoning. We visualize examples from the object identification task comparing (a) text and (b) visual-based reasoning. 9 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: STREAMGAZE data example for OTP and NFI task. Scene Recall Which transition best matches the user's gaze pattern? A. {pot, spoon} → {vegetables, sink, faucet} → {pepper, knife, hand} B. {paper, fork} → {paper, tomato} → {bowl, egg} C. {bell pepper, bottle} → {bowl, knife} → {cucumber, faucet} D. {vegetables, sink, faucet} → {pot, spoon} → {pepper, knife, hand} Gaze Sequence Matching 00:18 31:36 32:07 When… view at source ↗
Figure 13
Figure 13. Figure 13: STREAMGAZE data example for SR and GSM task. 10 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: STREAMGAZE data example for OI (Easy/Hard) task. Object Attribute Recognition Based on the recent fixation pattern (lid → spoon → bottle), what action will the user do next? A. Open the carrot container B. Move around the grocery bag C. Transfer the vegetable from the plate to the bowl. D. Pour seasoning from the seasoning container into the pot. Future Action Prediction 06:53 06:67 What color is this obj… view at source ↗
Figure 15
Figure 15. Figure 15: STREAMGAZE data example for OAR and FAP task. 11 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: STREAMGAZE data example for GTA and OAA task. 12 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for FOV region object extraction. 13 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for out-of-FOV region object extraction. Prompt for text-based reasoning You are an expert at gaze-conditioned streaming video reasoning. Rules: 1) Only use visual evidence up to the given timestamp. 2) Do your step-by-step reasoning ONLY inside <think>...</think>. 3) Give exactly ONE final choice (A/B/C/D) or yes/no with its short text inside <answer>...</answer>. {Question} [PITH_FULL_IMAGE:figu… view at source ↗
Figure 19
Figure 19. Figure 19: Prompt for text-based reasoning. 14 [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt for gaze-based reasoning. Prompt for visual-based reasoning You are an expert at gaze-conditioned streaming video reasoning. Rules: 1) Only use visual evidence up to the given timestamp (no future leakage). 2) Produce your reasoning ONLY inside <think>...</think> and never reveal reasoning outside the tag. 3) Return exactly FOUR tags in this order: <gaze>...</gaze> <objects>...</objects> <think>...… view at source ↗
Figure 21
Figure 21. Figure 21: Prompt for visual-based reasoning. 15 [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: HTML for human verification of STREAMGAZE data construction. 16 [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: HTML for human oracle evaluation of STREAMGAZE [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Instructions provided to human annotators. 17 [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗
read the original abstract

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces StreamGaze, the first benchmark for assessing how effectively Multimodal Large Language Models (MLLMs) utilize human gaze signals for gaze-guided past, present, and proactive reasoning in streaming videos. It presents a QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction to produce spatio-temporally grounded QA pairs, reports substantial performance gaps between SOTA MLLMs and humans across tasks, and provides analyses of prompting strategies and failure modes, with all data and code released publicly.

Significance. If the central claim holds after verification that the tasks isolate gaze-based reasoning deficits rather than generation artifacts, the work would be significant for streaming video understanding and AR applications by highlighting specific MLLM limitations in intention modeling and proactive prediction. The public data and code release is a clear strength supporting reproducibility and follow-on research.

major comments (2)
  1. [§3 (QA Generation Pipeline)] §3 (QA Generation Pipeline): The headline claim of substantial gaps highlighting limitations in gaze-based temporal reasoning rests on the assumption that fixation extraction + region-specific prompting + scanpath construction produces QA pairs whose difficulty is attributable to gaze utilization. The manuscript should include an ablation or control experiment comparing model performance on the same questions with vs. without gaze signals (or with randomized gaze) to rule out incidental properties of the generation process such as unnatural phrasing or region boundaries.
  2. [§4 (Task Definitions and Human Baselines)] §4 (Task Definitions and Human Baselines): It is unclear whether the past/present/proactive task formulations avoid future-frame leakage in the streaming setting and whether human baselines control for gaze visibility (i.e., do humans see the raw gaze overlay or only the video frames). These details are load-bearing for attributing the observed gaps specifically to missing gaze-temporal reasoning capacity rather than task construction.
minor comments (2)
  1. [Abstract and §5] Abstract and §5: The specific gaze prompting strategies analyzed could be enumerated more explicitly (e.g., raw coordinates vs. region masks) to aid reader understanding of the reported insights.
  2. [Results tables] Table 1 or equivalent results table: Ensure all evaluated MLLMs are listed with exact model versions, input resolutions, and whether gaze is provided as text coordinates, visual overlays, or both.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects for strengthening the attribution of performance gaps to gaze utilization and for improving clarity on task construction. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: §3 (QA Generation Pipeline): The headline claim of substantial gaps highlighting limitations in gaze-based temporal reasoning rests on the assumption that fixation extraction + region-specific prompting + scanpath construction produces QA pairs whose difficulty is attributable to gaze utilization. The manuscript should include an ablation or control experiment comparing model performance on the same questions with vs. without gaze signals (or with randomized gaze) to rule out incidental properties of the generation process such as unnatural phrasing or region boundaries.

    Authors: We agree that a direct control experiment is necessary to isolate the contribution of gaze signals from potential artifacts in the QA generation pipeline. While the current manuscript analyzes multiple gaze prompting strategies and failure modes, it does not include an explicit ablation with randomized gaze or no-gaze conditions on identical QA pairs. In the revised version, we will add this experiment, reporting model performance across the three conditions (original gaze, randomized gaze, and no gaze) to confirm that the observed gaps stem from challenges in gaze-guided temporal reasoning. revision: yes

  2. Referee: §4 (Task Definitions and Human Baselines): It is unclear whether the past/present/proactive task formulations avoid future-frame leakage in the streaming setting and whether human baselines control for gaze visibility (i.e., do humans see the raw gaze overlay or only the video frames). These details are load-bearing for attributing the observed gaps specifically to missing gaze-temporal reasoning capacity rather than task construction.

    Authors: We appreciate the need for explicit clarification on these design choices. The task formulations are strictly streaming: models receive only frames up to the current timestamp with no access to future frames, preventing leakage by construction. Human baselines were collected with participants viewing the egocentric video frames overlaid with the raw gaze trajectory to match the multimodal input given to MLLMs. We will expand §4 with a dedicated subsection detailing the streaming protocol, frame access constraints, and human evaluation setup to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces StreamGaze as a new benchmark with a gaze-video QA generation pipeline based on fixation extraction, region-specific prompting, and scanpath construction applied to egocentric videos. No mathematical derivations, equations, or fitted parameters are presented as predictions. Central claims rest on empirical performance gaps between MLLMs and humans on the newly constructed tasks, without reduction to self-citations, self-definitions, or ansatzes from prior author work. The pipeline is described as a novel contribution that produces the evaluation data, making the work self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central contribution rests on domain assumptions about gaze data rather than new mathematical axioms or fitted parameters. No free parameters are introduced. No new physical or invented entities are postulated.

axioms (2)
  • domain assumption Raw gaze trajectories from egocentric videos can be processed via fixation extraction and scanpath construction to produce spatio-temporally grounded QA pairs that reflect human perceptual dynamics.
    Invoked in the QA generation pipeline description; this assumption underpins the validity of all downstream tasks and model evaluations.
  • domain assumption Gaze signals provide non-redundant information that enables models to follow shifting attention and infer user intentions from past and current frames alone.
    Central to the definition of gaze-guided past, present, and proactive tasks.

pith-pipeline@v0.9.0 · 5605 in / 1384 out tokens · 48852 ms · 2026-05-17T02:44:28.076302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6, 8

  2. [2]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024. 2, 6, 7

  3. [3]

    VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957,

  4. [4]

    Vispeak: Visual instruction feedback in streaming videos.arXiv preprint arXiv:2503.12769, 2025

    Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun- Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, and Wei-Shi Zheng. Vispeak: Visual instruction feedback in streaming videos.arXiv preprint arXiv:2503.12769, 2025. 2, 3, 6, 7, 5

  5. [5]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022. 2, 6

  6. [6]

    Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities in real world

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. Egoexolearn: A dataset for bridging asyn- chronous ego-and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22072– 22086, 2024. 3, 2, 5, 6

  7. [7]

    Gazevqa: a video question answering dataset for multiview eye-gaze task- oriented collaborations

    Muhammet Ilaslan, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Lim, and Mike Shou. Gazevqa: a video question answering dataset for multiview eye-gaze task- oriented collaborations. InProceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing, pages 10462–10479, 2023. 2, 3, 4, 5, 6, 7

  8. [8]

    A gaze-grounded visual question answering dataset for clarifying ambiguous japanese questions.arXiv preprint arXiv:2403.17545, 2024

    Shun Inadumi, Seiya Kawano, Akishige Yuguchi, Yasutomo Kawanishi, and Koichiro Yoshino. A gaze-grounded visual question answering dataset for clarifying ambiguous japanese questions.arXiv preprint arXiv:2403.17545, 2024. 4

  9. [9]

    Foveal vision anticipates defining features of eye movement targets.Elife, 11:e78106,

    Lisa M Kroell and Martin Rolfs. Foveal vision anticipates defining features of eye movement targets.Elife, 11:e78106,

  10. [10]

    Listen to look into the future: Audio-visual egocentric gaze anticipation

    Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, and James M Rehg. Listen to look into the future: Audio-visual egocentric gaze anticipation. InEuropean Conference on Computer Vision, pages 192–210. Springer, 2024. 2

  11. [11]

    In the eye of the beholder: Gaze and actions in first person video.IEEE trans- actions on pattern analysis and machine intelligence, 45(6): 6731–6747, 2021

    Yin Li, Miao Liu, and James M Rehg. In the eye of the beholder: Gaze and actions in first person video.IEEE trans- actions on pattern analysis and machine intelligence, 45(6): 6731–6747, 2021. 2, 3, 5

  12. [12]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 2

  13. [13]

    Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video un- derstanding.arXiv preprint arXiv:2411.03628, 2024. 3, 5, 6

  14. [14]

    Kangaroo: A powerful video-language model supporting long-context video input

    Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xi- aoqi Ma, Xiaoming Wei, Jianbin Jiao, Enhua Wu, and Jie Hu. Kangaroo: A powerful video-language model supporting long-context video input.arXiv preprint arXiv:2408.15542,

  15. [15]

    Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 3, 5, 6, 7, 4

  16. [16]

    Visual search in naturalistic scenes from foveal to peripheral vision: A compar- ison between dynamic and static displays.Journal of Vision, 22(1):10–10, 2022

    Antje Nuthmann and Teresa Canas-Bajo. Visual search in naturalistic scenes from foveal to peripheral vision: A compar- ison between dynamic and static displays.Journal of Vision, 22(1):10–10, 2022. 2, 4

  17. [17]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024. 6, 7

  18. [18]

    Introducing claude 4

    Anthropic PBC. Introducing claude 4. Online blog post, 2025. https://www.anthropic.com/news/claude- 4 (accessed 2025-11-14). 6, 7

  19. [19]

    In the eye of mllm: Benchmarking egocentric video intent understanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025

    Taiying Peng, Jiacheng Hua, Miao Liu, and Feng Lu. In the eye of mllm: Benchmarking egocentric video intent un- derstanding with gaze-guided prompting.arXiv preprint arXiv:2509.07447, 2025. 2, 3, 4, 6, 8, 5

  20. [20]

    Hd-epic: A highly-detailed egocentric video dataset

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, et al. Hd-epic: A highly-detailed egocentric video dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23901–23913, 2025. 5, 6

  21. [21]

    Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video llms with active real-time interaction via dis- entangled perception, decision, and reaction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24045–24055, 2025. 6, 7

  22. [22]

    Radwan, Nancy M

    Nisreen I. Radwan, Nancy M. Salem, and Mohamed I. El Adawy. Histogram correlation for video scene change de- tection. InAdvances in Computer Science, Engineering & Applications: Proceedings of the Second International Con- ference on Computer Science, Engineering and Applications (ICCSEA 2012), May 25–27, 2012, New Delhi, India, Volume 1, pages 647–655, B...

  23. [23]

    Gazellm: Multimodal llms incorporating human visual attention, 2025

    Jun Rekimoto. Gazellm: Multimodal llms incorporating human visual attention, 2025. 4

  24. [24]

    A review of interactions between peripheral and foveal vision.Journal of vision, 20(12):2–2, 2020

    Emma EM Stewart, Matteo Valsecchi, and Alexander C Schütz. A review of interactions between peripheral and foveal vision.Journal of vision, 20(12):2–2, 2020. 2, 4

  25. [25]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 7

  26. [26]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2, 6, 7

  27. [27]

    Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bu- gra Tekin, Felipe Vieira Frujeri, et al. Holoassist: an egocen- tric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023. 3, 1, 2, 5

  28. [28]

    Omnimmi: A comprehensive multi- modal interaction benchmark in streaming video contexts

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehensive multi- modal interaction benchmark in streaming video contexts. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18925–18935, 2025. 2, 3

  29. [29]

    Streaming video under- standing and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025

    Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video under- standing and multi-round interaction with memory-enhanced knowledge.arXiv preprint arXiv:2501.13468, 2025. 2, 3

  30. [30]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 3, 4

  31. [31]

    Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

    Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Sheng- sheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025. 2, 3

  32. [32]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 6, 7

  33. [33]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 2

  34. [34]

    Flash-vstream: Memory- based real-time understanding for long video streams, 2024

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory- based real-time understanding for long video streams, 2024. 2, 6, 7

  35. [35]

    Deep future gaze: Gaze anticipation on egocen- tric videos using adversarial networks

    Mengmi Zhang, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, and Jiashi Feng. Deep future gaze: Gaze anticipation on egocen- tric videos using adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4372–4381, 2017. 2

  36. [36]

    person wearing [clothing description]

    Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue genera- tion from streaming egocentric videos.arXiv preprint arXiv:2506.05904, 2025. 2, 3, 6 10 STREAMGAZE: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos Supplementary Materia...

  37. [37]

    Only use visual evidence up to the given timestamp

  38. [39]

    {Question} Figure 19.Prompt for text-based reasoning

    Give exactly ONE final choice (A/B/C/D) or yes/no with its short text inside <answer>...</answer>. {Question} Figure 19.Prompt for text-based reasoning. 14 Prompt for gaze-based reasoning You are an expert at gaze-conditioned streaming video reasoning. Rules:

  39. [41]

    Do your step-by-step reasoning ONLY inside <think>...</think>

  40. [42]

    Return exactly three tags in order: <gaze>...</gaze> <think>...</think> <answer>...</answer>

  41. [43]

    In <gaze>, estimate the most recent gaze center and FOV radius from the green dot and red circle in the frames up to the timestamp

  42. [44]

    If the green dot/Red circle is not visible, output "unknown" in <gaze> but still reason and answer using other available evidence

  43. [45]

    {Question} Figure 20.Prompt for gaze-based reasoning

    In <answer>, give exactly ONE final choice (A/B/C/D) or Yes/No with a short text. {Question} Figure 20.Prompt for gaze-based reasoning. Prompt for visual-based reasoning You are an expert at gaze-conditioned streaming video reasoning. Rules:

  44. [46]

    Only use visual evidence up to the given timestamp (no future leakage)

  45. [47]

    Produce your reasoning ONLY inside <think>...</think> and never reveal reasoning outside the tag

  46. [48]

    Return exactly FOUR tags in this order: <gaze>...</gaze> <objects>...</objects> <think>...</think> <answer>...</answer>

  47. [49]

    In <gaze>, estimate the most recent gaze center and FOV radius from the green dot and red circle, or output "unknown" if not visible

  48. [50]

    In <objects>, ground the most relevant objects inside or closest to the gaze/FOV region; for each object provide: object_name, bbox_x1, bbox_y1, bbox_x2, bbox_y2 normalized to [0,1], or output "none" if no relevant object appears

  49. [51]

    In <think>, use the gaze and grounded objects to perform step-by-step reasoning and explain exclusion of irrelevant options

  50. [52]

    {Question} Figure 21.Prompt for visual-based reasoning

    In <answer>, provide exactly ONE final label (A/B/C/D or Yes/No) with a short justification. {Question} Figure 21.Prompt for visual-based reasoning. 15 Figure 22.HTML for human verification of STREAMGAZEdata construction. 16 Figure 23.HTML for human oracle evaluation of STREAMGAZE. Figure 24.Instructions provided to human annotators. 17