pith. machine review for the scientific record. sign in

arxiv: 2605.07575 · v2 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords scene graphstreaming videoproactive responseVideo-LLMresponse timingretrieval augmentedvideo understanding
0
0 comments X

The pith

Response-G1 aligns video evidence with query conditions through explicit scene graphs to decide response timing in streaming video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Response-G1 as a framework that creates explicit alignment between accumulated video evidence and a query's response conditions by representing both in scene graphs. It processes streaming video in three stages without any fine-tuning: generating query-guided scene graphs from incoming clips, retrieving the most relevant past graphs from memory, and prompting a decision to stay silent or respond on each frame. A reader would care because prior Video-LLM methods handle evidence implicitly and without reference to the query, which leads to unreliable timing in proactive settings, whereas the graph structure makes the decisions both more accurate and easier to interpret.

Core claim

Response-G1 establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame silence/response decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions.

What carries the argument

Explicit scene graph modeling that grounds both video evidence and query response conditions in a shared graph representation to enable memory retrieval and trigger prompting.

If this is right

  • The method shows higher accuracy than prior approaches on both proactive and reactive streaming video tasks.
  • Decisions become more interpretable because they rest on explicit graph alignments rather than hidden representations.
  • No fine-tuning is required because the three stages rely on generation, retrieval, and prompting.
  • Performance gains hold across established benchmarks for streaming video understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If reliable online scene graph generators become available for new domains, the same three-stage structure could apply to other timing-sensitive multimodal tasks such as live captioning or robotic monitoring.
  • The memory retrieval step suggests a path toward longer-term video memory in Video-LLMs beyond single-clip processing.
  • Replacing the scene graph generator with a stronger model would directly test whether the claimed gains scale with graph quality.

Load-bearing premise

Online query-guided scene graph generation from streaming clips works reliably and memory retrieval of historical graphs consistently surfaces relevant evidence for the timing decision.

What would settle it

An experiment in which query-guided scene graph generation produces noisy or incomplete graphs on a benchmark clip set, or memory retrieval returns irrelevant graphs, resulting in response timing accuracy no better than implicit baselines.

Figures

Figures reproduced from arXiv: 2605.07575 by Bin Guo, Jiaqi Tang, Ke Ma, Qifeng Chen, Qingfeng He, Ruonan Xu, Xueting Han, Xu Wang, Yunhao Liu, Zhiwen Yu, Ziheng Wang.

Figure 1
Figure 1. Figure 1: Existing proactive mechanisms in streaming [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Response-G1 framework. The system processes streaming video through three core components: (1) Online Query-Guided Scene Graph Generation, (2) Memory-Based Scene Graph Retrieval, and (3) Retrieval-Augmented Streaming Pipeline for proactive decision-making. the observed evidence in F1:t satisfies the response conditions implicit in Qtask , outputting a proactive action rt ∈ R = {silence, res… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of different K Values for top-K [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study of Response-G1 on the CRR subtask in OVO-Bench. The user query describes a target object (“the boy wearing a red T-shirt”) and a relation (“talking with others”). The results show that at time “18:51”, Response-G1 accurately retrieves query-relevant scene graphs (i.e., evidence) and triggers a response, whereas the baselines fail to respond throughout the video stream. Strategies Proactive Subta… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template for query-guided online scene graph generation. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template for query parsing [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for the original trigger on the CRR subtask in OVO-Bench. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for Response-G1’s trigger on the CRR subtask in OVO-Bench. {query} Is it the right time to output \"{ground_truth_output}\"? You can only answer yes or no. Prompts [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for the original trigger on the PO subtask in StreamingBench. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for Response-G1’s trigger on the PO subtask in StreamingBench [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Response-G1, a three-stage framework for proactive streaming video understanding in Video-LLMs. It performs online query-guided scene graph generation from streaming clips, memory-based retrieval of semantically relevant historical scene graphs, and retrieval-augmented prompting to decide per-frame whether to output a response or remain silent. By grounding both visual evidence and query response conditions in an explicit shared scene-graph representation, the method claims to deliver more interpretable and accurate trigger decisions than implicit, query-agnostic baselines, with reported superiority on established benchmarks for both proactive and reactive tasks.

Significance. If the empirical claims hold, the work offers a structured alternative to black-box modeling in streaming video understanding, leveraging scene graphs for explicit alignment between evidence and query conditions. The fine-tuning-free pipeline and emphasis on retrieval-augmented interpretability represent a concrete advance that could improve reliability in applications such as real-time monitoring or interactive video systems.

major comments (1)
  1. Experimental Results section: The manuscript asserts experimental superiority on benchmarks yet provides no details on the specific datasets, baseline methods, evaluation metrics, error bars, statistical tests, or ablation studies. This absence leaves the central claim of improved response timing without visible empirical support and prevents assessment of whether the gains are attributable to the scene-graph grounding or to other factors.
minor comments (2)
  1. The description of the memory-based retrieval step would benefit from an explicit algorithmic outline or pseudocode to clarify how semantic relevance is computed and how historical graphs are stored and queried.
  2. A pipeline diagram illustrating the three stages, the flow of streaming clips, and the per-frame decision process would improve readability and help readers follow the retrieval-augmented prompting mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and the opportunity to improve the manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: Experimental Results section: The manuscript asserts experimental superiority on benchmarks yet provides no details on the specific datasets, baseline methods, evaluation metrics, error bars, statistical tests, or ablation studies. This absence leaves the central claim of improved response timing without visible empirical support and prevents assessment of whether the gains are attributable to the scene-graph grounding or to other factors.

    Authors: We agree that the Experimental Results section requires substantially more detail to support the claims of superiority. In the revised manuscript we will expand this section to specify the exact datasets and benchmarks used for both proactive and reactive tasks, describe all baseline methods in full, list the evaluation metrics, report error bars from multiple runs, include statistical significance tests, and present comprehensive ablation studies that isolate the contributions of query-guided scene graph generation, memory-based retrieval, and retrieval-augmented prompting. These additions will make the empirical support explicit and allow readers to assess the role of explicit scene-graph grounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity; procedural framework with no derivations

full rationale

The paper presents Response-G1 as a three-stage procedural pipeline (online query-guided scene graph generation from streaming clips; memory-based retrieval of historical scene graphs; retrieval-augmented trigger prompting) without any equations, parameter fittings, or mathematical derivations. Claims of improved interpretability and accuracy rest on explicit shared graph representations and external benchmark comparisons rather than self-referential definitions or predictions that reduce to inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing elements in the provided description, rendering the method self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that scene graphs form a sufficient shared representation for both accumulated video evidence and query response conditions; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Scene graphs generated from streaming clips can be aligned with query-expected response conditions in a way that supports accurate silence/response decisions
    Invoked by the design of the three stages and the claim of improved interpretability and accuracy.

pith-pipeline@v0.9.0 · 5492 in / 1229 out tokens · 59554 ms · 2026-05-12T02:49:59.533787+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 13 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Proceedings of the 33rd ACM International Conference on Multimedia , pages=

    Timechat-online: 80\ author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

  9. [9]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Streaming dense video captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  10. [10]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Videollm-online: Online video large language model for streaming video , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  11. [11]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Lion-fs: Fast & slow video-language thinker as online video assistant , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  12. [12]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  13. [13]

    arXiv preprint arXiv:2505.05467 , year=

    StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant , author=. arXiv preprint arXiv:2505.05467 , year=

  14. [14]

    StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

    Streamagent: Towards anticipatory agents for streaming video understanding , author=. arXiv preprint arXiv:2508.01875 , year=

  15. [15]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  16. [16]

    Neurocomputing , volume=

    Scene graph generation: A comprehensive survey , author=. Neurocomputing , volume=. 2024 , publisher=

  17. [17]

    arXiv preprint arXiv:2510.14359 , year=

    AI for Service: Proactive Assistance with AI Glasses , author=. arXiv preprint arXiv:2510.14359 , year=

  18. [18]

    arXiv preprint arXiv:2510.14560 , year=

    Eyes wide open: Ego proactive video-llm for streaming video , author=. arXiv preprint arXiv:2510.14560 , year=

  19. [19]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

  20. [20]

    arXiv preprint arXiv:2501.13468 , year=

    Streaming video understanding and multi-round interaction with memory-enhanced knowledge , author=. arXiv preprint arXiv:2501.13468 , year=

  21. [21]

    arXiv preprint arXiv:2509.24871 , year=

    Streamforest: Efficient online video understanding with persistent event memory , author=. arXiv preprint arXiv:2509.24871 , year=

  22. [22]

    arXiv preprint arXiv:2503.00540 , year=

    Streaming video question-answering with in-context video kv-cache retrieval , author=. arXiv preprint arXiv:2503.00540 , year=

  23. [23]

    arXiv preprint arXiv:2506.23825 , year=

    Flash-VStream: Efficient Real-Time Understanding for Long Video Streams , author=. arXiv preprint arXiv:2506.23825 , year=

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Streaming long video understanding with large language models , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Videollama 3: Frontier multimodal foundation models for image and video understanding , author=. arXiv preprint arXiv:2501.13106 , year=

  26. [26]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  27. [27]

    Stream- ingbench: Assessing the gap for mllms to achieve streaming video understanding.CoRR, abs/2411.03628, 2024

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding , author=. arXiv preprint arXiv:2411.03628 , year=

  28. [28]

    European Conference on Computer Vision , pages=

    ActionSwitch: Class-Agnostic Detection of Simultaneous Actions in Streaming Videos , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  29. [29]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  30. [30]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Online Video Understanding: OVBench and VideoChat-Online , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  31. [31]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Learning situation hyper-graphs for video question answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  32. [32]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Contrastive video question answering via video graph transformer , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2023 , publisher=

  33. [33]

    Forty-second International Conference on Machine Learning , year=

    Fine-Grained Captioning of Long Videos through Scene Graph Consolidation , author=. Forty-second International Conference on Machine Learning , year=

  34. [34]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Image retrieval using scene graphs , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  35. [35]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

    Structured query-based image retrieval using scene graphs , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

  36. [36]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Image-to-image retrieval by learning similarity between scene graphs , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  37. [37]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Spatial-temporal transformer for dynamic scene graph generation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  38. [38]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  39. [39]

    Advances in neural information processing systems , volume=

    Faster r-cnn: Towards real-time object detection with region proposal networks , author=. Advances in neural information processing systems , volume=

  40. [40]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    From pixels to graphs: Open-vocabulary scene graph generation with vision-language models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  41. [41]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Llm meets scene graph: Can large language models understand and generate scene graphs? a benchmark and empirical study , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  42. [42]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Hyperglm: Hypergraph for video scene graph generation and anticipation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  43. [43]

    Advances in Neural Information Processing Systems , volume=

    Et bench: Towards open-ended event-level video-language understanding , author=. Advances in Neural Information Processing Systems , volume=

  44. [44]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

  45. [45]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

  46. [46]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  47. [47]

    2024 , howpublished =

    Claude 3.5 Sonnet , author =. 2024 , howpublished =

  48. [48]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms , author=. arXiv preprint arXiv:2406.07476 , year=

  49. [49]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Vila: On pre-training for visual language models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  50. [50]

    Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos.arXiv preprint arXiv:2408.14023, 2024

    Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos , author=. arXiv preprint arXiv:2408.14023 , year=

  51. [51]

    Long Context Transfer from Language to Vision

    Long context transfer from language to vision , author=. arXiv preprint arXiv:2406.16852 , year=

  52. [52]

    Science China Information Sciences , volume=

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites , author=. Science China Information Sciences , volume=. 2024 , publisher=

  53. [53]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee

    Kangaroo: A powerful video-language model supporting long-context video input , author=. arXiv preprint arXiv:2408.15542 , year=

  54. [54]

    Llavanext: Improved reasoning, ocr, and world knowledge , author=

  55. [55]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Minicpm: Unveiling the potential of small language models with scalable training strategies , author=. arXiv preprint arXiv:2404.06395 , year=

  56. [56]

    LLaVA-OneVision: Easy Visual Task Transfer

    Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

  57. [57]

    Qwen2.5-VL Technical Report

    Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  58. [58]

    Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Longvu: Spatiotemporal adaptive compression for long video-language understanding , author=. arXiv preprint arXiv:2410.17434 , year=

  59. [59]

    Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

    Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability , author=. arXiv preprint arXiv:2411.18211 , year=

  60. [60]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  61. [61]

    Frontiers of Computer Science , volume=

    Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues , author=. Frontiers of Computer Science , volume=. 2025 , publisher=

  62. [62]

    ACM Transactions on Information Systems , volume=

    Enabling harmonious human-machine interaction with visual-context augmented dialogue system: A review , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

  63. [63]

    Advances in Neural Information Processing Systems , volume=

    Hawk: Learning to understand open-world video anomalies , author=. Advances in Neural Information Processing Systems , volume=

  64. [64]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Surgeon: Memory-adaptive fully test-time adaptation via dynamic activation sparsity , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  65. [65]

    IEEE Transactions on Mobile Computing , year=

    AdaShift: Anti-Collapse and Real-Time Deep Model Evolution for Mobile Vision Applications , author=. IEEE Transactions on Mobile Computing , year=

  66. [66]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Llava-onevision-1.5: Fully open framework for democratized multimodal training , author=. arXiv preprint arXiv:2509.23661 , year=