pith. sign in

arxiv: 2605.26104 · v1 · pith:Y3HESH5Rnew · submitted 2026-05-25 · 💻 cs.CV

EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

Pith reviewed 2026-06-29 22:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords video temporal groundingMLLM adaptationentity groundingcross-domain robustnessparameter-efficient fine-tuningvisual evidence routing
0
0 comments X

The pith

Routing MLLM adaptation through entity-grounded visual evidence improves cross-domain video temporal grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that fine-tuning multimodal large language models for video temporal grounding succeeds in-domain but collapses under domain shift because visual changes break the model's ability to link its learned timing knowledge to its built-in attention to objects. EVIDENT fixes this by forcing the adaptation process to pass through explicit visual entity evidence instead of letting the model rely on dataset-specific shortcuts. It does so with an adapter that compresses visual tokens into entity slots, a distillation step that teaches those slots to represent coherent objects, and a gating step that uses the resulting entities to guide moment localization. This yields stronger out-of-domain results while keeping in-domain accuracy and adding only modest parameters. Readers should care because the method offers a lightweight way to make video localization models transfer across visual styles without retraining from scratch.

Core claim

EVIDENT anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence, using an Entity Bottleneck Adapter to create compact entity-level slots, an Entity-Binding Distillation loss to instill objectness priors, and an Entity-to-eVidence gating mechanism to steer localization toward query-relevant entities, thereby enabling fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts.

What carries the argument

Entity Bottleneck Adapter that compresses dense visual tokens into compact entity-level slots, paired with Entity-Binding Distillation loss and Entity-to-eVidence gating to route adaptation through captured entities.

If this is right

  • EVIDENT raises out-of-domain robustness on cross-domain VTG benchmarks while matching in-domain performance.
  • The approach adds only modest parameter overhead.
  • Entity-level grounding functions as an inductive bias that supports generalizable temporal localization.
  • Adaptation no longer depends on dataset shortcuts that fail under visual shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entity-routing idea could be tested on other multimodal tasks such as action recognition or video question answering where domain shift affects attention.
  • If entity slots prove stable across more video styles, the method might reduce the need for large-scale domain-specific fine-tuning.
  • Combining the entity bottleneck with other low-rank adapters might further lower the parameter cost.

Load-bearing premise

Visual domain shift is the main reason models lose the ability to couple temporal localization knowledge with their existing entity-attention, and explicit routing through entity evidence will overcome that.

What would settle it

A test set in which visual style is held constant across train and test but query concepts change, showing whether EVIDENT still improves over standard fine-tuning or whether the gain disappears when visual shift is removed.

Figures

Figures reproduced from arXiv: 2605.26104 by Geo Ahn, Jinwoo Choi, Jiwook Han, Joonseok Lee, Youngrae Kim.

Figure 1
Figure 1. Figure 1: Teaser. In this work, we tackle the domain generalization problem of MLLM-based VTG. We observe that naïvely fine-tuned MLLMs learn entangled visual representations that rely on dataset-specific scene context rather than the actual visual content, indiscriminately responding to memorized patterns across frames in OOD videos. In contrast, our method learns an entity-centric visual representation that regula… view at source ↗
Figure 2
Figure 2. Figure 2: Naïve fine-tuning breaks generalization. (a) Cosine similarity between visual patches and object text tokens from Qwen2.5-VL-3B [3] on Charades-STA (Cha.) [10] and QVHighlights (QVH) [22]. Even zero-shot, the MLLM exhibits strong visual-text alignment regardless of video domain. (b) When fine-tuned on a specific domain, the model suffers severe OOD drops. (c) The model’s attention on GT-interval frames rel… view at source ↗
Figure 3
Figure 3. Figure 3: Visual domain gap dominates. We decompose the domain gap into concept domain gap and visual domain gap. (a) Splitting OOD samples by query-concept overlap with ID yields only a marginal gap. (b) Ranking OOD samples by visual similarity to ID, the top-20% most similar samples substantially outperform the bottom-20%, identifying the visual domain gap as the primary cause of degradation. (c) On ID, perturbing… view at source ↗
Figure 4
Figure 4. Figure 4: Overview. (a) Entity Bottleneck Adapter (EB Adapter). A lightweight adapter in the early LLM decoder layers decomposes visual tokens into entity-level slots via slot attention. (b) Entity Binding Distillation (EB Distillation). Entity cluster maps derived from DINOv2 [37] patch features are distilled into slot assignment maps via BCE loss, encouraging semantically coherent slot formation. (c) Entity-to-eVi… view at source ↗
Figure 5
Figure 5. Figure 5: EB Adapter attention visualization. We visualize the attention map of EB Adapter (fine-tuned on Cha. [10]) on samples from Cha. (ID), QVH [22] (OOD), and DiDeMo [13] (OOD) by masking each frame with its highest-attending slot. decomposition generalizes to unseen domains without any domain-specific supervision, confirming that EVIDENT learns transferable entity-level representations rather than source-biase… view at source ↗
Figure 6
Figure 6. Figure 6: Visual similarity distribution and quintile partitioning. Distribution of cosine similarities s(v) = cos(v, cCha.) between QVHighlights [22] test video descriptors v and the Charades-STA [10] training centroid cCha., computed in a frozen Qwen2.5-VL-3B [3] embedding space. The 1,519 test videos (1,550 query samples) are sorted by s(v) and partitioned into five equally sized quantile bins (dashed lines). We … view at source ↗
Figure 7
Figure 7. Figure 7: Concept and visual axes carry independent information. Each dot is one of the 1,200 QVHighlights [22] test samples, plotted at its cosine similarity s(v) to the Charades-STA [10] centroid (x-axis); samples are split into two lanes by query concept (Seen / Unseen). Diamonds mark the per-group mean. The two lanes occupy nearly the same range of s(v) and their means sit only 0.013 apart, suggesting that the c… view at source ↗
Figure 8
Figure 8. Figure 8: Problem analysis on InternVL3-2B [53]. We replicate the analysis of [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Zero-shot prompt used with Qwen3-4B-Instruct-2507 [ [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: DINOv2 cluster maps vs. EB Adapter slot assignments. Each group shows a frame and four isolated patch views across Charades-STA (Cha.) [10] (ID), QVHighlights (QVH) [22](OOD), and DiDeMo [13] (OOD). Top (DINO): DINOv2 [37] K-means naturally separates semantically coherent regions, validating its use as pseudo-ground-truth in EB Distillation. Bottom (EB Adapter): Slots trained with EB Distillation mirror t… view at source ↗
Figure 11
Figure 11. Figure 11: Analysis on slot attention weights in EB Adapter. We plot the entropy and pairwise cosine similarity of normalized slot attention weights in EB Adapter across layers. The naïve MLLM visual space yields nearly uniform attention, indicating no entity-level binding, whereas our EB Distillation loss yields lower entropy and cosine similarity–evidence that each slot captures a distinct entity. D Additional Exp… view at source ↗
Figure 12
Figure 12. Figure 12: EB Adapter visualization. We visualize the slot assignments on samples from Charades￾STA (Cha.) [10] (ID), QVHighlights (QVH) [22] (OOD), and DiDeMo [13] (OOD) by masking each frame with its highest-attending slot. Frames are arranged in temporal order from left to right, and the same color denotes the same slot.      [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: E2V gating score visualization. We visualize the per-entity gating scores from E2V gating (see Section 4.4) on (a) Charades-STA [10] (ID) and (b) QVHighlights [22] (OOD) samples. E Limitations and Future Work EVIDENT addresses cross-domain VTG under visual distribution shift, with experiments spanning three diverse benchmarks (Charades-STA, QVHighlights, DiDeMo). Extending our analysis to more challenging… view at source ↗
read the original abstract

Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that fine-tuning MLLMs for Video Temporal Grounding (VTG) degrades under domain shift primarily due to visual domain shift blocking coupling of temporal localization knowledge with inherent entity-attention; it introduces EVIDENT, a parameter-efficient framework with an Entity Bottleneck Adapter (transforming dense tokens to entity slots), Entity-Binding Distillation loss (instilling objectness priors), and Entity-to-eVidence gating (steering localization via query-relevant entities) to enable entity-grounded adaptation, yielding improved out-of-domain robustness while preserving in-domain performance with modest overhead.

Significance. If the cross-domain gains hold with proper controls, the work establishes entity-level grounding as a practical inductive bias for generalizable temporal localization in MLLMs, directly targeting visual domain shift rather than query-concept novelty, with potential extension to other multimodal grounding tasks under distribution shift.

major comments (1)
  1. [Abstract] Abstract: the central claim of consistent out-of-domain improvement is asserted without any reported metrics, baselines, ablation tables, or error analysis, preventing verification of the experimental support for the entity-routing hypothesis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to respond. We address the single major comment below regarding the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of consistent out-of-domain improvement is asserted without any reported metrics, baselines, ablation tables, or error analysis, preventing verification of the experimental support for the entity-routing hypothesis.

    Authors: We acknowledge that the abstract presents a high-level summary of the claims without embedding specific numerical metrics, baseline names, or table references, as is conventional for abstracts to remain concise. The full manuscript contains the requested experimental details, including cross-domain VTG benchmark results with quantitative comparisons, ablation studies on the three proposed components, and supporting analysis in the Experiments section. To strengthen verifiability directly in the abstract while preserving its brevity, we will revise it to include key quantitative out-of-domain gains (e.g., relative improvements over baselines) drawn from the reported tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents EVIDENT as an architectural framework with three explicitly described components (Entity Bottleneck Adapter, Entity-Binding Distillation loss, Entity-to-eVidence gating) that directly implement the stated inductive bias of entity-grounded routing. No equations, derivations, or parameter-fitting steps are shown that reduce the claimed cross-domain gains to quantities defined by the method itself. The central claim rests on empirical out-of-domain robustness results, which remain externally falsifiable. No self-citation chains or uniqueness theorems are invoked as load-bearing premises in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable.

pith-pipeline@v0.9.1-grok · 5795 in / 989 out tokens · 24547 ms · 2026-06-29T22:25:11.022294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    Slot-guided adaptation of pre-trained diffusion models for object-centric learning and compositional generation

    Adil Kaan Akan and Yücel Yemez. Slot-guided adaptation of pre-trained diffusion models for object-centric learning and compositional generation. InICLR, 2025. 7

  2. [2]

    DEVIAS: Learning disentangled video representations of action and scene

    Kyungho Bae, Geo Ahn, Youngrae Kim, and Jinwoo Choi. DEVIAS: Learning disentangled video representations of action and scene. InECCV, 2024. 13

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  4. [4]

    Learning sample importance for cross-scenario video temporal grounding

    Peijun Bao and Yadong Mu. Learning sample importance for cross-scenario video temporal grounding. In ICMR, 2022. 2

  5. [5]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InECCV, 2020. 13

  6. [6]

    Towards a complete benchmark on video moment localization

    Jinyeong Chae, Donghwa Kim, Kwanseok Kim, Doyeon Lee, Sangho Lee, Seongsu Ha, Jonghwan Mun, Wooyoung Kang, Byungseok Roh, and Joonseok Lee. Towards a complete benchmark on video moment localization. InAISTATS, 2024. 2, 13

  7. [7]

    Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

    Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, and Sungwoong Kim. Slot-MLLM: Object-centric visual tokenization for multimodal llm.arXiv preprint arXiv:2505.17726, 2025. 2, 3

  8. [8]

    Learning phrase representations using RNN encoder–decoder for statistical machine translation

    Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. InEMNLP, 2014. 6

  9. [9]

    Why can’t i dance in the mall? learning to mitigate scene bias in action recognition

    Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-Bin Huang. Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. InNeurIPS, 2019. 13 10

  10. [10]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InICCV, 2017. 3, 4, 8, 9, 13, 14, 15, 16, 17, 18, 20, 21

  11. [11]

    TRACE: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643, 2024

    Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Qingbin Liu, and Xi Chen. TRACE: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643, 2024. 2, 3

  12. [12]

    Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding

    Jiachang Hao, Haifeng Sun, Pengfei Ren, Jingyu Wang, Qi Qi, and Jianxin Liao. Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding. InECCV, 2022. 2, 13

  13. [13]

    Localizing moments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InICCV, 2017. 3, 8, 9, 17, 18, 20, 21

  14. [14]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InICLR, 2022. 7, 8, 9, 10, 18, 19

  15. [15]

    Vtimellm: Empower llm to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. InCVPR, 2024. 2, 3

  16. [16]

    Knowing where to focus: Event-aware transformer for video grounding

    Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn. Knowing where to focus: Event-aware transformer for video grounding. InICCV, 2023. 8, 9, 13

  17. [17]

    Transferable video moment localization by moment-guided query prompting

    Hao Jiang, Yang Yizhang, and Yadong Mu. Transferable video moment localization by moment-guided query prompting. InAAAI, 2024. 2

  18. [18]

    Map the flow: Revealing hidden pathways of information in videollms

    Minji Kim, Taekyung Kim, and Bohyung Han. Map the flow: Revealing hidden pathways of information in videollms. InICLR, 2026. 4, 7, 19

  19. [19]

    Conditional object-centric learning from video

    Thomas Kipf, Gamaleldin F Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff. Conditional object-centric learning from video. InICLR, 2022. 3

  20. [20]

    The hungarian method for the assignment problem.Naval research logistics quarterly, 2 (1-2):83–97, 1955

    Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2 (1-2):83–97, 1955. 7

  21. [21]

    Curriculum multi-negative augmentation for debiased video grounding

    Xiaohan Lan, Yitian Yuan, Hong Chen, Xin Wang, Zequn Jie, Lin Ma, Zhi Wang, and Wenwu Zhu. Curriculum multi-negative augmentation for debiased video grounding. InAAAI, 2023. 13

  22. [22]

    Detecting moments and highlights in videos via natural language queries

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. InNeurIPS, 2021. 2, 3, 4, 8, 9, 13, 14, 15, 16, 17, 18, 20, 21

  23. [23]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. In ACL, 2023. 13

  24. [24]

    CORE: Compact object-centric representations as a new paradigm for token merging in lvlms

    Jingyu Lei, Gaoang Wang, and Der-Horng Lee. CORE: Compact object-centric representations as a new paradigm for token merging in lvlms. InCVPR, 2026. 2, 3

  25. [25]

    Compositional temporal grounding with structured variational cross-graph correspondence learning

    Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, and Xin Eric Wang. Compositional temporal grounding with structured variational cross-graph correspondence learning. InCVPR, 2022. 2

  26. [26]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023. 8

  27. [27]

    Resound: Towards action recognition without representation bias

    Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. InECCV, 2018. 13

  28. [28]

    Universal video temporal grounding with generative multi-modal large language models

    Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, and Weidi Xie. Universal video temporal grounding with generative multi-modal large language models. InNeurIPS, 2025. 3, 5, 8, 9

  29. [29]

    Univtg: Towards unified video-language temporal grounding

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. InICCV,

  30. [30]

    VideoMind: A chain-of-lora agent for temporal-grounded video reasoning

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. VideoMind: A chain-of-lora agent for temporal-grounded video reasoning. InICLR, 2026. 3, 8, 9

  31. [31]

    Object-centric learning with slot attention

    Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. InNeurIPS,

  32. [32]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 18 11

  33. [33]

    Chrono: A simple blueprint for representing time in mllms

    Boris Meinardus, Hector Rodriguez, Anil Batra, Anna Rohrbach, and Marcus Rohrbach. Chrono: A simple blueprint for representing time in mllms. InICCVW, 2025. 2, 3, 5, 8, 9, 18

  34. [34]

    Correlation-guided query-dependency calibration for video temporal grounding.arXiv preprint arXiv:2311.08835, 2023

    WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae-Pil Heo. Correlation-guided query-dependency calibration for video temporal grounding.arXiv preprint arXiv:2311.08835, 2023. 8, 9, 13

  35. [35]

    Query-dependent video representation for moment retrieval and highlight detection

    WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representation for moment retrieval and highlight detection. InCVPR, 2023. 2, 13

  36. [36]

    Interventional video grounding with dual contrastive learning

    Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, and Wei Lu. Interventional video grounding with dual contrastive learning. InCVPR, 2021. 13

  37. [37]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

  38. [38]

    Uncovering hidden challenges in query- based video moment retrieval.arXiv preprint arXiv:2009.00325, 2020

    Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkilä. Uncovering hidden challenges in query- based video moment retrieval.arXiv preprint arXiv:2009.00325, 2020. 2, 13

  39. [39]

    Bias-conflict sample synthesis and adversarial removal debias strategy for temporal sentence grounding in video

    Zhaobo Qi, Yibo Yuan, Xiaowen Ruan, ShuHui Wang, Weigang Zhang, and QingMing Huan. Bias-conflict sample synthesis and adversarial removal debias strategy for temporal sentence grounding in video. In AAAI, 2024. 2, 13

  40. [40]

    Timechat: A time-sensitive multimodal large language model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InCVPR, 2024. 2, 3

  41. [41]

    Bridging the gap to real-world object-centric learning

    Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon- Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, et al. Bridging the gap to real-world object-centric learning. InICLR, 2023. 3, 7

  42. [42]

    Tr-detr: Task-reciprocal transformer for joint moment retrieval and highlight detection

    Hao Sun, Mingyao Zhou, Wenjing Chen, and Wei Xie. Tr-detr: Task-reciprocal transformer for joint moment retrieval and highlight detection. InAAAI, 2024. 2

  43. [43]

    Time-R1: Post-training large vision language model for temporal video grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, and Qin Jin. Time-R1: Post-training large vision language model for temporal video grounding. InNeurIPS,

  44. [44]

    HawkEye: Training video-text llms for grounding text in videos.arXiv preprint arXiv:2403.10228, 2024

    Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. HawkEye: Training video-text llms for grounding text in videos.arXiv preprint arXiv:2403.10228, 2024. 2, 8, 9

  45. [45]

    Slotformer: Unsupervised visual dynamics simulation with object-centric models

    Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, and Animesh Garg. Slotformer: Unsupervised visual dynamics simulation with object-centric models. InICLR, 2023. 3

  46. [46]

    Slot-VLM: Object-event slots for video-language modeling

    Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, and Yan Lu. Slot-VLM: Object-event slots for video-language modeling. InNeurIPS, 2024. 2, 3

  47. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 7, 16, 17

  48. [48]

    AIM: Adapting image models for efficient video understanding

    Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. AIM: Adapting image models for efficient video understanding. InICLR, 2023. 7

  49. [49]

    Deconfounded video moment retrieval with causal intervention

    Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. Deconfounded video moment retrieval with causal intervention. InACM SIGIR, 2021. 13

  50. [50]

    A closer look at temporal sentence grounding in videos: Dataset and metric

    Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, and Wenwu Zhu. A closer look at temporal sentence grounding in videos: Dataset and metric. InACM MM Workshop, 2021. 13

  51. [51]

    TimeSuite: Improving MLLMs for long video understanding via grounded tuning

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, and Limin Wang. TimeSuite: Improving MLLMs for long video understanding via grounded tuning. InICLR, 2025. 2, 3, 8, 9

  52. [52]

    Timelens: Rethinking video temporal grounding with multimodal llms

    Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, and Limin Wang. Timelens: Rethinking video temporal grounding with multimodal llms. InCVPR, 2026. 5

  53. [53]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 4, 15, 16 12 Appendix In this appendix, we provide extended related work, comprehensive analyses...

  54. [54]

    Return a JSON list containing up to 2 strings

  55. [55]

    Subject:

    Each string MUST start with"Subject:"or"Object:"

  56. [56]

    If the Subject is interacting with a physical item, you MUST extract it as an Object

  57. [57]

    Ignore verbs, scenes, abstract concepts, and meta-descriptions

  58. [58]

    Keep the nouns extremely concise (1–2 words)

  59. [59]

    a person opens the refrigerator in the kitchen

    Return ONLY a valid JSON list. Examples: Query: “a person opens the refrigerator in the kitchen” JSON list:["Subject: person", "Object: refrigerator"] Query: “The girl dances around the room.” JSON list:["Subject: girl"] Query: {query} JSON list: Assistant JSON list of extracted concepts Figure 9: Zero-shot prompt used with Qwen3-4B-Instruct-2507 [47] for...