pith. machine review for the scientific record. sign in

arxiv: 2604.02860 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI

Recognition: no theorem link

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Allen He, Kun Liu, Qi Liu, Wu Liu, Xinchen Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords temporal sentence groundingend-to-end trainingvideo localizationsentence conditioned adapterfeature modulationmultimodal video understandingadapter tuning
0
0 comments X

The pith

Fully end-to-end training of video backbones and localization heads, guided by a sentence-conditioned adapter, outperforms frozen-backbone methods on temporal sentence grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most existing approaches extract video features once using classification-trained encoders that stay frozen during grounding training, creating a mismatch between what the backbone learned and what the task needs. The paper demonstrates that switching to joint optimization of the entire pipeline closes this gap and raises accuracy on standard benchmarks. A small Sentence Conditioned Adapter (SCADA) injects sentence information into selected layers of the backbone so that only a modest number of parameters are updated, keeping memory costs manageable even for deeper networks. This adaptive modulation produces video features that are more directly useful for locating sentence-matched segments.

Core claim

The central claim is that a fully end-to-end training paradigm, in which the video backbone and localization head are optimized together, combined with the Sentence Conditioned Adapter that uses sentence embeddings to modulate feature maps, resolves the task discrepancy between classification pre-training and temporal sentence grounding and delivers higher localization performance than prior state-of-the-art methods that keep backbones frozen.

What carries the argument

Sentence Conditioned Adapter (SCADA), a lightweight module that conditions a small subset of video-backbone parameters on linguistic embeddings to adaptively modulate feature maps during joint optimization.

If this is right

  • End-to-end training improves results over frozen baselines at multiple model scales.
  • SCADA permits deeper backbones to be trained with lower peak memory than full fine-tuning.
  • Visual features become more task-aligned for sentence-guided localization through direct linguistic modulation.
  • The overall pipeline exceeds previously published state-of-the-art numbers on the two evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar lightweight conditioning adapters could be tested on other video-language tasks that currently rely on frozen encoders.
  • The memory savings reported for deeper networks suggest the method may scale to larger backbones where full fine-tuning is currently prohibitive.
  • If SCADA-style modulation proves stable across datasets, it could reduce reliance on massive classification pre-training corpora for grounding-specific models.

Load-bearing premise

End-to-end optimization together with a small sentence-conditioned adapter can close the classification-to-grounding gap without causing overfitting or exceeding practical memory limits.

What would settle it

Training the same deeper backbone end-to-end without SCADA or with a random adapter on the two reported benchmarks yields no accuracy gain or a clear drop relative to the frozen baseline while memory usage remains comparable.

Figures

Figures reproduced from arXiv: 2604.02860 by Allen He, Kun Liu, Qi Liu, Wu Liu, Xinchen Liu.

Figure 1
Figure 1. Figure 1: An overview of the classic pipeline for TSGV (top) vs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The proposed overall framework for fully end-to-end training for TSGV. (a) In the Sentence Conditioned Adapter, sentence [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of sentence conditioned adapter. It con [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The effect of image and temporal resolution. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of training time and test performance for the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a fully end-to-end training paradigm for temporal sentence grounding in videos (TSGV) that jointly optimizes the video backbone and localization head, in contrast to prior work that freezes pre-trained query-agnostic visual encoders. It introduces the Sentence Conditioned Adapter (SCADA) to adaptively modulate a small subset of backbone parameters using sentence features, enabling deeper networks with lower memory while addressing the classification-to-grounding task discrepancy. An empirical study across model scales plus experiments on two benchmarks are reported to show outperformance over state-of-the-art methods.

Significance. If the empirical gains hold under rigorous controls, the work would mark a meaningful shift in TSGV by demonstrating that end-to-end optimization with a lightweight sentence-conditioned adapter can close the task gap without prohibitive memory costs, potentially improving visual representations for video-language localization tasks and encouraging similar joint training in related domains.

major comments (2)
  1. [Empirical study and experiments section] The empirical study validating end-to-end learning (mentioned in the abstract and presumably detailed in the experiments section) reports benchmark gains but provides no information on statistical significance testing, variance across runs, or full ablation controls isolating the contribution of joint backbone optimization versus the SCADA adapter alone; this weakens support for the central claim that the paradigm shift is effective across model scales.
  2. [SCADA introduction and method] The description of SCADA (abstract and method) claims it modulates feature maps through precise integration of linguistic embeddings while training only a small portion of parameters, yet no explicit formulation, pseudocode, or memory analysis is referenced to confirm it avoids overfitting or enables the stated deeper backbones; this is load-bearing for the weakest assumption identified in the review.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the empirical validation and methodological clarity of our work on fully end-to-end training for temporal sentence grounding. We address each major comment below and will incorporate revisions to provide additional statistical details, expanded ablations, and explicit SCADA documentation.

read point-by-point responses
  1. Referee: The empirical study validating end-to-end learning (mentioned in the abstract and presumably detailed in the experiments section) reports benchmark gains but provides no information on statistical significance testing, variance across runs, or full ablation controls isolating the contribution of joint backbone optimization versus the SCADA adapter alone; this weakens support for the central claim that the paradigm shift is effective across model scales.

    Authors: We agree that reporting variance, statistical significance, and more targeted ablations would provide stronger support for the claims. In the revised manuscript, we will add results from multiple independent runs (reporting mean and standard deviation), paired t-tests for significance on key gains, and expanded ablation tables that separately quantify the contribution of joint backbone optimization versus the SCADA adapter across model scales. The existing study already demonstrates consistent improvements over frozen baselines, but these additions will address the noted gaps directly. revision: yes

  2. Referee: The description of SCADA (abstract and method) claims it modulates feature maps through precise integration of linguistic embeddings while training only a small portion of parameters, yet no explicit formulation, pseudocode, or memory analysis is referenced to confirm it avoids overfitting or enables the stated deeper backbones; this is load-bearing for the weakest assumption identified in the review.

    Authors: Section 3.2 of the manuscript provides the mathematical formulation for SCADA's sentence-conditioned modulation of feature maps. To further clarify, the revision will include pseudocode for the adapter integration, a quantitative memory analysis (comparing trainable parameters and peak GPU usage against full fine-tuning), and explicit discussion of how the low parameter count reduces overfitting risk while supporting deeper backbones. These additions will make the implementation details self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances a fully end-to-end training paradigm for temporal sentence grounding, backed by an empirical study of end-to-end vs. frozen backbones across scales and by benchmark results showing gains from the SCADA adapter. No derivation chain, equations, or first-principles claims appear that reduce to fitted inputs or self-citations by construction. The method is presented as a practical design choice whose value is measured externally via experiments, not internally forced by its own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the domain assumption that frozen classification backbones create a measurable task discrepancy that end-to-end training plus a small adapter can resolve; SCADA is an invented module whose effectiveness is demonstrated empirically.

axioms (1)
  • domain assumption Pre-trained visual encoders are query-agnostic and induce task discrepancy when used frozen for grounding
    Explicitly stated in the abstract as the core motivation for the new paradigm.
invented entities (1)
  • Sentence Conditioned Adapter (SCADA) no independent evidence
    purpose: Modulate video backbone feature maps using sentence embeddings while updating only a small parameter subset
    New module introduced to enable memory-efficient end-to-end training

pith-pipeline@v0.9.0 · 5490 in / 1215 out tokens · 35564 ms · 2026-05-13T19:40:30.968259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 1 internal anchor

  1. [1]

    Quo vadis, action recognition? A new model and the kinetics dataset

    Jo ˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. InCVPR, pages 4724–4733, 2017. 3, 6

  2. [2]

    Temporally grounding natural sentence in video

    Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat- Seng Chua. Temporally grounding natural sentence in video. InEMNLP, pages 162–171, 2018. 2

  3. [3]

    Plugmark: A plug- in zero-watermarking framework for diffusion models

    Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Enci Liu, Zhuoyi Shang, Xiangyang Ji, and Wu Liu. Plugmark: A plug- in zero-watermarking framework for diffusion models. In ICCV, pages 17335–17345, 2025

  4. [4]

    MAPLE: masked pseudo- labeling autoencoder for semi-supervised point cloud action recognition

    Xiaodong Chen, Wu Liu, Xinchen Liu, Yongdong Zhang, Jungong Han, and Tao Mei. MAPLE: masked pseudo- labeling autoencoder for semi-supervised point cloud action recognition. InACM Multimedia, pages 708–718, 2022. 2

  5. [5]

    Mevis: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InICCV, pages 2694–2703, 2023. 2

  6. [6]

    Mose: A new dataset for video object segmentation in complex scenes

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. Mose: A new dataset for video object segmentation in complex scenes. InICCV, pages 20224–20234, 2023. 2

  7. [7]

    Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE TPAMI, 2025

    Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. Mevis: A multi-modal dataset for referring motion expres- sion video segmentation.IEEE TPAMI, 2025. 2

  8. [8]

    Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

    Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025. 2

  9. [9]

    Mosev2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025

    Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip HS Torr, and Song Bai. Mosev2: A more challenging dataset for video object segmentation in complex scenes.arXiv preprint arXiv:2508.05630, 2025. 2

  10. [10]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 6

  11. [11]

    Heterogeneous memory en- hanced multimodal attention model for video question an- swering

    Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. Heterogeneous memory en- hanced multimodal attention model for video question an- swering. InCVPR, pages 1999–2007, 2019. 1

  12. [12]

    Ac- tom sequence models for efficient action detection

    Adrien Gaidon, Za ¨ıd Harchaoui, and Cordelia Schmid. Ac- tom sequence models for efficient action detection. InCVPR, pages 3201–3208, 2011. 2

  13. [13]

    TALL: temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. TALL: temporal activity localization via language query. In ICCV, pages 5277–5285, 2017. 1, 2, 4, 5

  14. [14]

    Relation-aware video reading comprehension for temporal language grounding

    Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. Relation-aware video reading comprehension for temporal language grounding. InEMNLP (1), pages 3978– 3988, 2021. 2

  15. [15]

    MAC: mining activity concepts for language-based tempo- ral localization

    Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. MAC: mining activity concepts for language-based tempo- ral localization. InWACV, pages 245–253, 2019. 2

  16. [16]

    Hauptmann

    Soham Ghosh, Anuva Agarwal, Zarana Parekh, and Alexan- der G. Hauptmann. Excl: Extractive clip localization us- ing natural language descriptions. InNAACL-HLT (1), pages 1984–1990, 2019. 2

  17. [17]

    Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,

    Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Xi Chen, and Bo Zhao. VTG-LLM: integrating timestamp knowledge into video llms for enhanced video temporal grounding.CoRR, abs/2405.13382, 2024. 2

  18. [18]

    TRACE: temporal grounding video LLM via causal event modeling

    Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. TRACE: temporal grounding video LLM via causal event modeling. InICLR, 2025. 5

  19. [19]

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. Localizing mo- ments in video with natural language. InICCV, pages 5804– 5813, 2017. 1, 2

  20. [20]

    Vtimellm: Empower LLM to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower LLM to grasp video moments. CoRR, abs/2311.18445, 2023. 2

  21. [21]

    Lita: Language instructed temporal-localization assistant,

    De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. LITA: language instructed temporal-localization assistant. CoRR, abs/2403.19046, 2024. 2

  22. [22]

    Cross- modal video moment retrieval with spatial and language- temporal attention

    Bin Jiang, Xin Huang, Chao Yang, and Junsong Yuan. Cross- modal video moment retrieval with spatial and language- temporal attention. InICMR, pages 217–225, 2019. 2

  23. [23]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In ICCV, pages 706–715, 2017. 4

  24. [24]

    G2L: semantically aligned and uniform video grounding via geodesic and game theory

    Hongxiang Li, Meng Cao, Xuxin Cheng, Yaowei Li, Zhi- hong Zhu, and Yuexian Zou. G2L: semantically aligned and uniform video grounding via geodesic and game theory. In ICCV, pages 11998–12008, 2023. 5

  25. [25]

    Proposal-free video grounding with contextual pyramid network

    Kun Li, Dan Guo, and Meng Wang. Proposal-free video grounding with contextual pyramid network. InAAAI, pages 1902–1910, 2021. 2

  26. [26]

    Single shot tem- poral action detection

    Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot tem- poral action detection. InACM Multimedia, pages 988–996,

  27. [27]

    BMN: boundary-matching network for temporal action pro- posal generation

    Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. BMN: boundary-matching network for temporal action pro- posal generation. InICCV, pages 3888–3897, 2019. 4

  28. [28]

    Moment retrieval via cross-modal interaction networks with query reconstruction.IEEE Trans

    Zhijie Lin, Zhou Zhao, Zhu Zhang, Zijian Zhang, and Deng Cai. Moment retrieval via cross-modal interaction networks with query reconstruction.IEEE Trans. Image Process., 29: 3750–3762, 2020. 2

  29. [29]

    Adaptive proposal generation network for temporal sentence localization in videos

    Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. Adaptive proposal generation network for temporal sentence localization in videos. InEMNLP (1), pages 9292–9301,

  30. [30]

    Context-aware biaffine localizing network for temporal sentence grounding

    Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. Context-aware biaffine localizing network for temporal sentence grounding. InCVPR, pages 11235–11244, 2021. 2

  31. [31]

    Progressively guide to attend: An iterative alignment framework for tem- poral sentence grounding

    Daizong Liu, Xiaoye Qu, and Pan Zhou. Progressively guide to attend: An iterative alignment framework for tem- poral sentence grounding. InEMNLP (1), pages 9302–9311, 2021

  32. [32]

    Memory-guided semantic learning network for temporal sentence grounding

    Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu, and Pan Zhou. Memory-guided semantic learning network for temporal sentence grounding. InAAAI, pages 1665– 1673, 2022. 2

  33. [33]

    Single-frame supervision for spatio-temporal video grounding.IEEE TPAMI, 47(7):5177– 5191, 2024

    Kun Liu, Mengxue Qu, Yang Liu, Yunchao Wei, Wenming Zhe, Yao Zhao, and Wu Liu. Single-frame supervision for spatio-temporal video grounding.IEEE TPAMI, 47(7):5177– 5191, 2024. 2

  34. [34]

    Attentive moment retrieval in videos

    Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Bao- quan Chen, and Tat-Seng Chua. Attentive moment retrieval in videos. InSIGIR, pages 15–24, 2018. 1, 2

  35. [35]

    Sigformer: Sparse signal-guided transformer for multi- modal action segmentation.ACM ToMM, 20(8):1–22, 2024

    Qi Liu, Xinchen Liu, Kun Liu, Xiaoyan Gu, and Wu Liu. Sigformer: Sparse signal-guided transformer for multi- modal action segmentation.ACM ToMM, 20(8):1–22, 2024. 2

  36. [36]

    ETAD: training action detection end to end on a laptop

    Shuming Liu, Mengmeng Xu, Chen Zhao, Xu Zhao, and Bernard Ghanem. ETAD: training action detection end to end on a laptop. InCVPR Workshops, pages 4525–4534,

  37. [37]

    End-to-end temporal action detection with 1b pa- rameters across 1000 frames

    Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. End-to-end temporal action detection with 1b pa- rameters across 1000 frames. InCVPR, 2024. 2

  38. [38]

    Vehicle retrieval and trajectory inference in urban traffic surveillance scene

    Xinchen Liu, Huadong Ma, Huiyuan Fu, and Mo Zhou. Vehicle retrieval and trajectory inference in urban traffic surveillance scene. InICDSC, pages 26:1–26:6, 2014. 2

  39. [39]

    An empirical study of end-to-end temporal action detection

    Xiaolong Liu, Song Bai, and Xiang Bai. An empirical study of end-to-end temporal action detection. InCVPR, pages 19978–19987, 2022. 2

  40. [40]

    End-to-end temporal ac- tion detection with transformer.IEEE Trans

    Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. End-to-end temporal ac- tion detection with transformer.IEEE Trans. Image Process., 31:5427–5441, 2022. 2

  41. [41]

    DEBUG: A dense bottom-up grounding approach for natural language video localization

    Chujie Lu, Long Chen, Chilie Tan, Xiaolin Li, and Jun Xiao. DEBUG: A dense bottom-up grounding approach for natural language video localization. InEMNLP/IJCNLP (1), pages 5143–5152, 2019. 2

  42. [42]

    Zero-shot video grounding with pseudo query lookup and verification

    Yu Lu, Ruijie Quan, Linchao Zhu, and Yi Yang. Zero-shot video grounding with pseudo query lookup and verification. IEEE Trans. Image Process., 33:1643–1654, 2024. 2

  43. [43]

    Humannerf-se: A simple yet effec- tive approach to animate humannerf with diverse poses

    Caoyuan Ma, Yu-Lun Liu, Zhixiang Wang, Wu Liu, Xinchen Liu, and Zheng Wang. Humannerf-se: A simple yet effec- tive approach to animate humannerf with diverse poses. In CVPR, pages 1460–1470, 2024. 2

  44. [44]

    Interaction-integrated network for natural language moment localization.IEEE Trans

    Ke Ning, Lingxi Xie, Jianzhuang Liu, Fei Wu, and Qi Tian. Interaction-integrated network for natural language moment localization.IEEE Trans. Image Process., 30:2538–2548, 2021

  45. [45]

    Proposal-free temporal moment localization of a natural- language query in video using guided attention

    Cristian Rodriguez Opazo, Edison Marrese-Taylor, Fate- meh Sadat Saleh, Hongdong Li, and Stephen Gould. Proposal-free temporal moment localization of a natural- language query in video using guided attention. InWACV, pages 2453–2462, 2020. 2

  46. [46]

    Video captioning with transferred semantic attributes

    Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. Video captioning with transferred semantic attributes. InCVPR, pages 984–992, 2017. 1

  47. [47]

    Automatic differentiation in pytorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. InNeurIPS Workshop, 2017. 5

  48. [48]

    Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,

    Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning.CoRR, abs/2402.11435, 2024. 2, 5

  49. [49]

    Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding.CoRR, abs/2312.02051,

  50. [50]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter.CoRR, abs/1910.01108,

  51. [51]

    Temporal action localization in untrimmed videos via multi-stage cnns

    Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. InCVPR, pages 1049–1058, 2016. 2

  52. [52]

    Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

    Gunnar A. Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity under- standing. InECCV (1), pages 510–526, 2016. 4

  53. [53]

    V AL: visual-attention ac- tion localizer

    Xiaomeng Song and Yahong Han. V AL: visual-attention ac- tion localizer. InPCM (2), pages 340–350, 2018. 2

  54. [54]

    You need to read again: Multi-granularity perception net- work for moment retrieval in videos

    Xin Sun, Xuan Wang, Jialin Gao, Qiong Liu, and Xi Zhou. You need to read again: Multi-granularity perception net- work for moment retrieval in videos. InSIGIR, pages 1022– 1032, 2022. 2, 5

  55. [55]

    Bourdev, Rob Fergus, Lorenzo Torre- sani, and Manohar Paluri

    Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torre- sani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InICCV, pages 4489–4497,

  56. [56]

    Structured multi-level interaction network for video moment localization via language query

    Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. Structured multi-level interaction network for video moment localization via language query. InCVPR, pages 7026–7035, 2021. 2

  57. [57]

    Tempo- rally grounding language queries in videos by contextual boundary-aware prediction

    Jingwen Wang, Lin Ma, and Wenhao Jiang. Tempo- rally grounding language queries in videos by contextual boundary-aware prediction. InAAAI, pages 12168–12175,

  58. [58]

    Videomae V2: scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae V2: scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023. 3, 6

  59. [59]

    Internvideo: General video foundation models via generative and discriminative learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning.CoRR, abs/2212.03191, 2022. 2

  60. [60]

    Hawkeye: Training video-text llms for grounding text in videos,

    Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos.CoRR, abs/2403.10228,

  61. [61]

    Negative sample matters: A renaissance of metric learning for temporal grounding

    Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gang- shan Wu. Negative sample matters: A renaissance of metric learning for temporal grounding. InAAAI, pages 2613–2623,

  62. [62]

    Boundary proposal network for two-stage natural language video localization

    Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye, and Jun Xiao. Boundary proposal network for two-stage natural language video localization. InAAAI, pages 2986–2994, 2021. 2

  63. [63]

    Video question answer- ing via gradually refined attention over appearance and mo- tion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InACM Multimedia, pages 1645–1653, 2017. 1

  64. [64]

    R-C3D: region convolutional 3d network for temporal activity detection

    Huijuan Xu, Abir Das, and Kate Saenko. R-C3D: region convolutional 3d network for temporal activity detection. In ICCV, pages 5794–5803, 2017. 2

  65. [65]

    Learning multimodal attention LSTM networks for video captioning

    Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. Learning multimodal attention LSTM networks for video captioning. InACM Multimedia, pages 537–545, 2017. 1

  66. [66]

    Entity-aware and motion- aware transformers for language-driven action localization in videos.CoRR, abs/2205.05854, 2022

    Shuo Yang and Xinxiao Wu. Entity-aware and motion- aware transformers for language-driven action localization in videos.CoRR, abs/2205.05854, 2022. 2

  67. [67]

    Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans

    Shuo Yang, Xinxiao Wu, Zirui Shang, and Jiebo Luo. Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans. Multim., 26:7451– 7461, 2024. 5

  68. [68]

    Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans

    Shuo Yang, Xinxiao Wu, Zirui Shang, and Jiebo Luo. Dy- namic pathway for query-aware feature learning in language- driven action localization.IEEE Trans. Multim., 26:7451– 7461, 2024. 2

  69. [69]

    To find where you talk: Temporal sentence localization in video with attention based location regression

    Yitian Yuan, Tao Mei, and Wenwu Zhu. To find where you talk: Temporal sentence localization in video with attention based location regression. InAAAI, pages 9159–9166, 2019. 2

  70. [70]

    Dense regression network for video grounding

    Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding. InCVPR, pages 10284–10293, 2020. 2

  71. [71]

    Factorized learning for temporally grounded video-language models

    Wenzheng Zeng, Difei Gao, Mike Zheng Shou, and Hwee Tou Ng. Factorized learning for temporally grounded video-language models. InICCV, pages 20683–20693,

  72. [72]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InCVPR, pages 1204–1213, 2022. 6

  73. [73]

    Span-based localizing network for natural language video lo- calization

    Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Span-based localizing network for natural language video lo- calization. InACL, pages 6543–6554, 2020. 5

  74. [74]

    Exploiting temporal relationships in video moment localization with natural language

    Songyang Zhang, Jinsong Su, and Jiebo Luo. Exploiting temporal relationships in video moment localization with natural language. InACM Multimedia, pages 1230–1238,

  75. [75]

    Learning 2d temporal adjacent networks for moment local- ization with natural language

    Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment local- ization with natural language. InAAAI, pages 12870–12877,

  76. [76]

    Multi-scale 2d temporal adjacency networks for moment localization with natural language.IEEE Trans

    Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, and Jiebo Luo. Multi-scale 2d temporal adjacency networks for moment localization with natural language.IEEE Trans. Pattern Anal. Mach. Intell., 44(12):9073–9087, 2022. 2, 4, 5

  77. [77]

    Text-visual prompting for efficient 2d temporal video grounding

    Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, and Ke Ding. Text-visual prompting for efficient 2d temporal video grounding. InCVPR, pages 14794–14804, 2023. 2

  78. [78]

    Videoexpert: Augmented LLM for temporal- sensitive video understanding.CoRR, abs/2504.07519, 2025

    Henghao Zhao, Ge-Peng Ji, Rui Yan, Huan Xiong, and Zechao Li. Videoexpert: Augmented LLM for temporal- sensitive video understanding.CoRR, abs/2504.07519, 2025. 5

  79. [79]

    Temporal action detection with structured segment networks.Int

    Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xi- aoou Tang, and Dahua Lin. Temporal action detection with structured segment networks.Int. J. Comput. Vis., 128(1): 74–95, 2020. 2

  80. [80]

    Cas- caded prediction network via segment tree for temporal video grounding

    Yang Zhao, Zhou Zhao, Zhu Zhang, and Zhijie Lin. Cas- caded prediction network via segment tree for temporal video grounding. InCVPR, pages 4197–4206, 2021. 2, 5

Showing first 80 references.