pith. machine review for the scientific record. sign in

arxiv: 2604.26565 · v1 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

Mingji Ge, Qirui Chen, Weidi Xie, Zeqian Li

Pith reviewed 2026-05-07 13:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords instructional videosdense video captioningprocedural step groundingtraining-free annotationmultimodal large modelstemporal localizationlong-form video understanding
0
0 comments X

The pith

A training-free pipeline segments instructional videos and uses multimodal models to generate 2 million temporally grounded procedural steps from 100K videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an automated pipeline that divides videos into coherent shots, removes poorly aligned segments, and applies off-the-shelf multimodal models to produce structured step-by-step annotations. This addresses the noise and misalignment problems in existing instructional video collections, creating a cleaner resource for training models on long-term temporal reasoning and procedural activities. The resulting dataset is shown to improve performance on captioning, step localization, and retrieval tasks, with models generalizing across different camera perspectives without additional fine-tuning.

Core claim

We introduce a scalable, training-free pipeline that segments in-the-wild instructional videos into shots, applies alignment filtering, and leverages Qwen2.5-VL together with DeepSeek-R1 to output dense, temporally grounded procedural steps, yielding the DenseStep2M dataset of roughly 100K videos and 2M steps that supports measurable gains on dense captioning, procedural grounding, and cross-modal retrieval while maintaining zero-shot robustness across egocentric, exocentric, and mixed views.

What carries the argument

The training-free pipeline of shot segmentation followed by alignment filtering and structured generation from Qwen2.5-VL and DeepSeek-R1, which converts raw video-plus-narration into temporally accurate procedural steps.

If this is right

  • Models fine-tuned on DenseStep2M produce higher-quality dense captions and more precise temporal localization of procedural steps.
  • The same models exhibit robust zero-shot generalization to egocentric, exocentric, and mixed-perspective videos.
  • The dataset enables stronger performance on cross-modal retrieval and long-term activity reasoning without task-specific retraining.
  • A new human-written benchmark, DenseCaption100, confirms close alignment between the auto-generated steps and expert annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same segmentation-plus-generation approach could be applied to non-instructional video domains such as sports or surveillance to create large-scale temporal annotations without manual labeling.
  • Combining the dataset with existing narration-only corpora might further reduce reliance on ASR noise in long-form video pretraining.
  • If the pipeline scales to additional video sources, it could support training of models that follow real-world procedures in robotics or educational settings.

Load-bearing premise

Off-the-shelf Qwen2.5-VL and DeepSeek-R1 models, combined with the shot segmentation and filtering steps, generate temporally accurate and semantically correct procedural annotations at scale without substantial hallucinations or systematic biases.

What would settle it

A human audit of several thousand generated steps that finds frequent temporal misalignment with visual actions or semantic inaccuracies, or downstream models that show no measurable improvement on captioning and localization benchmarks after fine-tuning on the dataset.

Figures

Figures reproduced from arXiv: 2604.26565 by Mingji Ge, Qirui Chen, Weidi Xie, Zeqian Li.

Figure 1
Figure 1. Figure 1: The proposed three-stage pipeline for generating temporally grounded procedural steps from instructional videos. It view at source ↗
Figure 2
Figure 2. Figure 2: Three-step pipeline for Stage1 (Video Filtering and Segmentation). WhisperX [ view at source ↗
Figure 3
Figure 3. Figure 3: Examples of issues in initial step generation by Qwen2.5-VL [ view at source ↗
Figure 4
Figure 4. Figure 4: Illustrations of our DenseStep2M dataset characteristics. (a) Distribution statistics showing task categories (left), video view at source ↗
Figure 5
Figure 5. Figure 5: Examples of low-quality videos from the HowTo100M dataset [50]. view at source ↗
Figure 6
Figure 6. Figure 6: Illustrative examples of our DenseCaption100 benchmark. Annotated instructional steps show detailed instructions view at source ↗
Figure 7
Figure 7. Figure 7: Illustrative R1-Score computation for instructional step alignment. This example shows how we (1) match each view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of Downstream Tasks: Dense Video Captioning and Procedural Step Grounding. (a) Dense Video Captioning: view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative results demonstrating our model’s ability to produce temporally aligned instructional steps view at source ↗
Figure 10
Figure 10. Figure 10: Prompts for splitting the video and generating the title with Qwen2.5. Relevant examples have been omitted for view at source ↗
Figure 11
Figure 11. Figure 11: Prompts for extracting textual information with Qwen2.5. Relevant examples have been omitted for clarity, and the view at source ↗
Figure 12
Figure 12. Figure 12: Prompts for extracting visual information with Qwen2.5-VL. Relevant examples have been omitted for clarity, and view at source ↗
Figure 13
Figure 13. Figure 13: Prompts for matching textual and visual information with Qwen2.5. Relevant examples have been omitted for clarity, view at source ↗
Figure 14
Figure 14. Figure 14: Prompts for generating instructional steps with Qwen2.5-VL. Relevant examples have been omitted for clarity, and view at source ↗
Figure 15
Figure 15. Figure 15: Prompts for refining and merging steps with DeepSeek-R1. Relevant examples have been omitted for clarity, and the view at source ↗
Figure 16
Figure 16. Figure 16: Prompts for evaluating the R1-Score with DeepSeek-R1. The full prompts can be found in our code. view at source ↗
Figure 17
Figure 17. Figure 17: Prompts for simplifying the steps generated by our pipeline to align with YouCook2 annotations [ view at source ↗
read the original abstract

Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DenseStep2M, a dataset of ~2M dense procedural steps from 100K instructional videos, generated via a training-free pipeline that segments videos into shots, applies alignment filtering, and uses off-the-shelf Qwen2.5-VL and DeepSeek-R1 models to produce temporally grounded annotations. It curates DenseCaption100 as a human-annotated benchmark to show alignment with auto-generated steps, and reports downstream improvements in dense video captioning, procedural step grounding, and cross-modal retrieval when models are fine-tuned on DenseStep2M, with claimed robust zero-shot generalization across egocentric, exocentric, and mixed domains.

Significance. If the pipeline produces temporally accurate and semantically reliable steps at scale, DenseStep2M would be a valuable addition to instructional video resources, addressing noisy alignments in datasets like HowTo100M through scalable, training-free extraction. Strengths include the dataset release, multi-task downstream validation, and emphasis on cross-perspective generalization. However, the absence of quantified error rates for hallucinations or misalignments limits claims about quality and attribution of gains.

major comments (3)
  1. [Abstract] Abstract: The claims of 'strong alignment' with human annotations on DenseCaption100 and 'substantial gains' in captioning quality and temporal localization are unsupported by any quantitative metrics, error bars, ablation results, or statistical tests. This is load-bearing for the central claim that the pipeline yields high-quality annotations suitable for model training.
  2. [Methods/Evaluation] Pipeline and evaluation sections: No large-scale audit quantifies hallucination rates, factual errors, or temporal misalignment fractions (e.g., steps with offset > threshold) from Qwen2.5-VL + DeepSeek-R1 after filtering, across the full 2M-step distribution. Known VLM limitations on long procedural videos could propagate, undermining both dataset quality and attribution of downstream gains to DenseStep2M rather than data volume.
  3. [Evaluation] DenseCaption100 benchmark: Insufficient details on curation (video selection criteria, annotation protocol, inter-annotator agreement, size, and diversity) and baseline selection, raising risks of selection bias that could inflate reported alignment and generalization results.
minor comments (2)
  1. [Related Work] Add explicit references to prior work on instructional video datasets (e.g., HowTo100M limitations) and VLM hallucination mitigation techniques in the related work section.
  2. [Methods] Clarify the exact criteria and thresholds used in the 'alignment filtering' stage, including any hyperparameters, to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify areas where additional rigor and transparency will strengthen the presentation of our results. We address each major point below and commit to revisions that improve the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims of 'strong alignment' with human annotations on DenseCaption100 and 'substantial gains' in captioning quality and temporal localization are unsupported by any quantitative metrics, error bars, ablation results, or statistical tests. This is load-bearing for the central claim that the pipeline yields high-quality annotations suitable for model training.

    Authors: The body of the manuscript reports quantitative alignment metrics (e.g., step-level precision/recall/F1 on DenseCaption100) and downstream gains with baseline comparisons in Sections 4 and 5. We agree the abstract would be stronger with explicit numbers. In revision we will insert the key quantitative results, reference the ablations, and note where error bars or significance tests appear in the main text. revision: yes

  2. Referee: [Methods/Evaluation] Pipeline and evaluation sections: No large-scale audit quantifies hallucination rates, factual errors, or temporal misalignment fractions (e.g., steps with offset > threshold) from Qwen2.5-VL + DeepSeek-R1 after filtering, across the full 2M-step distribution. Known VLM limitations on long procedural videos could propagate, undermining both dataset quality and attribution of downstream gains to DenseStep2M rather than data volume.

    Authors: A full audit of all 2M steps is not present in the current manuscript and would require prohibitive human effort. We will add a limitations paragraph acknowledging VLM error modes on long videos and describing how the alignment filter reduces (but does not eliminate) misalignment. We will also expand the attribution discussion by highlighting the controlled volume-matched comparisons already present in the experiments. A complete large-scale audit remains outside the scope of this work. revision: partial

  3. Referee: [Evaluation] DenseCaption100 benchmark: Insufficient details on curation (video selection criteria, annotation protocol, inter-annotator agreement, size, and diversity) and baseline selection, raising risks of selection bias that could inflate reported alignment and generalization results.

    Authors: We will expand the DenseCaption100 description in the revised manuscript to specify video selection criteria (domain and perspective diversity, no overlap with training data), the annotation protocol and guidelines, inter-annotator agreement statistics, exact size, and category coverage. Baseline selection will be justified with references to standard methods in the literature. These additions will reduce ambiguity around potential bias. revision: yes

standing simulated objections not resolved
  • A complete large-scale human audit of hallucination and temporal misalignment rates across the entire 2M-step dataset

Circularity Check

0 steps flagged

No significant circularity; pipeline and evaluations are externally grounded

full rationale

The paper describes a training-free pipeline that applies off-the-shelf Qwen2.5-VL and DeepSeek-R1 models after shot segmentation and alignment filtering to produce the DenseStep2M dataset. Quality is assessed against an independently curated human benchmark (DenseCaption100) and downstream gains are measured on separate tasks with zero-shot generalization tests. No equations, fitted parameters, or self-citations are invoked in a manner that reduces any claimed result to the inputs by construction; the central claims rest on external models and human annotations rather than self-referential fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The pipeline depends on the assumption that current large multimodal models can reliably extract structured steps and that simple shot segmentation plus alignment filtering suffices to remove noise at scale.

axioms (2)
  • domain assumption Qwen2.5-VL and DeepSeek-R1 can generate accurate, temporally grounded procedural steps from video frames and narration without task-specific fine-tuning.
    The entire annotation pipeline is built on the performance of these two specific models.
  • domain assumption Standard video shot segmentation and alignment filtering can reliably separate well-aligned from poorly aligned content in instructional videos.
    This step is presented as a core component of the scalable pipeline.
invented entities (1)
  • DenseStep2M dataset no independent evidence
    purpose: Large-scale collection of dense, temporally grounded instructional steps for training long-form video models.
    The dataset is the primary output of the pipeline and is validated only against the authors' own human benchmark.

pith-pipeline@v0.9.0 · 5590 in / 1570 out tokens · 64318 ms · 2026-05-07T13:45:08.185339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

164 extracted references · 29 canonical work pages · 20 internal anchors

  1. [1]

    Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark.arXiv preprint arXiv:1609.08675 (2016)

  2. [2]

    Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. 2023. Ht-step: Aligning instructional articles with how-to videos.Advances in Neural Information Processing Systems36 (2023), 50310–50326

  3. [3]

    Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4575–4583

  4. [4]

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812

  5. [5]

    Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. https: //www.anthropic.com. https://www.anthropic.com

  6. [6]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  7. [7]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  8. [8]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)

  9. [9]

    Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. 2023. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio.INTERSPEECH 2023 (2023)

  10. [10]

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision

  11. [11]

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 1728–1738

  12. [12]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

  13. [13]

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818(2023)

  14. [14]

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li

  15. [15]

    Hourvideo: 1-hour video-language understanding.Advances in Neural Information Processing Systems37 (2024), 53168–53197

  16. [16]

    Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Car- los Niebles. 2020. Procedure planning in instructional videos. InEuropean Con- ference on Computer Vision. Springer, 334–350

  17. [17]

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. 2024. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158(2024)

  18. [18]

    David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200

  19. [19]

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. 2024. Sharegpt4video: Im- proving video understanding and generation with better captions.Advances in Neural Information Processing Systems37 (2024), 19472–19495

  20. [20]

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming- Hsuan Yang, et al . 2024. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13320–13331

  21. [21]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

  22. [22]

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences67, 12 (2024), 220101

  23. [23]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al . 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198

  24. [24]

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476(2024)

  25. [25]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  26. [26]

    Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InInternational Conference on Learning Representations (ICLR)

  27. [27]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]

  28. [28]

    Nikita Dvornik, Isma Hadji, Ran Zhang, Konstantinos G Derpanis, Richard P Wildes, and Allan D Jepson. 2023. Stepformer: Self-supervised step discovery and localization in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18952–18961

  29. [29]

    Soichiro Fujita, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, and Masaaki Nagata. 2020. Soda: Story oriented dense video captioning evaluation framework. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. Springer, 517–531

  30. [30]

    Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. 2019. Large-scale weakly- supervised pre-training for video action recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12046–12055

  31. [31]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18995–19012

  32. [32]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  33. [33]

    Tengda Han, Weidi Xie, and Andrew Zisserman. 2022. Temporal alignment networks for long-term video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2906–2916

  34. [34]

    Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. 2025. Egoexobench: A benchmark for first-and third-person view video understanding in mllms.arXiv preprint arXiv:2507.18342(2025)

  35. [35]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

  36. [36]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  37. [37]

    Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim

  38. [38]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Do you remember? dense video captioning with cross-modal memory retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13894–13904

  39. [39]

    Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim. 2025. HiCM 2: Hierarchical Compact Memory Modeling for Dense Video Captioning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 4293–4301

  40. [40]

    Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014. The language of actions: Recovering the syntax and semantics of goal-directed human activities. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 780–787

  41. [41]

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22195–22206

  42. [42]

    Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu

  43. [43]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2046–2065

  44. [44]

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al . 2026. Qwen3-VL- Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking.arXiv preprint arXiv:2601.04720(2026)

  45. [45]

    Zeqian Li, Qirui Chen, Tengda Han, Ya Zhang, Yanfeng Wang, and Weidi Xie

  46. [46]

    InEuropean Conference on Computer Vision

    Multi-Sentence Grounding for Long-term Instructional Video. InEuropean Conference on Computer Vision. 200–216

  47. [47]

    Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. 2022. Egocentric 9 video-language pretraining.Advances in Neural Information Processing Systems 35 (2022), 7575–7586

  48. [48]

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. 2025. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the Computer Vision and Pattern Recognition Conference. 4015–4025

  49. [49]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  50. [50]

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. 2023. Egoschema: A diagnostic benchmark for very long-form video language un- derstanding.Advances in Neural Information Processing Systems36 (2023), 46212– 46244

  51. [51]

    Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. 2023. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15201–15213

  52. [52]

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al . 2025. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590(2025)

  53. [53]

    Sachit Menon, Ishan Misra, and Rohit Girdhar. 2024. Generating illustrated instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6274–6284

  54. [54]

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2630–2640

  55. [55]

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748(2018)

  56. [56]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep- speed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3505–3506

  57. [57]

    Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics1 (2013), 25–36

  58. [58]

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. 2024. TOMATO: Assessing Visual Tempo- ral Reasoning Capabilities in Multimodal Foundation Models.arXiv preprint arXiv:2410.23266(2024)

  59. [59]

    Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, and Hilde Kuehne. 2024. Howtocaption: Prompting llms to transform video annotations at scale. InEuropean Conference on Computer Vision. 1–18

  60. [60]

    Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. 2018. Charades-ego: A large-scale dataset of paired third and first person videos.arXiv preprint arXiv:1804.09626(2018)

  61. [61]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

  62. [62]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

  63. [63]

    Yale Song, Eugene Byrne, Tushar Nagarajan, Huiyu Wang, Miguel Martin, and Lorenzo Torresani. 2023. Ego4d goal-step: Toward hierarchical understanding of procedural activities.Advances in Neural Information Processing Systems36 (2023), 38863–38886

  64. [64]

    Tomáš Souček, Dima Damen, Michael Wray, Ivan Laptev, and Josef Sivic. 2024. Genhowto: Learning to generate actions and state transformations from instruc- tional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6561–6571

  65. [65]

    Tomáš Souček, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. 2024. ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions.arXiv preprint arXiv:2412.01987(2024)

  66. [66]

    Tomáš Souček, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima Damen, and Josef Sivic. 2025. ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  67. [67]

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. 2019. Coin: A large-scale dataset for comprehensive in- structional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1207–1216

  68. [68]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  69. [69]

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. InProceedings of the IEEE confer- ence on computer vision and pattern recognition. 4566–4575

  70. [70]

    Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. 2023. Pdpp: Projected diffusion for procedure planning in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14836–14845

  71. [71]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

  72. [72]

    Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo

  73. [73]

    InProceedings of the IEEE/CVF International Conference on Computer Vision

    End-to-end dense video captioning with parallel decoding. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6847–6857

  74. [74]

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. 2023. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942(2023)

  75. [75]

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. 2024. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems37 (2024), 28828–28857

  76. [76]

    Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. 2024. Retrieval-augmented egocentric video captioning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13525–13536

  77. [77]

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296

  78. [78]

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. 2022. Advancing high-resolution video- language representation with large-scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5036–5045

  79. [79]

    Zihui Xue, Kumar Ashutosh, and Kristen Grauman. 2024. Learning object state changes in videos: An open-world perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18493–18503

  80. [80]

    Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. 2018. RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1358–1368

Showing first 80 references.