pith. machine review for the scientific record. sign in

arxiv: 2604.26707 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.LG

Recognition: unknown

CurEvo: Curriculum-Guided Self-Evolution for Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords curriculum learningself-evolutionvideo understandingVideoQAautonomous learningadaptive QAmodel competence
0
0 comments X

The pith

CurEvo adds curriculum guidance to self-evolution so video models improve autonomously by matching task difficulty to their current ability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the lack of structure in existing self-evolution methods for video understanding, where difficulty progression and optimization remain weakly controlled. CurEvo inserts curriculum learning to regulate task difficulty, refine evaluation criteria, and balance data diversity according to the model's competence. This creates a feedback loop that aligns learning complexity with capability. The approach is realized through a multi-dimensional adaptive QA framework that evolves questions and answers jointly across perception, recognition, and understanding. Results across seven backbones on four VideoQA benchmarks show gains in both accuracy and semantic scores.

Core claim

CurEvo is a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. It dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, a multi-dimensional adaptive QA framework jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions.

What carries the argument

The curriculum-guided feedback loop, implemented via a multi-dimensional adaptive QA framework that evolves question generation and answer evaluation across perception, recognition, and understanding dimensions.

If this is right

  • Video understanding models can advance without human annotations through structured, progressive self-evolution.
  • Both standard accuracy metrics and semantic evaluation scores improve when curriculum guidance controls the self-evolution process.
  • The method produces gains on multiple VideoQA benchmarks when applied to different backbone models.
  • Joint evolution of questions and answers across perception, recognition, and understanding maintains coherent curriculum progression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar curriculum mechanisms could stabilize self-improvement loops in other multimodal or language tasks.
  • The adaptive evaluation component might reduce reliance on fixed human benchmarks during autonomous training.
  • Measuring model competence more precisely could further strengthen the feedback loop's alignment.
  • Extending the multi-dimensional adaptation to longer videos or open-ended tasks would test broader applicability.

Load-bearing premise

That dynamically regulating task difficulty, refining evaluation criteria, and balancing data diversity according to model competence will form an effective curriculum-guided feedback loop that aligns learning complexity with capability without introducing new instabilities or biases.

What would settle it

An experiment comparing CurEvo against a baseline self-evolution method on identical backbones and the same four VideoQA benchmarks, checking whether benchmark accuracy and evaluator semantic scores rise, stay flat, or fall.

Figures

Figures reproduced from arXiv: 2604.26707 by Guiyi Zeng, Junqing Yu, Wei Yang, Xu Chen, Yi-Ping Phoebe Chen, Zikai Song.

Figure 1
Figure 1. Figure 1: Supervised VideoQA models rely on human view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the curriculum-guided self-evolution pipeline. At each iteration, the base model generates multi view at source ↗
Figure 3
Figure 3. Figure 3: Multi-dimensional question generation module view at source ↗
Figure 5
Figure 5. Figure 5: Left: Retained samples across self-evolution itera￾tions. Right: Accuracy improvement over the baseline model across four video question answering datasets as training iterations increase. 5 Experiments 5.1 Experimental Setup Implementation Details. All experiments were conducted under identical hardware conditions using a single NVIDIA V100 GPU with 32 GB memory. To ensure fair comparison, we fix Qwen2.5-… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of Generated Data. Examples of the video, view at source ↗
read the original abstract

Recent advances in self-evolution video understanding frameworks have demonstrated the potential of autonomous learning without human annotations. However, existing methods often suffer from weakly controlled optimization and uncontrolled difficulty progression, as they lack structured guidance throughout the iterative learning process. To address these limitations, we propose CurEvo, a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. CurEvo dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, we develop a multi-dimensional adaptive QA framework that jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions, ensuring coherent and measurable curriculum progression. Through this integration, CurEvo transforms weakly controlled self-evolution into a more structured learning process for autonomous video understanding. Across seven backbones, CurEvo consistently improves both benchmark accuracy and evaluator-based semantic score on four VideoQA benchmarks, validating the effectiveness of curriculum-guided self-evolution for video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CurEvo, a curriculum-guided self-evolution framework for video understanding that integrates structured curriculum learning into autonomous self-evolution loops. It dynamically adjusts task difficulty, evaluation criteria, and data diversity based on model competence, and introduces a multi-dimensional adaptive QA framework that jointly evolves questions and answers across perception, recognition, and understanding dimensions. The central empirical claim is that this approach yields consistent gains in benchmark accuracy and evaluator-based semantic scores across seven backbones on four VideoQA benchmarks.

Significance. If the reported gains are robustly validated with proper controls and ablations, the work could meaningfully advance autonomous video understanding by providing a more stable and progressive alternative to uncontrolled self-evolution methods, reducing reliance on human annotations while aligning learning complexity with model capability.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'consistent improvements' across seven backbones and four benchmarks is stated without any numerical results, tables, standard deviations, or direct comparisons to prior self-evolution baselines. This makes it impossible to assess whether the curriculum-guided feedback loop actually delivers the claimed gains or merely reflects uncontrolled variance.
  2. [§3] §3 (Method): The description of the multi-dimensional adaptive QA framework and the curriculum feedback loop lacks any equations, pseudocode, or precise definitions of how difficulty, diversity, and evaluation criteria are quantified and updated from model competence. Without these, the mechanism cannot be reproduced or checked for internal consistency or hidden biases.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from a brief statement of the specific VideoQA benchmarks and backbones used, even at high level, to orient readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and outline the specific revisions that will be incorporated to improve clarity, reproducibility, and the presentation of results.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'consistent improvements' across seven backbones and four benchmarks is stated without any numerical results, tables, standard deviations, or direct comparisons to prior self-evolution baselines. This makes it impossible to assess whether the curriculum-guided feedback loop actually delivers the claimed gains or merely reflects uncontrolled variance.

    Authors: We agree that the abstract presents the improvements qualitatively and that §4 would benefit from more explicit quantitative support. In the revised manuscript we will add a concise summary of key numerical gains (average accuracy and semantic score improvements) to the abstract. We will also augment §4 with complete tables that report per-backbone and per-benchmark results, include standard deviations across multiple runs, and provide direct side-by-side comparisons against prior self-evolution baselines. These changes will allow readers to evaluate the robustness of the reported gains. revision: yes

  2. Referee: [§3] §3 (Method): The description of the multi-dimensional adaptive QA framework and the curriculum feedback loop lacks any equations, pseudocode, or precise definitions of how difficulty, diversity, and evaluation criteria are quantified and updated from model competence. Without these, the mechanism cannot be reproduced or checked for internal consistency or hidden biases.

    Authors: We acknowledge that the current textual description would be strengthened by formal definitions. In the revision we will insert (i) pseudocode for the overall curriculum-guided feedback loop, (ii) an equation defining task difficulty as a function of model competence (e.g., accuracy on a held-out validation set), (iii) a diversity metric (entropy-based) that is updated at each iteration, and (iv) the adaptive rules for the three evaluation dimensions (perception, recognition, understanding). These additions will make the quantification and update procedures explicit and facilitate reproducibility checks. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances an empirical curriculum-guided self-evolution framework for video understanding, with the central claim being experimental gains in accuracy and semantic scores across seven backbones and four VideoQA benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method description (dynamic regulation of difficulty, evaluation, and diversity) is presented as a design choice tested experimentally rather than a quantity forced by definition or prior self-work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the framework is described conceptually without mathematical or implementation specifics.

pith-pipeline@v0.9.0 · 5502 in / 984 out tokens · 34354 ms · 2026-05-07T11:48:19.788233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 31 canonical work pages · 18 internal anchors

  1. [1]

    Steven Abney. 2002. Bootstrapping. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 360–367

  2. [2]

    Ahmed Akl, Abdelwahed Khamis, Zhe Wang, Ali Cheraghian, Sara Khalifa, and Kewen Wang. 2024. Task Progressive Curriculum Learning for Robust Visual Question Answering.arXiv preprint arXiv:2411.17292(2024)

  3. [3]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

  4. [4]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

  5. [5]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

  6. [6]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection. (2024)

  7. [7]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  8. [8]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  9. [9]

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 1728–1738

  10. [10]

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th annual international conference on machine learning. 41–48

  11. [11]

    Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. 2022. Debiased self-training for semi-supervised learning. Advances in Neural Information Processing Systems35 (2022), 32424–32437

  12. [12]

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. 2025. Self-Evolving Curriculum for LLM Reasoning.arXiv preprint arXiv:2505.14970(2025)

  13. [13]

    Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. INTENT: Invariance and Discrimination-aware Noise Mitiga- tion for Robust Composed Image Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 20463–20471

  14. [14]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie

  15. [15]

    InProceedings of the ACM International Conference on Multimedia

    OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval. InProceedings of the ACM International Conference on Multimedia. 6113–6122

  16. [16]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan

  17. [17]

    InProceedings of the ACM International Conference on Multimedia

    HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Com- posed Video Retrieval. InProceedings of the ACM International Conference on Multimedia. 6143–6152

  18. [18]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

  19. [19]

    Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, and Zixu Li

  20. [20]

    Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

    Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval. arXiv:2604.19386 [cs.CV] https://arxiv.org/abs/ 2604.19386

  21. [21]

    Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, and Mike Zheng Shou

  22. [22]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14773–14783

  23. [23]

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems33 (2020), 21271–21284

  24. [24]

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- mentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

  25. [25]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

  26. [26]

    Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2025. Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 29108–29117

  27. [27]

    Shenglei Huang, Shaohan Hu, and Bencheng Yan. 2019. Watch and ask: Video question generation. InInternational Conference on Neural Information Processing. Springer, 209–221

  28. [28]

    PJ Jeshmol and Binsu C Kovoor. 2024. Video Question Answering: A survey of the state-of-the-art.Journal of Visual Communication and Image Representation 105 (2024), 104320

  29. [29]

    Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander Hauptmann

  30. [30]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Self-paced curriculum learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 29

  31. [31]

    Zhengjian Kang, Qi Chen, Rui Liu, Kangtong Mo, Xingyu Zhang, Xiaoyu Deng, and Ye Zhang. 2026. Causality-Aware Temporal Projection for Video Understand- ing in Video-LLMs.arXiv preprint arXiv:2601.01804(2026)

  32. [32]

    Nazmul Karim, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-pang Chiu, Supun Samarasekera, and Nazanin Rahnavard. 2023. C-sfda: A curriculum learning aided self-training framework for efficient source free domain adapta- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24120–24131

  33. [33]

    Kara Kedrick, Paul Schrater, and Wilma Koutstaal. 2023. The multifaceted role of self-generated question asking in curiosity-driven learning.Cognitive Science47, 4 (2023), e13253

  34. [34]

    M Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-paced learning for latent variable models.Advances in neural information processing systems23 (2010)

  35. [35]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326(2024)

  36. [36]

    Haopeng Li, Qiuhong Ke, Mingming Gong, and Tom Drummond. 2024. Answer- ing from sure to uncertain: Uncertainty-aware curriculum learning for video question answering.arXiv preprint arXiv:2401.01510(2024)

  37. [37]

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355(2023)

  38. [38]

    Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, and Wei Yang. 2026. Large Language Model as Token Compressor and Decompressor. arXiv:2603.25340 [cs.CL]

  39. [39]

    Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, and Wei Yang

  40. [40]

    LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing.arXiv preprint arXiv:2507.00029(2025)

  41. [41]

    Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. 2024. Coupled mamba: Enhanced multimodal fusion with coupled state space model.Advances in Neural Information Processing Systems37 (2024), 59808–59832

  42. [42]

    Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan

  43. [43]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Encoder: Entity mining and modification relation binding for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5101–5109

  44. [44]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 23373–23381

  45. [45]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 6762–6770

  46. [46]

    Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. 2026. TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval. arXiv:2604.21806 [cs.CV] https://arxiv.org/abs/2604.21806

  47. [47]

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. 2024. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing. 5971–5984

  48. [48]

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. 2024. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26689–26699

  49. [49]

    Yunbo Long, Yuhan Liu, and Liming Xu. 2026. EmoMAS: Emotion-Aware Multi- Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Or- chestration. arXiv:2604.07003 [cs.AI] https://arxiv.org/abs/2604.07003

  50. [50]

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing508 (2022), 293–304

  51. [51]

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. 2024. Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12585–12602

  52. [52]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  53. [53]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

  54. [54]

    Neelu Madan, Andreas Møgelmose, Rajat Modi, Yogesh S Rawat, and Thomas B Moeslund. 2024. Foundation models for video understanding: A survey.arXiv preprint arXiv:2405.03770(2024)

  55. [55]

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. 2023. Cross-entropy loss functions: Theoretical analysis and applications. InInternational conference on Machine learning. pmlr, 23803–23828. Conference’17, July 2017, Washington, DC, USA Guiyi Zeng, Junqing Yu, Yi-Ping Phoebe Chen, Chen Xu, Wei Yang, and Zikai Song

  56. [56]

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training.arXiv preprint arXiv:1710.03740 (2017)

  57. [57]

    Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. 2020. Curriculum learning for reinforcement learning domains: A framework and survey.Journal of Machine Learning Research21, 181 (2020), 1–50

  58. [58]

    Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See Kiong Ng, and Luu Anh Tuan. 2024. Video-language understanding: A survey from model architecture, model training, and data perspectives. InFindings of the Association for Computational Linguistics: ACL

  59. [59]

    Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-Yves Oudeyer. 2020. Automatic curriculum learning for deep rl: A short survey.arXiv preprint arXiv:2003.04664(2020)

  60. [60]

    Guozhi Qiu, Zhiwei Chen, Zixu Li, Qinlei Huang, Zhiheng Fu, Xuemeng Song, and Yupeng Hu. 2026. MELT: Improve Composed Image Retrieval via the Modification Frequentation-Rarity Balance Network.arXiv preprint arXiv:2603.29291(2026)

  61. [61]

    Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. 2025. STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self- Training. InProceedings of the Computer Vision and Pattern Recognition Conference. 3284–3294

  62. [62]

    Henry Scudder. 1965. Probability of error of some adaptive pattern-recognition machines.IEEE Transactions on Information Theory11, 3 (1965), 363–371

  63. [63]

    Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. 2025. Enhancing video-llm reasoning via agent-of-thoughts distillation. InProceedings of the Computer Vision and Pattern Recognition Conference. 8523–8533

  64. [64]

    Zikai Song, Run Luo, Lintao Ma, Ying Tang, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025. Temporal Coherent Object Flow for Multi-Object Tracking. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6978–6986

  65. [65]

    Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2023. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 2321–2329

  66. [66]

    Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2024. Autogenic language embedding for coherent point tracking. In Proceedings of the 32nd ACM International Conference on Multimedia. 2021–2030

  67. [67]

    Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, and Wenwu Zhu. 2025. Modularized self-reflected video reasoner for multimodal llm with application to video question answering. InForty-second International Conference on Machine Learning

  68. [68]

    Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2022. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8791–8800

  69. [69]

    Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, and Xinchao Wang

  70. [70]

    Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

    Hypergraph-State Collaborative Reasoning for Multi-Object Tracking. arXiv:2604.12665 [cs.CV]

  71. [71]

    Hung-Ting Su, Chen-Hsi Chang, Po-Wei Shen, Yu-Siang Wang, Ya-Liang Chang, Yu-Cheng Chang, Pu-Jen Cheng, and Winston H Hsu. 2021. End-to-end video question-answer generation with generator-pretester network.IEEE Transactions on Circuits and Systems for Video Technology31, 11 (2021), 4497–4507

  72. [72]

    Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid

  73. [73]

    InProceedings of the IEEE/CVF ICCV

    Videobert: A joint model for video and language representation learning. InProceedings of the IEEE/CVF ICCV. 7464–7473

  74. [74]

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2025. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology(2025)

  75. [75]

    Yolo Y Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, et al. 2025. Video-lmm post- training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034(2025)

  76. [76]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  77. [77]

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. 2025. Videorft: Incen- tivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434(2025)

  78. [78]

    Xin Wang, Yudong Chen, and Wenwu Zhu. 2021. A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence44, 9 (2021), 4555–4576

  79. [79]

    Xudong Wang, Zhirong Wu, Long Lian, and Stella X Yu. 2022. Debiased learn- ing from naturally imbalanced pseudo-labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14647–14657

  80. [80]

    Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. 2025. Videotree: Adaptive tree-based video repre- sentation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference. 3272–3283

Showing first 80 references.