Recognition: unknown
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
Pith reviewed 2026-05-07 11:48 UTC · model grok-4.3
The pith
CurEvo adds curriculum guidance to self-evolution so video models improve autonomously by matching task difficulty to their current ability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CurEvo is a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. It dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, a multi-dimensional adaptive QA framework jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions.
What carries the argument
The curriculum-guided feedback loop, implemented via a multi-dimensional adaptive QA framework that evolves question generation and answer evaluation across perception, recognition, and understanding dimensions.
If this is right
- Video understanding models can advance without human annotations through structured, progressive self-evolution.
- Both standard accuracy metrics and semantic evaluation scores improve when curriculum guidance controls the self-evolution process.
- The method produces gains on multiple VideoQA benchmarks when applied to different backbone models.
- Joint evolution of questions and answers across perception, recognition, and understanding maintains coherent curriculum progression.
Where Pith is reading between the lines
- Similar curriculum mechanisms could stabilize self-improvement loops in other multimodal or language tasks.
- The adaptive evaluation component might reduce reliance on fixed human benchmarks during autonomous training.
- Measuring model competence more precisely could further strengthen the feedback loop's alignment.
- Extending the multi-dimensional adaptation to longer videos or open-ended tasks would test broader applicability.
Load-bearing premise
That dynamically regulating task difficulty, refining evaluation criteria, and balancing data diversity according to model competence will form an effective curriculum-guided feedback loop that aligns learning complexity with capability without introducing new instabilities or biases.
What would settle it
An experiment comparing CurEvo against a baseline self-evolution method on identical backbones and the same four VideoQA benchmarks, checking whether benchmark accuracy and evaluator semantic scores rise, stay flat, or fall.
Figures
read the original abstract
Recent advances in self-evolution video understanding frameworks have demonstrated the potential of autonomous learning without human annotations. However, existing methods often suffer from weakly controlled optimization and uncontrolled difficulty progression, as they lack structured guidance throughout the iterative learning process. To address these limitations, we propose CurEvo, a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. CurEvo dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, we develop a multi-dimensional adaptive QA framework that jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions, ensuring coherent and measurable curriculum progression. Through this integration, CurEvo transforms weakly controlled self-evolution into a more structured learning process for autonomous video understanding. Across seven backbones, CurEvo consistently improves both benchmark accuracy and evaluator-based semantic score on four VideoQA benchmarks, validating the effectiveness of curriculum-guided self-evolution for video understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CurEvo, a curriculum-guided self-evolution framework for video understanding that integrates structured curriculum learning into autonomous self-evolution loops. It dynamically adjusts task difficulty, evaluation criteria, and data diversity based on model competence, and introduces a multi-dimensional adaptive QA framework that jointly evolves questions and answers across perception, recognition, and understanding dimensions. The central empirical claim is that this approach yields consistent gains in benchmark accuracy and evaluator-based semantic scores across seven backbones on four VideoQA benchmarks.
Significance. If the reported gains are robustly validated with proper controls and ablations, the work could meaningfully advance autonomous video understanding by providing a more stable and progressive alternative to uncontrolled self-evolution methods, reducing reliance on human annotations while aligning learning complexity with model capability.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'consistent improvements' across seven backbones and four benchmarks is stated without any numerical results, tables, standard deviations, or direct comparisons to prior self-evolution baselines. This makes it impossible to assess whether the curriculum-guided feedback loop actually delivers the claimed gains or merely reflects uncontrolled variance.
- [§3] §3 (Method): The description of the multi-dimensional adaptive QA framework and the curriculum feedback loop lacks any equations, pseudocode, or precise definitions of how difficulty, diversity, and evaluation criteria are quantified and updated from model competence. Without these, the mechanism cannot be reproduced or checked for internal consistency or hidden biases.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from a brief statement of the specific VideoQA benchmarks and backbones used, even at high level, to orient readers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and outline the specific revisions that will be incorporated to improve clarity, reproducibility, and the presentation of results.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'consistent improvements' across seven backbones and four benchmarks is stated without any numerical results, tables, standard deviations, or direct comparisons to prior self-evolution baselines. This makes it impossible to assess whether the curriculum-guided feedback loop actually delivers the claimed gains or merely reflects uncontrolled variance.
Authors: We agree that the abstract presents the improvements qualitatively and that §4 would benefit from more explicit quantitative support. In the revised manuscript we will add a concise summary of key numerical gains (average accuracy and semantic score improvements) to the abstract. We will also augment §4 with complete tables that report per-backbone and per-benchmark results, include standard deviations across multiple runs, and provide direct side-by-side comparisons against prior self-evolution baselines. These changes will allow readers to evaluate the robustness of the reported gains. revision: yes
-
Referee: [§3] §3 (Method): The description of the multi-dimensional adaptive QA framework and the curriculum feedback loop lacks any equations, pseudocode, or precise definitions of how difficulty, diversity, and evaluation criteria are quantified and updated from model competence. Without these, the mechanism cannot be reproduced or checked for internal consistency or hidden biases.
Authors: We acknowledge that the current textual description would be strengthened by formal definitions. In the revision we will insert (i) pseudocode for the overall curriculum-guided feedback loop, (ii) an equation defining task difficulty as a function of model competence (e.g., accuracy on a held-out validation set), (iii) a diversity metric (entropy-based) that is updated at each iteration, and (iv) the adaptive rules for the three evaluation dimensions (perception, recognition, understanding). These additions will make the quantification and update procedures explicit and facilitate reproducibility checks. revision: yes
Circularity Check
No significant circularity
full rationale
The paper advances an empirical curriculum-guided self-evolution framework for video understanding, with the central claim being experimental gains in accuracy and semantic scores across seven backbones and four VideoQA benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The method description (dynamic regulation of difficulty, evaluation, and diversity) is presented as a design choice tested experimentally rather than a quantity forced by definition or prior self-work. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Steven Abney. 2002. Bootstrapping. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 360–367
2002
- [2]
-
[3]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al
-
[4]
Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736
2022
-
[5]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi
-
[6]
Self-rag: Learning to retrieve, generate, and critique through self-reflection. (2024)
2024
-
[7]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review arXiv 2025
-
[8]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)
work page internal anchor Pith review arXiv 2025
-
[9]
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 1728–1738
2021
-
[10]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th annual international conference on machine learning. 41–48
2009
-
[11]
Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. 2022. Debiased self-training for semi-supervised learning. Advances in Neural Information Processing Systems35 (2022), 32424–32437
2022
- [12]
-
[13]
Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. INTENT: Invariance and Discrimination-aware Noise Mitiga- tion for Robust Composed Image Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 20463–20471
2026
-
[14]
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie
-
[15]
InProceedings of the ACM International Conference on Multimedia
OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval. InProceedings of the ACM International Conference on Multimedia. 6113–6122
-
[16]
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan
-
[17]
InProceedings of the ACM International Conference on Multimedia
HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Com- posed Video Retrieval. InProceedings of the ACM International Conference on Multimedia. 6143–6152
-
[18]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)
work page internal anchor Pith review arXiv 2024
-
[19]
Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, and Zixu Li
-
[20]
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval. arXiv:2604.19386 [cs.CV] https://arxiv.org/abs/ 2604.19386
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, and Mike Zheng Shou
-
[22]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14773–14783
-
[23]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems33 (2020), 21271–21284
2020
-
[24]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- mentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738
2020
-
[25]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3
2022
-
[26]
Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2025. Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 29108–29117
2025
-
[27]
Shenglei Huang, Shaohan Hu, and Bencheng Yan. 2019. Watch and ask: Video question generation. InInternational Conference on Neural Information Processing. Springer, 209–221
2019
-
[28]
PJ Jeshmol and Binsu C Kovoor. 2024. Video Question Answering: A survey of the state-of-the-art.Journal of Visual Communication and Image Representation 105 (2024), 104320
2024
-
[29]
Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander Hauptmann
-
[30]
InProceedings of the AAAI Conference on Artificial Intelligence, Vol
Self-paced curriculum learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 29
- [31]
-
[32]
Nazmul Karim, Niluthpol Chowdhury Mithun, Abhinav Rajvanshi, Han-pang Chiu, Supun Samarasekera, and Nazanin Rahnavard. 2023. C-sfda: A curriculum learning aided self-training framework for efficient source free domain adapta- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24120–24131
2023
-
[33]
Kara Kedrick, Paul Schrater, and Wilma Koutstaal. 2023. The multifaceted role of self-generated question asking in curiosity-driven learning.Cognitive Science47, 4 (2023), e13253
2023
-
[34]
M Kumar, Benjamin Packer, and Daphne Koller. 2010. Self-paced learning for latent variable models.Advances in neural information processing systems23 (2010)
2010
-
[35]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326(2024)
work page internal anchor Pith review arXiv 2024
- [36]
-
[37]
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355(2023)
work page internal anchor Pith review arXiv 2023
-
[38]
Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Yiran Wang, and Wei Yang. 2026. Large Language Model as Token Compressor and Decompressor. arXiv:2603.25340 [cs.CL]
work page internal anchor Pith review arXiv 2026
-
[39]
Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, and Wei Yang
-
[40]
LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing.arXiv preprint arXiv:2507.00029(2025)
work page internal anchor Pith review arXiv 2025
-
[41]
Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. 2024. Coupled mamba: Enhanced multimodal fusion with coupled state space model.Advances in Neural Information Processing Systems37 (2024), 59808–59832
2024
-
[42]
Zixu Li, Zhiwei Chen, Haokun Wen, Zhiheng Fu, Yupeng Hu, and Weili Guan
-
[43]
InProceedings of the AAAI Conference on Artificial Intelligence, Vol
Encoder: Entity mining and modification relation binding for composed image retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5101–5109
-
[44]
Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026. ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 23373–23381
2026
-
[45]
Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 6762–6770
2026
-
[46]
Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, and Liqiang Nie. 2026. TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval. arXiv:2604.21806 [cs.CV] https://arxiv.org/abs/2604.21806
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. 2024. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing. 5971–5984
2024
-
[48]
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. 2024. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26689–26699
2024
-
[49]
Yunbo Long, Yuhan Liu, and Liming Xu. 2026. EmoMAS: Emotion-Aware Multi- Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Or- chestration. arXiv:2604.07003 [cs.AI] https://arxiv.org/abs/2604.07003
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing508 (2022), 293–304
2022
-
[51]
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. 2024. Video-chatgpt: Towards detailed video understanding via large vision and lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12585–12602
2024
-
[52]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
-
[53]
Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594
2023
- [54]
-
[55]
Anqi Mao, Mehryar Mohri, and Yutao Zhong. 2023. Cross-entropy loss functions: Theoretical analysis and applications. InInternational conference on Machine learning. pmlr, 23803–23828. Conference’17, July 2017, Washington, DC, USA Guiyi Zeng, Junqing Yu, Yi-Ping Phoebe Chen, Chen Xu, Wei Yang, and Zikai Song
2023
-
[56]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training.arXiv preprint arXiv:1710.03740 (2017)
work page internal anchor Pith review arXiv 2017
-
[57]
Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. 2020. Curriculum learning for reinforcement learning domains: A framework and survey.Journal of Machine Learning Research21, 181 (2020), 1–50
2020
-
[58]
Thong Nguyen, Yi Bin, Junbin Xiao, Leigang Qu, Yicong Li, Jay Zhangjie Wu, Cong-Duy Nguyen, See Kiong Ng, and Luu Anh Tuan. 2024. Video-language understanding: A survey from model architecture, model training, and data perspectives. InFindings of the Association for Computational Linguistics: ACL
2024
- [59]
- [60]
-
[61]
Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. 2025. STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self- Training. InProceedings of the Computer Vision and Pattern Recognition Conference. 3284–3294
2025
-
[62]
Henry Scudder. 1965. Probability of error of some adaptive pattern-recognition machines.IEEE Transactions on Information Theory11, 3 (1965), 363–371
1965
-
[63]
Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. 2025. Enhancing video-llm reasoning via agent-of-thoughts distillation. InProceedings of the Computer Vision and Pattern Recognition Conference. 8523–8533
2025
-
[64]
Zikai Song, Run Luo, Lintao Ma, Ying Tang, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. 2025. Temporal Coherent Object Flow for Multi-Object Tracking. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 6978–6986
2025
-
[65]
Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2023. Compact transformer tracker with correlative masked modeling. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 2321–2329
2023
-
[66]
Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2024. Autogenic language embedding for coherent point tracking. In Proceedings of the 32nd ACM International Conference on Multimedia. 2021–2030
2024
-
[67]
Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, and Wenwu Zhu. 2025. Modularized self-reflected video reasoner for multimodal llm with application to video question answering. InForty-second International Conference on Machine Learning
2025
-
[68]
Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. 2022. Transformer tracking with cyclic shifting window attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8791–8800
2022
-
[69]
Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, and Xinchao Wang
-
[70]
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking. arXiv:2604.12665 [cs.CV]
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Hung-Ting Su, Chen-Hsi Chang, Po-Wei Shen, Yu-Siang Wang, Ya-Liang Chang, Yu-Cheng Chang, Pu-Jen Cheng, and Winston H Hsu. 2021. End-to-end video question-answer generation with generator-pretester network.IEEE Transactions on Circuits and Systems for Video Technology31, 11 (2021), 4497–4507
2021
-
[72]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid
-
[73]
InProceedings of the IEEE/CVF ICCV
Videobert: A joint model for video and language representation learning. InProceedings of the IEEE/CVF ICCV. 7464–7473
-
[74]
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2025. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology(2025)
2025
- [75]
-
[76]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
- [77]
-
[78]
Xin Wang, Yudong Chen, and Wenwu Zhu. 2021. A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence44, 9 (2021), 4555–4576
2021
-
[79]
Xudong Wang, Zhirong Wu, Long Lian, and Stella X Yu. 2022. Debiased learn- ing from naturally imbalanced pseudo-labels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14647–14657
2022
-
[80]
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. 2025. Videotree: Adaptive tree-based video repre- sentation for llm reasoning on long videos. InProceedings of the Computer Vision and Pattern Recognition Conference. 3272–3283
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.