arxiv: 2604.12335 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.LG

Recognition: unknown

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

Tanzila Rahman , Renjie Liao , Leonid Sigal

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords synthetic datavideo understandingmultimodal modelsvisual question answeringobject countingvideo segmentationdata generation pipelinegeneralization

0 comments

The pith

A unified synthetic pipeline generates unlimited multimodal video data that trains models to generalize to real datasets and often outperform real-data training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a single pipeline that automatically creates large volumes of synthetic multimodal video data complete with annotations for several tasks at once. This targets the practical barriers of cost, time, and limited variety that come with collecting and labeling real video footage. The pipeline keeps data consistent across formats while adding a fine-tuning step based on structured visual questions to push models toward actual reasoning instead of surface-level descriptions. Tests on video object counting, visual question answering, and object segmentation show that models trained mostly on the synthetic output transfer well to real videos and frequently exceed the accuracy of models trained on conventional real data. The work therefore positions synthetic generation as a practical route to scaling video understanding systems.

Core claim

We introduce a unified synthetic data generation pipeline that automatically produces unlimited multimodal video data with rich and diverse supervision across multiple task formats. We further add a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content. Models trained predominantly on data from this pipeline generalize effectively to real-world datasets in video object counting, video-based visual question answering, and video object segmentation, often outperforming models trained on traditional real data.

What carries the argument

The unified synthetic data generation pipeline that produces consistent multimodal video data with annotations for multiple tasks in one framework, together with a VQA-based fine-tuning method that requires structured answers about visual content.

Load-bearing premise

The synthetic videos and their annotations are realistic enough in appearance, motion, and semantics that models can learn from them and apply the knowledge directly to real videos.

What would settle it

A controlled experiment in which models trained only on the synthetic data are tested on standard real-world benchmarks for counting, VQA, or segmentation and show clearly lower accuracy than equivalent models trained on real annotated data.

Figures

Figures reproduced from arXiv: 2604.12335 by Leonid Sigal, Renjie Liao, Tanzila Rahman.

**Figure 1.** Figure 1: Comparison of real and synthetic data for training video foundation models. Real data (left) requires manual annotation, making it costly and labor-intensive, while synthetic data (right) is automatically generated, offering a scalable and efficient alternative for creating high-quality training datasets. els and the training data. However, the acquisition of largescale, high-quality multimodal annotatio… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed multimodal synthetic data generation pipeline. The pipeline takes a static Input Image from which an LLM generates a plausible future caption to establish temporal intent. The caption and image are used to condition a video generation model to produce a coherent visual sequence. Subsequently, an audio generation model, conditioned on both the text and the generated video, synthesiz… view at source ↗

**Figure 3.** Figure 3: Examples of multimodal annotations generated by our LLM-based pipeline. Given a visual input, the system automatically produces (1) temporal captions describing the scene progression, (2) object-count annotations derived from detected entities, and (3) visual question–answer (VQA) pairs. This process enables scalable generation of diverse supervisory signals for vision-language learning [PITH_FULL_IMAGE:f… view at source ↗

**Figure 4.** Figure 4: Sample frames from our generated synthetic videos, along with the corresponding captions produced by the LLM. These videos are automatically generated and annotated, providing temporally consistent frame-level information that is used to fine-tune the model. The visualization demonstrates that our synthetic pipeline produces diverse and semantically meaningful video content, which can serve as high-quality… view at source ↗

read the original abstract

Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a unified synthetic pipeline for video tasks plus VQA fine-tuning, but the generalization claims rest on comparisons that do not appear to control for total data volume.

read the letter

The main takeaway is a single pipeline that generates labeled multimodal video data for counting, VQA, and segmentation at the same time, together with a fine-tuning step that asks the model structured questions about the content. The abstract reports that models trained mostly on this synthetic output generalize to real benchmarks and sometimes beat models trained on real data alone. That direction addresses a real bottleneck in scaling video MLLMs. The unification across tasks is the clearest new element; most prior synthetic work builds separate generators per task, so one system that keeps annotations consistent is useful. The VQA formulation is also reasonable because it pushes beyond caption-style supervision toward explicit visual reasoning. The evaluation spans three distinct tasks, which shows the pipeline is not narrowly scoped. The soft spot is the lack of reported controls on data scale. The headline result depends on whether the synthetic set is larger, has more epochs, or simply supplies more total tokens than the real baselines. Without those numbers or ablations that hold volume fixed, it is hard to separate the effect of the pipeline design from the effect of having more data. The abstract supplies no sample counts or training budgets, so the claim cannot be assessed from what is given. This work is aimed at groups that need large-scale video supervision but cannot afford real annotation. A reader already working on synthetic data or video MLLMs would find the task coverage and the VQA idea worth examining. It deserves a serious referee because the problem is concrete and the proposed fix is implementable; the review would mainly need to verify the data-volume comparisons and the actual generation details. I would send it for peer review with a request for those ablations.

Referee Report

1 major / 1 minor

Summary. The paper proposes a unified synthetic data generation pipeline that automatically produces large-scale multimodal video data with rich annotations supporting multiple tasks (object counting, VQA, and segmentation) within a single framework. It introduces a VQA-based fine-tuning strategy to promote deeper visual reasoning in MLLMs and reports that models trained predominantly on this synthetic data generalize well to real-world video benchmarks, often outperforming models trained on conventional real-world datasets.

Significance. If the central claims are substantiated with proper controls, the work could have substantial significance for multimodal video understanding by demonstrating a scalable, low-cost alternative to real-world data collection and annotation, potentially enabling broader experimentation and deployment of MLLMs in video tasks.

major comments (1)

[Experimental evaluation] The experimental comparisons between predominantly synthetic training and traditional real-world training do not report or control for total training data volume (sample counts, tokens, or epoch budgets). Without ablations that hold data scale fixed while varying source, it is impossible to isolate whether reported gains arise from the unified pipeline, VQA fine-tuning, or simply from differences in data quantity. This directly undermines the headline claim of effective generalization and outperformance.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta on a specific benchmark) to allow immediate assessment of effect size.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the major comment point by point below and will revise the manuscript to incorporate the suggested controls.

read point-by-point responses

Referee: [Experimental evaluation] The experimental comparisons between predominantly synthetic training and traditional real-world training do not report or control for total training data volume (sample counts, tokens, or epoch budgets). Without ablations that hold data scale fixed while varying source, it is impossible to isolate whether reported gains arise from the unified pipeline, VQA fine-tuning, or simply from differences in data quantity. This directly undermines the headline claim of effective generalization and outperformance.

Authors: We agree that the current experiments do not explicitly control for or report total training data volume, which limits the ability to isolate the contribution of the synthetic pipeline and VQA fine-tuning from potential differences in data scale. In the revised manuscript, we will add new ablations that hold the number of training samples (and report approximate token budgets and epoch counts) fixed across synthetic-only, real-only, and mixed training regimes for each task. These controls will allow direct comparison of data sources while clarifying the source of the reported generalization gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation uses independent real-world benchmarks

full rationale

The paper describes a synthetic data generation pipeline and reports empirical results from training models on that data then testing on separate real-world video datasets for counting, VQA, and segmentation. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described claims. Generalization performance is measured against external real data, satisfying the criterion for self-contained evaluation against external benchmarks. The skeptic concern about data volume is an experimental-design issue, not a circular reduction of the derivation to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that synthetic data is effective for training, with many implementation details as free parameters.

free parameters (1)

various hyperparameters in the data generation and model training
The pipeline and fine-tuning likely involve many tunable parameters not specified in abstract.

axioms (1)

domain assumption Synthetic data can approximate real-world video distributions sufficiently for generalization.
This is the core assumption for the approach to work.

pith-pipeline@v0.9.0 · 5496 in / 1281 out tokens · 35721 ms · 2026-05-10T15:23:15.125540+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

108 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015. 2

2015
[2]

Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mo- hammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023. 3

work page arXiv 2023
[3]

Ai- generated video detection via spatial-temporal anomaly learning

Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou. Ai- generated video detection via spatial-temporal anomaly learning. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 460–470. Springer,
[4]

A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024

Tianyi Bai, Hao Liang, Binwang Wan, Yanran Xu, Xi Li, Shiyu Li, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, et al. A survey of multimodal large language model from a data-centric perspective.arXiv preprint arXiv:2405.16640,

work page arXiv
[5]

Learning to see by looking at noise.Advances in Neural Information Processing Systems, 34:2556–2569, 2021

Manel Baradad Jurjo, Jonas Wulff, Tongzhou Wang, Phillip Isola, and Antonio Torralba. Learning to see by looking at noise.Advances in Neural Information Processing Systems, 34:2556–2569, 2021. 3

2021
[6]

Cnn detection of gan-generated face im- ages based on cross-band co-occurrences analysis

Mauro Barni, Kassem Kallas, Ehsan Nowroozi, and Benedetta Tondi. Cnn detection of gan-generated face im- ages based on cross-band co-occurrences analysis. In2020 IEEE international workshop on information forensics and security (WIFS), pages 1–6. IEEE, 2020. 2

2020
[7]

Augmenting clip with improved visio-linguistic reasoning.arXiv preprint arXiv:2307.09233, 2(6):10, 2023

Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, and Soheil Feizi. Augmenting clip with improved visio-linguistic reasoning.arXiv preprint arXiv:2307.09233, 2(6):10, 2023. 3

work page arXiv 2023
[8]

This dataset does not exist: train- ing models from generated images

Victor Besnier, Himalaya Jain, Andrei Bursuc, Matthieu Cord, and Patrick P´erez. This dataset does not exist: train- ing models from generated images. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2020. 3

2020
[9]

Improving fine-grained understanding in image- text pre-training.arXiv preprint arXiv:2401.09865, 2024

Ioana Bica, Anastasija Ili ´c, Matthias Bauer, Goker Erdo- gan, Matko Bo ˇsnjak, Christos Kaplanis, Alexey A Grit- senko, Matthias Minderer, Charles Blundell, Razvan Pas- canu, et al. Improving fine-grained understanding in image- text pre-training.arXiv preprint arXiv:2401.09865, 2024. 3

work page arXiv 2024
[10]

Make it count: Text-to-image generation with an accurate number of objects

Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, and Gal Chechik. Make it count: Text-to-image generation with an accurate number of objects. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 13242–13251, 2025. 2

2025
[11]

Revisiting the” video” in video-language understanding

Shyamal Buch, Crist ´obal Eyzaguirre, Adrien Gaidon, Jia- jun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the” video” in video-language understanding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022. 1

2022
[12]

One- shot video object segmentation

Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taix´e, Daniel Cremers, and Luc Van Gool. One- shot video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 221–230, 2017. 2

2017
[13]

Natural language infer- ence improves compositionality in vision-language models

Paola Cascante-Bonilla, Yu Hou, Yang Trista Cao, Hal Daum´e III, and Rachel Rudinger. Natural language infer- ence improves compositionality in vision-language models. arXiv preprint arXiv:2410.22315, 2024. 3

work page arXiv 2024
[14]

Clove: Encoding compositional language in contrastive vision-language models.arXiv preprint arXiv:2402.15021, 2024

Santiago Castro, Amir Ziai, Avneesh Saluja, Zhuoning Yuan, and Rada Mihalcea. Clove: Encoding compositional language in contrastive vision-language models.arXiv preprint arXiv:2402.15021, 2024. 3

work page arXiv 2024
[15]

What makes fake images detectable? understanding prop- erties that generalize

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. What makes fake images detectable? understanding prop- erties that generalize. InEuropean conference on computer vision, pages 103–120. Springer, 2020. 2

2020
[16]

All you may need for vqa are image captions

Soravit Changpinyo, Doron Kukliansy, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. All you may need for vqa are image captions. InProceedings of the 2022 confer- ence of the north american chapter of the association for computational linguistics: human language technologies, pages 1947–1963, 2022. 1

2022
[17]

Smote: synthetic minority over- sampling technique.Journal of artificial intelligence re- search, 16:321–357, 2002

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over- sampling technique.Journal of artificial intelligence re- search, 16:321–357, 2002. 2

2002
[18]

Text analysis for detecting terrorism-related articles on the web.Journal of Network and Computer Applica- tions, 38:16–21, 2014

Dongjin Choi, Byeongkyu Ko, Heesun Kim, and Pankoo Kim. Text analysis for detecting terrorism-related articles on the web.Journal of Network and Computer Applica- tions, 38:16–21, 2014. 8

2014
[19]

The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(11):4125–4141, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Da- vide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(11):4125–4141, 2020. 2

2020
[20]

An overview of privacy in machine learning.arXiv preprint arXiv:2005.08679, 2020

Emiliano De Cristofaro. An overview of privacy in machine learning.arXiv preprint arXiv:2005.08679, 2020. 1

work page arXiv 2005
[21]

An introduction to variational autoencoders.Foundations and Trends® in Ma- chine Learning, 12(4):307–392, 2019

P Kingma Diederik and Welling Max. An introduction to variational autoencoders.Foundations and Trends® in Ma- chine Learning, 12(4):307–392, 2019. 2

2019
[22]

Explaining deepfake detection by analysing im- age matching

Shichao Dong, Jin Wang, Jiajun Liang, Haoqiang Fan, and Renhe Ji. Explaining deepfake detection by analysing im- age matching. InEuropean conference on computer vision, pages 18–35. Springer, 2022. 2

2022
[23]

Dense and aligned captions (dac) promote compositional reason- ing in vl models.Advances in Neural Information Process- ing Systems, 36:76137–76150, 2023

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (dac) promote compositional reason- ing in vl models.Advances in Neural Information Process- ing Systems, 36:76137–76150, 2023. 3

2023
[24]

Teaching structured vision & language concepts to vision & language models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, and Leonid Karlinsky. Teaching structured vision & language concepts to vision & language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657–2668, 2023. 3

2023
[25]

Sugarcrepe++ dataset: Vision-language model 9 sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, and Has- san Sajjad. Sugarcrepe++ dataset: Vision-language model 9 sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems, 37:17972–18018,
[26]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Sch ¨onherr, Asja Fis- cher, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InIn- ternational conference on machine learning, pages 3247–
[27]

Vane-bench: Video anomaly evaluation benchmark for conversational lmms

Hanan Gani, Rohit Bharadwaj, Muzammal Naseer, Fa- had Shahbaz Khan, and Salman Khan. Vane-bench: Video anomaly evaluation benchmark for conversational lmms. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3123–3140, 2025. 3

2025
[28]

Self-guided noise-free data gen- eration for efficient zero-shot learning.arXiv preprint arXiv:2205.12679, 2022

Jiahui Gao, Renjie Pi, Yong Lin, Hang Xu, Jiacheng Ye, Zhiyong Wu, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, and Lingpeng Kong. Self-guided noise-free data gen- eration for efficient zero-shot learning.arXiv preprint arXiv:2205.12679, 2022. 3

work page arXiv 2022
[29]

Deep learning for video object segmentation: a review.Artificial Intelligence Re- view, 56(1):457–531, 2023

Mingqi Gao, Feng Zheng, James JQ Yu, Caifeng Shan, Guiguang Ding, and Jungong Han. Deep learning for video object segmentation: a review.Artificial Intelligence Re- view, 56(1):457–531, 2023. 2

2023
[30]

Generative adversarial networks.Com- munications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Com- munications of the ACM, 63(11):139–144, 2020. 2

2020
[31]

Making the v in vqa matter: El- evating the role of image understanding in visual ques- tion answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual ques- tion answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913,
[32]

Are gan generated images easy to detect? a critical analysis of the state-of-the-art.arXiv preprint arXiv:2104.02617, 2021

Diego Gragnaniello, Davide Cozzolino, Francesco Marra, Giovanni Poggi, and Luisa Verdoliva. Are gan generated images easy to detect? a critical analysis of the state-of- the-art.arXiv preprint arXiv:2104.02617, 2021. 2

work page arXiv 2021
[33]

How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection

Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. How close is chatgpt to human experts? compar- ison corpus, evaluation, and detection.arXiv preprint arXiv:2301.07597, 2023. 3

work page arXiv 2023
[34]

Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022

Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022. 3

work page arXiv 2022
[35]

Generate, annotate, and learn: Nlp with synthetic text.Transactions of the Association for Computational Linguistics, 10:826–842, 2022

Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, and Mohammad Norouzi. Generate, annotate, and learn: Nlp with synthetic text.Transactions of the Association for Computational Linguistics, 10:826–842, 2022. 3

2022
[36]

Incorporating structured representations into pretrained vi- sion & language models using scene graphs

Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Ar- belle, Rogerio Feris, Trevor Darrell, and Amir Globerson. Incorporating structured representations into pretrained vi- sion & language models using scene graphs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14077–14098, 2023. 3

2023
[37]

On pre-trained image features and synthetic images for deep learning

Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, and Kurt Konolige. On pre-trained image features and synthetic images for deep learning. InProceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018. 3

2018
[38]

Root mean square error (rmse) or mean absolute error (mae): When to use them or not.Geoscien- tific Model Development Discussions, 2022:1–10, 2022

Timothy O Hodson. Root mean square error (rmse) or mean absolute error (mae): When to use them or not.Geoscien- tific Model Development Discussions, 2022:1–10, 2022. 8

2022
[39]

Unnatural instructions: Tuning language mod- els with (almost) no human labor

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language mod- els with (almost) no human labor. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428,
[40]

Evading deepfake detectors via adversar- ial statistical consistency

Yang Hou, Qing Guo, Yihao Huang, Xiaofei Xie, Lei Ma, and Jianjun Zhao. Evading deepfake detectors via adversar- ial statistical consistency. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12271–12280, 2023. 2

2023
[41]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023. 3

2023
[42]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17980–17989, 2022. 1

2022
[43]

Promptcap: Prompt-guided im- age captioning for vqa with gpt-3

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided im- age captioning for vqa with gpt-3. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2963–2975, 2023. 1

2023
[44]

Video-to-audio generation with fine-grained temporal semantics.arXiv preprint arXiv:2409.14709, 2024

Yuchen Hu, Yu Gu, Chenxing Li, Rilin Chen, and Dong Yu. Video-to-audio generation with fine-grained temporal semantics.arXiv preprint arXiv:2409.14709, 2024. 4

work page arXiv 2024
[45]

Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations

Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, et al. Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations. InProceedings of the AAAI conference on artificial intelligence, pages 2417–2425, 2024. 3

2024
[46]

Learning with label noise for im- age retrieval by selecting interactions

Sarah Ibrahimi, Arnaud Sors, Rafael Sampaio de Rezende, and St´ephane Clinchant. Learning with label noise for im- age retrieval by selecting interactions. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2181–2190, 2022. 3

2022
[47]

Revisiting temporal modeling for video super-resolution

Takashi Isobe, Fang Zhu, Xu Jia, and Shengjin Wang. Revisiting temporal modeling for video super-resolution. arXiv preprint arXiv:2008.05765, 2020. 1

work page arXiv 2008
[48]

Distinguish any fake videos: Unleashing the power of large-scale data and motion features,

Lichuan Ji, Yingqi Lin, Zhenhua Huang, Yan Han, Xiao- gang Xu, Jiafei Wu, Chong Wang, and Zhe Liu. Distin- guish any fake videos: Unleashing the power of large-scale data and motion features.arXiv preprint arXiv:2405.15343,

work page arXiv
[49]

Vqa2: visual question answer- ing for video quality assessment

Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, and Xiongkuo Min. Vqa2: visual question answer- ing for video quality assessment. InProceedings of the 10 33rd ACM International Conference on Multimedia, pages 6751–6760, 2025. 1

2025
[50]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, 2023. 3

2023
[51]

Is clip ideal? no

Raphi Kang, Yue Song, Georgia Gkioxari, and Pietro Per- ona. Is clip ideal? no. can we fix it? yes! InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22436–22446, 2025. 8

2025
[52]

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

Zeeshan Khan, Makarand Tapaswi, et al. Figclip: Fine- grained clip adaptation via densely annotated videos.arXiv preprint arXiv:2401.07669, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Misalign, contrast then distill: Rethinking misalign- ments in language-image pre-training

Bumsoo Kim, Yeonsik Jo, Jinhyung Kim, and Seunghwan Kim. Misalign, contrast then distill: Rethinking misalign- ments in language-image pre-training. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 2563–2572, 2023. 3

2023
[54]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 4, 6

2023
[55]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vi- sion, 123(1):32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vi- sion, 123(1):32–73, 2017. 2

2017
[56]

Data augmentation using pre-trained transformer models

Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Data augmentation using pre-trained transformer models. InPro- ceedings of the 2nd workshop on life-long learning for spo- ken language systems, pages 18–26, 2020. 3

2020
[57]

Learning to learn from noisy labeled data

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. Learning to learn from noisy labeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5051–5059, 2019. 3

2019
[58]

Learn- ing from noisy data with robust representation learning

Junnan Li, Caiming Xiong, and Steven CH Hoi. Learn- ing from noisy data with robust representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9485–9494, 2021. 3

2021
[59]

Uniformerv2: Unlocking the potential of image vits for video understanding

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Unlocking the potential of image vits for video understanding. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 1632–1643, 2023. 2

2023
[60]

Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025. 1

2025
[61]

Fakebench: Probing explainable fake image detection via large multi- modal models.IEEE Transactions on Information Foren- sics and Security, 2025

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multi- modal models.IEEE Transactions on Information Foren- sics and Security, 2025. 3

2025
[62]

Multi- granularity video object segmentation

Sangbeom Lim, Seongchan Kim, Seungjun An, Seokju Cho, Paul Hongsuck Seo, and Seungryong Kim. Multi- granularity video object segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5200– 5208, 2025. 4, 7

2025
[63]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014. 2, 5

2014
[64]

Noise-resistant deep metric learning with ranking-based in- stance selection

Chang Liu, Han Yu, Boyang Li, Zhiqi Shen, Zhanning Gao, Peiran Ren, Xuansong Xie, Lizhen Cui, and Chunyan Miao. Noise-resistant deep metric learning with ranking-based in- stance selection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 6811–6820, 2021. 3

2021
[65]

Revisiting temporal modeling for clip-based image-to-video knowledge transferring

Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, and Thomas H Li. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6555–6564, 2023. 1

2023
[66]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Ad- vances in neural information processing systems, 35:2507– 2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Ad- vances in neural information processing systems, 35:2507– 2521, 2022. 3

2022
[67]

Seeing is not always be- lieving: Benchmarking human and model perception of ai- generated images.Advances in neural information process- ing systems, 36:25435–25447, 2023

Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always be- lieving: Benchmarking human and model perception of ai- generated images.Advances in neural information process- ing systems, 36:25435–25447, 2023. 3

2023
[68]

Gen- erating training data with language models: Towards zero- shot language understanding.Advances in Neural Informa- tion Processing Systems, 35:462–477, 2022

Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. Gen- erating training data with language models: Towards zero- shot language understanding.Advances in Neural Informa- tion Processing Systems, 35:462–477, 2022. 3

2022
[69]

Improving multimodal datasets with image captioning.NeurIPS, 36:22047–22069,

Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Se- woong Oh, and Ludwig Schmidt. Improving multimodal datasets with image captioning.NeurIPS, 36:22047–22069,
[70]

Synthesize diagnose and optimize: Towards fine- grained vision-language understanding

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zux- uan Wu. Synthesize diagnose and optimize: Towards fine- grained vision-language understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13279–13288, 2024. 3

2024
[71]

VisDA: The Visual Domain Adaptation Challenge

Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoff- man, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge.arXiv preprint arXiv:1710.06924, 2017. 3

work page Pith review arXiv 2017
[72]

Capture: Evaluating spatial reasoning in vision language models via occluded object counting

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, and Mo- hit Bansal. Capture: Evaluating spatial reasoning in vision language models via occluded object counting. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 8001–8010, 2025. 2

2025
[73]

The artificial intelligence cognitive examination: A survey on the evolution of multimodal evaluation from recognition to reasoning.IEEE Access, 14:2690–2725, 2025

Mayank Ravishankara and Varindra V Persad Maharaj. The artificial intelligence cognitive examination: A survey on the evolution of multimodal evaluation from recognition to reasoning.IEEE Access, 14:2690–2725, 2025. 1 11

2025
[74]

Speech recognition with augmented synthesized speech

Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu. Speech recognition with augmented synthesized speech. In2019 IEEE automatic speech recognition and understanding workshop (ASRU), pages 996–1002. IEEE, 2019. 3

2019
[75]

Building vision-language models on solid foundations with masked distillation

Sepehr Sameni, Kushal Kafle, Hao Tan, and Simon Jenni. Building vision-language models on solid foundations with masked distillation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 14216–14226, 2024. 3

2024
[76]

Detecting and grounding multi-modal media manipulation

Rui Shao, Tianxing Wu, and Ziwei Liu. Detecting and grounding multi-modal media manipulation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2023. 2

2023
[77]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 2

2019
[78]

Calvin: Improved contextual video captioning via instruc- tion tuning.Advances in Neural Information Processing Systems, 37:92983–93010, 2024

Gowthami Somepalli, Arkabandhu Chowdhury, Jonas Geiping, Ronen Basri, Tom Goldstein, and David Jacobs. Calvin: Improved contextual video captioning via instruc- tion tuning.Advances in Neural Information Processing Systems, 37:92983–93010, 2024. 2

2024
[79]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 2

work page internal anchor Pith review arXiv 2012
[80]

Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1

2025

Showing first 80 references.