Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al · 2024 · arXiv 2406.04325

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

dataset 3

citation-polarity summary

use dataset 2 background 1

representative citing papers

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

LLaVA-Video: Video Instruction Tuning With Synthetic Data

cs.CV · 2024-10-03 · unverdicted · novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

LLaVA-OneVision: Easy Visual Task Transfer

cs.CV · 2024-08-06 · unverdicted · novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

cs.MM · 2026-05-12

citing papers explorer

Showing 7 of 7 citing papers.

MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 7
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos cs.CV · 2026-04-20 · unverdicted · none · ref 5
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 30
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 160
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
LLaVA-OneVision: Easy Visual Task Transfer cs.CV · 2024-08-06 · unverdicted · none · ref 21
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 28
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unreviewed · ref 42

Sharegpt4video: Improving video understanding and generation with better captions

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer