Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
hub Baseline reference
Sharegpt4video: Improving video understanding and generation with better captions
Baseline reference. 71% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
HunyuanVideo presents a 13B-parameter open-source video generative model with integrated data, architecture, training, and inference systems whose professional evaluations show it outperforming prior SOTA models including Runway Gen-3 and Luma 1.6.
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
UNIVID generates policy-aware captions for video moderation, reducing violation leakage by 42.7% and overkill rate by 37.0% while replacing over 1,000 policy-specific models with a single backbone.
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
citing papers explorer
-
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.