SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
hub Baseline reference
MLVU: Benchmarking Multi-task Long Video Understanding
Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
TTF fuses temporally redundant visual tokens via local similarity search in a plug-and-play way, cutting ~67% tokens on Qwen3-VL-8B while retaining 99.5% accuracy with minimal overhead.
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
citing papers explorer
-
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.