DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
hub Baseline reference
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Baseline reference. 71% of citing Pith papers use this work as a benchmark or comparison.
abstract
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
A closed-loop self-evolving training system for spatial reasoning in MLLMs that iteratively generates QA pairs matched to the model's current capabilities via confidence feedback, achieving gains with an order of magnitude less data.
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
Reformulating 53 visual reasoning tasks in polar coordinates causes frontier MLLMs to drop from 70-83% to 31-39% accuracy while preserving logical equivalence, revealing a Cartesian shortcut in current benchmarks.
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Presents a failure taxonomy and intervention-mapping methodology for industry-scale AVLM development in content moderation, applied to a global video platform system.
BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and video models.
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
citing papers explorer
-
DataComp-VLM: Improved Open Datasets for Vision-Language Models
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning
A closed-loop self-evolving training system for spatial reasoning in MLLMs that iteratively generates QA pairs matched to the model's current capabilities via confidence feedback, achieving gains with an order of magnitude less data.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
-
Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse
Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
-
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
-
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
MOSS-Video-Preview introduces a cross-attention architecture and synthesized real-time QA data to enable continuous perception, answer revision, and faster inference in video-language models compared to decoder-only designs.
-
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
-
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
Reformulating 53 visual reasoning tasks in polar coordinates causes frontier MLLMs to drop from 70-83% to 31-39% accuracy while preserving logical equivalence, revealing a Cartesian shortcut in current benchmarks.
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation
Presents a failure taxonomy and intervention-mapping methodology for industry-scale AVLM development in content moderation, applied to a global video platform system.
-
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and video models.
-
Context Unrolling in Omni Models
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
Qwen2.5-VL Technical Report
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity
Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.
- ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
- LLM Harms: A Taxonomy and Discussion
- MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs