hub Baseline reference

Ku, Qian Liu, and Wenhu Chen

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen · 2024 · arXiv 2405.01483

Baseline reference. 70% of citing Pith papers use this work as a benchmark or comparison.

21 Pith papers citing it

Baseline 70% of classified citations

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 4 background 3 baseline 3

citation-polarity summary

use dataset 4 baseline 3 background 2 unclear 1

representative citing papers

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

cs.CV · 2026-03-04 · unverdicted · novelty 6.0

PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

Improving Video Generation with Human Feedback

cs.CV · 2025-01-23 · unverdicted · novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CrossView Suite supplies a 1.6M-sample dataset, scene-disjoint benchmark, and explicit-alignment framework to advance MLLMs from single-view perception to cross-view spatial intelligence.

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

cs.LG · 2025-09-16 · unverdicted · novelty 5.0

An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

cs.CV · 2024-08-09 · unverdicted · novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

LLaVA-OneVision: Easy Visual Task Transfer

cs.CV · 2024-08-06 · unverdicted · novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization

cs.DC · 2026-03-26 · unverdicted · novelty 4.0

DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

cs.CV · 2025-07-01 · unverdicted · novelty 4.0

A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on average across seven other benchmarks.

citing papers explorer

Showing 21 of 21 citing papers.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 44
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark cs.CL · 2024-09-04 · accept · none · ref 16
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 22
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 19
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 25
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion cs.CV · 2026-05-28 · unverdicted · none · ref 14
LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning cs.CV · 2026-05-27 · unverdicted · none · ref 26
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 52
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction cs.CL · 2026-04-30 · unverdicted · none · ref 45
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks cs.CV · 2026-03-04 · unverdicted · none · ref 6
PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 28
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 51
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 51
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Improving Video Generation with Human Feedback cs.CV · 2025-01-23 · unverdicted · none · ref 28
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 100
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark cs.CV · 2026-05-18 · unverdicted · none · ref 17
CrossView Suite supplies a 1.6M-sample dataset, scene-disjoint benchmark, and explicit-alignment framework to advance MLLMs from single-view perception to cross-view spatial intelligence.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe cs.LG · 2025-09-16 · unverdicted · none · ref 62
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 215
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
LLaVA-OneVision: Easy Visual Task Transfer cs.CV · 2024-08-06 · unverdicted · none · ref 47
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization cs.DC · 2026-03-26 · unverdicted · none · ref 32
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 13
A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on average across seven other benchmarks.

Ku, Qian Liu, and Wenhu Chen

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer