Qwen2.5-VL Technical Report

Haiyang Xu; Hang Zhang; Humen Zhong; Jiabo Ye; Jialin Wang; Jianqiang Wan; Jun Tang; Junyang Lin (additional authors not shown); Kai Dang; Keqin Chen

arxiv: 2502.13923 · v1 · submitted 2025-02-19 · 💻 cs.CV · cs.CL

Qwen2.5-VL Technical Report

Shuai Bai , Keqin Chen , Xuejing Liu , Jialin Wang , Wenbin Ge , Sibo Song , Kai Dang , Peng Wang

show 19 more authors

Shijie Wang Jun Tang Humen Zhong Yuanzhi Zhu Mingkun Yang Zhaohai Li Jianqiang Wan Pengfei Wang Wei Ding Zheren Fu Yiheng Xu Jiabo Ye Xi Zhang Tianbao Xie Zesen Cheng Hang Zhang Zhibo Yang Haiyang Xu Junyang Lin (additional authors not shown)

This is my paper

Pith reviewed 2026-05-23 02:22 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords Qwen2.5-VLvision-language modeldynamic resolutionobject localizationdocument parsinglong video understandingmultimodal agent

0 comments

The pith

Qwen2.5-VL-72B reaches parity with GPT-4o on document and diagram tasks via native dynamic resolution and absolute time encoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen2.5-VL as the next model in the Qwen vision-language series. It claims major gains in visual recognition, precise object localization with bounding boxes or points, structured extraction from invoices and tables, and comprehension of hour-long videos with second-level event timing. These gains come from training a native dynamic-resolution Vision Transformer from scratch and adding absolute time encoding, which lets the model handle variable image sizes and extended video durations without normalization steps. The 72B variant is presented as matching leading closed models like GPT-4o and Claude 3.5 Sonnet, especially on document and diagram benchmarks, while retaining the language strengths of the Qwen2.5 base. Smaller variants are offered for edge use.

Core claim

Qwen2.5-VL uses a native dynamic-resolution Vision Transformer trained from scratch together with Window Attention and absolute time encoding; this combination supports accurate bounding-box and point localization, robust parsing of forms and charts, and temporal localization in videos up to hours long, allowing the 72B model to match GPT-4o and Claude 3.5 Sonnet performance on those tasks.

What carries the argument

Native dynamic-resolution Vision Transformer (ViT) with Window Attention and absolute time encoding, which processes inputs at their original scales and durations without normalization.

If this is right

The model can function as an interactive visual agent for operating computers and mobile devices.
It processes images of arbitrary sizes and videos lasting multiple hours with second-level event localization.
Document and diagram understanding improves without separate normalization pipelines.
Smaller model sizes extend the same capabilities to edge devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Removing normalization steps may simplify deployment pipelines for variable-resolution inputs in other vision systems.
The same dynamic-resolution and time-encoding pattern could be tested on non-video sequences such as audio or sensor streams.
If the approach scales, future models might drop fixed-resolution assumptions entirely.

Load-bearing premise

The benchmarks used to claim parity with GPT-4o are free of data contamination and fairly measure document parsing, localization, and long-video performance.

What would settle it

A fresh benchmark of held-out invoices, diagrams, and hour-long videos where the 72B model scores more than 5 points below GPT-4o on localization precision or parsing F1.

Figures

Figures reproduced from arXiv: 2502.13923 by Haiyang Xu, Hang Zhang, Humen Zhong, Jiabo Ye, Jialin Wang, Jianqiang Wan, Jun Tang, Junyang Lin (additional authors not shown), Kai Dang, Keqin Chen, Mingkun Yang, Pengfei Wang, Peng Wang, Shijie Wang, Shuai Bai, Sibo Song, Tianbao Xie, Wei Ding, Wenbin Ge, Xi Zhang, Xuejing Liu, Yiheng Xu, Yuanzhi Zhu, Zesen Cheng, Zhaohai Li, Zheren Fu, Zhibo Yang.

read the original abstract

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Qwen2.5-VL, the latest flagship in the Qwen vision-language series, which incorporates a native dynamic-resolution Vision Transformer trained from scratch along with Window Attention and absolute time encoding. These enable native handling of variable image resolutions and long videos (up to hours) with second-level localization. The paper claims major advances in visual recognition, precise object localization via bounding boxes or points, structured document parsing (invoices, forms, tables, charts, diagrams), and long-video comprehension, positioning the 72B model as matching GPT-4o and Claude 3.5 Sonnet especially on document/diagram tasks while preserving the linguistic capabilities of the underlying Qwen2.5 LLM. The model is released in three sizes for different deployment scenarios and is presented as an interactive visual agent.

Significance. If the performance claims hold after proper verification, the work would be significant for demonstrating that open multimodal models can reach parity with leading proprietary systems on practical visual tasks such as document understanding and agentic interaction. The architectural choices around dynamic-resolution ViT and absolute temporal encoding, if shown to be effective without heavy normalization, could provide reusable techniques for future vision-language models handling real-world scale and duration variations.

major comments (2)

[Abstract] Abstract: The central claim that 'the flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding' is unsupported by any quantitative benchmark scores, tables, or evaluation protocols. This absence is load-bearing because the manuscript provides no evidence (e.g., DocVQA, ChartQA, or video localization results) against which the parity assertion can be assessed.
[Abstract] Abstract: No details are given on training data composition, decontamination procedures for the cited document/diagram/long-video benchmarks, or ablation studies isolating the contribution of the dynamic-resolution ViT and absolute time encoding. Without these, it is impossible to rule out data contamination as an alternative explanation for the headline performance, directly undermining the reliability of the SOTA-matching claim.

minor comments (1)

[Abstract] Abstract: The phrasing 'native dynamic-resolution Vision Transformer (ViT) trained from scratch' and 'Window Attention' is introduced at a high level without even brief clarification of how these differ from standard ViT or relative positional encodings, which would aid readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, with planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding' is unsupported by any quantitative benchmark scores, tables, or evaluation protocols. This absence is load-bearing because the manuscript provides no evidence (e.g., DocVQA, ChartQA, or video localization results) against which the parity assertion can be assessed.

Authors: The full manuscript contains quantitative benchmark tables and results in the Experiments section, with direct comparisons on DocVQA, ChartQA, diagram understanding, and video localization tasks against GPT-4o and Claude 3.5 Sonnet. The abstract summarizes these findings at a high level. We will revise the abstract to explicitly reference the relevant tables, sections, and key metrics to make the supporting evidence clear. revision: yes
Referee: [Abstract] Abstract: No details are given on training data composition, decontamination procedures for the cited document/diagram/long-video benchmarks, or ablation studies isolating the contribution of the dynamic-resolution ViT and absolute time encoding. Without these, it is impossible to rule out data contamination as an alternative explanation for the headline performance, directly undermining the reliability of the SOTA-matching claim.

Authors: We will add expanded ablation studies in the manuscript that isolate the contributions of the native dynamic-resolution ViT (with Window Attention) and absolute time encoding. However, full details on training data composition and decontamination procedures cannot be provided. revision: partial

standing simulated objections not resolved

Detailed training data composition and decontamination procedures for the benchmarks

Circularity Check

0 steps flagged

No derivation chain present; empirical benchmark reporting only

full rationale

The paper is a technical report on model architecture, training procedure, and benchmark results. It contains no equations, first-principles derivations, fitted parameters presented as predictions, or load-bearing self-citations that reduce claims to inputs by construction. All performance claims rest on reported empirical numbers rather than any self-referential mathematical structure. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical model release paper. No mathematical derivations, fitted constants in equations, or new theoretical entities are introduced.

pith-pipeline@v0.9.0 · 5960 in / 1079 out tokens · 25942 ms · 2026-05-23T02:22:15.285961+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents
cs.RO 2026-05 conditional novelty 8.0

The paper presents RoboAbstention, a new benchmark showing frontier VLMs and embodied planners abstain on only 16.5-39% of 6,069 instructions grounded in robotics images, with prompting interventions raising rates to ...
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
cs.CV 2026-05 unverdicted novelty 8.0

VLMs fail to detect semantically different image swaps up to 60% of the time despite self-reflective statements, with thinking models more vulnerable and attention analysis showing self-reflection does not increase vi...
MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays
cs.CV 2026-05 conditional novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
CalibAnyView: Beyond Single-View Camera Calibration in the Wild
cs.CV 2026-05 conditional novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
cs.CV 2026-05 unverdicted novelty 8.0

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
cs.CR 2026-05 unverdicted novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but inco...
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
cs.CV 2026-05 accept novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
cs.CV 2026-05 unverdicted novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation
cs.AI 2026-05 unverdicted novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
cs.CV 2026-02 unverdicted novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
cs.CV 2026-02 unverdicted novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
cs.LG 2025-12 conditional novelty 8.0

Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
cs.CV 2025-12 accept novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
cs.CV 2025-11 unverdicted novelty 8.0

MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
cs.CV 2025-06 conditional novelty 8.0

FLEX is the first large-scale multimodal multiview dataset for fitness AQA, featuring RGB, 3D pose, sEMG and physiological data plus a Fitness Knowledge Graph for structured annotations and a VideoQA benchmark.
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
cs.AI 2026-05 unverdicted novelty 7.0

Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
cs.CV 2026-05 unverdicted novelty 7.0

VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
cs.CV 2026-05 unverdicted novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
cs.CV 2026-05 unverdicted novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
cs.CV 2026-05 conditional novelty 7.0

Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
cs.AI 2026-05 unverdicted novelty 7.0

ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals
cs.CV 2026-05 unverdicted novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding
cs.CV 2026-05 unverdicted novelty 7.0

Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-s...
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
cs.CV 2026-05 unverdicted novelty 7.0

The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 7.0

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
cs.CV 2026-05 unverdicted novelty 7.0

iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.
ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization
cs.CL 2026-05 unverdicted novelty 7.0

ArPoMeme is a new annotated multimodal dataset of Arabic political memes collected via Facebook scraping and vision-language model text extraction, then manually labeled for ideology and polarization aspects.
What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VLM-to-DiT alignment in video editing models acts as a semantic bottleneck that degrades fine-grained structural semantics, demonstrated via a new diagnostic dataset and protocol on relation-based edits.
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
cs.CV 2026-05 unverdicted novelty 7.0

Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
cs.CV 2026-05 accept novelty 7.0

WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
cs.CL 2026-05 conditional novelty 7.0

AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 7.0

FineBench is a new dense VQA benchmark for fine-grained human activity in long videos that exposes weaknesses in open VLMs and demonstrates gains from the proposed FineAgent modular framework.
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
cs.CV 2026-05 unverdicted novelty 7.0

FineBench is a new dense VQA benchmark for fine-grained human activity understanding in long videos, revealing weaknesses in open VLMs and showing that FineAgent improves them via localization and description modules.
MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation
cs.RO 2026-05 unverdicted novelty 7.0

MCNav builds a dynamic cognitive map with goal re-validation and missed-goal re-exploration to reach state-of-the-art results on instance-level zero-shot navigation in HM3D environments.
Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
cs.CV 2026-05 unverdicted novelty 7.0

LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
cs.CV 2026-05 unverdicted novelty 7.0

EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
Dexora: Open-source VLA for High-DoF Bimanual Dexterity
cs.RO 2026-05 unverdicted novelty 7.0

Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7%...
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
cs.CV 2026-05 conditional novelty 7.0

SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot coop...
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
cs.CV 2026-05 unverdicted novelty 7.0

ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
cs.CV 2026-05 unverdicted novelty 7.0

GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-05 unverdicted novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 7.0

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments
cs.CV 2026-05 unverdicted novelty 7.0

WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 895 Pith papers · 29 internal anchors

[1]

Pixtral 12B

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a. Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang ...

work page internal anchor Pith review arXiv
[3]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 ,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024d. Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and ...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https: //blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ . 16 DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guantin...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

DeepSeek-V3 Technical Report

doi: 10.48550/ARXIV .2412.19437. URL https://doi.org/10.48550/arXiv.2412.19437. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[8]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515,

work page arXiv
[9]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv:2405.21075, 2024a. Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, ...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? CoRR, abs/2406.04127,

work page arXiv
[12]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv:2310.14566,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237,

work page arXiv
[14]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video- mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Otterhd: A high- resolution multi-modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high- resolution multi-modality model. arXiv:2311.04219, 2023a. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 202...

work page arXiv
[16]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv:2304.08485, 2023b. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun yue Li, Jianwei Yang, Hang Su, Jun-Juan Zh...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V . Jawahar. Infographicvqa. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pp. 2582–2591, 2021a. Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021b. MiniMax, Aonian Li, Bangw...

work page 2022
[19]

URL https://arxiv.org/abs/2501.08313. Openai. Chatml documents,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

URL https://arxiv.org/abs/2412.07626. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 3170–3180,

work page arXiv
[21]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos- 2: Grounding multimodal large language models to the world. arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on...

work page 2023
[23]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

URL http://papers.nips.cc/ paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic bench- marking environment for autonomous agents. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Grounding dino 1.5: Advance the” edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection. arXiv preprint arXiv:2405.10300,

work page arXiv
[26]

Mtvqa: Benchmarking multilingual text-centric visual question answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, and Can Huang. Mtvqa: Benchmarking multilingual text-centric visual question answering. arXiv:2405.11985,

work page arXiv
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

20 Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411, 2024a. Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-...

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024i. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li...

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. arXiv preprint arXiv:2406.18521, 2024k. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, ...

work page arXiv
[32]

URL https://arxiv.org/abs/2201.11903. Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark. CoRR, abs/2406.19314,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024a. URL https://arxiv.org/abs/2407.15754. 21 Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-lang...

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Florence-2: Advancing a unified representation for a variety of vision tasks (2023)

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks (2023). URL https://arxiv. org/abs/2311.06242,

work page arXiv 2023
[35]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2.5 technical report. arXiv:2412.15115, 2024a. Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, and Junyang Lin. Cc-ocr: A comprehensive and challe...

work page internal anchor Pith review Pith/arXiv arXiv
[37]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv:2311.04257,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv:2311.16502,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Ferret-v2: An improved baseline for referring and grounding with large language models

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, and Yinfei Yang. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv:2404.07973, 2024a. Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li,...

work page arXiv
[41]

Mmvu: Measuring expert-level multi-discipline video understanding, 2025

URL https://arxiv.org/abs/2501.12380. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

work page arXiv
[42]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Pixtral 12B

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a. Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang ...

work page internal anchor Pith review arXiv

[3] [3]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 ,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024d. Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and ...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

URL https: //blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ . 16 DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guantin...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

DeepSeek-V3 Technical Report

doi: 10.48550/ARXIV .2412.19437. URL https://doi.org/10.48550/arXiv.2412.19437. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[8] [8]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515,

work page arXiv

[9] [9]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv:2405.21075, 2024a. Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, ...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? CoRR, abs/2406.04127,

work page arXiv

[12] [12]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv:2310.14566,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237,

work page arXiv

[14] [14]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video- mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Otterhd: A high- resolution multi-modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high- resolution multi-modality model. arXiv:2311.04219, 2023a. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 202...

work page arXiv

[16] [16]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv:2304.08485, 2023b. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun yue Li, Jianwei Yang, Hang Su, Jun-Juan Zh...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V . Jawahar. Infographicvqa. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pp. 2582–2591, 2021a. Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021b. MiniMax, Aonian Li, Bangw...

work page 2022

[19] [19]

URL https://arxiv.org/abs/2501.08313. Openai. Chatml documents,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel

URL https://arxiv.org/abs/2412.07626. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 3170–3180,

work page arXiv

[21] [21]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos- 2: Grounding multimodal large language models to the world. arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on...

work page 2023

[23] [23]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

URL http://papers.nips.cc/ paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic bench- marking environment for autonomous agents. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Grounding dino 1.5: Advance the” edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection. arXiv preprint arXiv:2405.10300,

work page arXiv

[26] [26]

Mtvqa: Benchmarking multilingual text-centric visual question answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, and Can Huang. Mtvqa: Benchmarking multilingual text-centric visual question answering. arXiv:2405.11985,

work page arXiv

[27] [27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

20 Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411, 2024a. Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-...

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024i. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li...

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms.arXiv preprint arXiv:2406.18521, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. arXiv preprint arXiv:2406.18521, 2024k. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, ...

work page arXiv

[32] [32]

URL https://arxiv.org/abs/2201.11903. Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark. CoRR, abs/2406.19314,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024a. URL https://arxiv.org/abs/2407.15754. 21 Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-lang...

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Florence-2: Advancing a unified representation for a variety of vision tasks (2023)

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks (2023). URL https://arxiv. org/abs/2311.06242,

work page arXiv 2023

[35] [35]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2.5 technical report. arXiv:2412.15115, 2024a. Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, and Junyang Lin. Cc-ocr: A comprehensive and challe...

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv:2311.04257,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv:2311.16502,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Ferret-v2: An improved baseline for referring and grounding with large language models

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, and Yinfei Yang. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv:2404.07973, 2024a. Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li,...

work page arXiv

[41] [41]

Mmvu: Measuring expert-level multi-discipline video understanding, 2025

URL https://arxiv.org/abs/2501.12380. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

work page arXiv

[42] [42]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,

work page internal anchor Pith review Pith/arXiv arXiv