Qwen2.5-VL Technical Report
Pith reviewed 2026-05-23 02:22 UTC · model grok-4.3
The pith
Qwen2.5-VL-72B reaches parity with GPT-4o on document and diagram tasks via native dynamic resolution and absolute time encoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen2.5-VL uses a native dynamic-resolution Vision Transformer trained from scratch together with Window Attention and absolute time encoding; this combination supports accurate bounding-box and point localization, robust parsing of forms and charts, and temporal localization in videos up to hours long, allowing the 72B model to match GPT-4o and Claude 3.5 Sonnet performance on those tasks.
What carries the argument
Native dynamic-resolution Vision Transformer (ViT) with Window Attention and absolute time encoding, which processes inputs at their original scales and durations without normalization.
If this is right
- The model can function as an interactive visual agent for operating computers and mobile devices.
- It processes images of arbitrary sizes and videos lasting multiple hours with second-level event localization.
- Document and diagram understanding improves without separate normalization pipelines.
- Smaller model sizes extend the same capabilities to edge devices.
Where Pith is reading between the lines
- Removing normalization steps may simplify deployment pipelines for variable-resolution inputs in other vision systems.
- The same dynamic-resolution and time-encoding pattern could be tested on non-video sequences such as audio or sensor streams.
- If the approach scales, future models might drop fixed-resolution assumptions entirely.
Load-bearing premise
The benchmarks used to claim parity with GPT-4o are free of data contamination and fairly measure document parsing, localization, and long-video performance.
What would settle it
A fresh benchmark of held-out invoices, diagrams, and hour-long videos where the 72B model scores more than 5 points below GPT-4o on localization precision or parsing F1.
Figures
read the original abstract
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Qwen2.5-VL, the latest flagship in the Qwen vision-language series, which incorporates a native dynamic-resolution Vision Transformer trained from scratch along with Window Attention and absolute time encoding. These enable native handling of variable image resolutions and long videos (up to hours) with second-level localization. The paper claims major advances in visual recognition, precise object localization via bounding boxes or points, structured document parsing (invoices, forms, tables, charts, diagrams), and long-video comprehension, positioning the 72B model as matching GPT-4o and Claude 3.5 Sonnet especially on document/diagram tasks while preserving the linguistic capabilities of the underlying Qwen2.5 LLM. The model is released in three sizes for different deployment scenarios and is presented as an interactive visual agent.
Significance. If the performance claims hold after proper verification, the work would be significant for demonstrating that open multimodal models can reach parity with leading proprietary systems on practical visual tasks such as document understanding and agentic interaction. The architectural choices around dynamic-resolution ViT and absolute temporal encoding, if shown to be effective without heavy normalization, could provide reusable techniques for future vision-language models handling real-world scale and duration variations.
major comments (2)
- [Abstract] Abstract: The central claim that 'the flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding' is unsupported by any quantitative benchmark scores, tables, or evaluation protocols. This absence is load-bearing because the manuscript provides no evidence (e.g., DocVQA, ChartQA, or video localization results) against which the parity assertion can be assessed.
- [Abstract] Abstract: No details are given on training data composition, decontamination procedures for the cited document/diagram/long-video benchmarks, or ablation studies isolating the contribution of the dynamic-resolution ViT and absolute time encoding. Without these, it is impossible to rule out data contamination as an alternative explanation for the headline performance, directly undermining the reliability of the SOTA-matching claim.
minor comments (1)
- [Abstract] Abstract: The phrasing 'native dynamic-resolution Vision Transformer (ViT) trained from scratch' and 'Window Attention' is introduced at a high level without even brief clarification of how these differ from standard ViT or relative positional encodings, which would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below, with planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'the flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding' is unsupported by any quantitative benchmark scores, tables, or evaluation protocols. This absence is load-bearing because the manuscript provides no evidence (e.g., DocVQA, ChartQA, or video localization results) against which the parity assertion can be assessed.
Authors: The full manuscript contains quantitative benchmark tables and results in the Experiments section, with direct comparisons on DocVQA, ChartQA, diagram understanding, and video localization tasks against GPT-4o and Claude 3.5 Sonnet. The abstract summarizes these findings at a high level. We will revise the abstract to explicitly reference the relevant tables, sections, and key metrics to make the supporting evidence clear. revision: yes
-
Referee: [Abstract] Abstract: No details are given on training data composition, decontamination procedures for the cited document/diagram/long-video benchmarks, or ablation studies isolating the contribution of the dynamic-resolution ViT and absolute time encoding. Without these, it is impossible to rule out data contamination as an alternative explanation for the headline performance, directly undermining the reliability of the SOTA-matching claim.
Authors: We will add expanded ablation studies in the manuscript that isolate the contributions of the native dynamic-resolution ViT (with Window Attention) and absolute time encoding. However, full details on training data composition and decontamination procedures cannot be provided. revision: partial
- Detailed training data composition and decontamination procedures for the benchmarks
Circularity Check
No derivation chain present; empirical benchmark reporting only
full rationale
The paper is a technical report on model architecture, training procedure, and benchmark results. It contains no equations, first-principles derivations, fitted parameters presented as predictions, or load-bearing self-citations that reduce claims to inputs by construction. All performance claims rest on reported empirical numbers rather than any self-referential mathematical structure. This matches the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents
The paper presents RoboAbstention, a new benchmark showing frontier VLMs and embodied planners abstain on only 16.5-39% of 6,069 instructions grounded in robotics images, with prompting interventions raising rates to ...
-
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
VLMs fail to detect semantically different image swaps up to 60% of the time despite self-reflective statements, with thinking models more vulnerable and attention analysis showing self-reflection does not increase vi...
-
MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays
MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
-
CalibAnyView: Beyond Single-View Camera Calibration in the Wild
A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
-
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
-
Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but inco...
-
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
-
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
-
RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation
RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
-
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
-
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
-
VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
-
Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving
Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than ...
-
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
-
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
-
FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
FLEX is the first large-scale multimodal multiview dataset for fitness AQA, featuring RGB, 3D pose, sEMG and physiological data plus a Fitness Knowledge Graph for structured annotations and a VideoQA benchmark.
-
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.
-
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.
-
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
-
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
-
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
-
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
-
ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
ST-SimDiff is a training-free method using a spatio-temporal graph and dual similarity-difference selection to compress video tokens for MLLMs while retaining static and dynamic content.
-
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
-
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding
Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-s...
-
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
The Token Replacement Test shows VLMs keep most accuracy gains even after corrupting or replacing continuous thought token content, indicating the tokens are not used as information bottlenecks.
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
-
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.
-
ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization
ArPoMeme is a new annotated multimodal dataset of Arabic political memes collected via Facebook scraping and vision-language model text extraction, then manually labeled for ideology and polarization aspects.
-
What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing
VLM-to-DiT alignment in video editing models acts as a semantic bottleneck that degrades fine-grained structural semantics, demonstrated via a new diagnostic dataset and protocol on relation-based edits.
-
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
-
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
-
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
FineBench is a new dense VQA benchmark for fine-grained human activity in long videos that exposes weaknesses in open VLMs and demonstrates gains from the proposed FineAgent modular framework.
-
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
FineBench is a new dense VQA benchmark for fine-grained human activity understanding in long videos, revealing weaknesses in open VLMs and showing that FineAgent improves them via localization and description modules.
-
MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation
MCNav builds a dynamic cognitive map with goal re-validation and missed-goal re-exploration to reach state-of-the-art results on instance-level zero-shot navigation in HM3D environments.
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
-
EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
-
Dexora: Open-source VLA for High-DoF Bimanual Dexterity
Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7%...
-
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot coop...
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
-
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.
-
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
-
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
Asymmetric Flow Models
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments
WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.
Reference graph
Works this paper leans on
-
[1]
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a. Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang ...
work page internal anchor Pith review arXiv
-
[3]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024d. Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URL https: //blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ . 16 DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guantin...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
doi: 10.48550/ARXIV .2412.19437. URL https://doi.org/10.48550/arXiv.2412.19437. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[8]
Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515,
-
[9]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv:2405.21075, 2024a. Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? CoRR, abs/2406.04127,
-
[12]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv:2310.14566,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale. arXiv preprint arXiv:2412.05237,
-
[14]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video- mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Otterhd: A high- resolution multi-modality model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high- resolution multi-modality model. arXiv:2311.04219, 2023a. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 202...
-
[16]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv:2304.08485, 2023b. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun yue Li, Jianwei Yang, Hang Su, Jun-Juan Zh...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V . Jawahar. Infographicvqa. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pp. 2582–2591, 2021a. Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021b. MiniMax, Aonian Li, Bangw...
work page 2022
-
[19]
URL https://arxiv.org/abs/2501.08313. Openai. Chatml documents,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel
URL https://arxiv.org/abs/2412.07626. Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 3170–3180,
-
[21]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos- 2: Grounding multimodal large language models to the world. arXiv:2306.14824,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Manning, Stefano Ermon, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on...
work page 2023
-
[23]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
URL http://papers.nips.cc/ paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic bench- marking environment for autonomous agents. ar...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Grounding dino 1.5: Advance the” edge” of open-set object detection
Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection. arXiv preprint arXiv:2405.10300,
-
[26]
Mtvqa: Benchmarking multilingual text-centric visual question answering
Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, and Can Huang. Mtvqa: Benchmarking multilingual text-centric visual question answering. arXiv:2405.11985,
-
[27]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
20 Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411, 2024a. Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-...
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024i. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li...
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. arXiv preprint arXiv:2406.18521, 2024k. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, ...
-
[32]
URL https://arxiv.org/abs/2201.11903. Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark. CoRR, abs/2406.19314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024a. URL https://arxiv.org/abs/2407.15754. 21 Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-lang...
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Florence-2: Advancing a unified representation for a variety of vision tasks (2023)
Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks (2023). URL https://arxiv. org/abs/2311.06242,
-
[35]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction. arXiv preprint arXiv:2412.04454,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2.5 technical report. arXiv:2412.15115, 2024a. Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, and Junyang Lin. Cc-ocr: A comprehensive and challe...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv:2311.04257,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv:2311.16502,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2409.02813,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Ferret-v2: An improved baseline for referring and grounding with large language models
Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, and Yinfei Yang. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv:2404.07973, 2024a. Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li,...
-
[41]
Mmvu: Measuring expert-level multi-discipline video understanding, 2025
URL https://arxiv.org/abs/2501.12380. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,
-
[42]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.