pith. machine review for the scientific record. sign in

arxiv: 2504.10479 · v3 · submitted 2025-04-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Botian Shi, Conghui He, Dahua Lin, Erfei Cui, Han Lv, Hao Li, Haomin Wang, Hao Tian, Hongjie Zhang, Huipeng Deng, Jiahao Wang, Jiapeng Luo, Jiaye Ge, Jie Shao, Jifeng Dai, Jinguo Zhu, Junjun He, Kai Chen, Kaipeng Zhang, Lewei Lu, Lijun Wu, Limin Wang, Lixin Gu, Min Dou, Nianchen Deng, Penglong Jiao, Peng Sun, Shenglong Ye, Songze Li, Tan Jiang, Tong Lu, Weijie Su, Weiye Xu, Weiyun Wang, Wenhai Wang, Wenqi Shao, Wenwen Qu, Xingcheng Zhang, Xingguang Wei, Xizhou Zhu, Xuehui Wang, Yangzhou Liu, Yinan He, Yingtong Xiong, Yi Wang, Yuchen Duan, Yue Cao, Yu Qiao, Zhangwei Gao, Zhaoyang Liu, Zhe Chen

Pith reviewed 2026-05-10 13:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords native multimodal pre-trainingopen-source MLLMvariable visual position encodingsupervised fine-tuningmixed preference optimizationtest-time scalingMMMU benchmark
0
0 comments X

The pith

InternVL3 jointly pre-trains language and vision capabilities in one stage to avoid later alignment steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternVL3 as a new open-source multimodal model built through a native pre-training approach. Rather than first training a text-only model and then adapting it for images, the system learns both linguistic and visual skills at the same time from mixed data sources. This single-stage process is intended to reduce the alignment problems that arise in conventional two-step pipelines. The work adds variable visual position encoding for longer contexts along with refined post-training and test-time methods. If the central claim holds, the approach would simplify the creation of capable multimodal systems while preserving strong text-only performance.

Core claim

InternVL3 employs a native multimodal pre-training paradigm in which the model jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training addresses complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. The paradigm is paired with variable visual position encoding to support extended multimodal contexts, advanced post-training techniques such as supervised fine-tuning and mixed preference optimization, and test-time scaling strategies.

What carries the argument

native multimodal pre-training paradigm that jointly acquires multimodal and linguistic capabilities from multimodal data and pure-text corpora in a single stage

Load-bearing premise

Performance gains arise chiefly from the joint native pre-training rather than from model scale, data mixture choices, or the post-training and test-time techniques.

What would settle it

A controlled comparison that trains an otherwise identical model using the conventional text-first adaptation pipeline on the same data and scale would show whether the unified pre-training stage is required for the reported benchmark gains.

read the original abstract

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents InternVL3, a family of open-source multimodal large language models that adopt a native multimodal pre-training paradigm: jointly training on multimodal data and pure-text corpora in a single pre-training stage rather than post-hoc vision-language alignment of a text-only LLM. The work further incorporates variable visual position encoding (V2PE) for extended contexts, supervised fine-tuning (SFT) plus mixed preference optimization (MPO) post-training, and test-time scaling. Extensive benchmark results are reported, with the headline claim that InternVL3-78B reaches 72.2 on MMMU (new open-source SOTA) while remaining competitive with proprietary models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro; the authors commit to releasing both model weights and training data.

Significance. If the performance gains can be causally attributed to the unified pre-training paradigm rather than scale, data curation, or post-training choices, the result would meaningfully advance open-source MLLM training recipes by reducing alignment overhead and improving joint multimodal-linguistic capability. The public release of weights and data is a concrete contribution that enables reproducibility and follow-on research.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central attribution of the 72.2 MMMU score and competitiveness with proprietary models to the 'native multimodal pre-training paradigm' is not isolated from confounding factors. No controlled ablation compares the unified single-stage training against a same-scale (78B) post-hoc alignment baseline trained on identical data mixtures; without this comparison the headline claim that the paradigm 'effectively addresses the complexities and alignment challenges' remains unsupported.
  2. [§3 and Table 2] §3 (Method) and Table 2: the description of V2PE and the data mixture ratios are presented without quantitative ablations showing their individual contributions to the reported gains. The paper lists numerous training hyperparameters and data ratios as free parameters, yet provides no sensitivity analysis or removal experiments that would demonstrate the paradigm's necessity over scale and curation alone.
minor comments (2)
  1. [§4.1] §4.1 and Appendix: statistical significance or variance estimates across multiple runs are not reported for the MMMU, MMBench, or other key benchmarks, making it difficult to assess whether the 72.2 score reflects a stable improvement.
  2. [Abstract and §2] Abstract and §2: the claim that InternVL3 'maintains strong pure-language proficiency' would benefit from explicit comparison tables against the base LLM (e.g., InternLM2.5-78B) on standard text-only benchmarks such as MMLU or GSM8K.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing honest clarifications based on the manuscript's content and scope while outlining planned revisions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central attribution of the 72.2 MMMU score and competitiveness with proprietary models to the 'native multimodal pre-training paradigm' is not isolated from confounding factors. No controlled ablation compares the unified single-stage training against a same-scale (78B) post-hoc alignment baseline trained on identical data mixtures; without this comparison the headline claim that the paradigm 'effectively addresses the complexities and alignment challenges' remains unsupported.

    Authors: We agree that a direct, controlled ablation at the 78B scale against an identical-data post-hoc baseline would provide the strongest causal evidence. Such an experiment is not present in the manuscript and would require prohibitive additional compute. Our claims are supported by consistent gains over prior InternVL versions that used post-hoc alignment, as well as cross-model comparisons. We will revise the abstract and §4 to attribute performance to the full integrated recipe (native pre-training plus V2PE, MPO, and test-time scaling) rather than the paradigm in isolation, and add a limitations paragraph noting the absence of this specific ablation. revision: partial

  2. Referee: [§3 and Table 2] §3 (Method) and Table 2: the description of V2PE and the data mixture ratios are presented without quantitative ablations showing their individual contributions to the reported gains. The paper lists numerous training hyperparameters and data ratios as free parameters, yet provides no sensitivity analysis or removal experiments that would demonstrate the paradigm's necessity over scale and curation alone.

    Authors: We acknowledge that additional quantitative ablations would strengthen the presentation of V2PE and data-mixture choices. The manuscript already reports some supporting results for these components in §3 and the experiments; we will expand this in revision by adding sensitivity analyses and removal experiments at smaller scales (e.g., 8B/14B) in §3 or an appendix, showing their impact on key benchmarks while keeping the main 78B results unchanged. revision: yes

standing simulated objections not resolved
  • A same-scale (78B) controlled ablation of native multimodal pre-training versus post-hoc alignment on identical data mixtures, due to prohibitive computational cost.

Circularity Check

0 steps flagged

No circularity: empirical benchmark reporting with no derivational reduction

full rationale

The paper is an empirical model-release work. It describes a training paradigm (native multimodal pre-training in one stage), reports benchmark scores such as 72.2 on MMMU for the 78B model, and compares to proprietary systems. No equations, derivations, or 'predictions' appear in the provided text. Claims rest on external, independently verifiable benchmarks rather than self-referential metrics or fitted parameters renamed as outputs. Prior InternVL citations exist but are not load-bearing for the central empirical result; the performance numbers can be checked against released weights and public test sets. Absence of ablations is a question of evidence strength, not circularity. The derivation chain is empty, so no reduction to inputs occurs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The work rests on standard deep-learning scaling assumptions and the premise that joint multimodal-text pre-training reduces alignment issues; V2PE is introduced as a new component without independent prior validation.

free parameters (1)
  • Numerous training hyperparameters and data mixture ratios
    Typical in large-scale LLM/MLLM training; values are chosen to optimize final benchmark scores.
invented entities (1)
  • Variable Visual Position Encoding (V2PE) no independent evidence
    purpose: Support extended multimodal contexts
    New encoding scheme introduced to handle longer visual sequences.

pith-pipeline@v0.9.0 · 5754 in / 1376 out tokens · 26485 ms · 2026-05-10T13:36:25.141546+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. MedHorizon: Towards Long-context Medical Video Understanding in the Wild

    cs.CV 2026-05 unverdicted novelty 8.0

    MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

  3. S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

    cs.CV 2026-04 unverdicted novelty 8.0

    S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

  4. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  5. Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

    cs.CV 2026-04 conditional novelty 8.0

    VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

  6. When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

    cs.CY 2026-04 unverdicted novelty 8.0

    VLMs over-correct multi-line handwritten math OCR, and the PINK metric using LLM rubric grading penalizes this for better human alignment.

  7. G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 7.0

    G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.

  8. Allegory of the Cave: Measurement-Grounded Vision-Language Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.

  9. Count Anything at Any Granularity

    cs.CV 2026-05 unverdicted novelty 7.0

    Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...

  10. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

    cs.AI 2026-05 unverdicted novelty 7.0

    BenchCAD is a new benchmark showing that frontier multimodal models recover coarse geometry but fail to generate faithful parametric CAD programs for industrial parts.

  11. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

    cs.AI 2026-05 unverdicted novelty 7.0

    BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.

  12. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 7.0

    SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.

  13. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 conditional novelty 7.0

    TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

  14. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.

  15. When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

    cs.AI 2026-05 conditional novelty 7.0

    State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...

  16. Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.

  17. SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere

    cs.CV 2026-05 unverdicted novelty 7.0

    SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frech...

  18. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  19. PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks w...

  20. PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PolarVLM integrates polarimetric physical parameters into VLMs via dual-stream architecture and progressive training, outperforming RGB baselines by 25.4% on a new 75K-pair polarization-aware VQA benchmark.

  21. Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.

  22. Structured Role-Aware Policy Optimization for Multimodal Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...

  23. Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...

  24. VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.

  25. GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.

  26. QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

    quant-ph 2026-04 unverdicted novelty 7.0

    Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

  27. FCMBench-Video: Benchmarking Document Video Intelligence

    cs.CV 2026-04 unverdicted novelty 7.0

    FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.

  28. Improving Vision-language Models with Perception-centric Process Reward Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

  29. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  30. SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

  31. Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the diagnosis-driven CE video summarization task, the VideoCAP dataset with 240 annotated videos, and the DiCE framework that outperforms prior methods by screening candidates then weaving them into diagnos...

  32. X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

    cs.CV 2026-04 unverdicted novelty 7.0

    X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.

  33. Hybrid Latent Reasoning with Decoupled Policy Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

  34. Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

  35. GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

  36. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  37. HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    HyperGVL is the first benchmark for LVLMs on hypergraph tasks from basic counting to NP-hard reasoning, with 12 models tested and a router proposed to adapt representations.

  38. MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production

    cs.MM 2026-04 unverdicted novelty 7.0

    MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...

  39. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.

  40. Why MLLMs Struggle to Determine Object Orientations

    cs.CV 2026-04 accept novelty 7.0

    Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

  41. Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...

  42. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

  43. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

  44. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  45. ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.

  46. CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.

  47. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  48. Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

  49. MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.

  50. DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    DISSECT benchmark reveals that VLMs extract visual details from scientific diagrams but frequently lose them during reasoning, with open-source models showing a larger integration gap than closed-source ones.

  51. The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

  52. TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

    cs.AI 2026-04 conditional novelty 7.0

    TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

  53. Token Warping Helps MLLMs Look from Nearby Viewpoints

    cs.CV 2026-04 unverdicted novelty 7.0

    Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

  54. V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

    cs.CV 2026-03 unverdicted novelty 7.0

    V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...

  55. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    cs.CV 2025-05 unverdicted novelty 7.0

    DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

  56. Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

    cs.CV 2026-05 unverdicted novelty 6.0

    VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.

  57. Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

    cs.CV 2026-05 unverdicted novelty 6.0

    A new distillation method uses token-wise salient reasoning-prefix masking and self-paced scheduling to anchor student VLM thinking on visual inputs, outperforming prior distillation approaches on multimodal reasoning...

  58. Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

    cs.CV 2026-05 unverdicted novelty 6.0

    A training-free attention-guided debiasing framework mitigates position bias in MLLM multi-image retrieval by exploiting the observed mismatch between biased logits and aligned attention maps, yielding over 40% accura...

  59. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

  60. Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment

    cs.CV 2026-05 unverdicted novelty 6.0

    PRAF-Attack improves targeted attack transferability on black-box MLLMs by using multi-scale progressive resolution and adaptive intermediate feature alignment instead of final-layer global features.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 135 Pith papers · 28 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    CG-bench: Clue-grounded question answering benchmark for long video understanding

    Anonymous. CG-bench: Clue-grounded question answering benchmark for long video understanding. In Submitted to The Thirteenth International Conference on Learning Representations, 2024. under review. 14, 15

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com, 2024. 2, 8, 9, 10, 11, 12

  4. [4]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,

  6. [7]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 1, 2, 9, 10, 15

  7. [8]

    Smollm-corpus, 2024

    Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra. Smollm-corpus, 2024. 5 19

  8. [9]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301, 2019. 6

  9. [10]

    An augmented benchmark dataset for geometric question answering through dual parallel text encoding

    Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520,

  10. [11]

    MapQA: A dataset for question answering on choropleth maps,

    Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022. 6

  11. [12]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 13

  12. [13]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024. 12

  13. [14]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 16

  14. [15]

    Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding

    Qiaoling Chen, Diandian Gu, Guoteng Wang, Xun Chen, YingTong Xiong, Ting Huang, Qinghao Hu, Xin Jin, Yonggang Wen, Tianwei Zhang, et al. Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding. arXiv preprint arXiv:2401.09149, 2024. 2, 7

  15. [16]

    M3 cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought.arXiv preprint arXiv:2405.16473, 2024

    Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. arXiv preprint arXiv:2405.16473, 2024. 6

  16. [17]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7889–7901...

  17. [18]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024. 1, 2, 3, 5, 6, 9, 10, 11, 12, 13, 14, 15

  18. [20]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 3

  19. [21]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1, 2, 3

  20. [22]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024. 16

  21. [23]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 15

  22. [24]

    Simple and effective multi-paragraph reading comprehension

    Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 845–855, 2018. 6

  23. [25]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 16

  24. [26]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023. 9, 10, 11, 12 20

  25. [27]

    Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model

    X.AI Corp. Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model. https://x.ai/blog/grok-1.5v, 2024. 11

  26. [28]

    Nvlm: Open frontier-class multimodal llms.arXiv preprint arXiv:2409.11402, 2024

    Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Moham- mad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv:2409.11402, 2024. 10

  27. [29]

    Gemini 2.0 is now available to everyone

    Google Deepmind. Gemini 2.0 is now available to everyone. https://blog.google/technology/ google-deepmind/gemini-model-updates-february-2025/ , 202. 9

  28. [30]

    Introducing gemini 2.0: our new ai model for the agentic era

    Google Deepmind. Introducing gemini 2.0: our new ai model for the agentic era. https://blog.google/ technology/google-deepmind/google-gemini-ai-update-december-2024/ , 2024. 9

  29. [31]

    Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024. 1, 10

  30. [32]

    Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024. 1

  31. [33]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 7

  32. [34]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  33. [35]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515, 2024. 14, 15

  34. [36]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178, 2004. 7

  35. [37]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 12

  36. [38]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 14, 15

  37. [39]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024. 9, 11

  38. [40]

    G-llava: Solving geometric problem with multi-modal large language model

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023. 6

  39. [41]

    Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance

    Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance. arXiv preprint arXiv:2410.16261, 2024. 3

  40. [42]

    V2pe: Improving multimodal long-context capability of vision-language models with variable visual position encoding, 2024

    Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2pe: Improving multi- modal long-context capability of vision-language models with variable visual position encoding. arXiv preprint arXiv:2412.09616, 2024. 2, 3, 18

  41. [43]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017. 6

  42. [44]

    Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

    Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. arXiv preprint arXiv:2410.18558, 2024. 9, 10 21

  43. [45]

    Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023. 8, 12, 13

  44. [46]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In The International Conference on Learning Representations,

  45. [47]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 202...

  46. [48]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024. 16

  47. [49]

    Icdar2019 competition on scanned receipt ocr and information extraction

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019. 6

  48. [50]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. 6

  49. [51]

    Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024. 9, 11

  50. [52]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. 16

  51. [53]

    Binary classifier optimization for large language model alignment

    Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. arXiv preprint arXiv:2404.04656, 2024. 6

  52. [54]

    Dvqa: Understanding data visualizations via question answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5648–5656, 2018. 6

  53. [55]

    Geomverse: A systematic evaluation of large models for geometric reasoning

    Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023. 6

  54. [56]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 787–798, 2014. 13

  55. [57]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In European Conference on Computer Vision, pages 235–251, 2016. 6, 7, 8, 10

  56. [58]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. 16

  57. [59]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017. 16

  58. [60]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 9, 10, 11, 12, 15, 16

  59. [61]

    SEED-Bench-2-Plus: Bench- 12 marking multimodal large language models with text-rich vi- sual comprehension.arXiv preprint arXiv:2404.16790, 2024

    Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension. arXiv preprint arXiv:2404.16790, 2024. 8, 10

  60. [62]

    R-bench: Are your large multimodal model robust to real-world corruptions?arXiv preprint arXiv:2410.05474, 2024

    Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiaohong Liu, Xiongkuo Min, Weisi Lin, et al. R-bench: Are your large multimodal model robust to real-world corruptions? arXiv preprint arXiv:2410.05474, 2024. 11

  61. [63]

    CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023. 16 22

  62. [64]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 15

  63. [65]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 14, 15

  64. [66]

    Mvitv2: Improved multiscale vision transformers for classification and detection

    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022. 1, 3

  65. [67]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In The Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023. 12, 13

  66. [68]

    Monkey: Image resolution and text label are important things for large multi-modal models

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 1

  67. [69]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision- language models. arXiv preprint arXiv:2501.14818, 2025. 1

  68. [70]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023. 7

  69. [71]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 1, 9, 10, 15, 16

  70. [72]

    Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning

    Adam Dahlgren Lindström and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 6

  71. [73]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information Processing Systems, 36, 2023. 2

  72. [74]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2025. 13

  73. [75]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281,

  74. [76]

    Ocrbench: On the hidden mystery of ocr in large multimodal models.arXiv preprint arXiv:2305.07895, 2023

    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 8, 10

  75. [77]

    Acemath: Advancing frontier math reasoning with post-training and reward modeling

    Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint, 2024. 5

  76. [78]

    Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On-demand spatial- temporal understanding at arbitrary resolution. arXiv preprint arXiv:2409.12961, 2024. 15

  77. [79]

    Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain, 2025

    Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, and Yuan Qi. Scp-116k: A high-quality problem-solution dataset and a generalized pipeline for automated extraction in the higher education science domain, 2025. 5

  78. [80]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 7, 8, 9

  79. [81]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Inter- pretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165,

  80. [82]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022. 6 23

Showing first 80 references.