pith. machine review for the scientific record. sign in

arxiv: 2408.01800 · v1 · submitted 2024-08-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Ao Zhang, Chongyi Wang, Dahai Li, Guoyang Zeng, Haoye Zhang, Haoyu Li, Hongji Zhu, Huarong Zhou, Jie Cai, Jie Zhou, Junbo Cui, Maosong Sun, Qianyu Chen, Shengding Hu, Tianchi Cai, Tianyu Yu, Weilin Zhao, Xu Han, Yuan Yao, Zhensheng Zou, Zhihui He, Zhiyuan Liu, Zhi Zheng

Pith reviewed 2026-05-10 21:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords MiniCPM-Vmultimodal large language modelsmobile deploymentGPT-4V comparisonon-device AIhigh-resolution image perceptionOCR capabilitylow hallucination
0
0 comments X

The pith

MiniCPM-V reaches GPT-4V level performance on benchmarks while running on mobile phones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiniCPM-V as a family of efficient multimodal large language models built for end-side devices such as phones. It shows that the latest version, MiniCPM-Llama3-V 2.5, surpasses GPT-4V-1106, Gemini Pro, and Claude 3 on the OpenCompass suite of 11 benchmarks through combined advances in architecture, pretraining, and alignment. The model also delivers strong OCR, perception of 1.8M-pixel images at any aspect ratio, low hallucination rates, and support for more than 30 languages. If these results hold, high-capability MLLMs no longer require cloud servers and can operate locally in mobile, offline, and privacy-sensitive settings.

Core claim

MiniCPM-Llama3-V 2.5, obtained by integrating the latest MLLM techniques, outperforms leading closed models on a broad set of multimodal benchmarks while supporting high-resolution image input, accurate OCR, multilingual use, and trustworthy responses, all within an efficient footprint that fits mobile phones.

What carries the argument

The MiniCPM-V series architecture and training process that combines recent advances in model design, pretraining, and alignment to achieve high performance at reduced size and compute cost.

If this is right

  • Practical deployment of GPT-4V-level MLLMs becomes possible in offline and privacy-protected environments.
  • End-device computation can now support applications previously limited to cloud servers.
  • Model size for usable multimodal performance continues to shrink as device hardware improves.
  • Real-world AI use cases expand into mobile, energy-constrained, and localized settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continued scaling down of capable MLLMs could enable fully on-device personalized assistants.
  • Local processing would reduce reliance on network connectivity and third-party data handling.
  • Specialized mobile tools for vision-language tasks may emerge faster once the performance threshold is crossed on phones.

Load-bearing premise

The chosen benchmarks and evaluation protocol measure genuine multimodal ability without bias or data contamination.

What would settle it

A fresh, uncontaminated multimodal benchmark or real-world user study in which MiniCPM-Llama3-V 2.5 falls below GPT-4V-1106 on the same tasks.

read the original abstract

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiniCPM-V, a series of compact multimodal large language models (MLLMs) optimized for on-device deployment. It claims that the latest MiniCPM-Llama3-V 2.5 variant achieves GPT-4V level performance by outperforming GPT-4V-1106, Gemini Pro, and Claude 3 across the OpenCompass suite of 11 benchmarks, while also providing strong OCR, 1.8M-pixel high-resolution perception at arbitrary aspect ratios, low hallucination rates, support for 30+ languages, and efficient inference on mobile phones. The work positions this as evidence of a broader trend toward smaller models reaching usable multimodal capabilities as on-device compute improves.

Significance. If the benchmark results hold under matched evaluation conditions, the paper provides concrete empirical support for the feasibility of GPT-4V-level MLLMs on consumer hardware. This has clear implications for privacy-preserving, offline, and energy-constrained applications. The demonstration of high-resolution image handling and multilingual capability in a mobile-friendly footprint is a practical contribution, and the trend observation about shrinking model sizes for frontier-level performance is worth documenting.

major comments (2)
  1. [Abstract and evaluation results] Abstract and evaluation results: The central claim that MiniCPM-Llama3-V 2.5 outperforms GPT-4V-1106, Gemini Pro, and Claude 3 on OpenCompass rests on aggregate benchmark wins, but the manuscript does not confirm that the closed models were re-evaluated under the authors' exact protocol (identical prompt templates, image preprocessing, resolution handling, decoding parameters, and few-shot settings). Public leaderboard numbers are likely used instead; any mismatch in evaluation conditions would undermine the ranking and is therefore load-bearing for the outperformance assertion.
  2. [Evaluation section] Evaluation section: No contamination audits, training-data overlap analysis, or explicit data-split details are provided for the 11 OpenCompass benchmarks, nor are error bars or multiple-run statistics reported on the scores. Since model scale, architecture hyperparameters, and training data mixture weights are free parameters, the absence of these checks leaves open the possibility that reported gains partly reflect data leakage rather than genuine capability gains.
minor comments (2)
  1. [Abstract] The abstract states '1.8M pixel high-resolution image perception at any aspect ratio' without a corresponding equation or diagram in the main text that defines the exact tokenization or positional-encoding scheme used to achieve this; adding a short methods paragraph or figure would improve clarity.
  2. [Methods] The manuscript would benefit from a brief data-card summary or reference to the training mixture composition, even if high-level, to aid readers in assessing the multilingual and OCR claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major point below with honest responses and indicate planned changes to improve transparency without overstating our results.

read point-by-point responses
  1. Referee: The central claim that MiniCPM-Llama3-V 2.5 outperforms GPT-4V-1106, Gemini Pro, and Claude 3 on OpenCompass rests on aggregate benchmark wins, but the manuscript does not confirm that the closed models were re-evaluated under the authors' exact protocol (identical prompt templates, image preprocessing, resolution handling, decoding parameters, and few-shot settings). Public leaderboard numbers are likely used instead; any mismatch in evaluation conditions would undermine the ranking and is therefore load-bearing for the outperformance assertion.

    Authors: We thank the referee for this important clarification. The scores for GPT-4V-1106, Gemini Pro, and Claude 3 are taken directly from the public OpenCompass leaderboard, as re-evaluating proprietary models under our precise protocol is not feasible. Our MiniCPM-Llama3-V 2.5 was evaluated following the standard OpenCompass protocols for prompts, preprocessing, and decoding. We will revise the abstract and evaluation section to explicitly state the source of all scores and provide additional details on our model's evaluation settings. revision: partial

  2. Referee: No contamination audits, training-data overlap analysis, or explicit data-split details are provided for the 11 OpenCompass benchmarks, nor are error bars or multiple-run statistics reported on the scores. Since model scale, architecture hyperparameters, and training data mixture weights are free parameters, the absence of these checks leaves open the possibility that reported gains partly reflect data leakage rather than genuine capability gains.

    Authors: We agree that these elements would strengthen the evaluation. The manuscript does not include contamination audits, data overlap analysis, or multi-run statistics. We will add a dedicated paragraph in the evaluation section acknowledging these absences, noting that results are from single runs, and discussing the general risk of data leakage in large-scale training. This will be presented as a limitation while retaining the reported benchmark numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are independent measurements

full rationale

The paper presents an empirical MLLM with performance claims based on direct evaluation against external public benchmarks in the OpenCompass suite. No derivation chain, equations, or fitted parameters are used to generate the target metrics; the reported outperformance is a measurement outcome rather than a quantity constructed from the paper's own inputs or prior self-citations. Self-citations to earlier MiniCPM work exist but are not load-bearing for the benchmark scores themselves, which rely on independent external evaluation protocols and data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised fine-tuning and alignment assumptions plus the unstated premise that benchmark scores generalize. No new physical or mathematical axioms are introduced; the work is an empirical systems paper.

free parameters (2)
  • model scale and architecture hyperparameters
    Number of parameters, vision encoder size, and projection layers are chosen to fit mobile constraints and are not derived from first principles.
  • training data mixture weights
    Relative proportions of image-text, OCR, and multilingual data are selected to achieve the reported capabilities.
axioms (1)
  • domain assumption Standard next-token prediction plus instruction tuning produces aligned multimodal behavior
    Invoked implicitly when claiming trustworthy low-hallucination outputs after alignment.

pith-pipeline@v0.9.0 · 5709 in / 1384 out tokens · 51464 ms · 2026-05-10T21:00:59.104631+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 57 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. MedHorizon: Towards Long-context Medical Video Understanding in the Wild

    cs.CV 2026-05 unverdicted novelty 8.0

    MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

  3. SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression

    cs.NE 2026-04 unverdicted novelty 8.0

    SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.

  4. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  5. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  6. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 7.0

    SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.

  7. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.

  8. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 conditional novelty 7.0

    TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

  9. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  10. QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

    quant-ph 2026-04 unverdicted novelty 7.0

    Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

  11. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  12. Towards Temporal Compositional Reasoning in Long-Form Sports Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

  13. Grounding Video Reasoning in Physical Signals

    cs.CV 2026-04 unverdicted novelty 7.0

    A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...

  14. WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

    cs.CV 2026-04 unverdicted novelty 7.0

    WildFireVQA is a new large-scale visual question answering benchmark that pairs RGB imagery with radiometric thermal measurements for aerial wildfire monitoring across six task categories.

  15. LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

  16. Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

  17. OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

  18. MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation

    cs.CL 2026-04 unverdicted novelty 7.0

    MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other metho...

  19. Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems

    cs.CV 2026-04 conditional novelty 7.0

    Paza is a zero-shot, model-agnostic pipeline that uses behavioral pre-filters on cheap object and pose models to trigger expensive VLMs only when needed, delivering 89.5% precision and 92.8% specificity on a synthesiz...

  20. Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

  21. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  22. MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.

  23. VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

    cs.CV 2026-04 unverdicted novelty 7.0

    VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...

  24. TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

    cs.AI 2026-04 conditional novelty 7.0

    TableVision benchmark shows explicit spatial grounding recovers MLLM reasoning on hierarchical tables, delivering 12.3% accuracy improvement through a decoupled perception-reasoning framework.

  25. LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

    cs.CV 2026-05 unverdicted novelty 6.0

    LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...

  26. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

  27. From Priors to Perception: Grounding Video-LLMs in Physical Reality

    cs.CV 2026-05 unverdicted novelty 6.0

    Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...

  28. KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

    cs.CV 2026-05 unverdicted novelty 6.0

    KARMA-MV is a new benchmark showing that causal knowledge graphs improve VLMs on causal audio-visual reasoning in music videos.

  29. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    cs.CL 2026-04 unverdicted novelty 6.0

    MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

  30. See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

    cs.CV 2026-04 unverdicted novelty 6.0

    ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.

  31. One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition

    cs.CV 2026-04 unverdicted novelty 6.0

    CineMEC performs multimodal entity coreference by clustering visual entities and aligning them with text role mentions to boost captioning and grounding performance on an extended VidSitu dataset.

  32. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

  33. Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.

  34. Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

  35. MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

    cs.CL 2026-04 unverdicted novelty 6.0

    MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.

  36. UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...

  37. POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

  38. HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.

  39. AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.

  40. Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    cs.CV 2026-04 unverdicted novelty 6.0

    Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

  41. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  42. Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.

  43. ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.

  44. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  45. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  46. SkyReels-V2: Infinite-length Film Generative Model

    cs.CV 2025-04 unverdicted novelty 6.0

    SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.

  47. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  48. SmolVLM: Redefining small and efficient multimodal models

    cs.AI 2025-04 unverdicted novelty 6.0

    SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

  49. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  50. Cross-Modal Navigation with Multi-Agent Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    CRONA is a MARL framework that uses modality-specialized agents with auxiliary beliefs and a centralized multi-modal critic to achieve better performance and efficiency than single-agent baselines on visual-acoustic n...

  51. Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

    cs.LG 2026-04 unverdicted novelty 5.0

    CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...

  52. EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

    cs.CV 2026-04 unverdicted novelty 5.0

    EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

  53. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...

  54. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.

  55. Visual Prompt Based Reasoning for Offroad Mapping using Multimodal LLMs

    cs.RO 2026-04 unverdicted novelty 5.0

    A zero-shot pipeline uses SAM2 segmentation plus numeric-label prompting of a VLM to identify drivable off-road areas and enable navigation without task-specific training or datasets.

  56. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  57. A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

    cs.CV 2026-04 unverdicted novelty 4.0

    A multistage extraction pipeline with page-level retrieval improves field-level accuracy by up to 31.9 percentage points over direct VLM application on 3000 pages of real multilingual KYC documents, reaching 87.27% wi...

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 54 Pith papers · 21 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    RealCQA: Scientific chart question answering as a test-bed for first-order logic

    Saleem Ahmed, Bhavin Jawade, Shubham Pandey, Srirangaraj Setlur, and Venu Govindaraju. RealCQA: Scientific chart question answering as a test-bed for first-order logic. In ICDAR, pages 66–83. Springer, 2023

  4. [4]

    Flamingo: A visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: A visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022

  5. [5]

    Introducing the next generation of Claude, 2024

    Anthropic. Introducing the next generation of Claude, 2024. URL https://www.anthropic.com/ news/claude-3-family

  6. [6]

    VQA: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In ICCV, pages 2425–2433, 2015

  7. [7]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  8. [8]

    Gemma: Introducing new state-of-the-art open models

    Jeanine Banks and Tris Warkentin. Gemma: Introducing new state-of-the-art open models. https: //blog.google/technology/developers/gemma-open-models/, 2024

  9. [9]

    Introducing our multimodal models

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, , and Sagnak TasÄ´srlar. Introducing our multimodal models. adept.ai/blog/fuyu-8b. 2023

  10. [10]

    BELLE: Be everyone’s large language model engine

    BELLEGroup. BELLE: Be everyone’s large language model engine. https://github.com/ LianjiaTech/BELLE, 2023. 17

  11. [11]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

  12. [12]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In CVPR, pages 4291–4301, 2019

  13. [13]

    OCR-IDL: OCR annotations for industry document library dataset

    Ali Furkan Biten, Rubèn Tito, Lluis Gomez, Ernest Valveny, and Dimosthenis Karatzas. OCR-IDL: OCR annotations for industry document library dataset. In ECCV, pages 241–252. Springer, 2022

  14. [14]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023

  15. [15]

    COYO-700M: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. COYO-700M: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

  16. [16]

    TextOCR-GPT4V

    Jimmy Carter. TextOCR-GPT4V. https://huggingface.co/datasets/jimmycarter/ textocr-gpt4v, 2024

  17. [17]

    Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021

  18. [18]

    Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. ALLaV A: Harnessing GPT4V-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024

  19. [19]

    GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517, 2021

  20. [20]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023

  21. [21]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023

  22. [22]

    ArXivabs/1909.02164(2019),https://api.semanticscholar.org/CorpusID: 1989173392

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. TabFact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164, 2019

  23. [23]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024

  24. [24]

    Are deep neural networks smarter than second graders? In CVPR, pages 10834–10844, 2023

    Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Kevin A Smith, and Joshua B Tenenbaum. Are deep neural networks smarter than second graders? In CVPR, pages 10834–10844, 2023

  25. [25]

    Mobilevlm: A fast, strong and open vi- sion language assistant for mobile devices.arXiv preprint arXiv:2312.16886, 2023

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. MobileVLM: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023

  26. [26]

    OpenCompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. OpenCompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

  27. [27]

    XTuner: A toolkit for efficiently fine-tuning LLM

    XTuner Contributors. XTuner: A toolkit for efficiently fine-tuning LLM. https://github.com/ InternLM/xtuner, 2023

  28. [28]

    Visual Dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In CVPR, pages 326–335, 2017

  29. [29]

    Project Astra, 2024

    Google Deepmind. Project Astra, 2024. URL https://deepmind.google/technologies/gemini/ project-astra/

  30. [30]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023. 18

  31. [31]

    Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024

  32. [32]

    What makes for good visual instructions? Synthesizing complex visual reasoning instructions for visual instruction tuning

    Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Jinpeng Wang, Chuyuan Wang, Mingchen Cai, Ruihua Song, and Ji-Rong Wen. What makes for good visual instructions? Synthesizing complex visual reasoning instructions for visual instruction tuning. arXiv preprint arXiv:2311.01487, 2023

  33. [33]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  34. [34]

    Are you talking to a machine? dataset and methods for multilingual image question

    Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question. NeurIPS, 28, 2015

  35. [35]

    Wukong: A 100 million large-scale Chinese cross-modal pre-training benchmark

    Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale Chinese cross-modal pre-training benchmark. NeurIPS, 35:26418–26431, 2022

  36. [36]

    LVIS: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019

  37. [37]

    Synthetic data for text localisation in natural images

    Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In CVPR, pages 2315–2324, 2016

  38. [38]

    VizWiz Grand Challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz Grand Challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018

  39. [39]

    Efficient multimodal learning from data-centric perspective,

    Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530, 2024

  40. [40]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  41. [41]

    Large multilingual models pivot zero-shot multimodal learning across languages

    Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, et al. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv preprint arXiv:2308.12038, 2023

  42. [42]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. MiniCPM: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

  43. [43]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. NeurIPS, 36, 2024

  44. [44]

    GQA: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019

  45. [45]

    Phi-2: The surprising power of small language models

    Mojan Javaheripi and SÃl’bastien Bubeck. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/ phi-2-the-surprising-power-of-small-language-models/ , 2023

  46. [46]

    CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pages 2901–2910, 2017

  47. [47]

    DVQA: Understanding data visualizations via question answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via question answering. In CVPR, pages 5648–5656, 2018

  48. [48]

    The Matplotlib Development Team

    Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017

  49. [49]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251. Springer, 2016. 19

  50. [50]

    OCR-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free document understanding transformer. In ECCV, 2022

  51. [51]

    Visual Genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017

  52. [52]

    What matters when building vision-language models?, 2024

    Hugo Laurencon, Leo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models? arXiv preprint arXiv:2405.02246, 2024

  53. [53]

    LLaV A-NeXT: Stronger LLMs supercharge multimodal capabilities in the wild, 2024

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. LLaV A-NeXT: Stronger LLMs supercharge multimodal capabilities in the wild, 2024. URL https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

  54. [54]

    BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. ICML, pages 19730–19742, 2023

  55. [56]

    arXiv preprint arXiv:2403.00231 , year=

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024

  56. [57]

    Mini-gemini: Mining the potential of multi-modality vision language models.arXiv:2403.18814, 2024

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-Gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024

  57. [58]

    OpenOrca: An open dataset of GPT augmented FLAN reasoning traces

    Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet V ong, and "Teknium". OpenOrca: An open dataset of GPT augmented FLAN reasoning traces. https://https://huggingface.co/ Open-Orca/OpenOrca, 2023

  58. [59]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, pages 740–755. Springer, 2014

  59. [60]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023

  60. [61]

    LLaV A- NeXT: Improved reasoning, OCR, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaV A- NeXT: Improved reasoning, OCR, and world knowledge, January 2024. URL https://llava-vl. github.io/blog/2024-01-30-llava-next/

  61. [62]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 36, 2024

  62. [63]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

  63. [64]

    arXiv preprint arXiv:2305.07895 , year=

    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of OCR in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

  64. [65]

    Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. TextMonkey: An OCR-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024

  65. [66]

    MobileLLM: Optimizing sub-billion parameter language models for on-device use cases.arXiv preprint arXiv:2402.14905, 2024

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. MobileLLM: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905, 2024

  66. [67]

    llama.cpp

    llama.cpp Group. llama.cpp. https://github.com/ggerganov/llama.cpp, 2023

  67. [68]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. DeepSeek-VL: Towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 20

  68. [69]

    Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. IconQA: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

  69. [70]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 35:2507–2521, 2022

  70. [71]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

  71. [72]

    OK-VQA: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019

  72. [73]

    Masry, D

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

  73. [74]

    DocVQA: A dataset for VQA on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A dataset for VQA on document images. In WACV, pages 2200–2209, 2021

  74. [75]

    InfographicVQA

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. InfographicVQA. In WACV, pages 1697–1706, 2022

  75. [76]

    McKinzie, Z

    Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. MM1: Methods, analysis & insights from multimodal LLM pre-training. arXiv preprint arXiv:2403.09611, 2024

  76. [77]

    OCR-VQA: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. OCR-VQA: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019

  77. [78]

    Hello GPT-4o, 2024

    OpenAI. Hello GPT-4o, 2024. URL https://openai.com/index/hello-gpt-4o/

  78. [79]

    Compositional semantic parsing on semi-structured tables

    Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305, 2015

  79. [80]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

  80. [81]

    Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, pages 2641–2649, 2015

Showing first 80 references.