arxiv: 2412.05271 · v5 · submitted 2024-12-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen , Weiyun Wang , Yue Cao , Yangzhou Liu , Zhangwei Gao , Erfei Cui , Jinguo Zhu , Shenglong Ye

show 34 more authors

Hao Tian Zhaoyang Liu Lixin Gu Xuehui Wang Qingyun Li Yiming Ren Zixuan Chen Jiapeng Luo Jiahao Wang Tan Jiang Bo Wang Conghui He Botian Shi Xingcheng Zhang Han Lv Yi Wang Wenqi Shao Pei Chu Zhongying Tu Tong He Zhiyong Wu Huipeng Deng Jiaye Ge Kai Chen Kaipeng Zhang Limin Wang Min Dou Lewei Lu Xizhou Zhu Tong Lu Dahua Lin Yu Qiao Jifeng Dai Wenhai Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsopen-source modelsmodel scalingdata scalingtest-time scalingchain-of-thought reasoningMMMU benchmarkvision-language models

0 comments

The pith

Scaling models, data quality, and test-time reasoning allows open-source multimodal models to rival commercial systems and exceed 70% on the MMMU benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

InternVL 2.5 advances open-source multimodal large language models by keeping the same core architecture as its predecessor while improving data quality and systematically testing how performance changes with bigger vision encoders, stronger language models, larger datasets, and varied test-time setups. These changes let the model compete with leading commercial multimodal systems like GPT-4o and Claude-3.5-Sonnet on many benchmarks covering reasoning, documents, video, hallucination, grounding, and language tasks. The standout result is that InternVL 2.5 becomes the first open-source model to pass 70 percent on the MMMU benchmark, gaining 3.7 points when chain-of-thought reasoning is applied at test time. This demonstrates concrete ways to expand what open-source models can do through scaling in multiple dimensions.

Core claim

The paper establishes that by scaling up vision encoders and language models, improving dataset quality and size, and applying test-time strategies such as chain-of-thought reasoning, the InternVL 2.5 series achieves performance that rivals top commercial multimodal large language models across a broad set of benchmarks, with the notable achievement of being the first open-source MLLM to surpass 70% accuracy on the MMMU benchmark through a 3.7-point boost from chain-of-thought reasoning.

What carries the argument

The scaling of model components including vision encoders and language models, combined with data quality enhancements and test-time configurations such as chain-of-thought reasoning, which together drive performance trends and improvements on multimodal benchmarks.

If this is right

Performance scales positively with larger vision encoders and language models.
Increasing dataset size and quality leads to better multimodal understanding and reasoning.
Test-time scaling via chain-of-thought provides measurable gains, particularly on complex tasks like MMMU.
The model matches or approaches commercial performance on document understanding, video, hallucination detection, and multilingual tasks.
These methods establish new benchmarks for what open-source multimodal systems can achieve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar multi-dimensional scaling could help other open-source models close performance gaps with proprietary ones on reasoning benchmarks.
Test-time chain-of-thought reasoning might generalize as a low-cost way to boost accuracy in multimodal tasks without additional training.
Further scaling in these areas could unlock capabilities in real-world applications like visual grounding and multi-image analysis.

Load-bearing premise

The reported gains result primarily from the scaling of models, data improvements, and test-time strategies rather than from undisclosed training details, specific benchmark tuning, or evaluation choices.

What would settle it

An independent evaluation that implements the described scaling approaches but achieves significantly lower scores than reported on MMMU or other key benchmarks would falsify the central claim.

read the original abstract

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InternVL 2.5, an open-source MLLM series extending InternVL 2.0 with improvements to training strategies, data quality, model scaling (vision encoder and language model sizes), dataset composition, and test-time configurations including Chain-of-Thought reasoning. It reports extensive benchmark results across multi-discipline reasoning (e.g., MMMU), document understanding, video/multi-image tasks, hallucination detection, grounding, multilingual, and language-only capabilities, claiming competitive performance with closed models such as GPT-4o and Claude-3.5-Sonnet. The central empirical claims are that InternVL 2.5 is the first open-source MLLM to exceed 70% on MMMU and that CoT yields a 3.7-point gain demonstrating test-time scaling potential.

Significance. If the reported MMMU threshold and attribution of gains hold under controlled comparisons, the work provides concrete evidence that combined model/data/test-time scaling can push open-source MLLMs toward parity with leading closed models on challenging multimodal reasoning benchmarks. The broad evaluation suite and explicit scaling ablations offer reusable insights for the community on where performance gains accrue.

major comments (2)

[§4] §4 (Results on MMMU and related tables): The claim that InternVL 2.5 is the first open-source MLLM to surpass 70% on MMMU, together with the 3.7-point CoT gain, requires an exhaustive side-by-side table of recent open-source models (Qwen2-VL, LLaVA-Next, etc.) reporting exact scores both with and without CoT under identical prompt templates, sampling temperature, and inference settings. Without this, the novelty and causal attribution to the described test-time scaling cannot be verified.
[§3.3] §3.3 (Test-time scaling and CoT description): The manuscript must specify the precise CoT template, number of reasoning steps, stop conditions, and whether the non-CoT baseline uses the identical prompt prefix and output format. Any deviation in formatting or post-processing would undermine the reported 3.7-point delta as evidence of test-time scaling.

minor comments (2)

[Figure 3] Figure 3 and associated text: clarify whether the reported MMMU scores use the standard 0-shot or few-shot protocol and whether any benchmark-specific filtering or answer extraction rules differ from prior open-source evaluations.
[§4] Appendix or §4: add explicit statement of total training compute (FLOPs or GPU-hours) and whether any post-training data filtering was applied after the main scaling experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that help strengthen the rigor of our claims. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Results on MMMU and related tables): The claim that InternVL 2.5 is the first open-source MLLM to surpass 70% on MMMU, together with the 3.7-point CoT gain, requires an exhaustive side-by-side table of recent open-source models (Qwen2-VL, LLaVA-Next, etc.) reporting exact scores both with and without CoT under identical prompt templates, sampling temperature, and inference settings. Without this, the novelty and causal attribution to the described test-time scaling cannot be verified.

Authors: We agree that a controlled side-by-side comparison is essential to substantiate the novelty claim and the causal role of test-time scaling. In the revised manuscript we will add an expanded table in §4 that lists MMMU scores for recent open-source MLLMs (including Qwen2-VL, LLaVA-Next and others) together with the prompt templates, sampling temperature, and inference settings used for each entry. Where CoT results are already reported in the literature we will include them; where they are not, we will note the standard (non-CoT) scores and indicate that identical-prompt re-evaluations were performed for the models we could run under our evaluation harness. This provides the strongest feasible verification of the 70% threshold and the 3.7-point CoT delta while remaining transparent about practical constraints on exhaustive re-implementation. revision: partial
Referee: [§3.3] §3.3 (Test-time scaling and CoT description): The manuscript must specify the precise CoT template, number of reasoning steps, stop conditions, and whether the non-CoT baseline uses the identical prompt prefix and output format. Any deviation in formatting or post-processing would undermine the reported 3.7-point delta as evidence of test-time scaling.

Authors: We concur that full reproducibility details are required to support the test-time scaling claim. In the revised §3.3 we will insert the exact CoT prompt template, the number of reasoning steps, the stop conditions, and an explicit confirmation that the non-CoT baseline employs the identical prompt prefix and output format with no additional post-processing. These additions will make the 3.7-point gain directly attributable to the described test-time procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical benchmark measurements

full rationale

The paper reports empirical performance of InternVL 2.5 on public benchmarks (MMMU, etc.) after training with scaled models, data, and test-time CoT. No equations, derivations, or fitted parameters are presented as predictions that reduce to the inputs by construction. The 'first open-source >70%' and '3.7-point CoT gain' claims are comparisons against external prior results rather than self-definitional or self-citation load-bearing steps internal to a derivation chain. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning training and evaluation protocols plus the assumption that benchmark scores reflect genuine capability gains from the scaling choices.

free parameters (2)

vision encoder and language model scales
Specific sizes and configurations chosen after exploration to optimize performance on target benchmarks.
training dataset composition and size
Data quality and volume selected to support the reported gains.

axioms (1)

domain assumption Standard supervised fine-tuning and evaluation protocols in multimodal LLMs produce reliable capability measurements.
Invoked implicitly when reporting benchmark scores as evidence of improved performance.

pith-pipeline@v0.9.0 · 5681 in / 1382 out tokens · 41168 ms · 2026-05-10T13:18:39.128268+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
cs.CV 2026-05 unverdicted novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
cs.CV 2026-05 conditional novelty 7.0

SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annot...
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
cs.CV 2026-05 conditional novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
cs.CV 2026-05 unverdicted novelty 7.0

ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
cs.CV 2026-05 unverdicted novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.
Act2See: Emergent Active Visual Perception for Video Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
Membership Inference Attacks Against Video Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
cs.CV 2026-04 unverdicted novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
DistortBench: Benchmarking Vision Language Models on Image Distortion Identification
cs.CV 2026-04 unverdicted novelty 7.0

Vision-language models achieve at most 61.9% accuracy on identifying image distortion types and severities, falling short of human majority-vote performance at 65.7%.
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding
cs.MA 2026-04 unverdicted novelty 7.0

Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
cs.CV 2026-04 unverdicted novelty 7.0

ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark
cs.CL 2026-03 conditional novelty 7.0

CFMS is the first fine-grained Chinese multimodal sarcasm benchmark with detailed annotations, paired with a PGDS reinforcement learning strategy that improves model results on sarcasm tasks.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric
cs.AI 2026-05 unverdicted novelty 6.0

VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
From Priors to Perception: Grounding Video-LLMs in Physical Reality
cs.CV 2026-05 unverdicted novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-c...
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
cs.CV 2026-05 unverdicted novelty 6.0

Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
cs.CV 2026-04 unverdicted novelty 6.0

IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
Long-Horizon Manipulation via Trace-Conditioned VLA Planning
cs.RO 2026-04 unverdicted novelty 6.0

LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Exploring High-Order Self-Similarity for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 6.0

DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling
cs.CV 2026-04 unverdicted novelty 6.0

MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
cs.CV 2026-04 unverdicted novelty 6.0

PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization
cs.CV 2026-04 unverdicted novelty 6.0

Vision-language models display large performance differences and clear limits in zero-shot country-level geolocalization from ground-view photos, with semantic cues helping coarse guesses but failing on fine details.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Delta-LLaVA adds Change-Enhanced Attention, Change-SEG with prior embeddings, and Local Causal Attention to MLLMs to overcome temporal blindness, outperforming general models on a new unified benchmark for bi- and tri...
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
cs.CV 2026-04 unverdicted novelty 6.0

UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
MLLM-as-a-Judge Exhibits Model Preference Bias
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.

Reference graph

Works this paper leans on

300 extracted references · 200 canonical work pages · cited by 99 Pith papers · 41 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024. 4, 14, 16, 18, 21

work page internal anchor Pith review arXiv 2024
[2]

Tallyqa: Answering complex counting questions

Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8076–8084, 2019. 12, 13

work page 2019
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Wave-ui.https://huggingface.co/datasets/agentsea/wave-ui, 2024

AgentSea. Wave-ui.https://huggingface.co/datasets/agentsea/wave-ui, 2024. 13

work page 2024
[5]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 1

work page 2022
[6]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv preprint arXiv:1905.13319,

work page Pith review arXiv 1905
[7]

CG-bench: Clue-grounded question answering benchmark for long video understanding

Anonymous. CG-bench: Clue-grounded question answering benchmark for long video understanding. InSubmitted to The Thirteenth International Conference on Learning Representations, 2024. under review. 24

work page 2024
[8]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com, 2024. 1, 2, 14, 15, 16, 17, 18, 19, 20, 21

work page 2024
[9]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022. 13

work page 2022
[11]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2107.13731 (2021)

Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, et al. Uibert: Learning generic multimodal representations for ui understanding.arXiv preprint arXiv:2107.13731, 2021. 13

work page arXiv 2021
[13]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Beijing Anjie Zhihe Technology Co

Ltd. Beijing Anjie Zhihe Technology Co. Chinese-ocr. https://huggingface.co/datasets/ longmaodata/Chinese-OCR, 2024. 12, 13

work page 2024
[15]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes, 2019. 13

work page 2019
[16]

Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with imagenet?arXiv preprint arXiv:2006.07159, 2020. 27, 28

work page arXiv 2006
[17]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301, 2019. 12, 13 30

work page 2019
[18]

Coco-stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1209–1218, 2018. 28, 29

work page 2018
[19]

Internlm2 technical report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024. 1, 3, 4, 5, 13, 26

work page arXiv 2024
[20]

An augmented benchmark dataset for geometric question answering through dual parallel text encoding

Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. InProceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520,

work page
[21]

openai summarize tldr dataset

CarperAI. openai summarize tldr dataset. https://huggingface.co/datasets/CarperAI/openai_ summarize_tldr, 2023. 13

work page 2023
[22]

Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490, 2024

Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490, 2024. 13

work page arXiv 2024
[23]

MapQA: A dataset for question answering on choropleth maps,

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps.arXiv preprint arXiv:2211.08545, 2022. 12, 13

work page arXiv 2022
[24]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. Gui-world: A dataset for gui-oriented multimodal llm-based agents.arXiv preprint arXiv:2406.10819, 2024. 13

work page arXiv 2024
[25]

H., Chen, S., Zhang, R., Chen, J., Wu, X., Zhang, Z., Chen, Z., Li, J., Wan, X., and Wang, B

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model.arXiv preprint arXiv:2402.11684, 2024. 1, 12, 13

work page arXiv 2024
[26]

Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression.arXiv preprint arXiv:2212.02746, 2022. 12, 13

work page arXiv 2022
[27]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 12, 13, 22

work page internal anchor Pith review arXiv 2023
[28]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024. 20, 21

work page internal anchor Pith review arXiv 2024
[29]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023. 12

work page internal anchor Pith review arXiv 2023
[30]

Sharegpt4video: Improving video understanding and generation with better captions, 2024

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions.arXiv preprint arXiv:2406.04325, 2024. 12, 13

work page arXiv 2024
[31]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 26

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InThe International Conference on Learning Representations, pages 1597–1607. PMLR,

work page
[33]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7889–7901...

work page 2023
[34]

Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307, 2023. 13

work page arXiv 2023
[35]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024. 1, 2, 3, 4, 5, 12, 13, 14, 16, 18, 21, 22, 23, 24

work page internal anchor Pith review arXiv 2024
[36]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1, 2, 3, 7, 13 31

work page 2024
[37]

Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024. 13

work page arXiv 2024
[38]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 24

work page internal anchor Pith review arXiv 2024
[39]

Complicated Table Structure Recognition

Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. Complicated table structure recognition.arXiv preprint arXiv:1908.04729, 2019. 12, 13

work page arXiv 1908
[40]

Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art

Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. InInternational Conference on Document Analysis and Recognition, pages 1571–1576, 2019. 12, 13

work page 2019
[41]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 13

work page 2024
[42]

Simple and effective multi-paragraph reading comprehension

Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 845–855, 2018. 13, 17

work page 2018
[43]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 13, 26

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Free dolly: Introducing the world’s first truly open instruction-tuned llm.Company Blog of Databricks, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm.Company Blog of Databricks, 2023. 13

work page 2023
[45]

Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark

MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020. 29

work page 2020
[46]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023. 14, 16, 18, 21, 26

work page 2023
[47]

Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model

X.AI Corp. Grok-1.5 vision preview: Connecting the digital and physical worlds with our first multimodal model. https://x.ai/blog/grok-1.5v, 2024. 18, 19

work page 2024
[48]

Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377,

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023. 13

work page arXiv 2023
[49]

15m multimodal facial image-text dataset.arXiv preprint arXiv:2407.08515, 2024

Dawei Dai, YuTang Li, YingGe Liu, Mingming Jia, Zhang YuanHui, and Guoyin Wang. 15m multimodal facial image-text dataset.arXiv preprint arXiv:2407.08515, 2024. 12

work page arXiv 2024
[50]

NVLM: Open frontier-class multimodal LLMs

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Moham- mad Shoeybi, Bryan Catanzaro, and Wei Ping. Nvlm: Open frontier-class multimodal llms.arXiv preprint arXiv:2409.11402, 2024. 14, 15, 16

work page arXiv 2024
[51]

Visual dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 326–335, 2017. 12

work page 2017
[52]

Deep visual template-free form parsing

Brian Davis, Bryan Morse, Scott Cohen, Brian Price, and Chris Tensmeyer. Deep visual template-free form parsing. InInternational Conference on Document Analysis and Recognition, pages 134–141, 2019. 12, 13

work page 2019
[53]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational Conference on Machine Learning, pages 7480–7512, 2023. 3, 4

work page 2023
[54]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024. 1, 14, 16

work page internal anchor Pith review arXiv 2024
[55]

Rico: A mobile app dataset for building data-driven design applications

Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building data-driven design applications. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, pages 845–854, 2017. 13

work page 2017
[56]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255,

work page
[57]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2024

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2024. 13

work page 2024
[58]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233,

work page arXiv
[59]

Vintern-1b: An efficient multimodal large language model for vietnamese.arXiv preprint arXiv:2408.12480,

Khang T Doan, Bao G Huynh, Dung T Hoang, Thuc D Pham, Nhat H Pham, Quan Nguyen, Bang Q V o, and Suong N Hoang. Vintern-1b: An efficient multimodal large language model for vietnamese.arXiv preprint arXiv:2408.12480,

work page arXiv
[60]

InternLM- XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd.arXiv preprint arXiv:2404.06512, 2024. 1

work page arXiv 2024
[61]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InThe International Conference on Learning Representations, 2020. 3

work page 2020
[62]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 320–335, 2022. 1

work page 2022
[63]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 14

work page 2024
[64]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.arXiv preprint arXiv:2406.14515, 2024. 23, 24

work page arXiv 2024
[66]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 1

work page 2023
[67]

Finevideo

Miquel Farré, Andi Marafioti, Lewis Tunstall, Leandro V on Werra, and Thomas Wolf. Finevideo. https:// huggingface.co/datasets/HuggingFaceFV/finevideo, 2024. 13

work page 2024
[68]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 20, 21

work page internal anchor Pith review arXiv 2023
[69]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 23, 24

work page internal anchor Pith review arXiv 2024
[70]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024. 17, 18

work page arXiv 2024
[71]

Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance.arXiv preprint arXiv:2410.16261, 2024. 2

work page arXiv 2024
[72]

Overview of the imageclef 2015 medical classification task

Alba Garcia Seco De Herrera, Henning Müller, and Stefano Bromuri. Overview of the imageclef 2015 medical classification task. InWorking Notes of CLEF 2015–Cross Language Evaluation Forum, CEUR, volume 1391. CEUR Workshop Proceedings, 2015. 12, 13

work page 2015
[73]

Glaive code assistant v3 dataset

GlaiveAI. Glaive code assistant v3 dataset. https://huggingface.co/datasets/glaiveai/ glaive-code-assistant-v3, 2024. 13

work page 2024
[74]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017. 12, 13 33

work page 2017
[75]

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022. 12

work page 2022
[76]

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024. 14, 16

work page arXiv 2024
[77]

HallusionBench: An advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023. 20, 21

work page arXiv 2023
[78]

Eaten: Entity-aware attention for single shot visual text extraction

He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, and Errui Ding. Eaten: Entity-aware attention for single shot visual text extraction. InInternational Conference on Document Analysis and Recognition, pages 254–259, 2019. 12, 13

work page 2019
[79]

Synthetic data for text localisation in natural images

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2315–2324, 2016. 12

work page 2016
[80]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024. 2, 14, 15, 16

work page internal anchor Pith review arXiv 2024

Showing first 80 references.