arxiv: 2408.03326 · v3 · submitted 2024-08-06 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Chunyuan Li, Dong Guo, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Renrui Zhang, Yanwei Li, Yuanhan Zhang, Ziwei Liu

Pith reviewed 2026-05-10 14:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords large multimodal modelsvisual task transfersingle-image understandingmulti-image understandingvideo understandingtransfer learningopen LMMs

0 comments

The pith

LLaVA-OneVision is the first single open model to advance performance in single-image, multi-image, and video understanding at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLaVA-OneVision, a family of open large multimodal models built by combining lessons on data, models, and visual representations from the authors' prior LLaVA-NeXT work. It demonstrates that this single model family reaches new performance levels across three distinct visual scenarios simultaneously. The architecture supports direct transfer of skills learned on images to video tasks, which in turn produces additional emerging abilities. A reader would care because the result points to simpler ways of building versatile visual AI that does not require separate models for each input type.

Core claim

LLaVA-OneVision consolidates insights into data, models, and visual representations to create a single model family that simultaneously pushes performance boundaries of open LMMs in single-image, multi-image, and video scenarios. The design enables strong transfer learning across these modalities and scenarios, yielding new emerging capabilities, with particularly strong video understanding demonstrated through task transfer from images to videos.

What carries the argument

LLaVA-OneVision family of models, which unifies data curation, model design, and visual representation strategies to enable cross-scenario task transfer.

If this is right

One model suffices to reach leading results in single-image understanding.
The same model reaches leading results in multi-image understanding.
Video understanding improves through direct transfer of image-based capabilities.
New abilities emerge that were not present in the source image-only training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar consolidation of data and representation choices could be tested on pairs of other visual tasks to check whether transfer appears consistently.
The single-model approach may lower the engineering effort needed to deploy visual AI across varied input formats in practice.
If the transfer mechanism holds, it raises the question of whether further modalities such as 3D scenes could be added without rebuilding the model from scratch.

Load-bearing premise

That the reported performance gains and transfer abilities stem mainly from consolidating prior insights on data, models, and visual representations rather than from unstated differences in training scale or benchmark selection.

What would settle it

A head-to-head evaluation on standard single-image, multi-image, and video benchmarks where another single open LMM without the described consolidation matches or exceeds LLaVA-OneVision across all three scenarios would falsify the claim of being the first to push boundaries in this unified way.

read the original abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents LLaVA-OneVision, a family of open large multimodal models (LMMs) obtained by consolidating insights from the LLaVA-NeXT blog series on data curation, model architecture, and visual representations. It claims that a single model simultaneously sets new performance records for open LMMs on single-image, multi-image, and video tasks while enabling emergent transfer capabilities, especially image-to-video task transfer.

Significance. If the reported benchmark gains and transfer results are shown to arise specifically from the consolidated recipe rather than scale or data volume, the work would be significant: it would demonstrate a practical route to unified open LMMs that handle multiple visual modalities without task-specific retraining, reducing fragmentation in the open-source multimodal ecosystem.

major comments (2)

[Experimental results] Experimental results section: the central attribution of performance gains and cross-scenario transfer to the consolidation of LLaVA-NeXT insights on data, models, and visual representations is not supported by ablations that hold total training tokens, model size, and optimizer settings fixed while varying only the recipe versus a standard LLaVA-style mixture; without such controls the claim that the design enables 'easy visual task transfer' cannot be isolated from increased scale.
[Abstract and results tables] Abstract and results tables: the assertion that LLaVA-OneVision is 'the first single model' to push boundaries simultaneously across the three scenarios requires explicit side-by-side benchmark tables (with numerical scores on standard single-image, multi-image, and video datasets) against all relevant prior open LMMs; the current presentation leaves the 'first' claim difficult to verify.

minor comments (2)

[Introduction] Notation for the three scenarios (single-image, multi-image, video) is introduced without a compact summary table that lists the exact benchmarks and metrics used for each.
[Qualitative results] Figure captions for qualitative transfer examples should explicitly state the source image task and the target video task to make the transfer claim easier to follow.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing honest responses based on the manuscript's content and indicating where revisions will be made to strengthen the work.

read point-by-point responses

Referee: [Experimental results] Experimental results section: the central attribution of performance gains and cross-scenario transfer to the consolidation of LLaVA-NeXT insights on data, models, and visual representations is not supported by ablations that hold total training tokens, model size, and optimizer settings fixed while varying only the recipe versus a standard LLaVA-style mixture; without such controls the claim that the design enables 'easy visual task transfer' cannot be isolated from increased scale.

Authors: We acknowledge that a fully controlled ablation isolating the consolidated recipe (data curation, architecture, and visual representations) from differences in total training tokens would provide stronger causal evidence. The manuscript fixes model sizes (e.g., 7B and 13B) and uses consistent optimizer settings across our variants, with direct comparisons to prior LLaVA models of similar scale; however, exact token counts are not matched against a baseline LLaVA-style mixture in the reported experiments. We will revise the Experimental Results section to add a detailed breakdown of training data volumes used in LLaVA-OneVision versus prior works, along with a discussion clarifying the differences in the recipe and acknowledging that scale may contribute to some gains. The cross-scenario transfer results (image-to-video) are presented as emergent evidence supporting the unified design, but we agree this does not fully substitute for the requested controls. revision: partial
Referee: [Abstract and results tables] Abstract and results tables: the assertion that LLaVA-OneVision is 'the first single model' to push boundaries simultaneously across the three scenarios requires explicit side-by-side benchmark tables (with numerical scores on standard single-image, multi-image, and video datasets) against all relevant prior open LMMs; the current presentation leaves the 'first' claim difficult to verify.

Authors: We agree that an aggregated side-by-side table would improve verifiability of the 'first single model' claim. The manuscript already reports results on standard benchmarks for each scenario with comparisons to prior open LMMs in dedicated tables. We will add a new summary table in the results section that collates key numerical scores for LLaVA-OneVision and the leading prior open models across representative single-image, multi-image, and video datasets. This will explicitly support the simultaneous performance claim and we will reference it in the abstract. revision: yes

standing simulated objections not resolved

Performing new large-scale training runs for ablations that hold total training tokens exactly fixed against a standard LLaVA-style mixture is not feasible due to computational constraints.

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of self-cited insights

full rationale

The paper presents LLaVA-OneVision as a model family built by consolidating design insights from the authors' prior LLaVA-NeXT blog series on data, models, and visual representations. Its central claims rest on experimental benchmark results demonstrating performance across single-image, multi-image, and video scenarios plus image-to-video transfer. These outcomes are measured independently via standard evaluations and are not reduced by construction to the prior insights or any fitted parameters. No self-definitional equations, predictions that are statistically forced from subsets of the same data, or load-bearing self-citations that render the performance claims tautological appear in the abstract or described structure. The self-reference functions as engineering motivation rather than a mathematical premise that collapses the reported gains into the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims depend on empirical training outcomes from techniques consolidated from the LLaVA-NeXT series; no new mathematical axioms or derivations are introduced.

free parameters (2)

Model scale and architecture variants
Family of models with different sizes and configurations selected to achieve the reported performance levels.
Training data composition ratios
Proportions of single-image, multi-image, and video data used to enable the claimed transfer learning.

axioms (1)

domain assumption Insights from the LLaVA-NeXT blog series on data, models, and visual representations are valid and sufficient to build improved LMMs.
The model is explicitly developed by consolidating these prior insights.

pith-pipeline@v0.9.0 · 5436 in / 1245 out tokens · 66569 ms · 2026-05-10T14:17:35.070908+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
cs.CV 2026-05 accept novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
cs.AI 2026-04 unverdicted novelty 8.0

FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
cs.CV 2026-04 conditional novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 conditional novelty 7.0

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding
cs.CV 2026-05 unverdicted novelty 7.0

SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
cs.CV 2026-05 unverdicted novelty 7.0

ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
cs.CV 2026-05 unverdicted novelty 7.0

Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images
cs.AI 2026-04 unverdicted novelty 7.0

SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling m...
Membership Inference Attacks Against Video Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
cs.MM 2026-04 unverdicted novelty 7.0

MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...
Don't Pause! Every prediction matters in a streaming video
cs.CV 2026-04 unverdicted novelty 7.0

SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
cs.CV 2026-04 unverdicted novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
cs.LG 2026-04 unverdicted novelty 7.0

Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
cs.CL 2026-04 unverdicted novelty 7.0

MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
cs.CL 2026-04 unverdicted novelty 7.0

Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
cs.RO 2026-04 unverdicted novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Towards Unconstrained Human-Object Interaction
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
Why MLLMs Struggle to Determine Object Orientations
cs.CV 2026-04 accept novelty 7.0

Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 7.0

The SNG framework and SNG-VLA model enable end-to-end driving systems to better incorporate global navigation for state-of-the-art route following without auxiliary perception losses.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-04 unverdicted novelty 7.0

DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
cs.CV 2026-04 unverdicted novelty 7.0

SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
cs.CV 2026-04 unverdicted novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding
cs.MA 2026-04 unverdicted novelty 7.0

Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
cs.CV 2026-04 unverdicted novelty 7.0

MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
cs.CV 2026-05 unverdicted novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

Reference graph

Works this paper leans on

179 extracted references · 179 canonical work pages · cited by 152 Pith papers · 16 internal anchors

[1]

Tallyqa: Answering complex counting questions

Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In AAAI, 2019. 39

work page 2019
[2]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019. 39

work page 2019
[3]

Claude-3.5

Anthropic. Claude-3.5. https://www.anthropic.com/news/claude-3-5-sonnet , 2024. 2, 11

work page 2024
[4]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015. 39

work page 2015
[5]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022. 9

work page 2022
[6]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 40

work page 2022
[7]

Vision datasets: A benchmark for vision-based industrial inspection, 2023

Haoping Bai, Shancong Mou, Tatiana Likhomanenko, Ramazan Gokberk Cinbis, Oncel Tuzel, Ping Huang, Jiulong Shan, Jianjun Shi, and Meng Cao. Vision datasets: A benchmark for vision-based industrial inspection, 2023. 40

work page 2023
[8]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. Technical Report, 2023. 11, 37

work page 2023
[9]

Visual question answering on image sets

Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 51–67. Springer, 2020. 9

work page 2020
[10]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. 2

work page internal anchor Pith review arXiv 2024
[11]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, 2019. 39

work page 2019
[12]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving, 2020. 40

work page 2020
[13]

Textocr-gpt4v

Jimmy Carter. Textocr-gpt4v. https://huggingface.co/datasets/jimmycarter/ textocr-gpt4v, 2024. 39

work page 2024
[14]

Mapqa: A dataset for question answering on choropleth maps, 2022

Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps, 2022. 39

work page 2022
[15]

Webqa: Multihop and multimodal qa

Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. arXiv preprint arXiv:2109.00590, 2021. 40

work page arXiv 2021
[16]

Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v- synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 6, 7, 39

work page arXiv 2024
[17]

Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression,

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression,

work page
[18]

Xing, and Liang Lin

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022. 39

work page 2022
[19]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models? arXiv preprint arXiv:2403.20330, 2024. 10

work page internal anchor Pith review arXiv 2024
[20]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 5

work page internal anchor Pith review arXiv 2023
[21]

Sharegpt4video: Improving video understand- ing and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 38, 40

work page arXiv 2024
[22]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 9, 11, 37, 39

work page internal anchor Pith review arXiv 2023
[23]

Hitab: A hierarchical table dataset for question answering and natural language generation

Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. Hitab: A hierarchical table dataset for question answering and natural language generation. In ACL, 2022. 39

work page 2022
[24]

Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns, 2024

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns. arXiv preprint arXiv:2403.13315, 2024. 9

work page arXiv 2024
[25]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 40

work page 2017
[26]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024. 2

work page 2024
[27]

Neural naturalist: Generating fine-grained image comparisons, 2019

Maxwell Forbes, Christine Kaeser-Chen, Piyush Sharma, and Serge Belongie. Neural naturalist: Generating fine-grained image comparisons, 2019. 40

work page 2019
[28]

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024. 10, 36, 38

work page 2024
[29]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 10, 11

work page internal anchor Pith review arXiv 2024
[30]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data, 2023. 40

work page 2023
[31]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024. 9, 10

work page arXiv 2024
[32]

G-llava: Solving geometric problem with multi-modal large language model, 2023

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geometric problem with multi-modal large language model, 2023. 39 24

work page 2023
[33]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Carti...

work page 2022
[34]

Sciverse

Ziyu Guo, Renrui Zhang, Hao Chen, Jialin Gao, Peng Gao, Hongsheng Li, and Pheng-Ann Heng. Sciverse. https://sciverse-cuhk.github.io, 2024. 9

work page 2024
[35]

Point-bind & point-llm: Aligning point cloud with multi- modality for 3d understanding, generation, and instruction following,

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023. 2

work page arXiv 2023
[36]

Imagine this! scripts to compositions to videos, 2018

Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. Imagine this! scripts to compositions to videos, 2018. 40

work page 2018
[37]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018. 39, 40

work page 2018
[38]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023. 9

work page 2023
[39]

Image change captioning by learning from an auxiliary task

Mehrdad Hosseinzadeh and Yang Wang. Image change captioning by learning from an auxiliary task. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2724–2733, 2021. 40

work page 2021
[40]

Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aish- warya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aish- warya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016. 9

work page 2016
[41]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 39

work page 2019
[42]

Hq-edit: A high-quality dataset for instruction-based image editing, 2024

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing, 2024. 40

work page 2024
[43]

Lim, and Edward H

Phillip Isola, Joseph J. Lim, and Edward H. Adelson. Discovering states and transformations in image collections. In CVPR, 2015. 40

work page 2015
[44]

The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives, 2017

Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III au2, and Larry Davis. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives, 2017. 40

work page 2017
[45]

Learning to describe differences between pairs of similar images

Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584, 2018. 9

work page arXiv 2018
[46]

Learning to describe differences between pairs of similar images, 2018

Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similar images, 2018. 40 25

work page 2018
[47]

Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024. 2, 10, 12, 40

work page arXiv 2024
[48]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017. 39

work page 2017
[49]

Dvqa: Understanding data visualizations via question answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, 2018. 37, 39

work page 2018
[50]

Figureqa: An annotated figure dataset for visual reasoning, 2018

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Akos Kadar, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning, 2018. 39

work page 2018
[51]

Prismatic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. Technical Report, 2024. 2

work page 2024
[52]

Geomverse: A systematic evaluation of large models for geometric reasoning

Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023. 39

work page arXiv 2023
[53]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016. 10, 37, 39

work page 2016
[54]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 235–251. Springer, 2016. 9, 36, 38

work page 2016
[55]

Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007, 2017. 39

work page 2017
[56]

Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376–5384, 2017. 40

work page 2017
[57]

The hateful memes challenge: Detecting hate speech in multimodal memes

Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS, 2020. 39

work page 2020
[58]

Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV), 2022. 37, 39

work page 2022
[59]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations,

work page
[60]

Image retrieval from contextual descriptions

Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo Ponti, and Siva Reddy. Image retrieval from contextual descriptions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Online, May 2022. Association for Computational Linguistics. 40

work page 2022
[61]

Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o,

Shanghai AI Laboratory. Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o,

work page
[62]

A dataset of clinically generated visual questions and answers about radiology images

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018. 39

work page 2018
[63]

What matters when building vision-language models? Technical Report, 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? Technical Report, 2024. 2, 6, 37

work page 2024
[64]

Llava-next: What else influences visual instruction tuning beyond data?, May 2024

Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024. 1, 2, 3, 5, 34, 35

work page 2024
[65]

Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024. 1, 3, 9, 10, 34, 36, 38

work page 2024
[66]

Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023. 10

work page 2023
[67]

Multimodal foundation models: From specialists to general-purpose assistants

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision, 2024. 1

work page 2024
[68]

Llava-next: Tackling multi-image, video, and 3d in large multimodal models, June 2024

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next: Tackling multi-image, video, and 3d in large multimodal models, June 2024. 1, 2, 5, 6, 7, 9, 10, 12, 34, 35, 36, 38

work page 2024
[69]

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions, 2024

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions, 2024. 7, 40

work page 2024
[70]

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions.arXiv preprint:2308.04152, 2023

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Empowering vision-language models to follow interleaved vision-language instructions. arXiv preprint arXiv:2308.04152, 2023. 2, 12

work page arXiv 2023
[71]

Mvbench: A comprehensive multi-modal video understanding benchmark, 2023

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2023. 10

work page 2023
[72]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, 2024. 2

work page 2024
[73]

Mini-gemini: Mining the potential of multi-modality vision language models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. Technical Report, 2024. 2

work page 2024
[74]

Storygan: A sequential conditional gan for story visualization,

Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization,

work page
[75]

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39

work page 2023
[76]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2

work page internal anchor Pith review arXiv 2023
[77]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 2, 11, 12

work page 2024
[78]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 37, 39 27

work page 2015
[79]

Visual spatial reasoning

Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transac- tions of the Association for Computational Linguistics, 2023. 39

work page 2023
[80]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565,

work page internal anchor Pith review arXiv

Showing first 80 references.