pith. machine review for the scientific record. sign in

arxiv: 2408.03326 · v3 · submitted 2024-08-06 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Chunyuan Li, Dong Guo, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Renrui Zhang, Yanwei Li, Yuanhan Zhang, Ziwei Liu

Pith reviewed 2026-05-10 14:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords large multimodal modelsvisual task transfersingle-image understandingmulti-image understandingvideo understandingtransfer learningopen LMMs
0
0 comments X

The pith

LLaVA-OneVision is the first single open model to advance performance in single-image, multi-image, and video understanding at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLaVA-OneVision, a family of open large multimodal models built by combining lessons on data, models, and visual representations from the authors' prior LLaVA-NeXT work. It demonstrates that this single model family reaches new performance levels across three distinct visual scenarios simultaneously. The architecture supports direct transfer of skills learned on images to video tasks, which in turn produces additional emerging abilities. A reader would care because the result points to simpler ways of building versatile visual AI that does not require separate models for each input type.

Core claim

LLaVA-OneVision consolidates insights into data, models, and visual representations to create a single model family that simultaneously pushes performance boundaries of open LMMs in single-image, multi-image, and video scenarios. The design enables strong transfer learning across these modalities and scenarios, yielding new emerging capabilities, with particularly strong video understanding demonstrated through task transfer from images to videos.

What carries the argument

LLaVA-OneVision family of models, which unifies data curation, model design, and visual representation strategies to enable cross-scenario task transfer.

If this is right

  • One model suffices to reach leading results in single-image understanding.
  • The same model reaches leading results in multi-image understanding.
  • Video understanding improves through direct transfer of image-based capabilities.
  • New abilities emerge that were not present in the source image-only training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar consolidation of data and representation choices could be tested on pairs of other visual tasks to check whether transfer appears consistently.
  • The single-model approach may lower the engineering effort needed to deploy visual AI across varied input formats in practice.
  • If the transfer mechanism holds, it raises the question of whether further modalities such as 3D scenes could be added without rebuilding the model from scratch.

Load-bearing premise

That the reported performance gains and transfer abilities stem mainly from consolidating prior insights on data, models, and visual representations rather than from unstated differences in training scale or benchmark selection.

What would settle it

A head-to-head evaluation on standard single-image, multi-image, and video benchmarks where another single open LMM without the described consolidation matches or exceeds LLaVA-OneVision across all three scenarios would falsify the claim of being the first to push boundaries in this unified way.

read the original abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents LLaVA-OneVision, a family of open large multimodal models (LMMs) obtained by consolidating insights from the LLaVA-NeXT blog series on data curation, model architecture, and visual representations. It claims that a single model simultaneously sets new performance records for open LMMs on single-image, multi-image, and video tasks while enabling emergent transfer capabilities, especially image-to-video task transfer.

Significance. If the reported benchmark gains and transfer results are shown to arise specifically from the consolidated recipe rather than scale or data volume, the work would be significant: it would demonstrate a practical route to unified open LMMs that handle multiple visual modalities without task-specific retraining, reducing fragmentation in the open-source multimodal ecosystem.

major comments (2)
  1. [Experimental results] Experimental results section: the central attribution of performance gains and cross-scenario transfer to the consolidation of LLaVA-NeXT insights on data, models, and visual representations is not supported by ablations that hold total training tokens, model size, and optimizer settings fixed while varying only the recipe versus a standard LLaVA-style mixture; without such controls the claim that the design enables 'easy visual task transfer' cannot be isolated from increased scale.
  2. [Abstract and results tables] Abstract and results tables: the assertion that LLaVA-OneVision is 'the first single model' to push boundaries simultaneously across the three scenarios requires explicit side-by-side benchmark tables (with numerical scores on standard single-image, multi-image, and video datasets) against all relevant prior open LMMs; the current presentation leaves the 'first' claim difficult to verify.
minor comments (2)
  1. [Introduction] Notation for the three scenarios (single-image, multi-image, video) is introduced without a compact summary table that lists the exact benchmarks and metrics used for each.
  2. [Qualitative results] Figure captions for qualitative transfer examples should explicitly state the source image task and the target video task to make the transfer claim easier to follow.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing honest responses based on the manuscript's content and indicating where revisions will be made to strengthen the work.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results section: the central attribution of performance gains and cross-scenario transfer to the consolidation of LLaVA-NeXT insights on data, models, and visual representations is not supported by ablations that hold total training tokens, model size, and optimizer settings fixed while varying only the recipe versus a standard LLaVA-style mixture; without such controls the claim that the design enables 'easy visual task transfer' cannot be isolated from increased scale.

    Authors: We acknowledge that a fully controlled ablation isolating the consolidated recipe (data curation, architecture, and visual representations) from differences in total training tokens would provide stronger causal evidence. The manuscript fixes model sizes (e.g., 7B and 13B) and uses consistent optimizer settings across our variants, with direct comparisons to prior LLaVA models of similar scale; however, exact token counts are not matched against a baseline LLaVA-style mixture in the reported experiments. We will revise the Experimental Results section to add a detailed breakdown of training data volumes used in LLaVA-OneVision versus prior works, along with a discussion clarifying the differences in the recipe and acknowledging that scale may contribute to some gains. The cross-scenario transfer results (image-to-video) are presented as emergent evidence supporting the unified design, but we agree this does not fully substitute for the requested controls. revision: partial

  2. Referee: [Abstract and results tables] Abstract and results tables: the assertion that LLaVA-OneVision is 'the first single model' to push boundaries simultaneously across the three scenarios requires explicit side-by-side benchmark tables (with numerical scores on standard single-image, multi-image, and video datasets) against all relevant prior open LMMs; the current presentation leaves the 'first' claim difficult to verify.

    Authors: We agree that an aggregated side-by-side table would improve verifiability of the 'first single model' claim. The manuscript already reports results on standard benchmarks for each scenario with comparisons to prior open LMMs in dedicated tables. We will add a new summary table in the results section that collates key numerical scores for LLaVA-OneVision and the leading prior open models across representative single-image, multi-image, and video datasets. This will explicitly support the simultaneous performance claim and we will reference it in the abstract. revision: yes

standing simulated objections not resolved
  • Performing new large-scale training runs for ablations that hold total training tokens exactly fixed against a standard LLaVA-style mixture is not feasible due to computational constraints.

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of self-cited insights

full rationale

The paper presents LLaVA-OneVision as a model family built by consolidating design insights from the authors' prior LLaVA-NeXT blog series on data, models, and visual representations. Its central claims rest on experimental benchmark results demonstrating performance across single-image, multi-image, and video scenarios plus image-to-video transfer. These outcomes are measured independently via standard evaluations and are not reduced by construction to the prior insights or any fitted parameters. No self-definitional equations, predictions that are statistically forced from subsets of the same data, or load-bearing self-citations that render the performance claims tautological appear in the abstract or described structure. The self-reference functions as engineering motivation rather than a mathematical premise that collapses the reported gains into the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims depend on empirical training outcomes from techniques consolidated from the LLaVA-NeXT series; no new mathematical axioms or derivations are introduced.

free parameters (2)
  • Model scale and architecture variants
    Family of models with different sizes and configurations selected to achieve the reported performance levels.
  • Training data composition ratios
    Proportions of single-image, multi-image, and video data used to enable the claimed transfer learning.
axioms (1)
  • domain assumption Insights from the LLaVA-NeXT blog series on data, models, and visual representations are valid and sufficient to build improved LMMs.
    The model is explicitly developed by consolidating these prior insights.

pith-pipeline@v0.9.0 · 5436 in / 1245 out tokens · 66569 ms · 2026-05-10T14:17:35.070908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 8.0

    SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

  2. DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

    cs.CV 2026-05 accept novelty 8.0

    DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

  3. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  4. FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

    cs.AI 2026-04 unverdicted novelty 8.0

    FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.

  5. Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

    cs.CV 2026-04 conditional novelty 8.0

    VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

  6. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  7. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  8. AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

  9. UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

    cs.MM 2026-05 unverdicted novelty 7.0

    UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...

  10. Count Anything at Any Granularity

    cs.CV 2026-05 unverdicted novelty 7.0

    Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...

  11. V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

  12. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  13. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 conditional novelty 7.0

    TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

  14. TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.

  15. SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.

  16. Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.

  17. Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.

  18. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  19. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  20. Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

    cs.CV 2026-05 unverdicted novelty 7.0

    Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...

  21. SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

    cs.AI 2026-04 unverdicted novelty 7.0

    SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling m...

  22. Membership Inference Attacks Against Video Large Language Models

    cs.CR 2026-04 unverdicted novelty 7.0

    A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.

  23. MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding

    cs.MM 2026-04 unverdicted novelty 7.0

    MarkIt uses a query-to-mask bridge with open-vocabulary segmentation to add visual markers and frame indices to videos, enabling Vid-LLMs to achieve state-of-the-art temporal grounding on moment retrieval and highligh...

  24. Don't Pause! Every prediction matters in a streaming video

    cs.CV 2026-04 unverdicted novelty 7.0

    SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.

  25. LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...

  26. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  27. SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

  28. Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

    cs.LG 2026-04 unverdicted novelty 7.0

    Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...

  29. Hybrid Latent Reasoning with Decoupled Policy Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.

  30. MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

    cs.CL 2026-04 unverdicted novelty 7.0

    MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.

  31. Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

  32. GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

  33. OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

  34. Towards Unconstrained Human-Object Interaction

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

  35. Why MLLMs Struggle to Determine Object Orientations

    cs.CV 2026-04 accept novelty 7.0

    Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

  36. Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 7.0

    The SNG framework and SNG-VLA model enable end-to-end driving systems to better incorporate global navigation for state-of-the-art route following without auxiliary perception losses.

  37. Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...

  38. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  39. Mosaic: Cross-Modal Clustering for Efficient Video Understanding

    cs.PF 2026-04 unverdicted novelty 7.0

    Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

  40. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  41. SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.

  42. CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.

  43. Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

  44. Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding

    cs.MA 2026-04 unverdicted novelty 7.0

    Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.

  45. MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments

    cs.CV 2026-04 unverdicted novelty 7.0

    MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.

  46. PLUME: Latent Reasoning Based Universal Multimodal Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

  47. V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

    cs.CV 2026-03 unverdicted novelty 7.0

    V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...

  48. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    cs.CV 2025-05 unverdicted novelty 7.0

    DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

  49. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  50. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  51. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  52. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  53. Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

    cs.CV 2026-05 unverdicted novelty 6.0

    Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

  54. GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.

  55. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...

  56. Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    cs.CV 2026-05 unverdicted novelty 6.0

    SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...

  57. OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.

  58. SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

    cs.CV 2026-05 unverdicted novelty 6.0

    SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

  59. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

  60. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

Reference graph

Works this paper leans on

179 extracted references · 179 canonical work pages · cited by 152 Pith papers · 16 internal anchors

  1. [1]

    Tallyqa: Answering complex counting questions

    Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In AAAI, 2019. 39

  2. [2]

    Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms, 2019. 39

  3. [3]

    Claude-3.5

    Anthropic. Claude-3.5. https://www.anthropic.com/news/claude-3-5-sonnet , 2024. 2, 11

  4. [4]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015. 39

  5. [5]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022. 9

  6. [6]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 40

  7. [7]

    Vision datasets: A benchmark for vision-based industrial inspection, 2023

    Haoping Bai, Shancong Mou, Tatiana Likhomanenko, Ramazan Gokberk Cinbis, Oncel Tuzel, Ping Huang, Jiulong Shan, Jianjun Shi, and Meng Cao. Vision datasets: A benchmark for vision-based industrial inspection, 2023. 40

  8. [8]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. Technical Report, 2023. 11, 37

  9. [9]

    Visual question answering on image sets

    Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 51–67. Springer, 2020. 9

  10. [10]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. 2

  11. [11]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, 2019. 39

  12. [12]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving, 2020. 40

  13. [13]

    Textocr-gpt4v

    Jimmy Carter. Textocr-gpt4v. https://huggingface.co/datasets/jimmycarter/ textocr-gpt4v, 2024. 39

  14. [14]

    Mapqa: A dataset for question answering on choropleth maps, 2022

    Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps, 2022. 39

  15. [15]

    Webqa: Multihop and multimodal qa

    Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. arXiv preprint arXiv:2109.00590, 2021. 40

  16. [16]

    Allava: Harness- ing gpt4v-synthesized data for a lite vision-language model

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v- synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 6, 7, 39

  17. [17]

    Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression,

    Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression,

  18. [18]

    Xing, and Liang Lin

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022. 39

  19. [19]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models? arXiv preprint arXiv:2403.20330, 2024. 10

  20. [20]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 5

  21. [21]

    Sharegpt4video: Improving video understand- ing and generation with better captions

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi Wang. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 38, 40

  22. [22]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 9, 11, 37, 39

  23. [23]

    Hitab: A hierarchical table dataset for question answering and natural language generation

    Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. Hitab: A hierarchical table dataset for question answering and natural language generation. In ACL, 2022. 39

  24. [24]

    Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns, 2024

    Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns. arXiv preprint arXiv:2403.13315, 2024. 9

  25. [25]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 40

  26. [26]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024. 2

  27. [27]

    Neural naturalist: Generating fine-grained image comparisons, 2019

    Maxwell Forbes, Christine Kaeser-Chen, Piyush Sharma, and Serge Belongie. Neural naturalist: Generating fine-grained image comparisons, 2019. 40

  28. [28]

    Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024. 10, 36, 38

  29. [29]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 10, 11

  30. [30]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data, 2023. 40

  31. [31]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024. 9, 10

  32. [32]

    G-llava: Solving geometric problem with multi-modal large language model, 2023

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geometric problem with multi-modal large language model, 2023. 39 24

  33. [33]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Carti...

  34. [34]

    Sciverse

    Ziyu Guo, Renrui Zhang, Hao Chen, Jialin Gao, Peng Gao, Hongsheng Li, and Pheng-Ann Heng. Sciverse. https://sciverse-cuhk.github.io, 2024. 9

  35. [35]

    Point-bind & point-llm: Aligning point cloud with multi- modality for 3d understanding, generation, and instruction following,

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023. 2

  36. [36]

    Imagine this! scripts to compositions to videos, 2018

    Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. Imagine this! scripts to compositions to videos, 2018. 40

  37. [37]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018. 39, 40

  38. [38]

    3d-llm: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023. 9

  39. [39]

    Image change captioning by learning from an auxiliary task

    Mehrdad Hosseinzadeh and Yang Wang. Image change captioning by learning from an auxiliary task. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2724–2733, 2021. 40

  40. [40]

    Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aish- warya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al

    Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aish- warya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016. 9

  41. [41]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 39

  42. [42]

    Hq-edit: A high-quality dataset for instruction-based image editing, 2024

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing, 2024. 40

  43. [43]

    Lim, and Edward H

    Phillip Isola, Joseph J. Lim, and Edward H. Adelson. Discovering states and transformations in image collections. In CVPR, 2015. 40

  44. [44]

    The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives, 2017

    Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III au2, and Larry Davis. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives, 2017. 40

  45. [45]

    Learning to describe differences between pairs of similar images

    Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584, 2018. 9

  46. [46]

    Learning to describe differences between pairs of similar images, 2018

    Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similar images, 2018. 40 25

  47. [47]

    Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024. 2, 10, 12, 40

  48. [48]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017. 39

  49. [49]

    Dvqa: Understanding data visualizations via question answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, 2018. 37, 39

  50. [50]

    Figureqa: An annotated figure dataset for visual reasoning, 2018

    Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Akos Kadar, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning, 2018. 39

  51. [51]

    Prismatic vlms: Investigating the design space of visually-conditioned language models

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. Technical Report, 2024. 2

  52. [52]

    Geomverse: A systematic evaluation of large models for geometric reasoning

    Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023. 39

  53. [53]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016. 10, 37, 39

  54. [54]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 235–251. Springer, 2016. 9, 36, 38

  55. [55]

    Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension

    Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007, 2017. 39

  56. [56]

    Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension

    Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5376–5384, 2017. 40

  57. [57]

    The hateful memes challenge: Detecting hate speech in multimodal memes

    Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS, 2020. 39

  58. [58]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV), 2022. 37, 39

  59. [59]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations,

  60. [60]

    Image retrieval from contextual descriptions

    Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo Ponti, and Siva Reddy. Image retrieval from contextual descriptions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Online, May 2022. Association for Computational Linguistics. 40

  61. [61]

    Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o,

    Shanghai AI Laboratory. Sharegpt-4o: Comprehensive multimodal annotations with gpt-4o,

  62. [62]

    A dataset of clinically generated visual questions and answers about radiology images

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018. 39

  63. [63]

    What matters when building vision-language models? Technical Report, 2024

    Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? Technical Report, 2024. 2, 6, 37

  64. [64]

    Llava-next: What else influences visual instruction tuning beyond data?, May 2024

    Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024. 1, 2, 3, 5, 34, 35

  65. [65]

    Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

    Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024. 1, 3, 9, 10, 34, 36, 38

  66. [66]

    Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023. 10

  67. [67]

    Multimodal foundation models: From specialists to general-purpose assistants

    Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision, 2024. 1

  68. [68]

    Llava-next: Tackling multi-image, video, and 3d in large multimodal models, June 2024

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next: Tackling multi-image, video, and 3d in large multimodal models, June 2024. 1, 2, 5, 6, 7, 9, 10, 12, 34, 35, 36, 38

  69. [69]

    Fine-tuning multimodal llms to follow zero-shot demonstrative instructions, 2024

    Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions, 2024. 7, 40

  70. [70]

    Fine-tuning multimodal llms to follow zero-shot demonstrative instructions.arXiv preprint:2308.04152, 2023

    Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Empowering vision-language models to follow interleaved vision-language instructions. arXiv preprint arXiv:2308.04152, 2023. 2, 12

  71. [71]

    Mvbench: A comprehensive multi-modal video understanding benchmark, 2023

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2023. 10

  72. [72]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, 2024. 2

  73. [73]

    Mini-gemini: Mining the potential of multi-modality vision language models

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. Technical Report, 2024. 2

  74. [74]

    Storygan: A sequential conditional gan for story visualization,

    Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization,

  75. [75]

    Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023

    Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39

  76. [76]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2

  77. [77]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 2, 11, 12

  78. [78]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 37, 39 27

  79. [79]

    Visual spatial reasoning

    Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transac- tions of the Association for Computational Linguistics, 2023. 39

  80. [80]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565,

Showing first 80 references.