hub Canonical reference

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee · 2023 · cs.CV · arXiv 2310.03744

Canonical reference. 80% of citing Pith papers cite this work as background.

36 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 36 citing papers arXiv PDF

abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

claims ledger

abstract Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1

co-cited works

representative citing papers

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

cs.AI · 2026-04-04 · unverdicted · novelty 8.0

FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.

A Sanity Check on Composed Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

Revealing Interpretable Failure Modes of VLMs

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

REVELIO uncovers interpretable failure modes in VLMs by searching combinatorial concept spaces with diversity-aware beam search and Gaussian-process Thompson sampling, revealing vulnerabilities in autonomous driving and indoor robotics.

When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

cs.AI · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection rates from over 95% to under 5.47% while preserving benign performance.

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

cs.AI · 2026-04-27 · unverdicted · novelty 6.0

PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals via iterative nullspace projection while transferring strategies through a shared

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

cs.AI · 2025-01-28 · unverdicted · novelty 6.0

Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

OpenVLA: An Open-Source Vision-Language-Action Model

cs.RO · 2024-06-13 · unverdicted · novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

Chameleon: Mixed-Modal Early-Fusion Foundation Models

cs.CL · 2024-05-16 · unverdicted · novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

cs.CL · 2024-04-22 · accept · novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

Are We on the Right Way for Evaluating Large Vision-Language Models?

cs.CV · 2024-03-29 · conditional · novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.

SGLang: Efficient Execution of Structured Language Model Programs

cs.AI · 2023-12-12 · conditional · novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

cs.CV · 2023-11-21 · conditional · novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

MMBench: Is Your Multi-modal Model an All-around Player?

cs.CV · 2023-07-12 · accept · novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

Make Your LVLM KV Cache More Lightweight

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

citing papers explorer

Showing 36 of 36 citing papers.

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning cs.AI · 2026-04-04 · unverdicted · none · ref 28 · internal anchor
FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
A Sanity Check on Composed Image Retrieval cs.CV · 2026-04-14 · unverdicted · none · ref 26 · internal anchor
The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
Revealing Interpretable Failure Modes of VLMs cs.AI · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
REVELIO uncovers interpretable failure modes in VLMs by searching combinatorial concept spaces with diversity-aware beam search and Gaussian-process Thompson sampling, revealing vulnerabilities in autonomous driving and indoor robotics.
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs cs.CV · 2026-05-12 · unverdicted · none · ref 39 · internal anchor
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
Text-Conditional JEPA for Learning Semantically Rich Visual Representations cs.LG · 2026-05-05 · unverdicted · none · ref 8 · internal anchor
TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 22 · 2 links · internal anchor
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection rates from over 95% to under 5.47% while preserving benign performance.
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model cs.AI · 2026-04-27 · unverdicted · none · ref 8 · internal anchor
PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering cs.AI · 2026-04-22 · unverdicted · none · ref 264 · internal anchor
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs cs.AI · 2026-04-14 · unverdicted · none · ref 22 · internal anchor
MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals via iterative nullspace projection while transferring strategies through a shared
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 47 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 20 · internal anchor
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CV · 2026-03-29 · unverdicted · none · ref 11 · internal anchor
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training cs.AI · 2025-01-28 · unverdicted · none · ref 4 · internal anchor
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 150 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 44 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 18 · internal anchor
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone cs.CL · 2024-04-22 · accept · none · ref 15 · internal anchor
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 24 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 27 · internal anchor
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions cs.CV · 2023-11-21 · conditional · none · ref 30 · internal anchor
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 60 · internal anchor
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MMBench: Is Your Multi-modal Model an All-around Player? cs.CV · 2023-07-12 · accept · none · ref 32 · internal anchor
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
Make Your LVLM KV Cache More Lightweight cs.CV · 2026-05-01 · unverdicted · none · ref 62 · internal anchor
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation cs.CV · 2026-04-22 · unverdicted · none · ref 145 · internal anchor
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering cs.CV · 2026-04-18 · unverdicted · none · ref 23 · internal anchor
CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs cs.LG · 2026-04-17 · unverdicted · none · ref 32 · internal anchor
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
Qwen2.5-Omni Technical Report cs.CL · 2025-03-26 · conditional · none · ref 26 · internal anchor
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 114 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 92 · internal anchor
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise cs.CV · 2026-05-06 · unverdicted · none · ref 7 · 2 links · internal anchor
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification cs.CV · 2026-04-06 · unverdicted · none · ref 40 · internal anchor
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 49 · internal anchor
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 62 · internal anchor
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 48 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
A Survey on Hallucination in Large Vision-Language Models cs.CV · 2024-02-01 · unverdicted · none · ref 29 · internal anchor
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Improved Baselines with Visual Instruction Tuning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer