pith. machine review for the scientific record. sign in

hub Canonical reference

Improved Baselines with Visual Instruction Tuning

Canonical reference. 80% of citing Pith papers cite this work as background.

36 Pith papers citing it
Background 80% of classified citations
abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

hub tools

citation-role summary

background 4 method 1

citation-polarity summary

claims ledger

  • abstract Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1

co-cited works

representative citing papers

A Sanity Check on Composed Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

Revealing Interpretable Failure Modes of VLMs

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

REVELIO uncovers interpretable failure modes in VLMs by searching combinatorial concept spaces with diversity-aware beam search and Gaussian-process Thompson sampling, revealing vulnerabilities in autonomous driving and indoor robotics.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

OpenVLA: An Open-Source Vision-Language-Action Model

cs.RO · 2024-06-13 · unverdicted · novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

Chameleon: Mixed-Modal Early-Fusion Foundation Models

cs.CL · 2024-05-16 · unverdicted · novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.

Are We on the Right Way for Evaluating Large Vision-Language Models?

cs.CV · 2024-03-29 · conditional · novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.

MMBench: Is Your Multi-modal Model an All-around Player?

cs.CV · 2023-07-12 · accept · novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

Make Your LVLM KV Cache More Lightweight

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

citing papers explorer

Showing 36 of 36 citing papers.