pith. sign in

hub Mixed citations

Improved Baselines with Visual Instruction Tuning

Mixed citation behavior. Most common role is background (69%).

95 Pith papers citing it
Background 69% of classified citations
abstract

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

hub tools

citation-role summary

background 19 method 5 baseline 4 dataset 1

citation-polarity summary

claims ledger

  • abstract Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1

co-cited works

representative citing papers

Vision Harnessing Agent for Open Ad-hoc Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

A Sanity Check on Composed Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

cs.CV · 2024-06-24 · unverdicted · novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.

EventDrive: Event Cameras for Vision-Language Driving Intelligence

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

EventDrive supplies a multi-task benchmark and EventDrive-VLM architecture that fuses event data, RGB, and language supervision, reporting gains in temporal precision and motion awareness for driving intelligence.

Your Embedding Model is SMARTer Than You Think

cs.IR · 2026-05-24 · unverdicted · novelty 6.0

SMART unlocks latent multi-vector capabilities in single-vector embedding models by applying late interaction to frozen hidden states shaped by contrastive training, yielding consistent gains on MMEB-V2 and visual document retrieval.

citing papers explorer

Showing 50 of 95 citing papers.