hub Canonical reference

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu · 2023 · cs.CV · arXiv 2303.16199

Canonical reference. 95% of citing Pith papers cite this work as background.

51 Pith papers citing it

Background 95% of classified citations

open full Pith review browse 51 citing papers arXiv PDF

abstract

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 baseline 1

citation-polarity summary

background 19 baseline 1

representative citing papers

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

HARP: Efficient Data Selection for Finetuning Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

cs.CV · 2023-10-17 · accept · novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding

cs.HC · 2025-06-23 · unverdicted · novelty 6.0

UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

cs.CV · 2025-03-19 · unverdicted · novelty 6.0

MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

cs.CV · 2025-03-13 · unverdicted · novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

cs.CV · 2024-07-11 · unverdicted · novelty 6.0

Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

cs.CV · 2023-11-28 · accept · novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

cs.CL · 2023-06-01 · conditional · novelty 6.0

AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.

Otter: A Multi-Modal Model with In-Context Instruction Tuning

cs.CV · 2023-05-05 · unverdicted · novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

citing papers explorer

Showing 28 of 28 citing papers after filters.

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization cs.CV · 2026-06-02 · unverdicted · none · ref 116 · internal anchor
P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.
EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound cs.CV · 2026-05-22 · unverdicted · none · ref 25 · internal anchor
EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 83 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 282 · internal anchor
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 182
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V cs.CV · 2023-10-17 · accept · none · ref 57
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Visual Instruction Tuning cs.CV · 2023-04-17 · unverdicted · none · ref 59
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation cs.CV · 2026-06-15 · unverdicted · none · ref 46 · internal anchor
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 146 · 2 links · internal anchor
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems cs.CV · 2025-03-19 · unverdicted · none · ref 78 · internal anchor
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model cs.CV · 2025-03-13 · unverdicted · none · ref 8 · internal anchor
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 66 · internal anchor
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark cs.CV · 2023-11-28 · accept · none · ref 103 · internal anchor
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
Otter: A Multi-Modal Model with In-Context Instruction Tuning cs.CV · 2023-05-05 · unverdicted · none · ref 100 · internal anchor
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning cs.CV · 2026-05-08 · unverdicted · none · ref 66
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions cs.CV · 2023-11-21 · conditional · none · ref 61
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 81
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models cs.CV · 2023-08-13 · unverdicted · none · ref 33
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values cs.CV · 2025-09-03 · unverdicted · none · ref 79 · internal anchor
CLIP-SVD performs parameter-efficient adaptation of CLIP by fine-tuning singular values from SVD of weight matrices, reporting SOTA few-shot accuracy on 21 datasets plus a language-based interpretability analysis.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding cs.CV · 2025-01-09 · unverdicted · none · ref 89 · internal anchor
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning cs.CV · 2024-04-25 · conditional · none · ref 51 · internal anchor
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
AppAgent: Multimodal Agents as Smartphone Users cs.CV · 2023-12-21 · unverdicted · none · ref 44 · internal anchor
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model cs.CV · 2023-04-28 · conditional · none · ref 72 · internal anchor
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
LLaVA-OneVision: Easy Visual Task Transfer cs.CV · 2024-08-06 · unverdicted · none · ref 164
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 181
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 35
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models cs.CV · 2023-08-02 · unverdicted · none · ref 41
OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 78 · internal anchor
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer