hub Canonical reference

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu · 2023 · cs.CV · arXiv 2303.16199

Canonical reference. 95% of citing Pith papers cite this work as background.

51 Pith papers citing it

Background 95% of classified citations

open full Pith review browse 51 citing papers arXiv PDF

abstract

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 baseline 1

citation-polarity summary

background 19 baseline 1

representative citing papers

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

cs.CL · 2024-09-04 · accept · novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

HARP: Efficient Data Selection for Finetuning Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

cs.CV · 2023-10-17 · accept · novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding

cs.HC · 2025-06-23 · unverdicted · novelty 6.0

UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

cs.CV · 2025-03-19 · unverdicted · novelty 6.0

MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

cs.CV · 2025-03-13 · unverdicted · novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

cs.CV · 2024-07-11 · unverdicted · novelty 6.0

Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

cs.CV · 2023-11-28 · accept · novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

cs.CL · 2023-06-01 · conditional · novelty 6.0

AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.

Otter: A Multi-Modal Model with In-Context Instruction Tuning

cs.CV · 2023-05-05 · unverdicted · novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

citing papers explorer

Showing 7 of 7 citing papers after filters.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI cs.CL · 2023-11-27 · unverdicted · none · ref 89 · internal anchor
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark cs.CL · 2024-09-04 · accept · none · ref 67
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
Instruction Tuning with GPT-4 cs.CL · 2023-04-06 · unverdicted · none · ref 15
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration cs.CL · 2023-06-01 · conditional · none · ref 37 · internal anchor
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
Multimodal Chain-of-Thought Reasoning in Language Models cs.CL · 2023-02-02 · accept · none · ref 40
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 81 · internal anchor
PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 285 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer