Improving instruction-following in language models through activation steering

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi · 2024 · arXiv 2410.12877

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.

Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.

How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

cs.CL · 2026-05-11

citing papers explorer

Showing 5 of 5 citing papers.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic cs.LG · 2026-05-08 · unverdicted · none · ref 23
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference cs.LG · 2026-05-11 · unverdicted · none · ref 37
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation cs.CL · 2026-05-03 · unverdicted · none · ref 9
LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals cs.LG · 2026-04-24 · unverdicted · none · ref 19
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions cs.CL · 2026-05-11 · unreviewed · ref 13

Improving instruction-following in language models through activation steering

fields

years

verdicts

representative citing papers

citing papers explorer