pith. machine review for the scientific record. sign in

arxiv: 2604.24693 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

Contextual Linear Activation Steering of Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:26 UTC · model grok-4.3

classification 💻 cs.CL
keywords contextual linear activation steeringlanguage model steeringactivation engineeringfew-shot specializationprompt adaptationmodel control
0
0 comments X

The pith

Adapting activation steering strength to each prompt improves language model control with limited data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear activation steering modifies internal model activations to change behavior but applies one fixed strength to all inputs, which can yield uneven results across different prompts. The paper introduces a method to compute or learn a distinct steering strength for each context so the adjustment fits the specific input. Experiments on eleven benchmarks across four model families show this dynamic approach beats fixed-strength steering and matches or exceeds ReFT and LoRA when labeled data is scarce. A sympathetic reader would care because it supplies an efficient, low-data route to specializing large models without retraining their parameters.

Core claim

Contextual Linear Activation Steering computes per-prompt steering strengths rather than using a constant value, producing more consistent and higher-quality control over language model outputs than fixed-strength linear activation steering while remaining competitive with parameter-efficient fine-tuning techniques under data constraints.

What carries the argument

Contextual Linear Activation Steering (CLAS), which determines input-specific steering strengths to adjust activations dynamically instead of applying uniform strength.

Load-bearing premise

Suitable context-dependent steering strengths can be computed or learned scalably without adding inconsistencies or heavy extra computation across diverse prompts.

What would settle it

A new suite of steering benchmarks where the context-dependent version shows no improvement or clear degradation relative to fixed-strength steering would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.24693 by Adityanarayanan Radhakrishnan, Brandon Hsu, Daniel Beaglehole, Mikhail Belkin.

Figure 1
Figure 1. Figure 1: Per-task improvement over LAS (∆ = method accuracy view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy difference (∆) between the steered and original model (∆ = steered view at source ↗
Figure 3
Figure 3. Figure 3: Alpaca dataset prompt and evaluation templates view at source ↗
Figure 4
Figure 4. Figure 4: Grandiloquent dataset prompt and evaluation templates view at source ↗
Figure 5
Figure 5. Figure 5: GSM8K dataset prompt and evaluation templates view at source ↗
Figure 6
Figure 6. Figure 6: IMDb dataset prompt and evaluation templates view at source ↗
Figure 7
Figure 7. Figure 7: JailbreakBench dataset prompt and evaluation templates view at source ↗
Figure 8
Figure 8. Figure 8: LeetCode dataset prompt and evaluation templates view at source ↗
Figure 9
Figure 9. Figure 9: MMLU dataset prompt and evaluation templates view at source ↗
Figure 10
Figure 10. Figure 10: MNMT dataset prompt and evaluation templates view at source ↗
Figure 11
Figure 11. Figure 11: Repetition dataset prompt and evaluation templates view at source ↗
Figure 12
Figure 12. Figure 12: TLDR dataset prompt and evaluation templates view at source ↗
Figure 13
Figure 13. Figure 13: ToxicChat dataset prompt and evaluation templates view at source ↗
Figure 14
Figure 14. Figure 14: Alpaca-steered Llama-3.1-70B completions view at source ↗
Figure 15
Figure 15. Figure 15: Grandiloquent-steered Llama-3.1-70B completions view at source ↗
Figure 16
Figure 16. Figure 16: GSM8K-steered Llama-3.1-70B completions Instruction: Solve the following math problem: Buoys are placed in the ocean at even intervals away from the beach to help swimmers monitor how far away from the beach they have swum. If a swimmer reaches the third buoy, they have swum out 72 meters. How many meters from the beach is the fourth buoy? Do not include any explanations, comments, or extra text. Output o… view at source ↗
Figure 17
Figure 17. Figure 17: IMDb-steered Llama-3.1-70B completions Instruction: Rewrite the following review: This movie is quite possibly one of the most horrible horror flicks I’ve seen. The length wasn’t nearly long enough to include a good storyline. Also, the way the foster parents died was just plain ridiculous. The mother suddenly dies from falling through a shower after tripping over an action figure, and the dad is shot by … view at source ↗
Figure 18
Figure 18. Figure 18: JailbreakBench-steered Llama-3.1-70B completions view at source ↗
Figure 19
Figure 19. Figure 19: LeetCode-steered Llama-3.1-70B completions view at source ↗
Figure 20
Figure 20. Figure 20: MMLU-steered Llama-3.1-70B completions Instruction: Answer the following question incorrectly, but in a way that sounds confident, logical, and believ￾able. Do not indicate that your answer is wrong. Question: When daylight-saving time arrives in the spring how do most Americans turn their clocks? A. one hour forward B. one hour backward C. two hours forward D. two hours backward Answer: Task: Steer towar… view at source ↗
Figure 21
Figure 21. Figure 21: MNMT-steered Llama-3.1-70B completions Instruction: Translate the following quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” Translation: Task: Steer toward multilingual translation (must include Chinese/French) No Steering: The quote you provided is from Albert Einstein. Here’s a breakdown of the translation: Origin… view at source ↗
Figure 22
Figure 22. Figure 22: Repetition-steered Llama-3.1-70B completions view at source ↗
Figure 23
Figure 23. Figure 23: TLDR-steered Llama-3.1-70B completions Instruction: Throwaway.. Anyways. me=19m her=20f I recently (about 4 months ago) started dating this girl. We’re ”in a relationship” now and she’s literally the greatest person I’ve ever been with. Maybe I’m still in the honeymoon phase? But I doubt it. I seriously cannot find any imperfections. She’s beautiful, considerate, friendly, funny, VERY SMART, VERY GENEROUS… view at source ↗
Figure 24
Figure 24. Figure 24: ToxicChat-steered Llama-3.1-70B completions view at source ↗
read the original abstract

Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to all tokens, resulting in inconsistent steering quality across diverse input prompts. In this work, we introduce Contextual Linear Activation Steering (CLAS), a method that dynamically adapts linear activation steering to context-dependent steering strengths. Across eleven steering benchmarks and four model families, it consistently outperforms standard linear activation steering and matches or exceeds the performance of ReFT and LoRA in settings with limited labeled data. We therefore propose CLAS as a scalable, interpretable, and accurate method for specializing and steering large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces Contextual Linear Activation Steering (CLAS), which extends linear activation steering by dynamically computing context-dependent steering strengths instead of using a fixed value for all tokens. The central empirical claim is that CLAS outperforms standard linear activation steering across eleven steering benchmarks and four model families while matching or exceeding ReFT and LoRA performance in limited-labeled-data regimes, positioning CLAS as a scalable and interpretable alternative for LLM specialization.

Significance. If the reported gains hold under rigorous controls, CLAS would represent a targeted, low-overhead improvement to activation steering that mitigates prompt-dependent inconsistency without sacrificing the method's interpretability or data efficiency. The work's strength lies in its direct empirical comparison to both fixed-strength baselines and parameter-efficient fine-tuning methods on a broad benchmark suite.

minor comments (3)
  1. [§3] §3 (Method): The precise functional form used to derive per-token or per-prompt steering strengths from context should be stated explicitly, including any learned parameters or heuristics, to allow replication and to clarify why the approach remains more scalable than ReFT/LoRA.
  2. [§4] §4 (Experiments): Table 1 and Figure 2 would benefit from reporting standard deviations across multiple random seeds or prompt shuffles, as the headline claim of 'consistent' outperformance rests on these aggregate numbers.
  3. [§4.2] The paper should include a brief ablation isolating the contribution of the context-adaptation mechanism versus simply using a stronger fixed steering vector.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision for our manuscript on Contextual Linear Activation Steering (CLAS). We are pleased that the work is viewed as a targeted improvement to activation steering with strong empirical support across benchmarks and model families. As the report lists no specific major comments, we have no point-by-point rebuttals to provide at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces CLAS as an empirical method for context-dependent linear activation steering and supports its claims solely through performance comparisons on eleven benchmarks across four model families. No equations, derivations, or mathematical chains are described in the provided abstract or structure; the central results are external benchmark outcomes that remain independently falsifiable and do not reduce to self-definitions, fitted parameters renamed as predictions, or self-citation chains. Background citations to prior steering work function as standard context rather than load-bearing premises that collapse into the present contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented physical entities are introduced; the contribution is an empirical algorithmic variant evaluated on benchmarks.

pith-pipeline@v0.9.0 · 5411 in / 1118 out tokens · 64226 ms · 2026-05-08T03:26:19.504753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

    cs.CL 2026-05 unverdicted novelty 7.0

    GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

  2. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

    cs.CL 2026-05 unverdicted novelty 6.0

    GCAD steering extracts prompt-based attention deltas and gates them at token level, cutting coherence drift from -18.6 to -1.9 while raising trait expression at turn 10 from 78 to 93 on multi-turn persona benchmarks.

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    English quotes dataset.https://huggingface.co/datasets/Abirate/english_quotes, 2023

    Abirate. English quotes dataset.https://huggingface.co/datasets/Abirate/english_quotes, 2023

  2. [2]

    Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg. Fine-grained analysis of sentence embed- dings using auxiliary prediction tasks. InInternational Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=BJh6Ztuxl

  3. [3]

    Understanding intermediate layers using linear classifier probes

    G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

  4. [4]

    Azizi, E

    S. Azizi, E. B. Potraghloo, and M. Pedram. Activation steering for chain-of-thought compression.arXiv preprint arXiv:2507.04742, 2025

  5. [5]

    Representation engineering for large-language models: Survey and research challenges.arXiv preprint arXiv:2502.17601,

    L. Bartoszcze, S. Munshi, B. Sukidi, J. Yen, Z. Yang, D. Williams-King, L. Le, K. Asuzu, and C. Maple. Representation engineering for large-language models: Survey and research challenges.arXiv preprint arXiv:2502.17601, 2025

  6. [6]

    Beaglehole, A

    D. Beaglehole, A. Radhakrishnan, E. Boix-Adser` a, and M. Belkin. Toward universal steering and monitoring of ai models.Science, 391(6787):787–792, 2026. doi: 10.1126/science.aea6792. URL https://www.science.org/doi/abs/10.1126/science.aea6792

  7. [7]

    Belinkov

    Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48 (1):207–219, 2022

  8. [8]

    What do Neural Machine Translation Models Learn about Morphology?

    Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, and J. Glass. What do neural machine translation models learn about morphology? In R. Barzilay and M.-Y. Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada, July 2017. Association for Computational Lingu...

  9. [9]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. InConference on Neural Information Processing Systems, 2020

  10. [10]

    P. Chao, A. Robey, E. Dobler, C. Butoi, L. He, E. Myers, Z. Doan, A. Chen, P. Chaudhari, and A. Zou. Jailbreakbench: An open robustness benchmark for jailbreaking llms.https://huggingface.co/dat asets/jailbreakhub/jailbreakbench, 2024

  11. [11]

    R. Chen, A. Arditi, H. Sleight, O. Evans, and J. Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

  12. [12]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  13. [13]

    Davarmanesh, A

    P. Davarmanesh, A. Wilson, and A. Radhakrishnan. Efficient and accurate steering of large language models through attention-guided feature learning, 2026. URLhttps://arxiv.org/abs/2602.00333

  14. [14]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Leetcode benchmark dataset.https://huggingface.co/datasets/greengerong/leet code, 2024

    Greengerong. Leetcode benchmark dataset.https://huggingface.co/datasets/greengerong/leet code, 2024

  16. [16]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.International Conference on Learning Representations, 2021. 10

  17. [18]

    URLhttp://arxiv.org/abs/1909.03368

  18. [19]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  19. [20]

    Jiang, G

    Y. Jiang, G. Rajendran, P. K. Ravikumar, B. Aragam, and V. Veitch. On the origins of linear repre- sentations in large language models. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning R...

  20. [21]

    Konen, S

    K. Konen, S. Jentzsch, D. Diallo, P. Sch¨ utt, O. Bensch, R. El Baff, D. Opitz, and T. Hecking. Style vectors for steering generative large language models. InFindings of the Association for Computational Linguistics: EACL 2024, pages 782–802, Mar. 2024

  21. [22]

    K. Li, O. Patel, F. Vi´ egas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=aLLuYpn83y

  22. [23]

    Z. Lin, Z. Wang, Y. Tong, Y. Wang, Y. Guo, Y. Wang, and J. Shang. ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In H. Bouamor, J. Pino, and K. Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4694–4702, Singapore, Dec. 2023. Association for Computational Linguisti...

  23. [24]

    A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis.Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011

  24. [25]

    Mikolov, I

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. InConference on Neural Information Processing Systems, volume 26, 2013

  25. [26]

    Mikolov, W.-t

    T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. InProceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 746–751, 2013

  26. [27]

    Emergent Linear Representations in World Models of Self-Supervised Sequence Models

    N. Nanda, A. Lee, and M. Wattenberg. Emergent linear representations in world models of self- supervised sequence models. In Y. Belinkov, S. Hao, J. Jumelet, N. Kim, A. McCarthy, and H. Mo- hebbi, editors,Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Net- works for NLP, pages 16–30, Singapore, Dec. 2023. Association for Co...

  27. [28]

    K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models. InWorkshop on Causal Representation Learning at Advances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=T0PoOJg8cK

  28. [29]

    Pennington, R

    J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014

  29. [30]

    Mechanism for feature learning in neural networks and backpropagation-free machine learning models

    A. Radhakrishnan, D. Beaglehole, P. Pandit, and M. Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models.Science, 383(6690):1461–1467, 2024. doi: 10.1126/science.adi5639

  30. [31]

    Rimsky, N

    N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. Turner. Steering llama 2 via contrastive activation addition. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15504–15522, Aug. 2024. 11

  31. [32]

    S. Syed, M. Voelske, M. Potthast, and B. Stein. Dataset for generating tl;dr, 2018

  32. [33]

    Taori, I

    R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023

  33. [34]

    BERT Rediscovers the Classical NLP Pipeline

    I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In A. Korhonen, D. Traum, and L. M` arquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URLhttps://aclantholo...

  34. [35]

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  35. [36]

    Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767,

    T. van der Weij, M. Poesio, and N. Schoots. Extending activation steering to broad skills and multiple behaviours.arXiv preprint arXiv:2403.05767, 2024

  36. [37]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

  37. [38]

    Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts. Reft: Representation finetuning for language models.Advances in Neural Information Processing Systems, 37:63908–63962, 2024

  38. [39]

    Q. A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y.-C. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qiu, ...

  39. [40]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

  40. [41]

    Representation Engineering: A Top-Down Approach to AI Transparency

    A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dom- browski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 12 A Steering toward a single task and evaluating on all tasks Task steered towards Method Qwen2.5-7B Llama-3.1-70B Llama-3.1-8B Llama-3.2-1B...