arxiv: 2605.01167 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI

Recognition: unknown

Minimizing Collateral Damage in Activation Steering

Richard G. Baraniuk, Sina Alemohammad, Tam Nguyen, Tu Anh Nguyen

Pith reviewed 2026-05-09 18:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords activation steeringcollateral damagesecond-moment matrixconstrained optimizationlarge language modelsfeature directionsrepresentation intervention

0 comments

The pith

Steering LLMs by minimizing squared changes weighted by the empirical second-moment matrix reduces collateral damage to non-target features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes collateral damage as unintended shifts in alignment along non-target feature directions when intervening in model activations. Standard vector addition assumes isotropy and therefore perturbs all directions equally, even when some directions matter more for unrelated capabilities. The authors recast steering as a constrained optimization that selects a new activation minimizing the expected squared change, with the quadratic penalty given by the empirical second-moment matrix of observed activations. This weighting automatically penalizes changes more heavily in directions where the model has already seen large variation. If the approach works, steering becomes more surgical and model performance on tasks unrelated to the target feature degrades less.

Core claim

Collateral damage arises because isotropic penalties treat every non-target direction as equally costly; replacing the uniform penalty with the quadratic form induced by the empirical second-moment matrix produces a steered activation that achieves the target direction while keeping total expected collateral cost low.

What carries the argument

A constrained optimization that minimizes the quadratic form of the perturbation vector under the empirical second-moment matrix of activations.

If this is right

Steering interventions become more selective because directions with high observed variance are protected.
Model performance on tasks orthogonal to the steering objective degrades more slowly.
The same optimization framework can be applied at any layer where an empirical covariance can be estimated from a modest set of activations.
Steering vectors can be chosen to satisfy multiple target constraints simultaneously by extending the quadratic objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same second-moment weighting could be used to regularize other representation edits such as concept erasure or safety fine-tuning.
If the second-moment matrix changes rapidly across contexts, the method would need to be recomputed or approximated online.
The approach implicitly assumes that second-moment statistics estimated from a calibration set remain representative for new prompts.

Load-bearing premise

The empirical second-moment matrix correctly encodes the nonuniform costs of perturbing different non-target directions and that minimizing the resulting weighted squared change produces less collateral damage in practice.

What would settle it

On a fixed set of steering prompts and evaluation tasks, measure whether the new method produces smaller drops in accuracy on unrelated benchmarks than standard addition while still shifting the target feature by the same amount.

Figures

Figures reproduced from arXiv: 2605.01167 by Richard G. Baraniuk, Sina Alemohammad, Tam Nguyen, Tu Anh Nguyen.

**Figure 1.** Figure 1: Negative correlation between collateral damage and accuracy of steering models on Qwen2.5-14B-Instruct. Across six benchmarks, we observe a strong negative Pearson correlation (r < −0.9) between the average collateral damage and the model’s accuracy. This validates that our collateral damage metric is a reliable proxy for performance degradation. ignoring this geometry can unintentionally disturb them. In … view at source ↗

**Figure 2.** Figure 2: Unlike the rigid Slerp path (blue), our optimized COAST trajectory (red) traverses the manifold of valid steering vectors (green) in order to minimize the collateral damage. 3.1. Steering with an alignment budget Let h ∈ R p denote the activation, and let d ∈ R p denote a target feature direction. We seek a new activation x ∈ R p that satisfies a prescribed alignment budget d ⊤x = α, where α ∈ [−1, 1] is t… view at source ↗

**Figure 3.** Figure 3: Trade-off Analysis: Accuracy (%) vs. Attack Success Rate (ASR) on tinyBenchmarks across four models. COAST (ours) consistently maintains higher task accuracy while driving higher attack success rates compared to the Angular Steering, ActAdd and No Steering baselines. The proof of Theorem 1 is provided in Appendix A.3. Step-size selection. Since Algorithm 1 discretizes the continuous flow of Theorem 1 via … view at source ↗

read the original abstract

Activation steering is a method for controlling Large Language Model (LLM) behavior by intervening in its internal representations to increase the alignment with a specific target feature direction. However, standard interventions, such as vector addition, often cause ``collateral damage", defined as unintended changes in the alignment of activations along other non-target feature directions. This damage occurs because standard methods implicitly assume the isotropy of non-target features. In this work, we provide a mathematical formalization of collateral damage and introduce a principled framework that models steering as a constrained optimization problem. Our method finds a new activation that minimizes the expected squared collateral change weighted by the empirical second-moment matrix of activations. This weighting encodes the nonuniform cost of the perturbation in different feature directions, in contrast to isotropic approaches that penalize changes uniformly in all feature directions. By accounting for the empirical second-moment of activations, our approach achieves more precise control while reducing the degradation of model performance on unrelated tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes collateral damage in activation steering and replaces isotropic penalties with a second-moment-weighted quadratic objective, but the assumption that this weighting reduces real task degradation rests on thin justification.

read the letter

The core move is to treat steering as a constrained optimization: find the smallest perturbation that achieves the target feature shift while minimizing the expected squared change weighted by the empirical second-moment matrix of activations. This directly addresses the isotropy assumption in standard vector-addition methods and gives a clean mathematical statement of what collateral damage means.

Referee Report

2 major / 1 minor

Summary. The paper claims that activation steering in LLMs produces collateral damage (unintended shifts along non-target feature directions) because standard vector-addition interventions implicitly assume isotropy. It formalizes collateral damage and recasts steering as a constrained optimization problem whose objective is to minimize the expected squared change in activations, weighted by the empirical second-moment matrix Σ = E[a a^T]. The weighting is asserted to capture nonuniform perturbation costs across directions, yielding more precise target control and less degradation on unrelated tasks than isotropic baselines.

Significance. If the central claim holds, the work supplies a parameter-free, data-driven refinement to activation steering that directly addresses a practical limitation of current editing techniques. The absence of new hyperparameters and the grounding of the penalty in observed activation statistics are genuine strengths that could make the method immediately usable in interpretability pipelines.

major comments (2)

[Abstract / modeling choice] Abstract and modeling section: the assertion that Σ = E[a a^T] 'encodes the nonuniform cost of the perturbation in different feature directions' is load-bearing for the entire claim, yet the text supplies no derivation or toy example showing why variance-based weighting reduces task-specific performance loss. A low-variance direction that is critical to an unrelated capability would be under-penalized, so the optimization may not achieve the stated reduction in collateral damage.
[Abstract] The abstract states that the method 'achieves more precise control while reducing the degradation of model performance on unrelated tasks,' but neither the full derivation of the constrained optimization nor any quantitative results or error analysis are provided. Without these, the central empirical benefit cannot be checked against the modeling choice.

minor comments (1)

Notation for the second-moment matrix and the optimization variables should be introduced with explicit definitions before the formalization is used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We provide point-by-point responses to the major comments below, offering clarifications on the theoretical foundation and empirical support while committing to revisions that address the concerns raised.

read point-by-point responses

Referee: [Abstract / modeling choice] Abstract and modeling section: the assertion that Σ = E[a a^T] 'encodes the nonuniform cost of the perturbation in different feature directions' is load-bearing for the entire claim, yet the text supplies no derivation or toy example showing why variance-based weighting reduces task-specific performance loss. A low-variance direction that is critical to an unrelated capability would be under-penalized, so the optimization may not achieve the stated reduction in collateral damage.

Authors: We agree that including an explicit derivation and a toy example would improve the manuscript's accessibility. The objective minimizes the quadratic form δ^T Σ δ, where Σ = E[a a^T] is the uncentered second-moment matrix. This penalizes perturbations more heavily in directions of high activation variance because such changes have a larger expected impact on the model's internal representations, as they deviate from the observed statistics. For the concern about low-variance directions: if a direction has low empirical variance, it contributes less to the typical activation patterns, so under-penalizing changes there may not lead to significant collateral damage in practice. However, we acknowledge that this is an assumption and will add a dedicated toy example in the appendix demonstrating the weighting's effect on a simple linear model, along with a step-by-step derivation in Section 2. We will also discuss potential limitations regarding critical low-variance features. revision: yes
Referee: [Abstract] The abstract states that the method 'achieves more precise control while reducing the degradation of model performance on unrelated tasks,' but neither the full derivation of the constrained optimization nor any quantitative results or error analysis are provided. Without these, the central empirical benefit cannot be checked against the modeling choice.

Authors: The manuscript contains the full derivation of the constrained optimization in Section 3, formulating steering as arg min_δ δ^T Σ δ subject to the target feature alignment constraint. Quantitative results are reported in Section 4, where we evaluate the method on several LLMs and tasks, showing reduced performance degradation on unrelated benchmarks compared to standard steering, with statistical comparisons. We note that while error analysis (e.g., variance across runs) is partially included, we will expand it in the revision to include confidence intervals and additional ablation studies. The abstract summarizes these findings, but we will revise it to explicitly point to the relevant sections for clarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formalizes collateral damage and proposes a constrained optimization that minimizes the expected squared change in non-target directions, weighted by the empirical second-moment matrix Σ = E[a a^T] computed directly from observed activations. This weighting is an external data-derived quantity rather than a parameter fitted to the target steering outcome or defined circularly in terms of the performance metric being optimized. No load-bearing steps reduce to self-citation, self-definition, or renaming of known results; the framework is a straightforward application of quadratic optimization with a data-driven Mahalanobis-style penalty, independent of the claimed reductions in collateral damage.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the second-moment matrix captures the relevant nonuniform costs and that the optimization objective correctly trades off target steering against collateral change.

axioms (1)

domain assumption Non-target feature directions have nonuniform perturbation costs that are captured by the empirical second-moment matrix of activations
This replaces the isotropy assumption and is invoked to justify the weighted objective.

pith-pipeline@v0.9.0 · 5461 in / 1105 out tokens · 41764 ms · 2026-05-09T18:53:57.987251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 23 canonical work pages · 9 internal anchors

[1]

Steering Language Models With Activation Engineering

Activation Addition: Steering Language Models Without Optimization , author =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review arXiv
[2]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review doi:10.48550/arxiv.2310.01405
[3]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , shorttitle =

Li, Kenneth and Patel, Oam and Vi. Inference-. 2024 , month = jun, number =. doi:10.48550/arXiv.2306.03341 , urldate =. arXiv , keywords =:2306.03341 , primaryclass =

work page doi:10.48550/arxiv.2306.03341 2024
[4]

Steering Llama 2 via Contrastive Activation Addition

Steering llama 2 via contrastive activation addition , author=. arXiv preprint arXiv:2312.06681 , year=

work page internal anchor Pith review arXiv
[5]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Liu, Sheng and Ye, Haotian and Xing, Lei and Zou, James , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[6]

T ruth X : Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Zhang, Shaolei and Yu, Tian and Feng, Yang. T ruth X : Alleviating Hallucinations by Editing Large Language Models in Truthful Space. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.483

work page doi:10.18653/v1/2024.acl-long.483 2024
[7]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona vectors: Monitoring and controlling character traits in language models , author=. arXiv preprint arXiv:2507.21509 , year=

work page internal anchor Pith review arXiv
[8]

Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, and Dan Mossing

Persona features control emergent misalignment , author=. arXiv preprint arXiv:2506.19823 , year=

work page arXiv
[9]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[10]

Advances in Neural Information Processing Systems , year=

Angular Steering: Behavior Control via Rotation in Activation Space , author=. Advances in Neural Information Processing Systems , year=
[11]

Alphasteer: Learn- ing refusal steering with principled null-space constraint

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint , author=. arXiv preprint arXiv:2506.07022 , year=

work page arXiv
[12]

The Thirteenth International Conference on Learning Representations , year=

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation , author=. The Thirteenth International Conference on Learning Representations , year=
[13]

The Thirteenth International Conference on Learning Representations , year=

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
[14]

The Thirteenth International Conference on Learning Representations , year=

Programming Refusal with Conditional Activation Steering , author=. The Thirteenth International Conference on Learning Representations , year=
[15]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

2025
[16]

Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety

Zhang, Yuyou and Li, Miao and Han, William and Yao, Yihang and Cen, Zhepeng and Zhao, Ding. Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.960

work page doi:10.18653/v1/2025.findings-acl.960 2025
[17]

A da S teer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting. A da S teer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/...

work page doi:10.18653/v1/2025.emnlp-main.1248 2025
[18]

Representation Bending for Large Language Model Safety

Yousefpour, Ashkan and Kim, Taeheon and Kwon, Ryan Sungmo and Lee, Seungbeen and Jeung, Wonje and Han, Seungju and Wan, Alvin and Ngan, Harrison and Yu, Youngjae and Choi, Jonghyun. Representation Bending for Large Language Model Safety. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. ...

work page doi:10.18653/v1/2025.acl-long.1173 2025
[19]

Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and Aobo Yang and Archi Mitra and Archie Sravankumar and Artem Korenev and Arthur Hinsvark and Arun Rao and Aston Zhang and Aurélien Rodriguez ...

2024
[20]

QVQ: To See the World with Wisdom , author =
[21]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review arXiv
[22]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023
[23]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[24]

2023 , url =

Diff-in-means concept editing is worst-case optimal: Explaining a result by Sam Marks and Max Tegmark , author =. 2023 , url =

2023
[25]

tinyBenchmarks : evaluating LLMs with fewer examples

tinyBenchmarks: evaluating LLMs with fewer examples , author=. arXiv preprint arXiv:2402.14992 , year=

work page arXiv
[26]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[27]

H ella S wag: Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[28]

Communications of the ACM , volume=

Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

2021
[29]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[30]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Park, Kiho and Choe, Yo Joong and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[33]

BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=

Emergent Linear Representations in World Models of Self-Supervised Sequence Models , author=. BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , year=
[34]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[35]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Jiang, Yibo and Rajendran, Goutham and Ravikumar, Pradeep and Aragam, Bryon and Veitch, Victor , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[36]

The Twelfth International Conference on Learning Representations , year=

Stable Anisotropic Regularization , author=. The Twelfth International Conference on Learning Representations , year=
[37]

Annual Meeting of the Association for Computational Linguistics , year=

Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity , author=. Annual Meeting of the Association for Computational Linguistics , year=
[38]

Anisotropy Is Inherent to Self-Attention in Transformers

Godey, Nathan and Clergerie, \'E ric and Sagot, Beno \^i t. Anisotropy Is Inherent to Self-Attention in Transformers. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.3

work page doi:10.18653/v1/2024.eacl-long.3 2024
[39]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

2022
[40]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
[41]

European Conference on Computer Vision , pages=

Spherical linear interpolation and text-anchoring for zero-shot composed image retrieval , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[42]

Proceedings of the 12th annual conference on Computer graphics and interactive techniques , year=

Animating rotation with quaternion curves , author=. Proceedings of the 12th annual conference on Computer graphics and interactive techniques , year=
[43]

arXiv preprint arXiv:1609.04468 , year=

Sampling generative networks , author=. arXiv preprint arXiv:1609.04468 , year=

work page arXiv
[44]

Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLM s with Direction-Magnitude Perspective

Pham, Van-Cuong and Nguyen, Thien Huu. Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLM s with Direction-Magnitude Perspective. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.761

work page doi:10.18653/v1/2024.emnlp-main.761 2024
[45]

2004 , publisher=

Convex optimization , author=. 2004 , publisher=

2004
[46]

2013 , publisher=

Algorithms for minimization without derivatives , author=. 2013 , publisher=

2013
[47]

CS294A Lecture notes , volume=

Sparse autoencoder , author=. CS294A Lecture notes , volume=
[48]

The Thirteenth International Conference on Learning Representations , year=

Scaling and evaluating sparse autoencoders , author=. The Thirteenth International Conference on Learning Representations , year=
[49]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Lieberum, Tom and Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Sonnerat, Nicolas and Varma, Vikrant and Kramar, Janos and Dragan, Anca and Shah, Rohin and Nanda, Neel. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP....

work page doi:10.18653/v1/2024.blackboxnlp-1.19 2024
[50]

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models

Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey. The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models. Findings of the Association for Computational Linguistics: EACL 2024. 2024

2024
[51]

ACM Comput

Zhang, Shengyu and Dong, Linfeng and Li, Xiaoya and Zhang, Sen and Sun, Xiaofei and Wang, Shuhe and Li, Jiwei and Hu, Runyi and Zhang, Tianwei and Wang, Guoyin and Wu, Fei , title =. ACM Comput. Surv. , month = jan, articleno =. 2026 , issue_date =. doi:10.1145/3777411 , abstract =

work page doi:10.1145/3777411 2026
[52]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review arXiv 1909
[53]

International Conference on Learning Representations , year=

Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author=. International Conference on Learning Representations , year=
[54]

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models , author=. arXiv preprint arXiv:2506.18141 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[56]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[57]

M. J. Kearns , title =
[58]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[59]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[60]

Suppressed for Anonymity , author=
[61]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[62]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959