Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

Lars Malmqvist

arxiv: 2606.20572 · v1 · pith:K3FOQTV6new · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

Lars Malmqvist This is my paper

Pith reviewed 2026-07-01 08:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Shapley valueslinguistic steeringadjective effectsLLM alignmentMMLU benchmarkmodel familiesinteraction effectsprompt position

0 comments

The pith

Adjectives steer large language models through non-additive interactions that depend on model family and prompt position.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Shapley value framework to measure the steering impact of adjectives on LLM performance across models and the MMLU benchmark. It establishes that certain adjectives have outsized effects, but these effects correlate within model families and shift with syntactic context. Larger models exhibit complex interactions where adjectives can amplify or reverse each other, while smaller models react more literally. This matters because it shows that reliable control of LLMs requires accounting for these interactions rather than relying on fixed prompting rules.

Core claim

Using Shapley values to attribute performance changes to individual adjectives, the analysis reveals a family effect in sensitivity profiles across models like o3, gpt-4o-mini, phi-3, llama-3-70b, and deepseek-r1. It further demonstrates non-additive interaction effects in larger models, where adjectives synergistically amplify, antagonistically dampen, or reverse impacts, contrasting with more literal responses in smaller models.

What carries the argument

The Shapley value attribution method applied to adjective tokens, which quantifies each adjective's marginal contribution to the model's benchmark score while accounting for interactions.

If this is right

Powerful adjectives are not universal but show correlated effects within model lineages.
Steering direction depends on syntactic role and position in the prompt.
Larger models display strong non-additive interactions among adjectives.
Smaller models show more literal and less compositional responses to adjectives.
Prompting strategies must be model-specific to achieve reliable steering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment methods may need to incorporate compositional prompt analysis for scaled models.
Testing on other benchmarks could reveal if these effects generalize beyond MMLU.
Extending the framework to verbs or other modifiers might uncover similar patterns.
The family effect suggests that fine-tuning within a lineage could preserve steering behaviors.

Load-bearing premise

That the Shapley value calculation on adjective tokens accurately isolates their causal steering contribution without major interference from prompt syntax or tokenization.

What would settle it

Recomputing the attributions after rephrasing the prompts to alter syntax while keeping adjectives the same, and finding that the values shift substantially, would indicate the method does not isolate the intended effects.

Figures

Figures reproduced from arXiv: 2606.20572 by Lars Malmqvist.

**Figure 2.** Figure 2: Spearman rank correlation of adjective impact across models. The strong positive correlation [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of directional steering for o3 (left) and gpt-4o-mini (right). The plots show an inversion [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of Adjective Steering Profiles for the hyper-sensitive deepseek-r1 (left) and the insen [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overall model sensitivity to adjectival steering, broken down by broad academic domain. Note the [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Correlation of adjective steering effects across questions for o3. The strong negative correlation [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Achieving reliable control of Large Language Models (LLMs) requires a precise, scalable understanding of how they interpret linguistic cues. We introduce a rigorous framework using Shapley values to quantify the steering effect of individual adjectives on model performance, moving beyond anecdotal heuristics to principled attribution. Applying this method to 100 adjectives across a diverse suite of models (including o3, gpt-4o-mini, phi-3, llama-3-70b, and deepseek-r1) on the MMLU benchmark, we uncover several critical findings for AI alignment. First, we find that a small subset of adjectives act as disproportionately powerful "levers," yet their effects are not universal. Cross-model analysis reveals a "family effect": models of a shared lineage exhibit correlated sensitivity profiles, while architecturally distinct models react in a largely uncorrelated manner, challenging the notion of a one-size-fits-all prompting strategy. Second, focused follow-up studies demonstrate that the steering direction of these powerful adjectives is not intrinsic but is highly contingent on their syntactic role and position within the prompt. For larger models like gpt-4o-mini, we provide the first quantitative evidence of strong, non-additive interaction effects where adjectives can synergistically amplify, antagonistically dampen, or even reverse each other's impact. In contrast, smaller models like phi-3 exhibit a more literal and less compositional response. These results suggest that as models scale, their interpretation of prompts becomes more sophisticated but also less predictable, posing a significant challenge for robustly steering model behavior and highlighting the need for compositional and model-specific alignment techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies Shapley values to adjective steering on MMLU and reports family effects plus non-additive interactions in larger models, but the abstract gives no indication the method controls for syntactic and positional artifacts.

read the letter

The paper applies Shapley values to attribute the impact of adjectives on LLM performance on MMLU. The headline findings are family-specific sensitivity patterns and non-additive interactions that grow with model size.

This approach is new in its systematic quantification across architectures, and the observation that larger models show more complex, sometimes reversing interactions is worth noting. It does a decent job of moving from qualitative prompting tricks to measurable effects and points out that prompting strategies need to be model-specific.

The main concern is the validity of the attribution. Shapley values assume the value function can be evaluated on subsets independently, but in a prompt, the presence of one adjective affects the context for others through position and syntax. The abstract acknowledges position contingency but gives no sign that the method fixes positions or uses equivalent syntactic structures for coalitions. If the interactions come from those mechanics instead of semantic composition, the conclusion about scaling and alignment challenges doesn't follow as strongly. No mention of statistical significance or robustness checks either.

This paper is for people studying prompt engineering and model alignment who want quantitative tools. A reader working on steering techniques could get ideas from the family effects, even if the interaction claims need verification.

It deserves a serious referee because the questions it raises are concrete and the framework is a reasonable starting point, though the current presentation leaves the soundness open.

I would recommend sending it for review with a note to clarify the value function definition and any controls for prompt structure.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a Shapley value-based framework to quantify the steering effects of individual adjectives on LLM performance on the MMLU benchmark across multiple models including gpt-4o-mini and phi-3. It reports model family correlations in sensitivity profiles, contingency of effects on syntactic position and role, and evidence of non-additive interactions (synergistic, antagonistic, or reversing) in larger models.

Significance. If the attribution method validly isolates semantic contributions without syntactic confounds, the findings would offer the first quantitative evidence of scaling trends in prompt interpretation, with important implications for AI alignment and the need for model-specific techniques. The cross-architecture analysis challenges universal prompting strategies.

major comments (1)

[Abstract] Abstract: The central claim of strong non-additive interaction effects in gpt-4o-mini rests on Shapley values correctly attributing causal contributions of adjectives. However, the framework description provides no indication that the value function or coalition sampling controls for changes in tokenization, attention, or positional encodings when adjectives are added or removed, despite noting contingency on syntactic role and position. This leaves open the possibility that measured interactions reflect structural prompt artifacts rather than compositional semantics.

minor comments (2)

[Abstract] Abstract: No mention of error bars, statistical significance testing, or number of prompt variations used in the Shapley computations.
[Abstract] Abstract: The 'first quantitative evidence' claim would benefit from explicit comparison to prior work on prompt sensitivity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. The concern about potential structural artifacts in the Shapley framework is well-taken, and we address it directly below. We will revise the manuscript to expand the methodological description and add explicit controls.

read point-by-point responses

Referee: The central claim of strong non-additive interaction effects in gpt-4o-mini rests on Shapley values correctly attributing causal contributions of adjectives. However, the framework description provides no indication that the value function or coalition sampling controls for changes in tokenization, attention, or positional encodings when adjectives are added or removed, despite noting contingency on syntactic role and position. This leaves open the possibility that measured interactions reflect structural prompt artifacts rather than compositional semantics.

Authors: We agree that the abstract is terse on these controls and that an explicit discussion belongs in the main text. In the Methods, the value function is defined over coalitions where each adjective is inserted into one of several pre-specified syntactic slots (subject, modifier, etc.) within an otherwise fixed prompt template; this holds positional encodings and attention patterns constant for any given coalition. Tokenization variance is reduced by restricting the adjective set to items with comparable token lengths and by using the identical base prompt across all evaluations. The follow-up position/role experiments already serve as an internal control, demonstrating that interaction patterns persist (and sometimes reverse) when syntactic context is altered, which would not be expected if effects were purely structural. Nevertheless, we will add a dedicated subsection on potential confounds, including length-normalized ablations and attention-map comparisons, to make these safeguards transparent. This constitutes a partial revision focused on clarification and additional validation rather than a change in core results. revision: partial

Circularity Check

0 steps flagged

No circularity; standard Shapley attribution applied directly to benchmark outputs

full rationale

The paper applies the established Shapley value method to quantify adjective contributions to MMLU accuracy across models, without presenting any derivations, equations, or fitted parameters that reduce the claimed non-additive interactions or family effects to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes, and the framework relies on external game-theoretic attribution rather than redefining results in terms of themselves. The analysis is therefore self-contained as an empirical measurement exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; full paper would be needed to audit.

pith-pipeline@v0.9.1-grok · 5816 in / 961 out tokens · 23661 ms · 2026-07-01T08:58:46.123057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 11 canonical work pages · 7 internal anchors

[1]

Concept-level explainability for auditing and steering LLM responses

Kenza Amara, Rita Sevastjanova, and Mennatallah El-Assady. Concept-level explainability for auditing and steering LLM responses. arXiv preprint arXiv:2505.07610,

work page arXiv
[2]

URLhttps://arxiv.org/abs/2505. 07610. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 1877–1901. Curran Associates, Inc.,

2020
[4]

Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. InAdvances in Neural Information Processing Systems 30 (NeurIPS 2017), pp. 4299–4307,

2017
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

TokenSHAP: Interpreting large language models with monte carlo shapley value estimation

Roni Goldshmidt and Miriam Horovicz. TokenSHAP: Interpreting large language models with monte carlo shapley value estimation. arXiv preprint arXiv:2407.10114,

work page arXiv
[7]

Sarthak Jain and Byron C. Wallace. Attention is not explanation.arXiv preprint arXiv:1902.10186,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[8]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InAdvances in Neural Information Processing Systems 30 (NeurIPS 2017), pp. 4768–4777,

2017
[10]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter

URLhttps://arxiv.org/abs/2411.12405. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill,

work page arXiv
[11]

URLhttps://distill.pub/ 2020/circuits/zoom-in

doi: 10.23915/distill.00024.001. URLhttps://distill.pub/ 2020/circuits/zoom-in. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems 35 (Ne...

work page doi:10.23915/distill.00024.001 2020
[12]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Logan IV, Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235,

2020
[15]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 24824–24837. Curran Associates, Inc.,

2022
[16]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

ShunyuYao, DianYu, JeffreyZhao, IzhakShafran, ThomasL.Griffiths, YuanCao, andKarthikNarasimhan. Treeofthoughts: Deliberateproblemsolvingwithlargelanguagemodels.arXiv preprint arXiv:2305.10601,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Large Language Models Are Human-Level Prompt Engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers.arXiv preprint arXiv:2211.01910,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Concept-level explainability for auditing and steering LLM responses

Kenza Amara, Rita Sevastjanova, and Mennatallah El-Assady. Concept-level explainability for auditing and steering LLM responses. arXiv preprint arXiv:2505.07610,

work page arXiv

[2] [2]

URLhttps://arxiv.org/abs/2505. 07610. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 1877–1901. Curran Associates, Inc.,

2020

[4] [4]

Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. InAdvances in Neural Information Processing Systems 30 (NeurIPS 2017), pp. 4299–4307,

2017

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

TokenSHAP: Interpreting large language models with monte carlo shapley value estimation

Roni Goldshmidt and Miriam Horovicz. TokenSHAP: Interpreting large language models with monte carlo shapley value estimation. arXiv preprint arXiv:2407.10114,

work page arXiv

[7] [7]

Sarthak Jain and Byron C. Wallace. Attention is not explanation.arXiv preprint arXiv:1902.10186,

work page internal anchor Pith review Pith/arXiv arXiv 1902

[8] [8]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. InAdvances in Neural Information Processing Systems 30 (NeurIPS 2017), pp. 4768–4777,

2017

[9] [10]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter

URLhttps://arxiv.org/abs/2411.12405. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill,

work page arXiv

[10] [11]

URLhttps://distill.pub/ 2020/circuits/zoom-in

doi: 10.23915/distill.00024.001. URLhttps://distill.pub/ 2020/circuits/zoom-in. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems 35 (Ne...

work page doi:10.23915/distill.00024.001 2020

[11] [12]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

Logan IV, Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235,

2020

[14] [15]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 24824–24837. Curran Associates, Inc.,

2022

[15] [16]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

ShunyuYao, DianYu, JeffreyZhao, IzhakShafran, ThomasL.Griffiths, YuanCao, andKarthikNarasimhan. Treeofthoughts: Deliberateproblemsolvingwithlargelanguagemodels.arXiv preprint arXiv:2305.10601,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

Large Language Models Are Human-Level Prompt Engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers.arXiv preprint arXiv:2211.01910,

work page internal anchor Pith review Pith/arXiv arXiv