arxiv: 2604.22027 · v1 · submitted 2026-04-23 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

David Bau, Ellie Pavlick, Eric Todd, Francisco Piedrahita Velez, Jacob Xiaochen Li, Michael L. Littman, Stephen H. Bach, Zhuonan Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords attention headsprompt sensitivitytask representationslarge language modelsfew-shot promptinginstruction promptsmodel interpretabilitybehavioral variability

0 comments

The pith

Large language models share the same task-specific attention heads across instruction and example prompts, with head activation explaining performance differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models show inconsistent results on the same task when the prompt wording changes. The paper demonstrates that models nevertheless activate common internal components for a given task regardless of whether the prompt uses natural language instructions or few-shot examples. Certain attention heads produce outputs that encode the task itself and these heads are reused across prompt styles. The degree to which these heads activate predicts how reliably the model answers, while weak performance sometimes traces to other task representations interfering with the main signal.

Core claim

Despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, task-specific attention heads whose outputs literally describe the task are shared across prompting styles and trigger subsequent answer production. Behavioral variation between prompts can be explained by the degree to which these heads are activated, and failures are at least sometimes due to competing task representations that dilute the signal of the target task.

What carries the argument

Lexical task heads: attention heads whose outputs describe the task and are activated similarly across prompt styles to drive answer generation.

If this is right

Prompt performance differences arise from varying levels of activation in the same set of task heads rather than entirely separate mechanisms.
Competing task representations can dilute the target task signal and produce errors even when the model has the relevant heads.
The same heads trigger answer production after being activated by either instruction or demonstration prompts.
Task representations remain consistent enough across prompt styles that variability is largely a matter of activation strength.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Intervening on these heads could provide a direct way to stabilize model behavior without changing the prompt text.
Models may maintain multiple task representations simultaneously, and the dominant one determines the output when activation levels differ.
This mechanism suggests that failures on new tasks could be diagnosed by checking whether the expected lexical task heads are present and sufficiently activated.

Load-bearing premise

The identified heads causally drive task performance and are genuinely shared rather than merely correlated with the prompt style used.

What would settle it

Measuring whether directly suppressing or boosting activation in the identified heads alters task accuracy in the same direction for both instruction and example prompts, independent of other model components.

Figures

Figures reproduced from arXiv: 2604.22027 by David Bau, Ellie Pavlick, Eric Todd, Francisco Piedrahita Velez, Jacob Xiaochen Li, Michael L. Littman, Stephen H. Bach, Zhuonan Yang.

**Figure 1.** Figure 1: A representative result showing that outputs of Llama3.1-8B vary in accuracy based on both prompt styles (examplebased vs. instruction-based prompts) and prompt templates (number of examples, wording, etc.). A–E refer to specific wordings of instructions (see §A.3 for prompting details). Although the prompts all query the antonyms of a same set of words, accuracy can differ by a factor of 4. See Appendi… view at source ↗

**Figure 2.** Figure 2: An overview of Section 2. In §2.2, we describe the key model components studied in this work, lexical task components (shown as diamonds) and retrieval components (shown as circles). In §2.3, we focus our study on the cases where the model gets correct results and find that these key components are shared across prompting styles. (a) example-based prompts and (b) instruction-based prompts for the same targ… view at source ↗

**Figure 3.** Figure 3: shows the overlap between lexical task heads identified for the instruction-based vs. example-based prompting styles for each task. On average, 73% of lexical task heads for a given task are shared across the two prompting styles. This result suggests that some components are reused when processing different surface instantiations of the same underlying task. Notably, while these components are largely i… view at source ↗

**Figure 4.** Figure 4: The shared lexical task heads produce functionally equivalent outputs across prompting styles. Each solid line represents an activation patching experiment. For all lines, the activations of a same set of heads are patched to same prompts. The only difference is the source of the patched activation, which is cached from different prompting templates and styles, represented by different colors. Instructio… view at source ↗

**Figure 6.** Figure 6: There is positive correlation between the accuracy and the magnitude of the outputs of lexical task heads. In each subplot for a task, each dot represents a given shot count. The plots for all the tasks are in Appendix H.1.2. Increasing the activation of the lexical task components fixes incorrect prompts. Beyond correlation, we ask if increasing the activation of the target task can repair prompts that o… view at source ↗

**Figure 8.** Figure 8: Lexical task heads modulate the output of retrieval heads. After scaling up lexical task head activations, more retrieval heads retrieve the correct answer (top), and the logits of the correct answer contributed by these heads also increase (bottom). The four analyses presented above provide a mechanistic account of how differential activation of shared components dictates model output. Crucially, lexical … view at source ↗

**Figure 7.** Figure 7: Scaling up the activation of lexical task heads can fix a portion of originally failed prompts. The average outputs of lexical task heads from correct prompts are patched to the incorrect prompts. The baseline accuracy is 0 for the incorrect prompts. See Appendix H.1.3 for results of all the tasks. Interaction between task representation and execution. We further explore the mechanisms by which lexical tas… view at source ↗

**Figure 10.** Figure 10: Quantification of the causal effect of lexical task heads. 2.5.2. CODE GENERATION TASKS For free form generation, we test lexical task representation mechanisms on a code generation task, using Python and JavaScript coding problems from the HumanEval-X dataset (Zheng et al., 2024). First, we identify lexical task heads corresponding to “Python” and another set of heads corresponding to “JavaScript” [PIT… view at source ↗

**Figure 9.** Figure 9: Ambiguous prompts trigger the internal circuits of a competing task, diluting the signals of the intended target task. 2.5. Generalization To More Complex Tasks We investigate if lexical task representations generalize to more complex tasks like compositional or free-form generation tasks. 2.5.1. COMPOSITIONAL TASKS As a proof of concept, we test lexical task representation mechanisms on several two-hop c… view at source ↗

**Figure 11.** Figure 11: Steering the model to generate Python code by activating lexical task heads. Left: When the scaling factor is small, the model continues to generate Javascript code. Middle: With a moderate scaling factor, the model switches from generating JavaScript to Python code. Right: When the scaling factor is too strong, the model’s generation deteriorates to repeating "Python". tion deteriorates to repeating "Pyt… view at source ↗

**Figure 12.** Figure 12: Behavior variance across all tasks for Llama-3.1-8B-Instruct model. country-capital product-company park-country landmark-country country-currency person-occupation person-sport person-instrument antonym synonym singular-plural present-past next_item prev_item english-french english-german english-spanish 0 0.2 0.4 0.6 0.8 1 Example Instruction Accuracy [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Behavior variance across all tasks for Llama-3.1-70B-Instruct model. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Behavior variance across all tasks for gemma-2-9b-it model. country-capital product-company park-country landmark-country country-currency person-occupation person-sport person-instrument antonym synonym singular-plural present-past next_item prev_item english-french english-german english-spanish 0 0.2 0.4 0.6 0.8 1 Example Instruction Accuracy [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Behavior variance across all tasks for gemma-2-27b-it model. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Behavior variance across all tasks for Qwen2.5-7B-Instruct model. country-capital product-company park-country landmark-country country-currency person-occupation person-sport person-instrument antonym synonym singular-plural present-past next_item prev_item english-french english-german english-spanish 0 0.2 0.4 0.6 0.8 1 Example Instruction Accuracy [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Behavior variance across all tasks for Qwen2.5-32B-Instruct model. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Behavior variance across all tasks for Qwen3-30B-A3B-Instruct-2507 model. Each data point represents the accuracy of a specific prompt template for a given task and prompting style. country-capital product-company park-country landmark-country country-currency person-occupation person-sport person-instrument antonym synonym singular-plural present-past next_item prev_item english-french english-german eng… view at source ↗

**Figure 19.** Figure 19: Behavior variance across all tasks for Qwen3-30B-A3B-Thinking-2507 model. Each data point represents the accuracy of a specific prompt template for a given task and prompting style. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: Quantification of the number of lexical task heads for example-based prompts in Llama-3.1-8B-Instruct (1,024 total heads) [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: Quantification of the number of lexical task heads for example-based prompts in Llama-3.1-70B-Instruct (5,120 total heads) 31 [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: Quantification of the number of lexical task heads for example-based prompts in gemma-2-9b-it (672 total heads) [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗

**Figure 23.** Figure 23: Quantification of the number of lexical task heads for example-based prompts in gemma-2-27b-it (1,472 total heads) 32 [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗

**Figure 24.** Figure 24: Quantification of the number of lexical task heads for example-based prompts in Qwen2.5-7B-Instruct (784 total heads) [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗

**Figure 25.** Figure 25: Quantification of the number of lexical task heads for example-based prompts in Qwen2.5-32B-Instruct (2,560 total heads) 33 [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

**Figure 26.** Figure 26: Quantification of the number of lexical task heads for example-based prompts in Qwen3-30B-A3B-Instruct model. 28 25 16 24 85 9 52 13 26 24 24 5 25 12 17 15 12 country-capital product-company park-country landmark-country country-currency person-occupation person-sport person-instrument antonym synonym singular-plural present-past next_item prev_item english-french english-german english-spanish 0 20 40 60… view at source ↗

**Figure 27.** Figure 27: Quantification of the number of lexical task heads for example-based prompts in Qwen3-30B-A3B-Thinking model. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗

**Figure 28.** Figure 28: Quantification of the number of lexical task heads for instruction-based prompts in Llama-3.1-8B-Instruct (1,024 total heads) [PITH_FULL_IMAGE:figures/full_fig_p035_28.png] view at source ↗

**Figure 29.** Figure 29: Quantification of the number of lexical task heads for instruction-based prompts in Llama-3.1-70B-Instruct (5,120 total heads) 35 [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: Quantification of the number of lexical task heads for instruction-based prompts in Qwen2.5-7B-Instruct (784 total heads) [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗

**Figure 31.** Figure 31: Quantification of the number of lexical task heads for instruction-based prompts in Qwen2.5-32B-Instruct (2,560 total heads) 36 [PITH_FULL_IMAGE:figures/full_fig_p036_31.png] view at source ↗

**Figure 32.** Figure 32: Quantification of the number of lexical task heads for instruction-based prompts in gemma-2-9b-it (672 total heads) [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗

**Figure 33.** Figure 33: Quantification of the number of lexical task heads for instruction-based prompts in gemma-2-27b-it (1,472 total heads) 37 [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗

**Figure 34.** Figure 34: Distribution of the lexical task heads across model layers in Llama-3.1-8B-Instruct 38 [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗

**Figure 51.** Figure 51: Gemma-2-9b-it [PITH_FULL_IMAGE:figures/full_fig_p043_51.png] view at source ↗

**Figure 53.** Figure 53: Qwen2.5-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p043_53.png] view at source ↗

**Figure 56.** Figure 56 [PITH_FULL_IMAGE:figures/full_fig_p044_56.png] view at source ↗

**Figure 57.** Figure 57: Quantification of the causal effect of lexical task heads in the Llama-3.1-8B-Instruct model 46 [PITH_FULL_IMAGE:figures/full_fig_p046_57.png] view at source ↗

**Figure 58.** Figure 58 [PITH_FULL_IMAGE:figures/full_fig_p047_58.png] view at source ↗

**Figure 59.** Figure 59 [PITH_FULL_IMAGE:figures/full_fig_p048_59.png] view at source ↗

**Figure 60.** Figure 60 [PITH_FULL_IMAGE:figures/full_fig_p049_60.png] view at source ↗

**Figure 61.** Figure 61 [PITH_FULL_IMAGE:figures/full_fig_p050_61.png] view at source ↗

**Figure 62.** Figure 62: Quantification of the causal effect of lexical task heads in the Llama-3.1-8B-Instruct model 51 [PITH_FULL_IMAGE:figures/full_fig_p051_62.png] view at source ↗

**Figure 63.** Figure 63 [PITH_FULL_IMAGE:figures/full_fig_p052_63.png] view at source ↗

**Figure 64.** Figure 64 [PITH_FULL_IMAGE:figures/full_fig_p053_64.png] view at source ↗

**Figure 65.** Figure 65: Some correct prompts activate more lexical task heads and they are activated more strongly than incorrect prompts. The y-axis is the difference of the number of heads or their norms between correct and incorrect prompts. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_65.png] view at source ↗

**Figure 66.** Figure 66: In 14 out of the 17 tasks, there is positive correlation between the accuracy and the magnitude of the outputs of lexical task heads. In each subplot for a task, each dot represents a given shot count. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_66.png] view at source ↗

**Figure 67.** Figure 67: Scaling up the activation of lexical task heads can fix a portion of originally failed prompts. The average outputs of lexical task heads from correct prompts are patched to the incorrect prompts. The baseline accuracy is 0 for the incorrect prompts. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_67.png] view at source ↗

**Figure 68.** Figure 68: The distribution of lexical task heads and the top 20 universal function vector heads across layers of the Llama-3.1-8B-Instruct model. Despite both kinds of heads generate task representations, they are largely disjoint sets of heads. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_68.png] view at source ↗

**Figure 69.** Figure 69: The number of retrieval heads across tasks for example-based prompts in Llama-3.1-8B model [PITH_FULL_IMAGE:figures/full_fig_p058_69.png] view at source ↗

**Figure 70.** Figure 70: The number of retrieval heads across tasks for instruction-based prompts in Llama-3.1-8B model. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_70.png] view at source ↗

**Figure 71.** Figure 71: The heatmaps visualize the distribution of retrieval heads in the Llama-3.1-8B-Instruct model. The color scale represents the proportion of prompts that a given head retrieves the correct answer for. We highlight in red the heads that surpass the 10% threshold, indicating that they are the retrieval heads for a given prompting style. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_71.png] view at source ↗

**Figure 72.** Figure 72: X-axis of the heatmap displays tasks for instruction-based prompting and Y-axis for example-based prompting. The darker the color, the more % of heads overlap for a given task across the two prompting styles. 60 [PITH_FULL_IMAGE:figures/full_fig_p060_72.png] view at source ↗

**Figure 73.** Figure 73: Ambiguous prompts dilute the signals of the intended target task (Product-Producer) and trigger the internal circuits of a competing task (Product-Country). K.2. “Number list” Task (not a knowledge retrieval task) The “Number list” task requires the model to identify an underlying selection rule from a series of few-shot demonstrations. In this task, each example consists of a list of integers (the "input… view at source ↗

**Figure 74.** Figure 74: Ambiguous prompts dilute the signals of the target task (selecting the even number) and trigger the internal circuits of the off-target task (selecting the first number). 62 [PITH_FULL_IMAGE:figures/full_fig_p062_74.png] view at source ↗

**Figure 75.** Figure 75: Quantification and visualization of lexical task heads. 63 [PITH_FULL_IMAGE:figures/full_fig_p063_75.png] view at source ↗

**Figure 76.** Figure 76: Behavioral results of two-hop compositional tasks. 2 7 12 11 10 4 landmark_country_capital park_country_capital product_company_ceo 0 2 4 6 8 10 12 1st_hop 2nd_hop # of heads [PITH_FULL_IMAGE:figures/full_fig_p066_76.png] view at source ↗

**Figure 77.** Figure 77: Quantification of the number of lexical task heads in two-hop compositional tasks. −9 −5 0 5 9 0.0 0.2 0.4 0.6 0.8 1.0 −9 −5 0 5 9 −9 −5 0 5 9 Example_1_shot Example_5_shot Example_10_shot Example_20_shot Example_30_shot Instruction_A Instruction_B Instruction_C Instruction_D Instruction_E scaling factor scaling factor scaling factor accuracy landmark_country_capital park_country_capital product_company_c… view at source ↗

**Figure 78.** Figure 78: Quantification of the causal effect of lexical task heads. Scaling up the activation of lexical task heads can fix a portion of originally failed prompts. The average outputs of lexical task heads from correct prompts are patched to the incorrect prompts. The baseline accuracy is 0 for the incorrect prompts. Each solid line represents an activation patching experiment. For all lines, the activations of a … view at source ↗

**Figure 79.** Figure 79: Compare the number of lexical task heads in the Instruct model (Llama-3.1-8B-Instruct) and the Base model (Llama-3.1-8B). 13 10 9 12 8 12 5 8 9 2 13 6 21 16 6 8 3 18 12 10 13 11 10 6 7 12 2 14 8 10 13 4 7 2 country-capital product-company park-country landmark-country country-currency person-occupation person-sport person-instrument antonym synonym singular-plural present-past next_item prev_item english-… view at source ↗

**Figure 80.** Figure 80: Compare the number of lexical task heads in the Instruct model (Qwen2.5-7B-Instruct) and the Base model (Qwen2.5-7B). 67 [PITH_FULL_IMAGE:figures/full_fig_p067_80.png] view at source ↗

**Figure 81.** Figure 81: Compare the number of lexical task heads in the Instruct model (gemma-2-9b-it) and the Base model (gemma-2-9b). 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 2… view at source ↗

**Figure 82.** Figure 82: Compare the locations of lexical task heads in the Instruct model (Llama-3.1-8B-Instruct) and the Base model (Llama-3.1-8B). 68 [PITH_FULL_IMAGE:figures/full_fig_p068_82.png] view at source ↗

**Figure 83.** Figure 83: Compare the locations of lexical task heads in the Instruct model (Qwen2.5-7B-Instruct) and the Base model (Qwen2.5-7B). 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 … view at source ↗

**Figure 84.** Figure 84: Compare the locations of lexical task heads in the Instruct model (gemma-2-9b-it) and the Base model (gemma-2-9b). 69 [PITH_FULL_IMAGE:figures/full_fig_p069_84.png] view at source ↗

**Figure 85.** Figure 85: Quantification of the causal effect of lexical task heads in Llama-3.1-8B model. Scaling up the activation of lexical task heads can fix a portion of originally failed prompts. The average outputs of lexical task heads from correct prompts are patched to the incorrect prompts. The baseline accuracy is 0 for the incorrect prompts. Each solid line represents an activation patching experiment. For all lines,… view at source ↗

**Figure 86.** Figure 86: Quantification of the causal effect of lexical task heads in Qwen2.5-7B model. Scaling up the activation of lexical task heads can fix a portion of originally failed prompts. The average outputs of lexical task heads from correct prompts are patched to the incorrect prompts. The baseline accuracy is 0 for the incorrect prompts. Each solid line represents an activation patching experiment. For all lines, t… view at source ↗

**Figure 87.** Figure 87: Quantification of the causal effect of lexical task heads in gemma-2-9b model. Scaling up the activation of lexical task heads can fix a portion of originally failed prompts. The average outputs of lexical task heads from correct prompts are patched to the incorrect prompts. The baseline accuracy is 0 for the incorrect prompts. Each solid line represents an activation patching experiment. For all lines, t… view at source ↗

**Figure 88.** Figure 88: Compare the number of lexical task heads identified with the original predefined task descriptive terms versus the terms generated by three language models: GPT-5.4 Mini, Claude 4.6 Sonnet, and Grok 4.2 Beta. 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15… view at source ↗

**Figure 89.** Figure 89: Compare the distribution of lexical task heads identified with the original predefined task descriptive terms versus the combined set of terms generated by three language models: GPT-5.4 Mini, Claude 4.6 Sonnet, and Grok 4.2 Beta. 73 [PITH_FULL_IMAGE:figures/full_fig_p073_89.png] view at source ↗

read the original abstract

One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates prompt sensitivity in LLMs by comparing instruction-based and example-based prompting. It identifies task-specific attention heads (termed lexical task heads) whose outputs encode task descriptions; these heads are shared across prompt styles, trigger answer production, and their activation degree accounts for performance variation, with failures sometimes arising from competing task representations that dilute the target signal.

Significance. If the causal and sharing claims hold with appropriate controls, the work supplies a mechanistic account of why LLMs exhibit prompt-dependent behavior. It moves beyond descriptive observations of inconsistency by tying variability to identifiable, reusable internal components, which could inform both interpretability research and practical prompting strategies.

major comments (2)

[Results section on head activation and performance correlation] The central claim that lexical task heads are causally responsible for triggering answer production and that their activation degree explains behavioral variation across prompts requires intervention evidence. Correlational activation patterns alone leave open the possibility that the heads are downstream correlates of successful runs rather than upstream drivers; ablation, patching, or activation-manipulation experiments are needed to establish necessity. This issue is load-bearing for the explanation of prompt failures via competing representations.
[Methods section describing lexical task head identification] The method for discovering and validating that heads 'literally describe the task' and are shared across prompting styles must include quantitative controls for prompt-style confounds. Without explicit metrics showing that the identification isolates task representations independent of surface prompt features, the sharing claim risks circularity with the prompting manipulation itself.

minor comments (2)

Clarify the precise definition and quantification of 'degree to which these heads are activated' (e.g., via a specific activation metric or threshold) to allow replication.
[Abstract] The abstract would benefit from naming the specific tasks or benchmarks used, even briefly, to ground the generality of the findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights key opportunities to strengthen the causal and methodological foundations of our work. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Results section on head activation and performance correlation] The central claim that lexical task heads are causally responsible for triggering answer production and that their activation degree explains behavioral variation across prompts requires intervention evidence. Correlational activation patterns alone leave open the possibility that the heads are downstream correlates of successful runs rather than upstream drivers; ablation, patching, or activation-manipulation experiments are needed to establish necessity. This issue is load-bearing for the explanation of prompt failures via competing representations.

Authors: We agree that correlational evidence, while consistent with our interpretation, leaves room for alternative accounts in which the observed heads are downstream effects rather than causal drivers. Our manuscript currently relies on activation strength correlating with performance, heads being shared across prompt styles, and their decoded outputs encoding task information. To establish necessity, we will add activation-patching experiments in the revised manuscript: we will selectively boost or suppress the activations of the identified lexical task heads during inference and quantify changes in task accuracy and prompt sensitivity. These interventions will directly test whether manipulating head activation alters answer production and resolves or induces competing-representation failures. revision: yes
Referee: [Methods section describing lexical task head identification] The method for discovering and validating that heads 'literally describe the task' and are shared across prompting styles must include quantitative controls for prompt-style confounds. Without explicit metrics showing that the identification isolates task representations independent of surface prompt features, the sharing claim risks circularity with the prompting manipulation itself.

Authors: Our head-identification procedure already employs a contrastive approach that isolates heads whose outputs differ systematically between task-specific and task-irrelevant conditions, and we verify cross-style sharing by matching heads discovered independently from instruction-based versus example-based prompts. Nevertheless, we acknowledge that additional quantitative safeguards against surface-feature confounds would increase rigor. In the revision we will report (i) cosine similarity of activation vectors for the same heads across the two prompt styles, (ii) controls using style-matched but semantically unrelated prompts, and (iii) metrics showing that the decoded task descriptions remain stable after lexical overlap between prompt styles is minimized. These additions will demonstrate that the identified representations are task-specific rather than prompt-style artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical head identification is independent of claims

full rationale

The paper presents an empirical analysis identifying task-specific attention heads via inspection of model internals, then correlates their activation and sharing across prompt styles with observed behavioral variation. No derivation step reduces a claimed prediction to a fitted input by construction, invokes a self-citation as the sole justification for uniqueness, or renames a known result under new coordinates. The account of shared representations and competing signals is grounded in direct observation of activations rather than self-referential definitions, leaving the central claims self-contained against external model behavior benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the existence of newly identified lexical task heads as causal drivers of task behavior; no explicit free parameters are mentioned, and the main added element is the postulated heads themselves.

axioms (1)

domain assumption Attention heads can encode and output representations that literally describe a task
Implicit in the identification and causal role assigned to lexical task heads

invented entities (1)

lexical task heads no independent evidence
purpose: Attention heads whose outputs describe the task, are shared across prompting styles, and trigger answer production
Newly named and posited in the abstract to explain shared mechanisms and performance variation

pith-pipeline@v0.9.0 · 5534 in / 1333 out tokens · 50554 ms · 2026-05-09T21:12:15.291387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 11 canonical work pages · 1 internal anchor

[1]

cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, and Maarten Sap. Out of style: RAG’s fragility to lin- guistic variation. In Vera Demberg, Kentaro Inui, and Lluís Marquez (eds.),Proceedings of the 19th Confer- ence of the European Chapter ...

2020
[2]

URL https: //aclanthology.org/2026.eacl-long.13/

doi: 10.18653/v1/2026.eacl-long.13. URL https: //aclanthology.org/2026.eacl-long.13/. Hakaze Cho, Haolin Yang, Gouki Minegishi, and Naoya Inoue. Mechanism of task-oriented information removal in in-context learning, 2025. URL https://arxiv. org/abs/2509.21012. Bilal Chughtai, Alan Cooney, and Neel Nanda. Summing up the facts: Additive mechanisms behind fa...

work page doi:10.18653/v1/2026.eacl-long.13 2026
[3]

findings-naacl.283/

URL https://aclanthology.org/2025. findings-naacl.283/. Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Be- rant. Analyzing transformers in embedding space. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16124–16170, Toronto, Ca...

work page doi:10.18653/v1/2023.acl-long.893 2025
[4]

ISBN 979-8-89176-251-0

Association for Computational Linguistics. ISBN 10 Shared Lexical Task Representations Explain Behavioral Variability In LLMs 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long

work page doi:10.18653/v1/2025.acl-long 2025
[5]

What Did I Do Wrong? Quantifying LLM s' Sensitivity and Consistency to Prompt Engineering

URL https://aclanthology.org/2025. acl-long.866/. Federico Errica, Davide Sanvito, Giuseppe Siracusano, and Roberto Bifulco. What did I do wrong? quanti- fying LLMs’ sensitivity and consistency to prompt en- gineering. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Proceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Assoc...

work page doi:10.18653/v1/2025.naacl-long.73 2025
[6]

emnlp-main.97

URL https://openreview.net/forum? id=x2Dw9aNbvw. Roee Hendel, Mor Geva, and Amir Globerson. In- context learning creates task vectors. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Find- ings of the Association for Computational Linguis- tics: EMNLP 2023, pp. 9318–9333, Singapore, De- cember 2023. Association for Computational Lin- guistics. doi: 10...

work page doi:10.18653/v1/2023.findings-emnlp 2023
[7]

findings-emnlp.624/

URL https://aclanthology.org/2023. findings-emnlp.624/. Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Be- linkov, and David Bau. Linearity of relation decod- ing in transformer language models. InThe Twelfth International Conference on Learning Representations, 12 Shared Lexical Task Representations Ex...

2023
[8]

Neeko: Leveraging dynamic LoRA for efficient multi-character role-playing agent

URL https://openreview.net/forum? id=w7LU2s14kE. Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. The at- las of in-context learning: How attention heads shape in-context retrieval augmentation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/ f...

work page doi:10.18653/v1/2024.emnlp-main 2025
[9]

The power of scale for parameter-efficient prompt tuning,

URL https://aclanthology.org/2024. emnlp-main.699/. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie- Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.),Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Pro- cessing, pp. 3045–3059, Online a...

work page doi:10.18653/v1/2021.emnlp-main 2024
[10]

emnlp-main.243/

URL https://aclanthology.org/2021. emnlp-main.243/. Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, and Byron C. Wallace. Do nat- ural language descriptions of model activations convey privileged information?, 2025. URL https://arxiv. org/abs/2509.13316. Emmy Liu, Graham Neubig, and Jacob Andreas. An incom- plete loop: Instruct...

work page internal anchor Pith review arXiv 2021
[11]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025
[12]

ISBN 9781450380959

URL https://aclanthology.org/2025. emnlp-main.762/. Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Cir- cuit component reuse across tasks in transformer lan- guage models. InThe Twelfth International Confer- ence on Learning Representations, 2024. URL https: //openreview.net/forum?id=fpoAYV6Wsk. Andrew Joohun Nam, Henry Conklin, Yukang Yang, Thomas L....

work page doi:10.1145/3411763.3451760 2025
[13]

Eric Todd, Millicent L

URL https://openreview.net/forum? id=O8rrXl71D5. Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vec- tors in large language models. InProceedings of the 2024 International Conference on Learning Representations,

2024
[14]

URL https://aclantholo gy.org/2023.acl-long.557/

URL https://openreview.net/forum? id=AwyxtyMwaG. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Confer- ence on Learning Representations, 2023a. URL https: //openreview.net/forum?id=NpsVSN6o4ul. ...

work page doi:10.18653/v1/2023 2023
[15]

Germany: Berlin, Japan: ____

Vary the number of demonstration pairs / shots: •" Germany: Berlin, Japan: ____"(1-shot) •" Germany: Berlin, Greece: Athens, Japan: ____"(2-shot)
[16]

Germany: Berlin, Greece: Athens, Japan: ____

Use different demonstration pairs: •" Germany: Berlin, Greece: Athens, Japan: ____" •" Peru: Lima, France: Paris, Japan: ____" Many different prompt templates can be constructed forinstruction-based prompting. e.g. • Template A:" What is the capital city of the country? Q: Japan. A: ____" • Template B:" Tell me the capital city of Japan. A: ____" • Templa...

2020
[17]

Hwasong-6: North Korea

to extract theuniversalfunction vector heads, which are a set of function vector heads shared across tasks. We then compare the universal function vector heads against the lexical task heads (identified using the procedure detailed in §2.2.2). 0 5 10 15 20 25 30 Head 0 5 10 15 20 25 30 Layer country-capital 0 5 10 15 20 25 30 Head 0 5 10 15 20 25 30 Layer...

2024