arxiv: 2507.20906 · v3 · submitted 2025-07-28 · 💻 cs.CL

Soft Head Selection for Injecting ICL-Derived Task Embeddings

Jungwon Park , Jimyeong Kim , Changin Choi , Wonjong Rhee This is my paper

Pith reviewed 2026-05-19 02:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords in-context learningtask adaptationattention headsLLM adaptationtask embeddingsgradient-based selectionparameter-efficient methods

0 comments p. Extension

The pith

Gradient-based head selection lets ICL task embeddings outperform few-shot ICL and PEFT

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors propose SITE, a method that derives task embeddings from few-shot ICL prompts and uses gradients to select which attention heads should receive those embeddings during inference. This selective injection leads to better performance than prior embedding adaptation techniques and plain few-shot ICL, while requiring many fewer trainable parameters than PEFT. The approach is shown to work across twelve different large language models from 4B to 70B parameters. Patching analyses inside and across tasks indicate that attention head roles depend strongly on the specific task being performed. This matters for anyone adapting LLMs because it offers an efficient middle ground between prompting and full fine-tuning.

Core claim

SITE identifies task-relevant attention heads by computing gradients with respect to task embeddings derived from few-shot in-context learning prompts and then injects the embeddings preferentially into the selected heads to adapt the model to downstream tasks.

What carries the argument

Gradient-based soft head selection that identifies task-relevant attention heads for targeted injection of ICL-derived task embeddings

If this is right

SITE outperforms prior embedding-based adaptation methods and few-shot ICL across open-ended generation, reasoning, and natural language understanding tasks.
The method requires substantially fewer trainable parameters than PEFT while delivering higher performance.
Results hold across twelve LLMs sized from 4B to 70B parameters.
Intra-task and inter-task activation patching shows strong task dependence in attention head functionality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gradient signals could be used to focus other adaptation techniques on the most relevant model components.
Strong task dependence in heads implies attention mechanisms may implement identifiable modular computations that could transfer across related tasks.
The approach might extend to zero-shot or dynamic selection settings to further lower adaptation costs.

Load-bearing premise

Gradients from few-shot ICL prompts can reliably identify task-relevant attention heads for embedding injection without selection bias or task-specific overfitting.

What would settle it

If randomly selected heads produce the same performance gains as gradient-selected heads when the same task embeddings are injected, the value of the selection step would be falsified.

Figures

Figures reproduced from arXiv: 2507.20906 by Changin Choi, Jimyeong Kim, Jungwon Park, Wonjong Rhee.

**Figure 1.** Figure 1: Method Overview. Our method consists of three stages. (1) A set of task embeddings is constructed by averaging attention head activations for the last token across few-shot ICL prompts, using M=50 prompts each with N=10 input–output pairs. (2) Soft head-selection parameters are optimized to determine how the task embeddings should be injected into the model during zero-shot inference. (3) At inference time… view at source ↗

**Figure 2.** Figure 2: Average performance across 57 ICL tasks for 12 backbone large language models. For each backbone model, the performance of our method is presented along with 0-shot and 10-shot baselines. Average accuracies are annotated above each bar. Task-wise results for all 57 tasks–across our method, 0-shot, and 10-shot–are provided in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Optimized values of soft head-selection parameters for three ICL tasks. Each plot shows the optimized soft head-selection values for all 1024 attention heads in Llama-3.1-8B, sorted in descending order. Dashed lines indicate the initial value of 0.5 assigned to all selection parameters at the start of training. Results for all 57 tasks are provided in Figures 6-8 of Appendix C.1. soft head-selection parame… view at source ↗

**Figure 4.** Figure 4: Training dynamics of soft head-selection parameters for three ICL tasks. Validation loss (left y-axis) and test accuracy (right y-axis) are plotted over 400 training iterations for AG News, Person-Instrument, and Choose Middle Of 5. Dashed lines indicate the 10-shot baseline accuracies for reference. Plots for all 57 tasks are provided in Figures 12-14 of Appendix E.1. 048 15 25 35 45 60 80 100 0.0 0.2 0.4… view at source ↗

**Figure 5.** Figure 5: Impact of shot count on task performance for six ICL tasks. The plots show the performance of the N-shot baseline as the shot count (N) increases from 0 to 100. For comparison, our method is also evaluated using different values of N for task embedding construction (default: N=10), while keeping M=50 fixed. tives better capture the functional roles of attention heads, while task-agnostic approaches obscure… view at source ↗

**Figure 6.** Figure 6: Optimized values of soft head-selection parameters for 57 ICL tasks (part 1 of 3). Each plot shows the optimized values of the soft head-selection parameters for all 1024 attention heads in Llama-3.1-8B, sorted in descending order. Dashed lines indicate the initial value of 0.5 assigned to all selection parameters at the start of training. Plots for the remaining tasks are provided in Figures 7-8. 31 [PIT… view at source ↗

**Figure 7.** Figure 7: Optimized values of soft head-selection parameters for 57 ICL tasks (part 2 of 3). This figure continues from [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: Optimized values of soft head-selection parameters for 57 ICL tasks (part 3 of 3). This figure concludes the series from Figures 6-7 Each plot shows the optimized values of the soft head-selection parameters for all 1024 attention heads in Llama-3.1-8B, sorted in descending order. Dashed lines indicate the initial value of 0.5 assigned to all selection parameters at the start of training. 33 [PITH_FULL_IM… view at source ↗

**Figure 9.** Figure 9: Optimized values of soft head-selection parameters for six ICL tasks using Qwen3- 32B. Each plot shows the optimized values of the soft head-selection parameters for all 4096 attention heads in Qwen3-32B, sorted in descending order. Dashed lines indicate the initial value of 0.5 assigned to all selection parameters at the start of training. 0 200 400 600 800 1000 0.00 0.25 0.50 0.75 1.00 Soft Head-Selecti… view at source ↗

**Figure 10.** Figure 10: Optimized values of soft head-selection parameters for six ICL tasks using Mixtral8x7B-v0.1. Each plot shows the optimized values of the soft head-selection parameters for all 1024 attention heads in Mixtral-8x7B-v0.1, sorted in descending order. Dashed lines indicate the initial value of 0.5 assigned to all selection parameters at the start of training. 0 1000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.0… view at source ↗

**Figure 11.** Figure 11: Optimized values of soft head-selection parameters for six ICL tasks using Llama3.1-70B. Each plot shows the optimized values of the soft head-selection parameters for all 5120 attention heads in Llama-3.1-70B, sorted in descending order. Dashed lines indicate the initial value of 0.5 assigned to all selection parameters at the start of training. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: Training dynamics of soft head-selection parameters for 57 ICL tasks (part 1 of 3). Validation loss (left y-axis) and test accuracy (right y-axis) are plotted over 400 training iterations. Dashed lines indicate the 10-shot baseline accuracies for reference. The results are based on Llama3.1-8B. Plots for the remaining tasks are provided in [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 13.** Figure 13: Training dynamics of soft head-selection parameters for 57 ICL tasks (part 2 of 3). This figure continues from [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗

**Figure 14.** Figure 14: Training dynamics of soft head-selection parameters for 57 ICL tasks (part 3 of 3). This figure concludes the series from Figures 12-13. Validation loss (left y-axis) and test accuracy (right y-axis) are plotted over 400 training iterations. Dashed lines indicate the 10-shot baseline accuracies for reference. The results are based on Llama-3.1-8B. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗

**Figure 15.** Figure 15: Training dynamics of soft head-selection parameters for six ICL tasks using Qwen3- 32B. Validation loss (left y-axis) and test accuracy (right y-axis) are plotted over 400 training iterations. Dashed lines indicate the 10-shot baseline accuracies for reference. 0 100 200 300 400 0.5 1.0 Validation Loss ( ) Ag_News 0 100 200 300 400 0.5 1.0 Person-Instrument 0 100 200 300 400 1 2 Capitalize_Last_Letter 0 … view at source ↗

**Figure 16.** Figure 16: Training dynamics of soft head-selection parameters for six ICL tasks using Mixtral-8x7B-v0.1. Validation loss (left y-axis) and test accuracy (right y-axis) are plotted over 400 training iterations. Dashed lines indicate the 10-shot baseline accuracies for reference. 0 100 200 300 400 0.5 1.0 1.5 2.0 Validation Loss ( ) Ag_News 0 100 200 300 400 0.50 0.75 1.00 1.25 Person-Instrument 0 100 200 300 400 0 1… view at source ↗

**Figure 17.** Figure 17: Training dynamics of soft head-selection parameters for six ICL tasks using Llama3.1-70B. Validation loss (left y-axis) and test accuracy (right y-axis) are plotted over 400 training iterations. Dashed lines indicate the 10-shot baseline accuracies for reference. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_17.png] view at source ↗

read the original abstract

Large language models (LLMs) are commonly adapted to downstream tasks using parameter-efficient fine-tuning (PEFT) or in-context learning (ICL). Recently, ICL-driven embedding-based adaptation has been proposed as a distinct task adaptation paradigm. It derives task-specific embeddings from intermediate activations using few-shot prompts and injects them during inference. Despite its conceptual appeal, this approach has not demonstrated consistent performance gains over PEFT or ICL, and its empirical advantages have been limited in practice. We propose Soft head-selection for ICL-derived Task Embeddings (SITE), a gradient-based method that identifies task-relevant attention heads to enable effective task embedding injection. Across various types of open-ended generation, reasoning, and natural language understanding tasks, SITE significantly outperforms prior embedding-based adaptation methods and few-shot ICL, while using substantially fewer trainable parameters than PEFT. Experiments on 12 LLMs ranging from 4B to 70B parameters demonstrate the generality of our approach, and intra-task and inter-task activation patching analyses further provide new mechanistic insights by revealing strong task dependence in attention head functionality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SITE uses gradients to softly select attention heads for ICL embedding injection and reports gains across 12 models, but the selection risks being tied to specific prompts rather than tasks.

read the letter

The punchline for this paper is that they have developed a method called SITE that uses gradients to softly select which attention heads should receive injected task embeddings derived from ICL prompts. This allows better performance than previous embedding-based adaptations and few-shot ICL while keeping the number of trainable parameters much lower than typical PEFT approaches. What stands out positively is the scale of the evaluation. They test on 12 different LLMs ranging from 4B to 70B parameters and include a variety of tasks like generation, reasoning, and natural language understanding. The intra-task and inter-task activation patching analyses are also useful, as they reveal patterns of task dependence in how attention heads function. This goes beyond pure performance reporting and touches on interpretability. Where it could be softer is in the reliance on gradients computed from a particular set of few-shot ICL prompts to identify the relevant heads. Because ICL results are often highly sensitive to the order, selection, and formatting of examples, the selected heads might reflect those prompt-specific characteristics more than intrinsic task properties. If that's happening, then both the performance improvements and the mechanistic insights from patching could be less general than claimed. The paper would be stronger with explicit tests showing that the head selections remain stable across different ICL prompt variations. This kind of work is relevant for researchers exploring efficient task adaptation strategies for large models or those investigating the roles of individual attention heads. Given the experimental coverage and the concrete technical contribution, it deserves to go through peer review rather than being rejected at the desk. Referees could help clarify the robustness aspects and the exact experimental protocols.

Referee Report

2 major / 2 minor

Summary. The paper proposes Soft head-selection for ICL-derived Task Embeddings (SITE), a gradient-based method to identify and softly select task-relevant attention heads for injecting embeddings derived from few-shot ICL prompts. It claims consistent outperformance over prior embedding-based adaptation methods and few-shot ICL across open-ended generation, reasoning, and NLU tasks on 12 LLMs (4B–70B parameters), while using substantially fewer trainable parameters than PEFT. Intra-task and inter-task activation patching analyses are presented as evidence of strong task dependence in attention head functionality.

Significance. If the claims hold after addressing selection robustness, the work offers a parameter-efficient adaptation paradigm that bridges ICL and embedding injection with mechanistic interpretability. The scale of experiments across model sizes and task categories, plus the patching analyses, would constitute a solid empirical contribution to efficient LLM adaptation.

major comments (2)

[§3.2] §3.2 (gradient-based head selection): The method computes head importance from gradients on a fixed few-shot ICL prompt set. Because ICL is known to be sensitive to example order, selection, and formatting, it is unclear whether the resulting soft mask captures task-general heads or prompt artifacts. This directly bears on the performance gains in §4 and the task-dependence conclusions from the patching experiments in §5; a stability analysis across multiple prompt configurations for selection is needed.
[§4] §4 (experimental results): The headline claim of significant outperformance lacks reported statistical testing, exact baseline re-implementation details, and confirmation that head-selection hyperparameters were not tuned on the same data used for final evaluation. These omissions weaken the reliability of the comparisons to PEFT, ICL, and prior embedding methods.

minor comments (2)

[Abstract] Abstract: Include a brief mention of the number of tasks, model sizes, and whether gains are statistically significant to give readers a clearer sense of scope.
[§3] Notation in §3: Define the soft selection weights and their injection mechanism more explicitly to avoid ambiguity when reproducing the forward pass.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional analyses and clarifications that strengthen the reliability of our claims.

read point-by-point responses

Referee: [§3.2] §3.2 (gradient-based head selection): The method computes head importance from gradients on a fixed few-shot ICL prompt set. Because ICL is known to be sensitive to example order, selection, and formatting, it is unclear whether the resulting soft mask captures task-general heads or prompt artifacts. This directly bears on the performance gains in §4 and the task-dependence conclusions from the patching experiments in §5; a stability analysis across multiple prompt configurations for selection is needed.

Authors: We agree that ICL sensitivity to prompt variations is a valid concern that could affect whether the selected heads reflect task-general properties or prompt artifacts. To address this directly, we will perform a stability analysis by varying example order, selection, and formatting during head selection, then report the consistency of the resulting soft masks and downstream performance. These results and discussion will be added to §3.2, with explicit links to how they support the task-dependence findings from the patching experiments in §5. revision: yes
Referee: [§4] §4 (experimental results): The headline claim of significant outperformance lacks reported statistical testing, exact baseline re-implementation details, and confirmation that head-selection hyperparameters were not tuned on the same data used for final evaluation. These omissions weaken the reliability of the comparisons to PEFT, ICL, and prior embedding methods.

Authors: We acknowledge the need for greater statistical rigor and transparency. In the revision we will add statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests across seeds) for all headline comparisons. We will also expand §4 with precise re-implementation details for every baseline, including code references and hyperparameter choices. Finally, we will explicitly confirm and document that head-selection hyperparameters were tuned exclusively on a held-out validation split disjoint from all evaluation data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper proposes SITE, a gradient-based soft head selection procedure for injecting ICL-derived task embeddings, and supports its claims through direct experiments on 12 LLMs (4B–70B) that compare against independent external baselines (PEFT, few-shot ICL, prior embedding methods). Performance metrics and the intra-/inter-task patching analyses are defined and measured outside the head-selection rule itself; no equation or result is shown to reduce to the selection mask by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention heads exhibit stable task-specific roles detectable via gradients from few-shot prompts; no new physical or mathematical entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Attention heads in transformer LLMs possess task-dependent functionality that can be identified via gradient signals from ICL prompts.
This premise is required for the soft-selection step to be meaningful and is invoked to justify why selective injection improves performance.

pith-pipeline@v0.9.0 · 5724 in / 1274 out tokens · 33099 ms · 2026-05-19T02:37:49.278419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 16 internal anchors

[1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[2]

Word Translation Without Parallel Data

Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv ´e J ´egou. Word translation without parallel data. arXiv preprint arXiv:1710.04087,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Is in-context learning in large language models bayesian? a martingale perspective

Fabian Falck, Ziyu Wang, and Chris Holmes. Is in-context learning in large language models bayesian? a martingale perspective. arXiv preprint arXiv:2406.00793,

work page arXiv
[5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

In-context learning creates task vectors

Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. arXiv preprint arXiv:2310.15916,

work page arXiv
[7]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations. Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to...

work page arXiv
[8]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arX...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DARTS: Differentiable Architecture Search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work?arXiv preprint arXiv:2202.12837,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Distinguishing Antonyms and Synonyms in a Pattern-based Neural Network

Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. Distinguishing antonyms and synonyms in a pattern-based neural network. arXiv preprint arXiv:1701.02962,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language- independent named entity recognition. arXiv preprint cs/0306050,

work page internal anchor Pith review Pith/arXiv arXiv 2003
[16]

arXiv:2404.07129 [cs]

Aaditya K Singh, Ted Moskovitz, Felix Hill, Stephanie CY Chan, and Andrew M Saxe. What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation. arXiv preprint arXiv:2404.07129,

work page arXiv
[17]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language pro- cessing, pp. 1631–1642,

work page 2013
[18]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge.arXiv preprint arXiv:1811.00937,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Function vectors in large language models

Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213,

work page arXiv
[21]

Retrieval head mechanisti- cally explains long-context factuality

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanisti- cally explains long-context factuality. arXiv preprint arXiv:2404.15574,

work page arXiv
[22]

An Explanation of In-context Learning as Implicit Bayesian Inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Complementary explanations for effective in-context learning

Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Ves Stoyanov, Greg Durrett, and Ramakanth Pasunuru. Complementary explanations for effective in-context learning. arXiv preprint arXiv:2211.13892,

work page arXiv
[25]

More is not always better? enhancing many-shot in-context learning with differen- tiated and reweighting objectives

Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, and Rui Yan. More is not always better? enhancing many-shot in-context learning with differen- tiated and reweighting objectives. arXiv preprint arXiv:2501.04070,

work page arXiv
[26]

On the role of attention heads in large language model safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. On the role of attention heads in large language model safety. arXiv preprint arXiv:2410.13708,

work page arXiv
[27]

Neural Architecture Search with Reinforcement Learning

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Retrieval or global context understanding? on many-shot in-context learning for long-context evaluation

Kaijian Zou, Muhammad Khalifa, and Lu Wang. Retrieval or global context understanding? on many-shot in-context learning for long-context evaluation. arXiv preprint arXiv:2411.07130 ,

work page arXiv
[29]

and MTV (Huang et al., 2024). For FV , we adopt the hy- perparameter settings specified for Llama-2-7B in the official repository and apply them to our experiments on Llama-3.1-8B, as both models share the same number of attention layers and atten- tion heads per layer. For MTV , we train the head-sampling distribution on the full training dataset for 100...

work page 2024
[30]

Input:Sammy wanted to go to where the people were

Task Name Task Description Input-Output Example CommonsenseQA Select the most plausible answer to a commonsense question from five given options. Input:Sammy wanted to go to where the people were. Where might he go? a: race track b: populated areas c: the desert d: apartment e: roadblock Output:b Country-Capital Generate the capital city of a given countr...

work page 2023
[31]

These results demonstrate the robustness of our method to variations in prompt format

Across all five templates, our method consistently achieves strong performance, with average accuracies ranging from 89.0% to 91.2%, significantly outperforming the 10-shot baseline (76.7%-77.8%). These results demonstrate the robustness of our method to variations in prompt format. Table 12: Prompt templates used in the ablation study. Each template show...

work page 2023
[32]

32 0 200 400 600 800 1000 0.00 0.25 0.50 0.75 1.00Soft Head-Selection Value Choose_Last_Of_3 0 200 400 600 800 1000 0.00 0.25 0.50 0.75 1.00 Choose_Last_Of_5 0 200 400 600 800 1000 0.00 0.25 0.50 0.75 1.00 Choose_Middle_Of_3 0 200 400 600 800 1000 0.00 0.25 0.50 0.75 1.00Soft Head-Selection Value Choose_Middle_Of_5 0 200 400 600 800 1000 0.00 0.25 0.50 0....

work page 2000