pith. sign in

arxiv: 2606.04057 · v1 · pith:JWNWGB7Nnew · submitted 2026-06-02 · 💻 cs.SE · cs.AI· cs.LG

The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation

Pith reviewed 2026-06-28 08:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords LLM code generationprompt sensitivityalgorithm choicealgorithm steeringprompt cuessoftware engineeringinvisible lottery
0
0 comments X

The pith

Incidental prompt cues steer which algorithm LLMs select for coding tasks, even when all outputs pass the same tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that contextual words or metadata outside the core task specification can influence the algorithm family an LLM chooses when generating code. It demonstrates this through 46,535 controlled experiments across 11 tasks and 19 cue types, revealing shifts in algorithm distributions as large as 100 percentage points that track the meaning of the cues. This creates an unpredictable selection process over code properties like performance and security. Direct naming of the desired algorithm emerges as the strongest tested control on the choice.

Core claim

Incidental prompt cues steer algorithm choice under fixed correctness, producing systematic shifts in algorithm-family distributions up to 100 percentage points that align with cue semantics across tasks including rate limiting, with direct algorithm naming as the most reliable mitigation tested.

What carries the argument

Algorithm steering, defined as cue-induced shifts in algorithm-family distributions, quantified via controlled prompt variations and code parsing across models.

If this is right

  • Accidental context in prompts creates an invisible lottery over performance, security, and maintainability of generated code.
  • Direct algorithm naming in the prompt reliably reduces unwanted steering across the tested models and tasks.
  • Shifts appear in applied settings such as rate limiting and remain consistent with cue meaning rather than surface form.
  • The effect holds across 15 model configurations and persists even when all generated solutions pass identical tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt design guidelines for code generation may need explicit rules against incidental context that could bias algorithm selection.
  • Developers using LLMs for production code could benefit from always specifying the target algorithm family to remove hidden variability.
  • Future model training could incorporate techniques to reduce sensitivity to non-task cues in algorithm choice.

Load-bearing premise

Observed distribution shifts are driven by the semantic content of the cues rather than sampling noise, parsing differences, or task ambiguity, and algorithm families can be identified consistently from generated code.

What would settle it

Repeating the full experiment suite with fixed sampling seeds and an independent code classifier to test whether the shifts remain when cue semantics are isolated from other variables.

Figures

Figures reproduced from arXiv: 2606.04057 by Akanksha Narula, Laurent Bindschaedler, Mofasshara Binte Rafique.

Figure 1
Figure 1. Figure 1: Four representative steering cases. Each group holds the task specification fixed and compares the max–min cue pair within that task–channel; bars report the selected algorithm-family rate. Persona Context Import Dependency Benchmark Code Review Constraint Test Complexity Style Docstring Style Comment Density Variable Naming Error Handling Input Format Temporal Placebo Innocuous Interaction Expr. Parsing M… view at source ↗
Figure 2
Figure 2. Figure 2: Maximum steering delta (pp) per task–channel pair. Each cell is the largest algorithm-family swing observed when varying cue values within that channel. Appendix [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative steering examples showing algorithm-family distributions before (baseline) and after a cue. Appendix B.2 gives the full per-cue distributions. der junior, with a corresponding pass rate penalty (80% vs. 100%; Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model divergence under identical cues. Each cell shows a target algorithm’s selection rate for one model–cue pair; pass rate is annotated when below 85%. Columns span three tasks and six cue conditions. opposite directions across models. Finding. Steering generalizes across models and tested tem￾peratures; direction varies by model. Quantization and Reasoning Quantization modestly at￾tenuates steering: Dee… view at source ↗
Figure 5
Figure 5. Figure 5: Baseline-referenced steering across all task–channel pairs. Each cell reports the largest |∆| in pp relative to the none baseline across cue values within the channel; [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sophistication–reliability diagnostic across task–cue conditions. Each point is one condition; the x-axis is the share of outputs selecting a task-specific sophisticated family (e.g., matrix exponentiation for memoization, eval for expression parsing), and the y-axis is pass rate. The aggregate Pearson correlation is near zero (r≈0.09), masking task-specific tradeoffs (memoization with academic personas re… view at source ↗
read the original abstract

Large language models (LLMs) now generate substantial production code, often for tasks with multiple valid algorithmic solutions. Incidental prompt cues, meaning contextual words or metadata outside the task specification, can steer which algorithm the model selects, even when all outputs pass the same tests. Prompt sensitivity is well studied as a tool to improve output quality. Here, output policy means algorithm choice under fixed correctness. We define algorithm steering as cue-induced shifts in algorithm-family distributions and run 46,535 controlled experiments across 11 tasks, 19 cue types (18 channels plus a memoization semantic-vs-surface ablation that preserves meaning while changing typography and punctuation), and 15 model configurations. We find large, systematic shifts in algorithm-family distributions (up to 100 pp), largely consistent with cue semantics, including in applied tasks such as rate limiting. Direct algorithm naming is the most reliable mitigation we tested. Accidental context therefore creates an "invisible lottery" over performance, security, and maintainability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that incidental prompt cues (contextual words or metadata outside the task spec) steer LLMs toward different algorithm families in code generation tasks, even when all outputs pass the same tests. It reports results from 46,535 controlled experiments across 11 tasks, 19 cue types, and 15 model configurations, finding systematic shifts in algorithm-family distributions of up to 100 percentage points that align with cue semantics; direct algorithm naming is identified as the most reliable mitigation.

Significance. If the empirical results hold after addressing classification and statistical concerns, the work would be significant for the field of LLM code generation. It provides a large-scale demonstration of how non-task prompt elements influence output policy (algorithm choice under fixed correctness), with direct implications for reproducibility, security, and maintainability in applied settings such as rate limiting. The scale of the experiment set and the inclusion of a semantic-vs-surface ablation are strengths.

major comments (3)
  1. [Abstract / Methods] The abstract and methods description provide no information on the procedure used to classify generated code into algorithm families. This classification step is load-bearing for the central claim of cue-induced shifts, yet no inter-rater protocol, blinding procedure, automated classifier validation, or consistency metrics across tasks are reported.
  2. [Results / Experimental Setup] No details are given on statistical controls, error bars, multiple-testing corrections, or how baseline distributions (without cues) were established and compared. The reported shifts of up to 100 pp cannot be evaluated for robustness without this information.
  3. [Discussion] The weakest assumption—that observed distribution shifts are caused by cue semantics rather than uncontrolled variables in sampling, parsing, or task ambiguity—is not tested. The paper does not report any sensitivity analysis or validation that the family labels are independent of the incidental cue.
minor comments (1)
  1. [Introduction] The term 'output policy' is introduced without a formal definition or contrast to standard usage in the LLM literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important areas for clarification on classification, statistics, and causal attribution. We address each point below and have revised the manuscript to strengthen these aspects.

read point-by-point responses
  1. Referee: [Abstract / Methods] The abstract and methods description provide no information on the procedure used to classify generated code into algorithm families. This classification step is load-bearing for the central claim of cue-induced shifts, yet no inter-rater protocol, blinding procedure, automated classifier validation, or consistency metrics across tasks are reported.

    Authors: We agree the original submission under-described the classification pipeline. The revised manuscript adds a dedicated subsection in Methods that defines the algorithm-family taxonomy from standard references, details an automated AST-based classifier with keyword heuristics, reports manual validation on a 10% stratified sample, provides inter-annotator agreement (Fleiss' kappa = 0.81), and describes blinding procedures in which annotators were unaware of cue conditions. A new supplementary table reports per-task consistency metrics. revision: yes

  2. Referee: [Results / Experimental Setup] No details are given on statistical controls, error bars, multiple-testing corrections, or how baseline distributions (without cues) were established and compared. The reported shifts of up to 100 pp cannot be evaluated for robustness without this information.

    Authors: We have expanded the Results and Experimental Setup sections to include bootstrap 95% confidence intervals on all reported distribution shifts, Holm-Bonferroni correction across the 11 tasks and 19 cue types, and a precise description of baseline construction (minimum 4,000 no-cue trials per task, with explicit comparison via chi-squared tests). All figures now display error bars and corrected p-values. revision: yes

  3. Referee: [Discussion] The weakest assumption—that observed distribution shifts are caused by cue semantics rather than uncontrolled variables in sampling, parsing, or task ambiguity—is not tested. The paper does not report any sensitivity analysis or validation that the family labels are independent of the incidental cue.

    Authors: The original manuscript already contains the memoization semantic-vs-surface ablation, which isolates semantics while holding surface form constant. We have added further sensitivity analyses in the revised Discussion: (i) re-running subsets at varied temperatures and top-p values yields stable shifts; (ii) task rephrasings that preserve semantics but alter surface wording produce unchanged family distributions; (iii) an explicit check confirms cue tokens do not alter the AST parser or family-labeling logic. These results support attribution to cue semantics. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement of prompt effects

full rationale

The paper reports results from 46,535 controlled experiments measuring shifts in algorithm-family distributions induced by incidental prompt cues. No equations, fitted models, predictions, or first-principles derivations appear in the provided text; the central claim is an observed empirical pattern across tasks and models. Algorithm-family labeling is a methodological classification step whose consistency is not secured by any self-referential definition or self-citation chain. No load-bearing self-citations, ansatzes, or renamings of known results are present. The work is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is purely empirical and introduces no new free parameters, axioms beyond standard statistical inference, or invented entities. It relies on the background assumption that algorithm families are identifiable and that cue semantics can be isolated.

axioms (2)
  • domain assumption Algorithm families can be reliably and consistently identified from generated code outputs across different tasks
    Required to convert raw model outputs into the reported distribution shifts
  • standard math Statistical significance testing can establish that observed shifts are due to the tested cues
    Implicit in claims of large, systematic shifts

pith-pipeline@v0.9.1-grok · 5716 in / 1431 out tokens · 23344 ms · 2026-06-28T08:56:39.257676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    Code2vec: Learning distributed representations of code,

    Alon, U., Zilberstein, M., Levy, O., and Yahav, E. code2vec : Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3 0 (POPL): 0 40:1--40:29, January 2019. doi:10.1145/3290353

  2. [2]

    Program Synthesis with Large Language Models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

  3. [3]

    Grounded copilot: How programmers interact with code-generating models,

    Barke, S., James, M. B., and Polikarpova, N. Grounded Copilot : How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7 0 (OOPSLA1): 0 85--111, April 2023. doi:10.1145/3586030

  4. [4]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis...

  5. [5]

    N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J

    Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., and Stoica, I. Chatbot arena: An open platform for evaluating LLMs by human preference. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 8359--8388. ...

  6. [6]

    S., Reid, M., Matsuo, Y., and Iwasawa, Y

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pp.\ 22199--22213. Curran Associates, Inc., 2022

  7. [7]

    H., Wang, C., Huang, J.-T., and Lyu, M

    Lam, M. H., Wang, C., Huang, J.-T., and Lyu, M. R. CodeCrash : Exposing LLM fragility to misleading natural language in code reasoning. In Advances in Neural Information Processing Systems, volume 38, pp.\ 120782--120828. Curran Associates, Inc., 2025

  8. [8]

    How diversely can language models solve problems? exploring the algorithmic diversity of model-generated code

    Lee, S., Chon, H., Jang, J., Lee, D., and Yu, H. How diversely can language models solve problems? exploring the algorithmic diversity of model-generated code. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 152--167, Suzhou, China, November 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-emnlp.10

  9. [9]

    S., Wang, Y., and Zhang, L

    Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36, pp.\ 21558--21572. Curran Associates, Inc., 2023

  10. [10]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

    Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024. doi:10.1162/tacl_a_00638

  11. [11]

    In: Proc

    Mozannar, H., Bansal, G., Fourney, A., and Horvitz, E. Reading between the lines: Modeling user behavior and costs in AI -assisted programming. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, 2024. doi:10.1145/3613904.3641936

  12. [12]

    In-context Learning and Induction Heads

    Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learning and inducti...

  13. [13]

    Prompting

    OpenAI . Prompting. OpenAI API documentation, 2026. URL https://platform.openai.com/docs/guides/prompting

  14. [14]

    Gutfleisch, J

    Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. Asleep at the keyboard? assessing the security of GitHub Copilot 's code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp.\ 754--768. IEEE, 2022. doi:10.1109/SP46214.2022.9833571

  15. [15]

    Improving reproducibility in machine learning research ( A report from the NeurIPS 2019 reproducibility program)

    Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivi \`e re, V., Beygelzimer, A., d'Alch \'e Buc, F., Fox, E., and Larochelle, H. Improving reproducibility in machine learning research ( A report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research, 22 0 (164): 0 1--20, 2021

  16. [16]

    Rice, J. R. The algorithm selection problem. In Advances in Computers, volume 15, pp.\ 65--118. Academic Press, 1976. doi:10.1016/S0065-2458(08)60520-3

  17. [17]

    Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

    Sclar, M., Choi, Y., Tsvetkov, Y., and Suhr, A. Quantifying language models' sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. In International Conference on Learning Representations, volume 2024, pp.\ 25055--25083, 2024

  18. [18]

    and Zhang, T

    Tian, Y. and Zhang, T. Selective prompt anchoring for code generation. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pp.\ 59528--59551. PMLR, 2025

  19. [19]

    ACM Trans Softw Eng Methodol 34(8):225:1--225:53, doi:10.1145/3722108, ://doi.org/10.1145/3722108

    Tony, C., D \'i az Ferreyra, N. E., Mutas, M., Dhif, S., and Scandariato, R. Prompting techniques for secure code generation: A systematic investigation. ACM Transactions on Software Engineering and Methodology, 34 0 (8), October 2025. doi:10.1145/3722108

  20. [20]

    Vaithilingam, P., Zhang, T., and Glassman, E. L. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA '22). Association for Computing Machinery, 2022. doi:10.1145/3491101.3519665

  21. [21]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., and Beutel, A. The instruction hierarchy: Training LLMs to prioritize privileged instructions, 2024. URL https://arxiv.org/abs/2404.13208

  22. [22]

    Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses ,

    Wang, C., Huang, K., Zhang, J., Feng, Y., Zhang, L., Liu, Y., and Peng, X. LLMs meet library evolution: Evaluating deprecated API usage in LLM -based code completion. In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering, pp.\ 885--897. IEEE, 2025. doi:10.1109/ICSE55347.2025.00245

  23. [23]

    Tree of thoughts: Deliberate problem solving with large language models

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pp.\ 11809--11822. Curran Associates, Inc., 2023

  24. [24]

    When ``A Helpful Assistant'' Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

    Zheng, M., Pei, J., Logeswaran, L., Lee, M., and Jurgens, D. When ``a helpful assistant'' is not really helpful: Personas in system prompts do not improve performances of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 15126--15154, Miami, Florida, USA, November 2024. Association for Computational Ling...

  25. [25]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title =

    Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =

  26. [26]

    Proceedings of the IEEE/ACM 47th International Conference on Software Engineering , pages =

    Wang, Chong and Huang, Kaifeng and Zhang, Jian and Feng, Yebo and Zhang, Lyuye and Liu, Yang and Peng, Xin , title =. Proceedings of the IEEE/ACM 47th International Conference on Software Engineering , pages =. 2025 , publisher =

  27. [27]

    International Conference on Learning Representations , volume=

    Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , title =. International Conference on Learning Representations , volume=

  28. [28]

    Advances in Neural Information Processing Systems , volume =

    Kojima, Takeshi and Gu, Shixiang (Shane) and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Advances in Neural Information Processing Systems , volume =. 2022 , publisher =

  29. [29]

    Advances in Neural Information Processing Systems , volume =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Tom and Cao, Yuan and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

  30. [30]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace and Kai Xiao and Reimar Leike and Lilian Weng and Johannes Heidecke and Alex Beutel , year =. The Instruction Hierarchy: Training. doi:10.48550/arXiv.2404.13208 , url =. 2404.13208 , archivePrefix =

  31. [31]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

    Zheng, Mingqian and Pei, Jiaxin and Logeswaran, Lajanugen and Lee, Moontae and Jurgens, David , title =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , month = nov, address =

  32. [32]

    Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and...

  33. [33]

    Program Synthesis with Large Language Models

    Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles , title =. 2021 , eprint =. doi:10.48550/arXiv.2108.07732 , url =

  34. [34]

    Proceedings of the 42nd International Conference on Machine Learning , series =

    Tian, Yuan and Zhang, Tianyi , title =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , publisher =

  35. [35]

    Prompting Techniques for Secure Code Generation: A Systematic Investigation , journal =

    Tony, Catherine and D. Prompting Techniques for Secure Code Generation: A Systematic Investigation , journal =. 2025 , month = oct, publisher =

  36. [36]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

    Lee, Seonghyeon and Chon, HeeJae and Jang, Joonwon and Lee, Dongha and Yu, Hwanjo , title =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =. 2025 , month = nov, address =

  37. [37]

    , title =

    Rice, John R. , title =. Advances in Computers , volume =. 1976 , publisher =

  38. [38]

    , title =

    Vaithilingam, Priyan and Zhang, Tianyi and Glassman, Elena L. , title =. Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA '22) , articleno =. 2022 , publisher =

  39. [39]

    and Polikarpova, Nadia , title =

    Barke, Shraddha and James, Michael B. and Polikarpova, Nadia , title =. Proceedings of the ACM on Programming Languages , volume =. 2023 , month = apr, publisher =

  40. [40]

    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24) , articleno =

    Mozannar, Hussein and Bansal, Gagan and Fourney, Adam and Horvitz, Eric , title =. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24) , articleno =. 2024 , publisher =

  41. [41]

    Proceedings of the ACM on Programming Languages , volume =

    Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran , title =. Proceedings of the ACM on Programming Languages , volume =. 2019 , month = jan, publisher =

  42. [42]

    In-context Learning and Induction Heads

    Olsson, Catherine and Elhage, Nelson and Nanda, Neel and Joseph, Nicholas and DasSarma, Nova and Henighan, Tom and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Johnston, Scott and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse...

  43. [43]

    2022 IEEE Symposium on Security and Privacy (SP) , pages =

    Pearce, Hammond and Ahmad, Baleegh and Tan, Benjamin and Dolan-Gavitt, Brendan and Karri, Ramesh , title =. 2022 IEEE Symposium on Security and Privacy (SP) , pages =. 2022 , publisher =

  44. [44]

    2026 , url =

    Prompting , howpublished =. 2026 , url =

  45. [45]

    Improving Reproducibility in Machine Learning Research (

    Pineau, Joelle and Vincent-Lamarre, Philippe and Sinha, Koustuv and Larivi. Improving Reproducibility in Machine Learning Research (. Journal of Machine Learning Research , volume =

  46. [46]

    and Stoica, Ion , title =

    Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , publisher =

  47. [47]

    Advances in Neural Information Processing Systems , volume =

    Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , title =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

  48. [48]

    , title =

    Lam, Man Ho and Wang, Chaozheng and Huang, Jen-Tse and Lyu, Michael R. , title =. Advances in Neural Information Processing Systems , volume =. 2025 , pages=