pith. machine review for the scientific record. sign in

arxiv: 2604.23267 · v1 · submitted 2026-04-25 · 💻 cs.CL · cs.LG

Recognition: unknown

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

Bishwamittra Ghosh, Deepak Garg, Evimaria Terzi, Krishna P. Gummadi, Mohammad Aflah Khan, Qinyuan Wu, Soumi Das, Till Speicher

Pith reviewed 2026-05-08 08:02 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords fine-tuningin-context learningformal languageslanguage proficiencyinductive biasesgeneralizationlarge language modelsdiscriminative test
0
0 comments X

The pith

Fine-tuning achieves greater language proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares fine-tuning and in-context learning by training and testing large language models on a formal language learning task that defines exact language boundaries, samples strings in a controlled way, and avoids any data contamination from natural language. It introduces a discriminative test in which a model demonstrates proficiency by assigning higher generation probability to strings that belong to the target language than to strings that do not. Under this setup, fine-tuning produces higher in-distribution proficiency than in-context learning, yet both methods reach the same level of out-of-distribution generalization. Inductive biases, measured by the correlation between the probability distributions each method assigns to strings, stay aligned while proficiency remains partial and separate once models reach higher accuracy. In-context learning also exhibits larger differences across model sizes, families, and token vocabularies than fine-tuning does.

Core claim

In a formal language learning task, fine-tuning yields greater language proficiency than in-context learning for in-distribution generalization, as measured by higher generation probabilities for in-language strings. Both approaches perform equally on out-of-distribution generalization. Their inductive biases are similar at partial learning levels but diverge at higher proficiency, and in-context learning varies more with model size and token vocabulary.

What carries the argument

The discriminative test for language proficiency, in which an LLM succeeds when it assigns higher generation probability to in-language strings than to out-of-language strings, applied inside a formal language framework that supplies precise boundaries, controlled sampling, and no data contamination.

If this is right

  • Fine-tuning should be chosen over in-context learning whenever high in-distribution accuracy on a well-defined language is required.
  • In-context learning performance will continue to depend more strongly on the choice of base model and its tokenizer than fine-tuning performance does.
  • Inductive biases of the two learning modes remain comparable only while both are still acquiring partial command of the language.
  • Formal languages supply a contamination-free testbed that can isolate LLM behaviors otherwise entangled in natural-language datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could use small formal-language proxies to decide whether fine-tuning or prompting is likely to be more effective for a target task before investing in large-scale training.
  • The observed divergence in biases at higher proficiency levels suggests that further scaling or continued training would widen the difference between the two modes rather than close it.
  • The same formal-language protocol could be applied to other structured domains, such as programming languages or logical formulas, to obtain clean comparisons free of natural-language leakage.

Load-bearing premise

Assigning higher probability to in-language strings on the formal-language discriminative test accurately measures differences in language proficiency between fine-tuning and in-context learning that would matter for natural language.

What would settle it

Running the identical models on a natural-language task that supplies known in-language and out-of-language strings without contamination and finding that the probability gap between fine-tuning and in-context learning disappears or reverses on in-distribution items.

Figures

Figures reproduced from arXiv: 2604.23267 by Bishwamittra Ghosh, Deepak Garg, Evimaria Terzi, Krishna P. Gummadi, Mohammad Aflah Khan, Qinyuan Wu, Soumi Das, Till Speicher.

Figure 1
Figure 1. Figure 1: Fine-tuning and in-context learning are two view at source ↗
Figure 3
Figure 3. Figure 3: A string s generated by the grammar in Fig￾ure 2. The rule ‘A19 → A18 A16 [1]’ indicates that non-terminal A19 is expanded to A18 followed by A16 with probability 1, and so on, until reaching T. The generation probability of s is the multiplication of the probabilities of rules applied recursively to generate s, and P(s) = (0.5)23 . lustrate a representative grammar and a sampled string, respectively. Addi… view at source ↗
Figure 4
Figure 4. Figure 4: We visualize the set of all strings in a hierar view at source ↗
Figure 5
Figure 5. Figure 5: Language proficiency of Mistral-7B on lan￾guage L1, while varying the number of examples in both learning modes. value, the better. Thus, LLM M is more language proficient in L than M′ , if aucM(L,T(L)) > aucM′(L,T(L)). We formalize the comparability of the discriminative test in the following claim. Claim 1. For a given language, the discriminative test yields a numerically comparable score between two le… view at source ↗
Figure 6
Figure 6. Figure 6: FT and ICL across different LLMs while learn￾ing language L1. Different LLMs demonstrate similar FT performance, but their ICL ability varies. ICL ability (AUC range) Model Good (≥ 0.75) Qwen-2.5-7B, Mistral-7B, Qwen-2.5-1.5B, Llama-2-13B, Qwen-2.5-0.5B, Llama-2-7B, Mistral-12B Moderate (≥ 0.6) Gemma-2-2B, Gemma-2-9B, Pythia-6.9B, Opt-1.3B, Opt-6.7B, Pythia-1B, Llama-3.2- 3B, Opt-2.7B, Llama-3.2-1B Poor (<… view at source ↗
Figure 7
Figure 7. Figure 7: In-distribution generalization of FT vs. ICL on L1 in comparable ≈ 7B parameter size LLMs. FT usually dominates ICL, except in Qwen-2.5-7B, Mistral￾7B, and Llama-2-7B, where ICL is close to FT. fit in their context. We find the following order of ICL ability of LLM families: Qwen (0.78) ≥ Mistral (0.78) > Llama-2 (0.77) > Gemma (0.69) > Opt (0.64) > Pythia (0.61) > Llama-3 (0.59). Due to variable performan… view at source ↗
Figure 9
Figure 9. Figure 9: Inductive bias of ICL and FT, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with more examples (larger markers). the inductive bias of FT and ICL, we do not focus on how each mode operates internally, but on the correlation between their generation losses when evaluated on the same set of strings. Thus,… view at source ↗
Figure 10
Figure 10. Figure 10: Robustness of language proficiency of FT and ICL in Qwen-2.5-7B while varying languages in two ways: changing the grammar rules (rows) and changing the alphabet tokens (columns). The underlying grammar for a language is inside the parentheses. Compared to FT, ICL is sensitive to the tokens used in the language, despite having the same underlying grammar. by Mosbach et al. (2023) (Appendix D). These find￾i… view at source ↗
Figure 11
Figure 11. Figure 11: Production rules of GNumerical α (left) and GLatin α (right). occurrences of each string to the training or test set. This process repeats until the initial finite set is exhausted view at source ↗
Figure 12
Figure 12. Figure 12: Production rules of GNumerical β (left) and GLatin β (right) view at source ↗
Figure 13
Figure 13. Figure 13: Length distribution of considered probabilistic languages, based on view at source ↗
Figure 14
Figure 14. Figure 14: Representative strings from different languages, annotated with non-terminals applied in different view at source ↗
Figure 15
Figure 15. Figure 15: Optimal fine-tuning performance in all models across different languages. view at source ↗
Figure 16
Figure 16. Figure 16: In-context learning performance of all models across different languages view at source ↗
Figure 17
Figure 17. Figure 17: Intra-family FT performance view at source ↗
Figure 18
Figure 18. Figure 18: Intra-family ICL performance view at source ↗
Figure 19
Figure 19. Figure 19: Qwen-2.5-7B: comparison between fine-tuning and in-context learning across different languages 1 2 4 8 16 32 64 128 256 512 0.6 0.8 1 No. Examples AUC (a) Language L1 (G Numerical α ) 1 2 4 8 16 32 64 128 256 512 0.7 0.8 0.9 1 No. Examples AUC (b) Language L2 (G Latin α ) 16 64 128 256 512 0.6 0.8 1 No. Examples AUC (c) Language L3 (G Under-trained α ) 1 2 4 8 16 32 64 128 256 0.6 0.8 1 No. Examples AUC (… view at source ↗
Figure 20
Figure 20. Figure 20: Mistral-7B: comparison between fine-tuning and in-context learning across different languages view at source ↗
Figure 21
Figure 21. Figure 21: Llama-2-7B: comparison between fine-tuning and in-context learning across different languages. 1 2 4 8 16 32 64 128 256 512 1024 0.6 0.8 1 No. Examples AUC (a) Language L1 (G Numerical α ) 16 64 128 256 512 1024 0.6 0.8 1 No. Examples AUC (b) Language L3 (G Under-trained α ) 1 2 4 8 16 32 64 128 0.6 0.8 1 No. Examples AUC (c) Language L4 (G Numerical β ) 16 64 128 256 512 1024 0.6 0.8 1 No. Examples AUC (… view at source ↗
Figure 22
Figure 22. Figure 22: Llama-3.1-8B: comparison between fine-tuning and in-context learning across different languages view at source ↗
Figure 23
Figure 23. Figure 23: Inductive bias of ICL and FT on language L1, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 24
Figure 24. Figure 24: Inductive bias of ICL and FT on language L2, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 25
Figure 25. Figure 25: Inductive bias of ICL and FT on language L4, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 26
Figure 26. Figure 26: Inductive bias of ICL and FT on language L5, computed as the Pearson correlation of generation loss of FT and ICL on identical test strings. Correlation, despite being positive, tends to decrease with higher examples (larger markers) view at source ↗
Figure 27
Figure 27. Figure 27: Out-of-distribution generalization to languages of increasing distance using view at source ↗
Figure 28
Figure 28. Figure 28: In-distribution generalization of FT and ICL on the MNLI dataset, where the learning task is to perform natural language inference by generating the sentiment label {entailment, neutral, contradiction} given premise and hypothesis. At a high level, FT is better than ICL with more examples, consistent with results on formal languages. In a detailed analysis, we observe that different LLMs perform different… view at source ↗
Figure 29
Figure 29. Figure 29: MNLI dataset: In-distribution (inference within the same genre, Column view at source ↗
Figure 30
Figure 30. Figure 30: Testing the limit of utilizing ICL context (1536 examples ≈ 77K tokens) on language L1. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 31
Figure 31. Figure 31: Testing the limit of utilizing ICL context (1536 examples ≈ 77K tokens) on language L2. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 32
Figure 32. Figure 32: Testing the limit of utilizing ICL context on language L4. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 33
Figure 33. Figure 33: Testing the limit of utilizing ICL context on language L5. Training loss provides a lower bound of test loss in ICL. Long context LLMs cannot further improve from additional examples view at source ↗
Figure 34
Figure 34. Figure 34: Qwen-2.5-7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 0 1 2 4 8 16 32 64 128 256 512 1024 0 1 2 3 Incorrect Random Incorrect by 3 Edit Incorrect by 2 Edit Incorrect by 1 Edit Correct Test No. Examples Loss (a) FT, Language L1, Generative performance 0 1 2 4 8 16 32 64… view at source ↗
Figure 35
Figure 35. Figure 35: Mistral-7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4 view at source ↗
Figure 36
Figure 36. Figure 36: Llama-2-7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 0 1 2 4 8 16 32 64 128 256 512 1024 0 5 10 Incorrect Random Incorrect by 3 Edit Incorrect by 2 Edit Incorrect by 1 Edit Correct Test No. Examples Loss (a) FT, Language L1, Generative performance 0 1 2 4 8 16 32 64 1… view at source ↗
Figure 37
Figure 37. Figure 37: Llama-3.1-8B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4 view at source ↗
Figure 38
Figure 38. Figure 38: Gemma-2-9B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 0 1 2 4 8 16 32 64 128 256 512 1024 0 2 4 6 8 Incorrect Random Incorrect by 3 Edit Incorrect by 2 Edit Incorrect by 1 Edit Correct Test No. Examples Loss (a) FT, Language L1, Generative performance 0 1 2 4 8 16 32 0… view at source ↗
Figure 39
Figure 39. Figure 39: Pythia-6.9B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4 view at source ↗
Figure 40
Figure 40. Figure 40: Opt-6.7B: Language proficiency according to generative (first row) and discriminative (second row) tests. First two columns are for language L1, and the last two columns are for language L4. 16 64 256 1000 2000 3000 4000 5000 No. Examples Training Time (s) (a) Mistral-7B 16 64 256 0 100 200 No. Examples Inference Time (s) (b) Mistral-7B 16 64 256 30 35 40 45 No. Examples GPU Memory (GB) (c) Mistral-7B 16 … view at source ↗
Figure 41
Figure 41. Figure 41: Comparing FT and ICL across compute cost, such as training cost (column 1), inference cost (column 2), and memory cost (column 3), recorded for language L1. The results show that FT and ICL are expensive in different phases of computation: FT incurs training cost, which does not apply to ICL. In contrast, ICL has significantly higher inference cost, despite requiring less memory. Our paper therefore compa… view at source ↗
read the original abstract

Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a formal language learning task as a controlled testbed to compare fine-tuning (FT) and in-context learning (ICL) in LLMs. It defines a discriminative test for language proficiency (higher generation probability assigned to in-language strings than out-of-language strings) and reports three main empirical findings: (a) FT shows greater proficiency than ICL on in-distribution generalization but both modes perform equally on out-of-distribution generalization; (b) inductive biases (measured by correlation in string generation probabilities) are similar at partial learning levels but diverge at higher proficiency; (c) ICL performance varies more across model sizes/families and is sensitive to token vocabulary, unlike FT. Source code is released.

Significance. If the results hold under rigorous controls, the work supplies a contamination-free, precisely bounded testbed that can help resolve mixed prior comparisons of FT vs. ICL. The public code release is a clear strength for reproducibility. The approach of using formal languages to isolate inductive biases and generalization modes is a useful methodological contribution to the field.

major comments (3)
  1. [§4 and §3.2] §4 (Experimental Setup) and §3.2 (Discriminative Test): The sampling of out-of-language strings is not described with sufficient controls (length matching, n-gram overlap, or structural violation types). This is load-bearing for claims (a) and (b), because the reported equal OOD performance and the FT/ICL bias divergence at high proficiency could be artifacts if OOD strings share local surface statistics with training data rather than testing full grammar acquisition.
  2. [§5] §5 (Results): No statistical tests, number of independent runs, sample sizes, or error bars are reported for the performance differences, correlations, or sensitivity analyses. This undermines verification of the empirical claims, especially given the abstract's lack of these details and the low reader confidence in soundness.
  3. [§3.1] §3.1 (Formal Language Task): The specific formal languages (e.g., regular, context-free) and their generative grammars are not enumerated with examples. Without this, it is difficult to assess whether the discriminative test requires learning the underlying grammar or can be satisfied by rejecting local invalid patterns, directly affecting the interpretation of proficiency differences.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'precise language boundaries, controlled string sampling' is used but not previewed with even one concrete language or sampling rule; adding a brief example would improve immediate readability.
  2. [§3.3] Notation: The correlation metric for inductive biases is introduced without an explicit equation; adding a short definition (e.g., Pearson correlation over log-probabilities) would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and empirical rigor of the manuscript. We address each major comment below and have incorporated revisions to improve the presentation of our controlled testbed and results.

read point-by-point responses
  1. Referee: [§4 and §3.2] §4 (Experimental Setup) and §3.2 (Discriminative Test): The sampling of out-of-language strings is not described with sufficient controls (length matching, n-gram overlap, or structural violation types). This is load-bearing for claims (a) and (b), because the reported equal OOD performance and the FT/ICL bias divergence at high proficiency could be artifacts if OOD strings share local surface statistics with training data rather than testing full grammar acquisition.

    Authors: We agree that explicit controls on OOD sampling are essential to substantiate claims about full grammar acquisition versus surface-level patterns. Although §4 described the overall sampling approach and noted the use of structural violations, we acknowledge that additional specifics were needed. We have revised §4 and §3.2 to detail: length matching between in- and out-of-language strings, explicit minimization of 3-gram overlap with the training distribution, and enumeration of violation categories (e.g., nesting depth violations for context-free languages and symbol-order violations for regular languages). These additions confirm that OOD evaluation targets global structural properties, supporting the interpretation of equal OOD performance and bias divergence. revision: yes

  2. Referee: [§5] §5 (Results): No statistical tests, number of independent runs, sample sizes, or error bars are reported for the performance differences, correlations, or sensitivity analyses. This undermines verification of the empirical claims, especially given the abstract's lack of these details and the low reader confidence in soundness.

    Authors: We accept that the lack of statistical reporting limits verifiability. We have updated §5 to specify five independent runs per condition, test sample sizes (500 strings for proficiency discrimination and 1,000 for probability correlations), paired t-tests with reported p-values for all FT-ICL comparisons, and standard-error bars on all figures. We have also added a brief mention of these controls to the abstract. These changes provide the necessary quantitative support for the reported differences and sensitivity results. revision: yes

  3. Referee: [§3.1] §3.1 (Formal Language Task): The specific formal languages (e.g., regular, context-free) and their generative grammars are not enumerated with examples. Without this, it is difficult to assess whether the discriminative test requires learning the underlying grammar or can be satisfied by rejecting local invalid patterns, directly affecting the interpretation of proficiency differences.

    Authors: We thank the referee for noting this gap in concreteness. While §3.1 outlined the task framework, we agree that explicit grammars and examples are required to demonstrate that the test probes full grammar learning. We have expanded §3.1 to enumerate the languages (Dyck language as the primary context-free example and a^n b^n as a regular example), provide their generative grammars in standard notation, and include sample in-language versus out-of-language strings. This makes clear that local n-gram rejection is insufficient for high proficiency scores, thereby clarifying the interpretation of FT versus ICL differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons on formal language task

full rationale

The paper's central claims rest on empirical evaluations of FT versus ICL using a proposed formal language learning task and a discriminative probability test. No load-bearing derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology; results are reported directly from model runs on controlled string samples, with no equations or premises that reduce to their own inputs by construction. The work is self-contained as an experimental study introducing a new testbed rather than deriving results from prior fitted quantities or author-specific uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that formal languages with controlled sampling can isolate learning-mode differences without the confounds of natural language data.

axioms (1)
  • domain assumption Formal languages offer precise boundaries and controlled string sampling with no data contamination.
    Invoked to justify the rigorous comparison setup described in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1243 out tokens · 39846 ms · 2026-05-08T08:02:43.886482+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 24 canonical work pages · 9 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Ekin Aky \"u rek, Bailin Wang, Yoon Kim, and Jacob Andreas. 2024. In-context language learning: Architectures and algorithms. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

  4. [4]

    Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of language models: Part 1, learning hierarchical language structures. arXiv preprint arXiv:2305.13673

  5. [5]

    Akari Asai, Sneha Kudugunta, Xinyan Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2024. https://doi.org/10.18653/v1/2024.naacl-long.100 BUFFET : Benchmarking large language models for few-shot cross-lingual transfer . In Proceedings of the 2024 Conference of the North American Chapter of the Associat...

  6. [6]

    Anas Awadalla, Mitchell Wortsman, Gabriel Ilharco, Sewon Min, Ian Magnusson, Hannaneh Hajishirzi, and Ludwig Schmidt. 2022. Exploring the landscape of distributional robustness for question answering models. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  7. [7]

    Gormley, and Graham Neubig

    Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R. Gormley, and Graham Neubig. 2024. https://openreview.net/forum?id=4KAmc7vUbq In-context learning with long-context models: An in-depth exploration . In First Workshop on Long-Context Foundation Models @ ICML 2024

  8. [8]

    Kush Bhatia, Avanika Narayan, Christopher M De Sa, and Christopher R \'e . 2023. TART : A plug-and-play transformer module for task-agnostic reasoning. Advances in Neural Information Processing Systems, 36:9751--9788

  9. [9]

    Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. 2020. On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics

  10. [10]

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, and 1 others. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430. PMLR

  11. [11]

    Nadav Borenstein, Anej Svete, Robin Chan, Josef Valvoda, Franz Nowak, Isabelle Augenstein, Eleanor Chodroff, and Ryan Cotterell. 2024. What languages are easy to language-model? a perspective from learning probabilistic regular languages. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  12. [12]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  13. [13]

    Nick Chater and Christopher D Manning. 2006. Probabilistic models of language processing and acquisition. Trends in cognitive sciences, 10(7):335--344

  14. [14]

    Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, YanTao Jia, Zhao Cao, and Ji-Rong Wen. 2025. https://aclanthology.org/2025.coling-main.693/ ICLE val: Evaluating in-context learning ability of large language models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 10398--10422, Abu Dhabi, UAE. Association for ...

  15. [15]

    Ta-Chung Chi, Ting-Han Fan, Alexander I Rudnicky, and Peter J Ramadge. 2023. Transformer working memory enables regular language reasoning and natural language length extrapolation. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics

  16. [16]

    Noam Chomsky. 1956. Three models for the description of language. IRE Transactions on information theory, 2(3):113--124

  17. [17]

    Michael Collins. 2013. Probabilistic context-free grammars ( PCFGs )

  18. [18]

    Ryan Cotterell, Sabrina J Mielke, Jason Eisner, and Brian Roark. 2018. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana. Association for Computational Linguistics

  19. [19]

    Colin de la Higuera, James Scicluna, and Mark-Jan Nederhof. 2014. On the computation of distances for probabilistic context-free grammars. arXiv preprint arXiv:1407.1513

  20. [20]

    Gr \'e goire Del \'e tang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and 1 others. 2023. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations

  21. [21]

    Ricardo Dominguez-Olmedo, Florian E Dorner, and Moritz Hardt. 2025. Training on the test task confounds evaluation and emergence. In The Thirteenth International Conference on Learning Representations

  22. [22]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

  23. [23]

    Himanshu Gupta, Saurabh Arjun Sawant, Swaroop Mishra, Mutsumi Nakamura, Arindam Mitra, Santosh Mashetty, and Chitta Baral. 2023. Instruction tuned models are quick learners. arXiv preprint arXiv:2306.05539

  24. [24]

    Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156--171

  25. [25]

    Michael Hahn and Mark Rofin. 2024. Why are sensitive functions hard for transformers? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand. Association for Computational Linguistics

  26. [26]

    Mark Hopkins. 2022. Towards more natural artificial languages. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 85--94

  27. [27]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-rank adaptation of large language models . In International Conference on Learning Representations

  28. [28]

    Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, and 5 others. 2024. https://openreview.net/forum?id=3X2L2TFr0f Mini CPM : Unveiling the potential of small language models wit...

  29. [29]

    Thomas F Icard. 2020. Calibrating generative models: The probabilistic Chomsky--Sch \"u tzenberger hierarchy. Journal of Mathematical Psychology, 95:102308

  30. [30]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

  31. [31]

    Jaap Jumelet and Willem Zuidema. 2023. Transparency at the source: Evaluating and interpreting language models with access to the true distribution. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics

  32. [32]

    Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts. 2024. Mission: Impossible language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand. Association for Computational Linguistics

  33. [33]

    Masahiro Kaneko, Danushka Bollegala, and Timothy Baldwin. 2025. The gaps between fine tuning and in-context learning in bias evaluation and debiasing. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2758--2764

  34. [34]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

  35. [35]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. T...

  36. [36]

    Sander Land and Max Bartolo. 2024. Fishing for Magikarp : Automatically detecting under-trained tokens in large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA. Association for Computational Linguistics

  37. [37]

    Teven Le Scao and Alexander M Rush. 2021. How many data points is a prompt worth? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2627--2636

  38. [38]

    Eric Lehman, Evan Hernandez, Diwakar Mahajan, Jonas Wulff, Micah J Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer. 2023. Do we still need clinical language models? In Conference on health, inference, and learning, pages 578--597. PMLR

  39. [39]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  40. [40]

    Ziqian Lin and Kangwook Lee. 2024. https://openreview.net/forum?id=5H4nJIGqmK Dual operating modes of in-context learning . In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models

  41. [41]

    Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. 2023. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations

  42. [42]

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950--1965

  43. [43]

    Christopher D Manning. 2003. Probabilistic syntax. Probabilistic linguistics, 289341

  44. [44]

    William Merrill. 2023. Formal languages and the NLP black box. In International Conference on Developments in Language Theory, pages 1--8. Springer

  45. [45]

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, and 88 others. 2024. Gemma: Open models based on ...

  46. [46]

    Sabrina J Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, and Jason Eisner. 2019. What kind of language is hard to language-model? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics

  47. [47]

    Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. 2023. https://doi.org/10.18653/v1/2023.findings-acl.779 Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation . In Findings of the Association for Computational Linguistics: ACL 2023, pages 12284--12314, Toronto, Canada. Association for Computation...

  48. [48]

    Shikhar Murty, Pratyusha Sharma, Jacob Andreas, and Christopher D Manning. 2023. Characterizing intrinsic compositionality in transformers with tree projections. In The Eleventh International Conference on Learning Representations

  49. [49]

    Michael Oliver and Guan Wang. 2024. Crafting efficient fine-tuning strategies for large language models. arXiv preprint arXiv:2407.13906

  50. [50]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

  51. [51]

    Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. 2023. https://doi.org/10.18653/v1/2023.findings-acl.527 What in-context learning learns in-context: Disentangling task recognition and task learning . In Findings of the Association for Computational Linguistics: ACL 2023, pages 8298--8319, Toronto, Canada. Association for Computational Linguistics

  52. [52]

    Isabel Papadimitriou and Dan Jurafsky. 2023. Injecting structural hints: Using language models to study inductive biases in language learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics

  53. [53]

    Branislav Pecher, Ivan Srba, and Maria Bielikova. 2025. Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break-even performance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 165--184

  54. [54]

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners

  55. [55]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

  56. [56]

    Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. 2019. Studying the inductive biases of RNNs with synthetic variations of natural languages. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association ...

  57. [57]

    Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. 2025. Benchmarking prompt sensitivity in large language models. In European Conference on Information Retrieval, pages 303--313. Springer

  58. [58]

    Gautam Reddy. 2024. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In The Twelfth International Conference on Learning Representations

  59. [59]

    Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, and 178 others. 2024. Gemma 2: Improving open language mode...

  60. [60]

    Lingfeng Shen, Aayush Mishra, and Daniel Khashabi. 2023. Do pretrained transformers really learn in-context by gradient descent? arXiv preprint arXiv:2310.08540

  61. [61]

    Hui Shi, Sicun Gao, Yuandong Tian, Xinyun Chen, and Jishen Zhao. 2022. Learning bounded context-free-grammar via LSTM and the transformer: difference and the explanations. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 8267--8276

  62. [62]

    Heydar Soudani, Evangelos Kanoulas, and Faegheh Hasibi. 2024. Fine tuning vs. retrieval augmented generation for less popular knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 12--22

  63. [63]

    Krishna Prasad Varadarajan Srinivasan, Prasanth Gumpena, Madhusudhana Yattapu, and Vishal H Brahmbhatt. 2024. Comparative analysis of different efficient fine tuning methods of large language models ( LLMs ) in low-resource setting. arXiv preprint arXiv:2405.13181

  64. [64]

    Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. 2023. Transformers as recognizers of formal languages: A survey on expressivity. arXiv preprint arXiv:2311.00208

  65. [65]

    Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and 1 others. 2023. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations

  66. [66]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023 a . Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  67. [67]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023 b . Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  68. [68]

    Shunjie Wang. 2021. Evaluating transformer’s ability to learn mildly context-sensitive languages. University of Washington

  69. [69]

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and 1 others. 2023. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846

  70. [70]

    Jennifer C White and Ryan Cotterell. 2021. Examining the inductive bias of neural language models with artificial languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for Computational Linguistics

  71. [71]

    Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. http://aclweb.org/anthology/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112--...

  72. [72]

    Qinyuan Wu, Mohammad Aflah Khan, Soumi Das, Vedant Nanda, Bishwamittra Ghosh, Camila Kolling, Till Speicher, Laurent Bindschaedler, Krishna Gummadi, and Evimaria Terzi. 2025. Towards reliable latent knowledge estimation in llms: Zero-prompt many-shot based factual knowledge extraction. In Proceedings of the Eighteenth ACM International Conference on Web S...

  73. [73]

    Cheng Xu, Shuhao Guan, Derek Greene, M Kechadi, and 1 others. 2024. Benchmark data contamination of large language models: A survey. arXiv preprint arXiv:2406.04244

  74. [74]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115

  75. [75]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

  76. [76]

    Qingyu Yin, Xuzheng He, Chak Tou Leong, Fan Wang, Yanzhao Yan, Xiaoyu Shen, and Qiang Zhang. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.239 Deeper insights without updates: The power of in-context learning over fine-tuning . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4138--4151, Miami, Florida, USA. Associat...

  77. [77]

    Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. 2024. https://openreview.net/forum?id=5HCnKDeTws When scaling meets LLM finetuning: The effect of data, model and finetuning method . In The Twelfth International Conference on Learning Representations

  78. [78]

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, and 1 others. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

  79. [79]

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697--12706. PMLR

  80. [80]

    Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. 2024. ProSA : Assessing and understanding the prompt sensitivity of llms. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1950--1976