Recognition: unknown
Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction
Pith reviewed 2026-05-10 17:29 UTC · model grok-4.3
The pith
LLMs' in-context translation accuracy falls sharply with larger grammars, longer sentences, and mismatches in morphology or script.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When given a synchronous context-free grammar and a source-language sentence, LLMs produce correct translations less often as the grammar grows larger or the sentences grow longer; performance is further reduced when the source and target languages differ in morphology or orthography; and the dominant error modes are recalling incorrect target-language words, hallucinating new words, or leaving source words untranslated.
What carries the argument
Synchronous context-free grammars that pair formal languages to isolate grammar size, sentence length, morphology, and script differences.
If this is right
- Accuracy will continue to degrade as the number of rules or the length of input sentences increases.
- Languages with richer morphology or different scripts will remain harder to translate under in-context conditions.
- Models will keep producing the same three dominant error types rather than other kinds of syntactic mistakes.
- In-context grammar use alone will not close the performance gap for languages lacking large training corpora.
Where Pith is reading between the lines
- The formal-task results imply that simply scaling context length or model size may not overcome the observed limits without changes to how rules are represented.
- These SCFG benchmarks could serve as a quick filter for new in-context methods before testing them on scarce real-language data.
- The error patterns suggest models treat the grammar more like a lookup table than as a generative rule system when context grows complex.
- Future work could add explicit rule-application steps to the prompt to test whether the drop in performance is due to inference difficulty rather than rule encoding.
Load-bearing premise
That success or failure on these artificial grammar tasks will predict how the same models behave when given real-language grammars and dictionaries for low-resource natural languages.
What would settle it
Running the same models on actual low-resource language pairs supplied with textbook-style grammar descriptions and dictionaries, then checking whether the error rates and accuracy trends match those observed on the SCFG tasks.
Figures
read the original abstract
Low-resource languages pose a challenge for machine translation with large language models (LLMs), which require large amounts of training data. One potential way to circumvent this data dependence is to rely on LLMs' ability to use in-context descriptions of languages, like textbooks and dictionaries. To do so, LLMs must be able to infer the link between the languages' grammatical descriptions and the sentences in question. Here we isolate this skill using a formal analogue of the task: string transduction based on a formal grammar provided in-context. We construct synchronous context-free grammars which define pairs of formal languages designed to model particular aspects of natural language grammar, morphology, and written representation. Using these grammars, we measure how well LLMs can translate sentences from one formal language into another when given both the grammar and the source-language sentence. We vary the size of the grammar, the lengths of the sentences, the syntactic and morphological properties of the languages, and their written script. We note three key findings. First, LLMs' translation accuracy decreases markedly as a function of grammar size and sentence length. Second, differences in morphology and written representation between the source and target languages can strongly diminish model performance. Third, we examine the types of errors committed by models and find they are most prone to recall the wrong words from the target language vocabulary, hallucinate new words, or leave source-language words untranslated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the use of synchronous context-free grammars (SCFGs) as a formal proxy for evaluating LLMs' ability to perform in-context translation between languages. The authors construct SCFGs to model grammar, morphology, and script differences, then test LLM performance on transduction tasks while varying grammar size, sentence length, and linguistic properties. They report that accuracy decreases with larger grammars and longer sentences, is reduced by morphological and orthographic mismatches, and that models commonly make errors by recalling incorrect words, hallucinating, or failing to translate source words.
Significance. If the SCFG-based tasks validly capture the challenges of in-context learning for low-resource translation, this provides a valuable controlled experimental paradigm for diagnosing LLM limitations in grammar induction and transduction from descriptions. The consistent directional findings across controlled variables strengthen the case for using such formal analogues in future work on LLM capabilities.
major comments (2)
- [Methods] The experimental setup does not specify the exact prompting templates used for providing the grammar and source sentence to the models, nor the precise versions of the LLMs tested (e.g., specific GPT or Llama checkpoints). This omission makes it difficult to assess or reproduce the quantitative results on performance degradation.
- [Results] No statistical significance tests, error bars, or variance measures are reported for the accuracy trends as a function of grammar size and sentence length. Given that the central claims rely on 'markedly decreases' observations, this weakens the ability to evaluate the robustness of the findings.
minor comments (2)
- [Abstract] The abstract mentions 'we examine the types of errors' but does not preview the specific error categories until the findings section; a brief mention would improve clarity.
- [Discussion] The paper could benefit from a more explicit comparison to related work on in-context learning for translation, such as studies using natural low-resource languages, to better situate the SCFG approach.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation for minor revision. We appreciate the positive evaluation of the work's significance and address each major comment below.
read point-by-point responses
-
Referee: [Methods] The experimental setup does not specify the exact prompting templates used for providing the grammar and source sentence to the models, nor the precise versions of the LLMs tested (e.g., specific GPT or Llama checkpoints). This omission makes it difficult to assess or reproduce the quantitative results on performance degradation.
Authors: We agree that explicit specification of prompting templates and exact model versions is necessary for reproducibility. The revised manuscript will include the complete prompting templates (including variations for grammar presentation and in-context examples) in a dedicated appendix and will list the precise model checkpoints used (e.g., gpt-4-turbo-2024-04-09 and Meta-Llama-3-70B-Instruct). revision: yes
-
Referee: [Results] No statistical significance tests, error bars, or variance measures are reported for the accuracy trends as a function of grammar size and sentence length. Given that the central claims rely on 'markedly decreases' observations, this weakens the ability to evaluate the robustness of the findings.
Authors: We acknowledge that the absence of statistical tests and variance measures limits the strength of the reported trends. In the revision we will add error bars computed over multiple independent runs (varying sampling seeds where applicable) and include paired statistical significance tests (e.g., McNemar's test or Wilcoxon signed-rank) for the key comparisons of accuracy across grammar sizes and sentence lengths. revision: yes
Circularity Check
No significant circularity; purely empirical measurement study
full rationale
The paper constructs synchronous context-free grammars to generate controlled test cases modeling aspects of grammar, morphology, and script, then directly queries LLMs on source sentences given the grammar in context and measures translation accuracy, error types, and degradation with grammar size or length. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the methods or claims; results follow from explicit experimental runs on the generated data. The generalization to natural languages is flagged as a scope limitation rather than an internal derivation step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synchronous context-free grammars can be constructed to model particular aspects of natural language grammar, morphology, and written representation.
Reference graph
Works this paper leans on
-
[1]
URLhttps://www.sciencedirect.com/science/article/pii/S0049237X08720238. 10 Preprint. Under review. John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Stru...
-
[2]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu
Accessed: 2025-11-10. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, pp. 311–318, Morristown, NJ, USA,
2025
-
[3]
Bleu: a method for automatic evaluation of machine translation
Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL http://dx.doi.org/10.3115/1073083.1073135. Maja Popovi´ c. chrF++: words helping character n-grams. InProceedings of the Second Conference on Machine T ranslation, pp. 612–618, Stroudsburg, PA, USA, 2017. Association for Computational Linguistics. doi: 10.18653/v1/w17-4770. URL h...
-
[4]
Association for Computational Linguistics. doi: 10.18653/v1/w18-6319. URL http://dx.doi.org/10.18653/v1/w18-64019. Michael Przystupa and Muhammad Abdul-Mageed. Neural machine translation of low- resource and similar languages with backtranslation. InProceedings of the Fourth Conference on Machine T ranslation (Volume 3: Shared T ask Papers, Day 2), pp. 22...
-
[5]
A framework for few-shot language model evaluation
URLhttp://arxiv.org/abs/2309.16575. Eline Visser. A grammar of Kalamang, 2022. URL https://doi.org/10.5281/zenodo. 6499927. Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Rama...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.