pith. sign in

arxiv: 2411.01141 · v2 · pith:RPQVG4E6new · submitted 2024-11-02 · 💻 cs.CL

Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models

Pith reviewed 2026-05-23 18:00 UTC · model grok-4.3

classification 💻 cs.CL
keywords Dictionary Insertion PromptingMultilingual ReasoningLarge Language ModelsPrompt EngineeringSynthetic BenchmarksInterleaved PromptsMultilingual TranslationReasoning Tasks
0
0 comments X

The pith

Dictionary Insertion Prompting interleaves English word translations into non-English prompts to improve LLM reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Dictionary Insertion Prompting (DIP) to address the limits of English-centric LLMs on non-English reasoning tasks. DIP works by looking up words in a dictionary and inserting their English counterparts directly into the middle of the input prompt rather than prepending or appending them. This is tested on synthetic multilingual versions of English benchmarks such as GSM8K and AQuA, created by translating into up to 200 languages using NLLB. The method is claimed to produce better English translations of the query and stronger English-based reasoning steps inside the model. Interleaving the translations outperforms other placements under the same dictionary.

Core claim

When a non-English prompt is given, DIP looks up a word dictionary and inserts the English counterparts of those words into the middle of the prompt; this produces better translation into English and better English model thinking steps, which in turn yields improved results on multilingual reasoning. The placement of the insertions matters, with interleaving outperforming prepending or appending under the same dictionary. Experiments cover 10 to 200 languages drawn from FLORES-200 on four synthetic benchmarks derived from English reasoning datasets.

What carries the argument

Dictionary Insertion Prompting (DIP), a prompting technique that interleaves English word translations into non-English inputs.

If this is right

  • Interleaving dictionary words in the prompt produces larger gains than prepending or appending the same words.
  • The technique scales to experiments involving 10 to 200 languages on translated reasoning tasks.
  • The method enables better English translation of the query and stronger English reasoning steps inside the model.
  • Synthetic benchmarks translated by NLLB and back-checked into English serve as evaluation resources when native multilingual reasoning data are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended to insert other forms of external knowledge, such as facts or code snippets, at chosen positions inside prompts.
  • Performance might vary if the dictionary itself contains errors or covers only a subset of the prompt vocabulary.
  • Applying DIP to genuinely native multilingual problems, rather than translated English ones, would test whether the gains transfer beyond synthetic data.

Load-bearing premise

The synthetic benchmarks created by translating English datasets accurately reflect real multilingual reasoning challenges and the observed gains come from the dictionary insertions rather than prompt-length changes or translation artifacts.

What would settle it

Running the same prompts with English words replaced by random strings of matching length and observing whether the performance advantage over baselines disappears.

Figures

Figures reproduced from arXiv: 2411.01141 by Hongyuan Lu, Wai Lam, Zixuan Li.

Figure 1
Figure 1. Figure 1: An illustrated comparison of the GSM8K dataset made up in Buginese. Compared to the standard [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustrated comparison of the GSM8K dataset with 200 languages from FLORES-200 on six base [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
read the original abstract

There are two shortages in the current Large Language Models (LLMs) era. The first is short of multilingual models, where most LLMs are English-centric and performance is limited on multilingual reasoning. The second is the place of external knowledge to be used, where most retrieved knowledge is prepended to the user queries (maybe sub-optimal). This paper presents a novel and simple yet effective method called \textbf{D}ictionary \textbf{I}nsertion \textbf{P}rompting (\textbf{DIP}). When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the middle of the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with 10 to 200 languages from FLORES-200.\footnote{The number of languages varies on the datasets, and we experiment with 200 languages on GSM8K as in Appendix} Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. The synthetic benchmarks are translated back into English for quality assurance with manual annotation. Interestingly, the place for injecting the dictionary plays an important factor in the performance gains, and we found that interleaving the dictionary with the original words gives a better performance compared to prepending/appending the dictionary, under the same dictionary constructed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that Dictionary Insertion Prompting (DIP) improves multilingual reasoning in English-centric LLMs by retrieving English word translations from a dictionary and interleaving them into non-English prompts. This is said to facilitate better translation to English and subsequent reasoning steps. Experiments use NLLB to create synthetic multilingual versions of English benchmarks (GSM8K, AQuA and others) across 10–200 languages from FLORES-200; back-translation plus manual annotation serves as quality assurance. Ablations indicate that interleaving the dictionary outperforms prepending or appending it under the same dictionary.

Significance. If the gains prove attributable to the insertion mechanism rather than translation artifacts or length changes, DIP would supply a simple, training-free technique for boosting multilingual performance on existing LLMs at scale. The breadth of language coverage and the explicit position ablation are positive features; however, the synthetic-data foundation limits immediate applicability until native multilingual validation is provided.

major comments (3)
  1. [Data construction / abstract] Data construction (abstract and methods): synthetic benchmarks are produced via NLLB with only back-translation and manual spot-checks described; no quantitative translation-quality metrics (BLEU, COMET, or error-rate statistics) or inter-annotator agreement figures are supplied. This is load-bearing because the central claim that DIP yields “obviously better results” on multilingual reasoning rests on the fidelity of these benchmarks.
  2. [Results / position ablation] Ablation design (results section): the interleaving vs. prepend/append comparison does not control for total prompt length or lexical density; all conditions add the same dictionary content, so observed position effects could be confounded by differences in effective context length or token placement rather than insertion location per se.
  3. [Results] Evaluation reporting (abstract and results): performance differences by insertion position are stated without reference to statistical significance tests, run-to-run variance, error bars, or detailed baseline definitions (standard prompting, CoT, retrieval baselines). This weakens the evidential support for the claimed gains.
minor comments (1)
  1. [Abstract] The footnote on varying language counts across datasets should be expanded with a table or explicit per-dataset counts in the main text for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Data construction / abstract] Data construction (abstract and methods): synthetic benchmarks are produced via NLLB with only back-translation and manual spot-checks described; no quantitative translation-quality metrics (BLEU, COMET, or error-rate statistics) or inter-annotator agreement figures are supplied. This is load-bearing because the central claim that DIP yields “obviously better results” on multilingual reasoning rests on the fidelity of these benchmarks.

    Authors: We agree that quantitative metrics would strengthen the presentation. In the revised manuscript we will report BLEU and COMET scores (computed on the back-translations against the original English) for a sampled subset of each benchmark, together with the exact number of sentences manually inspected and the observed error categories. The back-translation step already functions as an automatic filter, and the manual checks were performed on a stratified sample; adding the numeric metrics will make this process fully transparent without altering the experimental conclusions. revision: yes

  2. Referee: [Results / position ablation] Ablation design (results section): the interleaving vs. prepend/append comparison does not control for total prompt length or lexical density; all conditions add the same dictionary content, so observed position effects could be confounded by differences in effective context length or token placement rather than insertion location per se.

    Authors: The concern is valid: although the dictionary content is identical, tokenization can produce different effective lengths. In the revision we will add a controlled-length ablation in which we truncate or pad the dictionary entries so that total prompt length (in tokens) is matched across the three insertion positions, and we will report average token counts for each condition. This will isolate the effect of insertion location from length confounds. revision: yes

  3. Referee: [Results] Evaluation reporting (abstract and results): performance differences by insertion position are stated without reference to statistical significance tests, run-to-run variance, error bars, or detailed baseline definitions (standard prompting, CoT, retrieval baselines). This weakens the evidential support for the claimed gains.

    Authors: We will revise the results section and abstract to include (i) standard deviations and error bars from at least three independent runs with different random seeds, (ii) paired t-test p-values for the key position comparisons, and (iii) explicit definitions and prompting templates for all baselines (standard zero-shot, CoT, and any retrieval-augmented variants). These additions will be placed in both the main text and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompting method with external benchmarks

full rationale

The paper describes a prompting technique (DIP) and evaluates it empirically on NLLB-translated versions of standard English reasoning datasets (GSM8K, AQuA, etc.), with back-translation and manual annotation for quality. No equations, derivations, fitted parameters, or self-citations appear in the provided text as load-bearing elements of any claimed result. The central claims rest on observed performance differences rather than any reduction of outputs to self-defined quantities or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the quality and representativeness of NLLB-generated synthetic data and the assumption that English word insertion improves internal translation and reasoning without introducing new artifacts.

axioms (1)
  • domain assumption NLLB translator produces sufficiently accurate translations to create valid multilingual reasoning benchmarks from English originals
    Used to generate the test sets for 10-200 languages; quality is checked only by manual annotation on a subset.

pith-pipeline@v0.9.0 · 5792 in / 1228 out tokens · 26373 ms · 2026-05-23T18:00:58.410246+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

    cs.CL 2025-07 conditional novelty 6.0

    SLoW selects low-frequency word dictionaries to boost LLM translation quality and efficiency across 100 languages from FLORES.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Sweta Agrawal , Chunting Zhou , Mike Lewis , Luke Zettlemoyer , and Marjan Ghazvininejad . 2022. https://doi.org/10.48550/arXiv.2212.02437 In-context Examples Selection for Machine Translation . arXiv e-prints, arXiv:2212.02437

  2. [2]

    Philip Arthur, Graham Neubig, and Satoshi Nakamura. 2016. https://doi.org/10.18653/v1/D16-1162 Incorporating discrete translation lexicons into neural machine translation . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1557--1567, Austin, Texas. Association for Computational Linguistics

  3. [3]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  4. [4]

    Guanhua Chen, Yun Chen, Yong Wang, and Victor O. K. Li. 2021. Lexical-constraint-aware neural machine translation via data augmentation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI'20

  5. [5]

    Karl Cobbe , Vineet Kosaraju , Mohammad Bavarian , Mark Chen , Heewoo Jun , Lukasz Kaiser , Matthias Plappert , Jerry Tworek , Jacob Hilton , Reiichiro Nakano , Christopher Hesse , and John Schulman . 2021. https://doi.org/10.48550/arXiv.2110.14168 Training Verifiers to Solve Math Word Problems . arXiv e-prints, arXiv:2110.14168

  6. [6]

    Georgiana Dinu, Prashant Mathur, Marcello Federico, and Yaser Al-Onaizan. 2019. https://doi.org/10.18653/v1/P19-1294 Training neural machine translation to apply terminology constraints . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3063--3068, Florence, Italy. Association for Computational Linguistics

  7. [7]

    Abhimanyu Dubey , Abhinav Jauhri , Abhinav Pandey , Abhishek Kadian , Ahmad Al-Dahle , Aiesha Letman , Akhil Mathur , Alan Schelten , Amy Yang , Angela Fan , Anirudh Goyal , Anthony Hartshorn , Aobo Yang , Archi Mitra , Archie Sravankumar , Artem Korenev , Arthur Hinsvark , Arun Rao , Aston Zhang , Aurelien Rodriguez , Austen Gregerson , Ava Spataru , Bap...

  8. [8]

    Xavier Garcia and Orhan Firat . 2022. https://doi.org/10.48550/arXiv.2202.11822 Using natural language prompts for machine translation . arXiv e-prints, arXiv:2202.11822

  9. [9]

    a m\" a l\

    Mika H\" a m\" a l\" a inen and Khalid Alnajjar. 2020. https://doi.org/10.1145/3377713.3377801 A template based approach for training nmt for low-resource uralic languages - a pilot with finnish . In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, ACAI '19, page 520–525, New York, NY, USA. Associa...

  10. [10]

    Chris Hokamp and Qun Liu. 2017. https://doi.org/10.18653/v1/P17-1141 Lexically constrained decoding for sequence generation using grid beam search . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535--1546, Vancouver, Canada. Association for Computational Linguistics

  11. [11]

    Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yun-Ze Song, Wai Lam, and Yue Zhang. 2024. https://openreview.net/forum?id=Hvq9RtSoHG Chain-of-symbol prompting for spatial reasoning in large language models . In First Conference on Language Modeling

  12. [12]

    Haoyang Huang, Tianyi Tang, Dongdong Zhang, Xin Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.826 Not all languages are created equal in LLM s: Improving multilingual capability by cross-lingual-thought prompting . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12365--12394, ...

  13. [13]

    Albert Q. Jiang , Alexandre Sablayrolles , Antoine Roux , Arthur Mensch , Blanche Savary , Chris Bamford , Devendra Singh Chaplot , Diego de las Casas , Emma Bou Hanna , Florian Bressand , Gianna Lengyel , Guillaume Bour , Guillaume Lample , L \'e lio Renard Lavaud , Lucile Saulnier , Marie-Anne Lachaux , Pierre Stock , Sandeep Subramanian , Sophia Yang ,...

  14. [14]

    Yunsu Kim, Petre Petrov, Pavel Petrushkov, Shahram Khadivi, and Hermann Ney. 2019. https://doi.org/10.18653/v1/D19-1080 Pivot-based transfer learning for neural machine translation between non- E nglish languages . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natura...

  15. [15]

    Teven Le Scao , Angela Fan , Christopher Akiki , Ellie Pavlick , Suzana Ili \'c , Daniel Hesslow , Roman Castagn \'e , Alexandra Sasha Luccioni , Fran c ois Yvon , Matthias Gall \'e , Jonathan Tow , Alexander M. Rush , Stella Biderman , Albert Webson , Pawan Sasanka Ammanamanchi , Thomas Wang , Beno \^ t Sagot , Niklas Muennighoff , Albert Villanova del M...

  16. [16]

    Peng Li, Tianxiang Sun, Qiong Tang, Hang Yan, Yuanbin Wu, Xuanjing Huang, and Xipeng Qiu. 2023. https://doi.org/10.18653/v1/2023.acl-long.855 C ode IE : Large code generation models are better few-shot information extractors . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339--1...

  17. [17]

    Zheng Wei Lim , Nitish Gupta , Honglin Yu , and Trevor Cohn . 2024. https://doi.org/10.48550/arXiv.2409.13949 Mufu: Multilingual Fused Learning for Low-Resource Translation with LLM . arXiv e-prints, arXiv:2409.13949

  18. [18]

    Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O ' Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. 2022. https://aclanthology.org/2022.emn...

  19. [19]

    Hongyuan Lu, Haoyang Huang, Shuming Ma, Dongdong Zhang, Wai Lam, Zhaochuan Gao, Anthony Aue, Arul Menezes, and Furu Wei. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.526 TRIP : Accelerating document-level multilingual pre-training via triangular document-level pre-training on parallel data triplets . In Findings of the Association for Computation...

  20. [20]

    Hongyuan Lu , Haoran Yang , Haoyang Huang , Dongdong Zhang , Wai Lam , and Furu Wei . 2023. https://doi.org/10.48550/arXiv.2305.06575 Chain-of-Dictionary Prompting Elicits Translation in Large Language Models . arXiv e-prints, arXiv:2305.06575

  21. [21]

    Yinquan Lu , Wenhao Zhu , Lei Li , Yu Qiao , and Fei Yuan . 2024. https://doi.org/10.48550/arXiv.2407.05975 LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages . arXiv e-prints, arXiv:2407.05975

  22. [22]

    NLLB-Team. 2022. No language left behind: Scaling human-centered machine translation

  23. [23]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

  24. [24]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

  25. [25]

    Maja Popovi \'c . 2015. https://doi.org/10.18653/v1/W15-3049 chr F : character n-gram F -score for automatic MT evaluation . In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392--395, Lisbon, Portugal. Association for Computational Linguistics

  26. [26]

    Matt Post and David Vilar. 2018. https://doi.org/10.18653/v1/N18-1119 Fast lexically constrained decoding with dynamic beam allocation for neural machine translation . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1314--1324...

  27. [27]

    Laria Reynolds and Kyle McDonell. 2021. https://doi.org/10.1145/3411763.3451760 Prompt programming for large language models: Beyond the few-shot paradigm . In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA '21, New York, NY, USA. Association for Computing Machinery

  28. [28]

    Kai Song, Yue Zhang, Heng Yu, Weihua Luo, Kun Wang, and Min Zhang. 2019. https://doi.org/10.18653/v1/N19-1044 Code-switching for enhancing NMT with pre-specified translation . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) ...

  29. [29]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava , Abhinav Rastogi , Abhishek Rao , Abu Awal Md Shoeb , Abubakar Abid , Adam Fisch , Adam R. Brown , Adam Santoro , Aditya Gupta , Adri \`a Garriga-Alonso , Agnieszka Kluska , Aitor Lewkowycz , Akshat Agarwal , Alethea Power , Alex Ray , Alex Warstadt , Alexander W. Kocurek , Ali Safaya , Ali Tazarv , Alice Xiang , Alicia Parrish , Allen ...

  30. [30]

    Masao Utiyama and Hitoshi Isahara. 2007. https://aclanthology.org/N07-1061 A comparison of pivot methods for phrase-based statistical machine translation . In Human Language Technologies 2007: The Conference of the North A merican Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference , pages 484--491, Rochester, New ...

  31. [31]

    David Vilar , Markus Freitag , Colin Cherry , Jiaming Luo , Viresh Ratnakar , and George Foster . 2022. https://doi.org/10.48550/arXiv.2211.09102 Prompting PaLM for Translation: Assessing Strategies and Performance . arXiv e-prints, arXiv:2211.09102

  32. [32]

    Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2023. https://doi.org/10.18653/v1/2023.acl-long.153 Towards understanding chain-of-thought prompting: An empirical study of what matters . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717--2...

  33. [33]

    Jiaan Wang , Yunlong Liang , Fandong Meng , Haoxiang Shi , Zhixu Li , Jinan Xu , Jianfeng Qu , and Jie Zhou . 2023. https://doi.org/10.48550/arXiv.2303.04048 Is ChatGPT a Good NLG Evaluator? A Preliminary Study . arXiv e-prints, arXiv:2303.04048

  34. [34]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2024. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA. Curran Associates Inc

  35. [35]

    Hua Wu and Haifeng Wang. 2007. https://aclanthology.org/P07-1108 Pivot language approach for phrase-based statistical machine translation . In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 856--863, Prague, Czech Republic. Association for Computational Linguistics

  36. [36]

    Jiajun Zhang and Chengqing Zong . 2016. https://doi.org/10.48550/arXiv.1610.07272 Bridging Neural Machine Translation and Bilingual Dictionaries . arXiv e-prints, arXiv:1610.07272

  37. [37]

    Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023. https://doi.org/10.18653/v1/2023.acl-long.45 Self-edit: Fault-aware code editor for code generation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769--787, Toronto, Canada. Association for Computational Linguistics

  38. [38]

    Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab , Xian Li , Xi Victoria Lin , Todor Mihaylov , Myle Ott , Sam Shleifer , Kurt Shuster , Daniel Simig , Punit Singh Koura , Anjali Sridhar , Tianlu Wang , and Luke Zettlemoyer . 2022. https://doi.org/10.48550/arXiv.2205.01068 OPT: Open Pre-...

  39. [39]

    Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024. https://doi.org/10.18653/v1/2024.findings-naacl.176 Multilingual machine translation with large language models: Empirical results and analysis . In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2765--2781, Mexico ...

  40. [40]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  41. [41]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...