SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
Pith reviewed 2026-05-21 23:26 UTC · model grok-4.3
The pith
SLoW selects low-frequency word dictionaries to enhance LLM translation while reducing token costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that selecting dictionaries based on low word frequency, estimated without access to the LLM's training data, provides a flexible trade-off that improves translation quality and lowers token usage. In experiments using the FLORES benchmark across 100 languages, this approach outperforms strong baselines and, for many languages, even exceeds the performance obtained when all dictionaries are included in the prompt.
What carries the argument
The SLoW selection criterion that identifies and uses only dictionaries containing lower-frequency words, as estimated from public corpora.
If this is right
- It saves token usage compared to the full dictionary baseline.
- It surpasses strong baselines on 100 languages from FLORES.
- Many languages show better translation performance than using the full set of dictionaries.
- No access to training data or model tuning is needed for the frequency estimates.
Where Pith is reading between the lines
- Similar frequency-based selection could be explored for other LLM prompting tasks like summarization or question answering.
- Public frequency proxies might help in other low-resource language applications beyond translation.
- Testing the method on different benchmarks or larger sets of languages could reveal its broader applicability.
Load-bearing premise
That estimates of word frequency from public resources accurately point to the dictionaries that will most benefit the LLM's translation output.
What would settle it
An experiment showing that selecting high-frequency word dictionaries yields better translation quality than low-frequency ones on the same set of languages would challenge the central mechanism.
Figures
read the original abstract
There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Automatic Dictionary Selection (ADS) task for LLM-based translation on low-resource languages. It proposes SLoW, which automatically selects dictionaries by ranking on low-frequency words estimated from public corpora rather than LLM training data. Experiments on 100 languages from FLORES report that SLoW outperforms strong baselines, reduces token usage, and for many languages exceeds the translation quality of the full-dictionary baseline with models including ChatGPT, Llama, and DeepSeek.
Significance. If the results hold under closer scrutiny, the work provides a practical, tuning-free method for balancing token cost and quality in multilingual prompting without requiring access to proprietary training data. The scale of the evaluation and the demonstration that public frequency resources can serve as effective proxies are notable strengths that could extend to other dictionary-augmented tasks.
major comments (3)
- [Experimental Setup] Experimental Setup: the low-frequency selection threshold or ranking cutoff is listed as a free parameter yet the text claims 'no additional tuning'; clarify whether this cutoff is fixed globally, chosen per language, or determined without reference to the FLORES evaluation data.
- [Results and Analysis] Results and Analysis: outperformance over the full-dictionary baseline and token savings are reported, but no statistical significance tests, confidence intervals, or per-language variance are provided; this makes it difficult to assess whether the gains are robust or driven by a subset of languages.
- [Method] Method: the central claim that public-resource frequencies reliably identify useful dictionaries rests on an untested assumption that these frequencies correlate with LLM-internal token utility; an ablation across alternative public corpora or a direct comparison to model-derived frequencies would be needed to support the 'shocking fact' that training data access is unnecessary.
minor comments (3)
- [Abstract] Abstract footnote uses informal phrasing ('a shocking fact'); rephrase for journal tone while retaining the substantive point.
- [Experiments] Provide explicit references or implementation details for all 'strong baselines' mentioned in the experiments.
- [Introduction] Clarify early in the paper what constitutes a 'dictionary' entry (word-to-translation pair, full phrase, etc.) to avoid ambiguity in the selection procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below, providing clarifications and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Experimental Setup] Experimental Setup: the low-frequency selection threshold or ranking cutoff is listed as a free parameter yet the text claims 'no additional tuning'; clarify whether this cutoff is fixed globally, chosen per language, or determined without reference to the FLORES evaluation data.
Authors: The ranking cutoff is a fixed global threshold (bottom 20% frequency rank) computed once from the public corpus statistics and applied uniformly across all languages. It is not tuned per language and was chosen without reference to FLORES development or test data. We will revise the manuscript to state this explicitly and remove any phrasing that could imply the cutoff is a tunable hyperparameter. revision: yes
-
Referee: [Results and Analysis] Results and Analysis: outperformance over the full-dictionary baseline and token savings are reported, but no statistical significance tests, confidence intervals, or per-language variance are provided; this makes it difficult to assess whether the gains are robust or driven by a subset of languages.
Authors: We agree that statistical tests and variance reporting would strengthen the presentation. In the revision we will add bootstrap confidence intervals on the mean improvements, paired significance tests against the full-dictionary baseline, and a supplementary table showing the fraction of languages where SLoW exceeds the full dictionary. revision: yes
-
Referee: [Method] Method: the central claim that public-resource frequencies reliably identify useful dictionaries rests on an untested assumption that these frequencies correlate with LLM-internal token utility; an ablation across alternative public corpora or a direct comparison to model-derived frequencies would be needed to support the 'shocking fact' that training data access is unnecessary.
Authors: The consistent gains across 100 languages and three LLMs (including closed models) constitute empirical support for the proxy. Direct comparison to model-internal frequencies is not possible for proprietary models such as ChatGPT, which is precisely why a public-corpus method is valuable. We will add an ablation using an alternative public corpus (e.g., Wikipedia) and expand the discussion of the frequency-utility correlation assumption. revision: partial
Circularity Check
No significant circularity; selection criterion is externally defined
full rationale
The paper defines the SLoW method as selecting low-frequency words via frequency estimates drawn from public resources, explicitly noting that no access to LLM training data is required. This selection rule is fixed in advance and independent of the FLORES evaluation outcomes or any fitted parameters derived from translation performance. Experimental results comparing SLoW to baselines and the full-dictionary case are presented as post-selection validation rather than inputs that define or force the selection itself. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims rest on empirical comparison under an external frequency proxy rather than reducing to the inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- low-frequency selection threshold or ranking cutoff
axioms (1)
- domain assumption Frequency estimates from public resources serve as a valid proxy for usefulness in dictionary-based prompting
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SLoW selects the dictionaries that have a lower frequency in the training data: ˆD = first(sort¯xi∈D(G(¯xi, T )), V)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
there is no need for access to the training data for frequency estimation (which is usually unavailable)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sweta Agrawal , Chunting Zhou , Mike Lewis , Luke Zettlemoyer , and Marjan Ghazvininejad . 2022. https://doi.org/10.48550/arXiv.2212.02437 In-context Examples Selection for Machine Translation . arXiv e-prints, arXiv:2212.02437
-
[2]
Philip Arthur, Graham Neubig, and Satoshi Nakamura. 2016. https://doi.org/10.18653/v1/D16-1162 Incorporating discrete translation lexicons into neural machine translation . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1557--1567, Austin, Texas. Association for Computational Linguistics
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
-
[4]
Guanhua Chen, Yun Chen, Yong Wang, and Victor O. K. Li. 2021. Lexical-constraint-aware neural machine translation via data augmentation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI'20
work page 2021
-
[5]
DeepSeek-AI , Aixin Liu , Bei Feng , Bing Xue , Bingxuan Wang , Bochao Wu , Chengda Lu , Chenggang Zhao , Chengqi Deng , Chenyu Zhang , Chong Ruan , Damai Dai , Daya Guo , Dejian Yang , Deli Chen , Dongjie Ji , Erhang Li , Fangyun Lin , Fucong Dai , Fuli Luo , Guangbo Hao , Guanting Chen , Guowei Li , H. Zhang , Han Bao , Hanwei Xu , Haocheng Wang , Haowe...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[6]
Georgiana Dinu, Prashant Mathur, Marcello Federico, and Yaser Al-Onaizan. 2019. https://doi.org/10.18653/v1/P19-1294 Training neural machine translation to apply terminology constraints . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3063--3068, Florence, Italy. Association for Computational Linguistics
-
[7]
Abhimanyu Dubey , Abhinav Jauhri , Abhinav Pandey , Abhishek Kadian , Ahmad Al-Dahle , Aiesha Letman , Akhil Mathur , Alan Schelten , Amy Yang , Angela Fan , Anirudh Goyal , Anthony Hartshorn , Aobo Yang , Archi Mitra , Archie Sravankumar , Artem Korenev , Arthur Hinsvark , Arun Rao , Aston Zhang , Aurelien Rodriguez , Austen Gregerson , Ava Spataru , Bap...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[8]
Xavier Garcia and Orhan Firat . 2022. https://doi.org/10.48550/arXiv.2202.11822 Using natural language prompts for machine translation . arXiv e-prints, arXiv:2202.11822
-
[9]
Mika H\" a m\" a l\" a inen and Khalid Alnajjar. 2020. https://doi.org/10.1145/3377713.3377801 A template based approach for training nmt for low-resource uralic languages - a pilot with finnish . In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, ACAI '19, page 520–525, New York, NY, USA. Associa...
-
[10]
Chris Hokamp and Qun Liu. 2017. https://doi.org/10.18653/v1/P17-1141 Lexically constrained decoding for sequence generation using grid beam search . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535--1546, Vancouver, Canada. Association for Computational Linguistics
-
[11]
Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yun-Ze Song, Wai Lam, and Yue Zhang. 2024. https://openreview.net/forum?id=Hvq9RtSoHG Chain-of-symbol prompting for spatial reasoning in large language models . In First Conference on Language Modeling
work page 2024
-
[12]
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...
-
[13]
Teven Le Scao , Angela Fan , Christopher Akiki , Ellie Pavlick , Suzana Ili \'c , Daniel Hesslow , Roman Castagn \'e , Alexandra Sasha Luccioni , Fran c ois Yvon , Matthias Gall \'e , Jonathan Tow , Alexander M. Rush , Stella Biderman , Albert Webson , Pawan Sasanka Ammanamanchi , Thomas Wang , Beno \^ t Sagot , Niklas Muennighoff , Albert Villanova del M...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.05100 2022
-
[14]
Peng Li, Tianxiang Sun, Qiong Tang, Hang Yan, Yuanbin Wu, Xuanjing Huang, and Xipeng Qiu. 2023. https://doi.org/10.18653/v1/2023.acl-long.855 C ode IE : Large code generation models are better few-shot information extractors . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339--1...
-
[15]
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O ' Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. 2022. https://aclanthology.org/2022.emn...
work page 2022
-
[16]
Hongyuan Lu , Zixuan Li , and Wai Lam . 2024. https://doi.org/10.48550/arXiv.2411.01141 Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models . arXiv e-prints, arXiv:2411.01141
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.01141 2024
-
[17]
Hongyuan Lu , Haoran Yang , Haoyang Huang , Dongdong Zhang , Wai Lam , and Furu Wei . 2023. https://doi.org/10.48550/arXiv.2305.06575 Chain-of-Dictionary Prompting Elicits Translation in Large Language Models . arXiv e-prints, arXiv:2305.06575
-
[18]
Hongyuan Lu, Haoran Yang, Haoyang Huang, Dongdong Zhang, Wai Lam, and Furu Wei. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.55 Chain-of-dictionary prompting elicits translation in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 958--976, Miami, Florida, USA. Association for Com...
-
[19]
NLLB-Team. 2022. No language left behind: Scaling human-centered machine translation
work page 2022
-
[20]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics
-
[21]
Maja Popovi \'c . 2015. https://doi.org/10.18653/v1/W15-3049 chr F : character n-gram F -score for automatic MT evaluation . In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392--395, Lisbon, Portugal. Association for Computational Linguistics
-
[22]
Matt Post and David Vilar. 2018. https://doi.org/10.18653/v1/N18-1119 Fast lexically constrained decoding with dynamic beam allocation for neural machine translation . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1314--1324...
-
[23]
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.213 COMET : A neural framework for MT evaluation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685--2702, Online. Association for Computational Linguistics
-
[24]
Laria Reynolds and Kyle McDonell. 2021. https://doi.org/10.1145/3411763.3451760 Prompt programming for large language models: Beyond the few-shot paradigm . In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA '21, New York, NY, USA. Association for Computing Machinery
-
[25]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1009 Improving neural machine translation models with monolingual data . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86--96, Berlin, Germany. Association for Computational Linguistics
-
[26]
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch\" a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org
work page 2023
-
[27]
Kai Song, Yue Zhang, Heng Yu, Weihua Luo, Kun Wang, and Min Zhang. 2019. https://doi.org/10.18653/v1/N19-1044 Code-switching for enhancing NMT with pre-specified translation . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) ...
-
[28]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, page 3104–3112, Cambridge, MA, USA. MIT Press
work page 2014
-
[29]
Arata Ugawa, Akihiro Tamura, Takashi Ninomiya, Hiroya Takamura, and Manabu Okumura. 2018. https://aclanthology.org/C18-1274 Neural machine translation incorporating named entity . In Proceedings of the 27th International Conference on Computational Linguistics, pages 3240--3250, Santa Fe, New Mexico, USA. Association for Computational Linguistics
work page 2018
-
[30]
David Vilar , Markus Freitag , Colin Cherry , Jiaming Luo , Viresh Ratnakar , and George Foster . 2022. https://doi.org/10.48550/arXiv.2211.09102 Prompting PaLM for Translation: Assessing Strategies and Performance . arXiv e-prints, arXiv:2211.09102
-
[31]
Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2023. https://doi.org/10.18653/v1/2023.acl-long.153 Towards understanding chain-of-thought prompting: An empirical study of what matters . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717--2...
-
[32]
Jiaan Wang , Yunlong Liang , Fandong Meng , Haoxiang Shi , Zhixu Li , Jinan Xu , Jianfeng Qu , and Jie Zhou . 2023. https://doi.org/10.48550/arXiv.2303.04048 Is ChatGPT a Good NLG Evaluator? A Preliminary Study . arXiv e-prints, arXiv:2303.04048
-
[33]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2024. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA. Curran Associates Inc
work page 2024
-
[34]
Jiajun Zhang and Chengqing Zong . 2016. https://doi.org/10.48550/arXiv.1610.07272 Bridging Neural Machine Translation and Bilingual Dictionaries . arXiv e-prints, arXiv:1610.07272
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.07272 2016
-
[35]
Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023. https://doi.org/10.18653/v1/2023.acl-long.45 Self-edit: Fault-aware code editor for code generation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769--787, Toronto, Canada. Association for Computational Linguistics
-
[36]
Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab , Xian Li , Xi Victoria Lin , Todor Mihaylov , Myle Ott , Sam Shleifer , Kurt Shuster , Daniel Simig , Punit Singh Koura , Anjali Sridhar , Tianlu Wang , and Luke Zettlemoyer . 2022. https://doi.org/10.48550/arXiv.2205.01068 OPT: Open Pre-...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01068 2022
-
[37]
Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. 2024. https://doi.org/10.18653/v1/2024.findings-naacl.176 Multilingual machine translation with large language models: Empirical results and analysis . In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2765--2781, Mexico ...
-
[38]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[39]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.