pith. sign in

arxiv: 2605.14890 · v2 · pith:TYEQY3D6new · submitted 2026-05-14 · 💻 cs.CL · cs.AI

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

Pith reviewed 2026-05-20 21:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords tokenizer fertilityUkrainian legal textfoundation modelszero-shot performancefew-shot promptingtemporal generalizationlegal NLP
0
0 comments X

The pith

Tokenizer fertility varies by a factor of 1.6 across foundation models on Ukrainian legal text, a cost factor ignored in current model selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper measures how many tokens different foundation models need to encode the same Ukrainian court decisions. It shows that some models use 60 percent more tokens than others for identical inputs, directly raising inference costs. The benchmarks also compare zero-shot and few-shot results on three tasks and track performance drops when models trained on older legal texts encounter wartime documents. Scale proves a weak predictor of results, and few-shot examples from Ukrainian hurt rather than help accuracy. The work matters for anyone deploying models in under-resourced legal domains where token efficiency and temporal stability affect real-world reliability.

Core claim

Tokenizer fertility, defined as the number of tokens a model uses for a given text, ranges from lowest in Llama-family models to 60 percent higher in Qwen 3 models on Ukrainian legal inputs. NVIDIA Nemotron Super 3 (120B) scores highest overall while costing less than larger alternatives. Few-shot prompting lowers performance by up to 26 points due to language-specific effects. Classifiers lose 27.9 points when tested on post-invasion decisions after training on pre-war ones, with newer models transferring better backward than forward.

What carries the argument

Tokenizer fertility, which counts the tokens needed to represent fixed legal text and thereby sets the effective cost and context length for any downstream task.

If this is right

  • Model selection for Ukrainian legal applications must include tokenizer fertility checks to avoid unnecessary cost increases.
  • Zero-shot prompting is more reliable than few-shot for this morphologically rich language.
  • Model parameter count does not reliably predict performance on specialized legal classification tasks.
  • Legal classifiers require regular updates or validation when the underlying social and legal context changes over time.
  • Public release of annotated Ukrainian court data fills a gap in existing benchmarks for non-English legal NLP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Token efficiency differences likely affect other languages with complex morphology beyond Ukrainian.
  • Systems could dynamically route inputs to the most fertile tokenizer for a given language or domain.
  • The observed asymmetry in temporal transfer may generalize to other domains experiencing rapid language evolution.
  • Dataset releases like this one enable future work on conflict-related shifts in legal language.

Load-bearing premise

The 273 court decisions drawn from the state registry stand in for the broader Ukrainian legal language and the three evaluation tasks across the studied time periods.

What would settle it

A follow-up experiment that applies the same tokenization and task measurements to a larger, independently collected sample of Ukrainian legal texts and finds no 1.6x fertility spread or the reported performance gaps would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.14890 by Volodymyr Ovcharov.

Figure 1
Figure 1. Figure 1: Tokenizer fertility (average tokens per whitespace-delimited word) on 100 Ukrainian [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Few-shot effect (few-shot minus zero-shot, in percentage points) across all model–task [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cost–quality frontier for seven models on Ukrainian legal text. Each point represents [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-temporal generalization heatmap. Rows: training epoch; columns: test epoch. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that tokenizer fertility varies by a factor of 1.6 across seven foundation models on Ukrainian legal text, with Qwen 3 models consuming 60% more tokens than Llama-family models on the same inputs. Benchmarking on 273 validated court decisions from the EDRSR registry shows that NVIDIA Nemotron Super 3 (120B) achieves the highest composite score while being cheaper than larger models, that few-shot prompting degrades performance by up to 26 points due to intrinsic issues with Ukrainian demonstrations, and that cross-temporal generalization from pre-war (2008-2013) to wartime (2022-2026) decisions drops by 27.9 points with asymmetric transfer favoring newer models. The authors release a dataset of 14,452 annotated decisions spanning 2008-2026 to support reproducibility.

Significance. If the empirical measurements hold, the work demonstrates that tokenizer fertility is a first-order cost factor for deploying LLMs in morphologically rich, low-resource legal domains and that model scale is a poor proxy for domain performance. The public release of the large temporally stratified dataset fills a gap in legal NLP resources and enables studies of language shift under armed conflict, strengthening the practical recommendation that tokenizer analysis should precede model selection for Ukrainian legal applications.

major comments (2)
  1. [Dataset and Experimental Setup] The headline 1.6x fertility variation and all performance results are measured exclusively on the 273 validated EDRSR decisions. The manuscript releases 14,452 decisions but provides no stratification, sampling justification, or robustness checks across subsamples by case type, court level, document length, or temporal epoch. This makes it unclear whether the reported ratios (e.g., Qwen 3 vs. Llama-family) are stable properties of Ukrainian legal text or artifacts of the validated subset, directly weakening the claim that tokenizer analysis must precede model selection in practice.
  2. [Cross-Temporal Generalization Experiment] The cross-temporal experiment reports a 27.9 pp drop when training on 2008-2013 decisions and testing on 2022-2026 decisions, with a +14.6 pp backward-transfer advantage for newer models. Without explicit details on the classifier architecture, input representation, or controls for document-length and vocabulary-shift confounds, it is difficult to attribute the asymmetry specifically to wartime language change rather than other distributional differences.
minor comments (2)
  1. [Abstract] The abstract refers to performance on 'three tasks' without naming or briefly describing them; adding this information in the abstract or early introduction would improve readability and allow readers to assess the scope of the zero-shot and few-shot claims.
  2. [Results] Tables reporting accuracy or composite scores should include standard errors or confidence intervals and the exact number of examples per condition to support the comparative statements (e.g., Nemotron outperforming Mistral Large 3).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. Below we provide point-by-point responses to the major comments and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Dataset and Experimental Setup] The headline 1.6x fertility variation and all performance results are measured exclusively on the 273 validated EDRSR decisions. The manuscript releases 14,452 decisions but provides no stratification, sampling justification, or robustness checks across subsamples by case type, court level, document length, or temporal epoch. This makes it unclear whether the reported ratios (e.g., Qwen 3 vs. Llama-family) are stable properties of Ukrainian legal text or artifacts of the validated subset, directly weakening the claim that tokenizer analysis must precede model selection in practice.

    Authors: We agree that additional details on the dataset construction and robustness would strengthen the paper. The 273 decisions represent a carefully validated subset selected for annotation quality and task suitability from the larger corpus of 14,452 decisions. To address the concern, we will include in the revised manuscript a new appendix or subsection detailing the sampling justification, including available stratifications by case type, court level, and temporal distribution. We will also conduct and report robustness checks by computing tokenizer fertility ratios on multiple random subsamples of the full released dataset to demonstrate that the 1.6x variation is not specific to the validated subset. revision: yes

  2. Referee: [Cross-Temporal Generalization Experiment] The cross-temporal experiment reports a 27.9 pp drop when training on 2008-2013 decisions and testing on 2022-2026 decisions, with a +14.6 pp backward-transfer advantage for newer models. Without explicit details on the classifier architecture, input representation, or controls for document-length and vocabulary-shift confounds, it is difficult to attribute the asymmetry specifically to wartime language change rather than other distributional differences.

    Authors: We thank the referee for this feedback. While the manuscript describes the overall experimental design, we recognize that more granular details are warranted to support the attribution to language change. In the revised version, we will expand the relevant section to provide explicit details on the classifier architecture, the input representation and preprocessing steps, and add controls including document length normalization and measures of vocabulary shift between the pre-war and wartime periods. These additions will help clarify the sources of the observed asymmetry. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on external models and new dataset

full rationale

The paper reports tokenizer fertility ratios and zero-shot/few-shot performance scores obtained by applying seven public foundation models to a fixed set of 273 court decisions drawn from the released EDRSR corpus. No equations, fitted parameters, or derivations appear; the 1.6x fertility variation and composite scores are direct counts and accuracy measurements rather than quantities defined in terms of themselves. The cross-temporal generalization experiment likewise consists of training and testing classifiers on temporally partitioned subsets of the same external data. No self-citation is invoked to justify a uniqueness theorem or ansatz, and the central claims remain independent of any internal fitting loop. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard NLP evaluation practices and the assumption that the selected court decisions and tasks adequately represent Ukrainian legal text across time periods. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The three tasks on court decisions validly measure model performance on Ukrainian legal text understanding.
    Invoked when reporting composite scores and performance differences without further justification in the abstract.

pith-pipeline@v0.9.0 · 5856 in / 1378 out tokens · 47799 ms · 2026-05-20T21:04:05.270080+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

  1. [1]

    How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

    Rust, P., Pfeiffer, J., Vuli\' c , I., Ruder, S., and Gurevych, I. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the ACL, pages 3118--3135, 2021. https://aclanthology.org/2021.acl-long.243/

  2. [2]

    Language Model Tokenizers Introduce Unfairness Between Languages

    Petrov, A., La Malfa, E., Torr, P., and Bibi, A. Language Model Tokenizers Introduce Unfairness Between Languages. Advances in Neural Information Processing Systems, 37, 2024. https://arxiv.org/abs/2305.15425

  3. [3]

    I., Kreutzer, J., and Hooker, S

    Ahia, O., Ogueji, K., Winata, G. I., Kreutzer, J., and Hooker, S. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of EMNLP 2023, pages 9524--9538, 2023. https://aclanthology.org/2023.emnlp-main.614/

  4. [4]

    Neural Machine Translation of Rare Words with Subword Units

    Sennrich, R., Haddow, B., and Birch, A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the ACL, pages 1715--1725, 2016. https://aclanthology.org/P16-1162/

  5. [5]

    and Richardson, J

    Kudo, T. and Richardson, J. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. Proceedings of EMNLP 2018: System Demonstrations, pages 66--71, 2018. https://aclanthology.org/D18-2012/

  6. [6]

    LEGAL-BERT: The Muppets straight out of Law School

    Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I. LEGAL-BERT: The Muppets straight out of Law School. Findings of EMNLP 2020, pages 2898--2904, 2020. https://aclanthology.org/2020.findings-emnlp.261/

  7. [7]

    LEXTREME : A Multi-Lingual and Multi-Task Benchmark for the Legal Domain

    Niklaus, J., Matoshi, V., Rani, P., Galassi, A., St \"u rmer, M., and Chalkidis, I. LEXTREME : A Multi-Lingual and Multi-Task Benchmark for the Legal Domain. Findings of EMNLP 2023, pages 12898--12916, 2023. https://aclanthology.org/2023.findings-emnlp.865/

  8. [8]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring Massive Multitask Language Understanding. Proceedings of ICLR, 2021. https://arxiv.org/abs/2009.03300

  9. [9]

    Language Models are Few-Shot Learners

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33:1877--1901, 2020. https://arxiv.org/abs/2005.14165

  10. [10]

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

    Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Proceedings of the 60th Annual Meeting of the ACL, pages 8086--8098, 2022. https://aclanthology.org/2022.acl-long.556/

  11. [11]

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Proceedings of EMNLP 2022, pages 11048--11064, 2022

    Min, S., Lyu, X., Holtzman, A., Arber, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Proceedings of EMNLP 2022, pages 11048--11064, 2022. https://aclanthology.org/2022.emnlp-main.759/

  12. [12]

    D., Ngo, N

    Lai, V. D., Ngo, N. T., Veyseh, A. P. B., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T. H. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of EMNLP 2023, pages 13171--13189, 2023. https://aclanthology.org/2023.findings-emnlp.878/

  13. [13]

    Unsupervised Cross-lingual Representation Learning at Scale

    Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., et al. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the ACL, pages 8440--8451, 2020. https://aclanthology.org/2020.acl-main.747/

  14. [14]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023. https://arxiv.org/abs/2307.09288

  15. [15]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., et al. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783, 2024. https://arxiv.org/abs/2407.21783

  16. [16]

    A Survey of AI Agent Protocols.arXiv preprint arXiv:2504.16736, 2025

    Meta AI . The Llama 4 Herd of Models. arXiv preprint arXiv:2504.16736, 2025. https://arxiv.org/abs/2504.16736

  17. [17]

    Mistral Large

    Mistral AI . Mistral Large. Technical report, 2024. https://mistral.ai/news/mistral-large-2407/

  18. [18]

    Nemotron Super: Open Hybrid Mamba-Transformer Models

    NVIDIA . Nemotron Super: Open Hybrid Mamba-Transformer Models. Technical report, 2025. https://developer.nvidia.com/blog/nemotron-super-open-model-for-enterprise-reasoning/

  19. [19]

    Amazon Nova: Foundation Models for Enterprise AI

    Amazon Web Services . Amazon Nova: Foundation Models for Enterprise AI. Technical report, 2024. https://aws.amazon.com/ai/generative-ai/nova/

  20. [20]

    Qwen3 Technical Report

    Qwen Team . Qwen3 Technical Report. Technical report, 2025. https://qwenlm.github.io/blog/qwen3/

  21. [21]

    Finetuned Language Models Are Zero-Shot Learners

    Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned Language Models Are Zero-Shot Learners. Proceedings of ICLR, 2022. https://arxiv.org/abs/2109.01652

  22. [22]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., Chiang, W.-L., Sheng, Y., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36, 2024. https://arxiv.org/abs/2306.05685

  23. [23]

    lang-uk: Building a Comprehensive Corpus and Language Technology for Ukrainian

    Kotsyba, N., Mykulyak, A., and Shvedova, M. lang-uk: Building a Comprehensive Corpus and Language Technology for Ukrainian. Proceedings of LREC 2018, 2018. https://lang.org.ua/en/

  24. [24]

    and Nahorna, O

    Syvokon, O. and Nahorna, O. UA-GEC : Grammatical Error Correction and Fluency Corpus for the Ukrainian Language. Proceedings of the Second UNLP Workshop, pages 96--102, 2023. https://aclanthology.org/2023.unlp-1.12/

  25. [25]

    Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale

    Chaplynskyi, D. Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale. Proceedings of the Second UNLP Workshop, pages 1--10, 2023. https://aclanthology.org/2023.unlp-1.1/

  26. [26]

    Regularization Paths for Generalized Linear Models via Coordinate Descent

    Friedman, J., Hastie, T., and Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1):1--22, 2010. https://www.jstatsoft.org/article/view/v033i01