arxiv: 2605.04576 · v1 · submitted 2026-05-06 · 💻 cs.CL

Recognition: unknown

Benchmarking POS Tagging for the Tajik Language: A Comparative Study of Neural Architectures on the TajPersParallel Corpus

Mullosharaf K. Arabov

Pith reviewed 2026-05-08 16:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords POS taggingTajik languagemBERTLoRAmultilingual transformerspart-of-speech tagginglow-resource NLPdictionary corpus

0 comments

The pith

mBERT with LoRA achieves the highest scores in the first POS tagging benchmark for Tajik on isolated dictionary words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts the initial benchmark study for part-of-speech tagging in Tajik, a low-resource language. It evaluates classical and transformer-based neural architectures on a corpus of about 44,000 isolated dictionary entries without any sentence context. The mBERT model adapted via LoRA emerges as the top performer, yet all models show limited success due to the challenging context-free setup. The work reveals typological links to Persian and Russian while underscoring the difficulties of morphological ambiguity in isolation. These findings lay groundwork for advancing natural language processing tools tailored to Tajik.

Core claim

The study presents the first systematic comparison of neural architectures for Tajik POS tagging, including BiLSTM-CRF and multilingual transformers like mBERT, XLM-RoBERTa, ParsBERT, and ruBERT, all adapted with LoRA on the TajPersParallel corpus of isolated lexical units. Results indicate that mBERT + LoRA yields the highest scores with macro F1 of 0.11 and weighted F1 of 0.62. Models excel at common classes such as nouns and adjectives but fail entirely on rare function words, and zero-shot tests confirm greater similarity to Persian and Russian.

What carries the argument

Comparative evaluation of recurrent and transformer models with parameter-efficient LoRA fine-tuning applied to context-independent lexical classification for Tajik POS tagging.

If this is right

The mBERT + LoRA combination provides the strongest baseline for Tajik POS tagging among tested models.
Absence of syntactic context causes all models to struggle with resolving morphological ambiguities.
High-frequency classes like noun and adjective are classified reliably while rare classes see zero performance.
Zero-shot results suggest Tajik is typologically closest to Persian and Russian among the evaluated languages.
The benchmark establishes a starting point for future development of Tajik language processing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Expanding the corpus with full sentences would likely improve model performance by providing necessary context for disambiguation.
Further adaptation techniques or larger training data from related languages could address the low macro F1 scores.
Similar benchmarking approaches may apply to other under-resourced languages using dictionary resources.
Integration of these models into practical Tajik NLP applications would require handling the observed class imbalances.

Load-bearing premise

That performing POS tagging on isolated lexical units from a dictionary corpus without sentences is a valid and informative proxy for real-world Tajik grammatical analysis.

What would settle it

Evaluating the same models on a corpus of Tajik sentences with full syntactic context and observing whether weighted F1 scores substantially exceed 0.62 or macro F1 rises above 0.11 would test the claim.

Figures

Figures reproduced from arXiv: 2605.04576 by Mullosharaf K. Arabov.

**Figure 1.** Figure 1: 4 view at source ↗

**Figure 1.** Figure 1: Comparative visualization of quality metrics for POS tagging models on the test set (Accuracy, Macro F1, view at source ↗

**Figure 2.** Figure 2: Zero-shot performance of transformer models on the Tajik corpus (without fine-tuning). view at source ↗

**Figure 3.** Figure 3: Distribution of per-class F1-scores across part-of-speech categories for all compared architectures. view at source ↗

**Figure 4.** Figure 4: Confusion matrix for the mBERT + LoRA model: distribution of true and predicted part-of-speech labels on view at source ↗

read the original abstract

This paper presents the first benchmark for the task of automatic part-of-speech (POS) tagging for the Tajik language. Despite the existence of multilingual language models demonstrating high effectiveness for many of the world's languages, their capacity for grammatical analysis of Tajik has remained unexplored until now. The aim of this study is to fill this gap through a systematic comparison of classical neural network architectures and modern multilingual transformers. Experiments were conducted on the TajPersParallel corpus, a parallel lexical resource comprising approximately 44,000 dictionary entries. Due to the absence of full-fledged example sentences in the current version of the corpus, the task was performed at the level of isolated lexical units, representing a challenging case of context-independent classification. The study compares the following architectures: a recurrent BiLSTM-CRF model, as well as multilingual models XLM-RoBERTa (large), mBERT, ParsBERT (Persian), and ruBERT (Russian), adapted using the parameter-efficient fine-tuning method LoRA. The testing results showed that the best performance is achieved by the mBERT + LoRA model (macro F1-score = 0.11, weighted F1-score = 0.62). It was established that in the absence of syntactic context, all models experience significant difficulty in resolving morphological ambiguity, successfully classifying primarily high-frequency classes ("noun," "adjective") while demonstrating zero effectiveness for rare function words. Zero-shot evaluation revealed the greatest typological proximity of Tajik to Persian (ParsBERT) and Russian (ruBERT). The obtained results form a foundation for further research and development in the field of automatic processing of the Tajik language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first numbers on Tajik POS tagging but runs the task on isolated dictionary words with no sentence context, so the results mainly show the limits of that proxy rather than model performance on real tagging.

read the letter

The paper delivers the first reported benchmark for Tajik POS tagging on the TajPersParallel corpus of about 44k entries. It compares a BiLSTM-CRF against LoRA-tuned versions of mBERT, XLM-RoBERTa, ParsBERT, and ruBERT, and reports mBERT+LoRA as the strongest with macro F1 of 0.11 and weighted F1 of 0.62. It also notes zero-shot closeness to Persian and Russian models and the expected trouble with rare function words when context is missing. That is the concrete contribution: someone working on Tajik or similar low-resource Central Asian languages now has a starting set of numbers and a model comparison to build from. The work is straightforward empirical benchmarking with no circular derivations or invented claims. The main limitation is the setup itself. The abstract states the task is context-independent classification on lexical units because the corpus lacks example sentences. Standard POS tagging is sequence labeling that uses syntactic context to resolve ambiguity; without that, the problem collapses toward majority-class or lexicon-based prediction. The big gap between macro and weighted F1 plus zero recall on rare classes follows directly from this reduction, not from any deep test of the models' capacity for Tajik grammar. The paper is honest about the difficulty, but the framing as a benchmark for automatic POS tagging for Tajik overstates what the numbers actually test. Details on splits, training procedure, and error analysis are not in the abstract, though the full text may supply them. For readers focused on low-resource language technology in the region, this is worth seeing as a first data point. It is not yet a strong proxy for downstream use. I would send it to peer review so referees can check the experimental protocol and ask for clearer scoping of what the numbers do and do not show.

Referee Report

3 major / 2 minor

Summary. The paper claims to deliver the first benchmark for automatic POS tagging in Tajik by evaluating BiLSTM-CRF and several LoRA-adapted multilingual transformers (XLM-RoBERTa large, mBERT, ParsBERT, ruBERT) on the TajPersParallel corpus of ~44k dictionary entries. Because the corpus lacks example sentences, the task is reduced to context-independent classification of isolated lexical units; the headline result is that mBERT+LoRA achieves the highest scores (macro F1 = 0.11, weighted F1 = 0.62), with all models struggling on rare function words and zero-shot results indicating closest typological proximity to Persian and Russian.

Significance. If the proxy task is accepted as informative, the work supplies the first quantitative baseline for Tajik grammatical analysis and usefully identifies mBERT+LoRA as strongest while documenting the expected difficulty of context-free morphological disambiguation. The zero-shot cross-lingual findings also offer a modest typological signal. These contributions are modest in scope but could seed future sentence-level resources for Tajik NLP.

major comments (3)

[Abstract] Abstract and Introduction: The central claim that the study constitutes 'the first benchmark for the task of automatic part-of-speech tagging for the Tajik language' is undermined by the explicit reduction to isolated lexical-unit classification. Standard POS tagging is a sequence-labeling problem whose primary difficulty is contextual disambiguation; the reported macro F1 of 0.11 versus weighted F1 of 0.62, together with zero recall on rare function words, is the signature of majority-class prediction on an imbalanced lexicon rather than evidence about model capacity for Tajik grammar.
[Methodology] Methodology and Results sections: No train/test splits, hyperparameter settings, cross-validation procedure, or statistical significance tests are described for the reported F1 scores. Without these details the numerical comparison (including the claim that mBERT+LoRA is best) cannot be reproduced or evaluated for robustness.
[Results] Results and Discussion: The zero-shot evaluation is presented as evidence of 'typological proximity' to Persian and Russian, yet the paper provides no quantitative measure (e.g., confusion matrices or per-class transfer gaps) to support this interpretation beyond the headline F1 numbers.

minor comments (2)

[Abstract] The abstract should more prominently foreground the context-free limitation when stating the headline result rather than relegating it to a single sentence.
[Results] Table captions and axis labels in the results figures should explicitly note that evaluation is performed on isolated dictionary entries, not sentences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current work is a proxy task on isolated lexical units rather than full contextual POS tagging, and that additional methodological details and quantitative support are needed. We will revise the manuscript to clarify the scope, add the missing experimental details, and strengthen the analysis of zero-shot results. Point-by-point responses are provided below.

read point-by-point responses

Referee: [Abstract] Abstract and Introduction: The central claim that the study constitutes 'the first benchmark for the task of automatic part-of-speech tagging for the Tajik language' is undermined by the explicit reduction to isolated lexical-unit classification. Standard POS tagging is a sequence-labeling problem whose primary difficulty is contextual disambiguation; the reported macro F1 of 0.11 versus weighted F1 of 0.62, together with zero recall on rare function words, is the signature of majority-class prediction on an imbalanced lexicon rather than evidence about model capacity for Tajik grammar.

Authors: We acknowledge that the task performed is context-independent classification of dictionary entries rather than sequence labeling with syntactic context, and that the low macro F1 (0.11) and struggles with rare function words reflect class imbalance and the absence of disambiguation cues. The TajPersParallel corpus in its current form contains only isolated lexical units without sentences, making full contextual POS tagging impossible at this stage. We will revise the abstract and introduction to explicitly frame the contribution as the first benchmark for context-free POS tagging on Tajik lexical units, while discussing the limitations for grammatical analysis and the value of this initial quantitative baseline for a low-resource language. revision: yes
Referee: [Methodology] Methodology and Results sections: No train/test splits, hyperparameter settings, cross-validation procedure, or statistical significance tests are described for the reported F1 scores. Without these details the numerical comparison (including the claim that mBERT+LoRA is best) cannot be reproduced or evaluated for robustness.

Authors: We agree that these details are required for reproducibility and fair evaluation of model comparisons. In the revised manuscript we will add a new subsection detailing the 80/10/10 train/validation/test split, all hyperparameter values (including LoRA rank, alpha, learning rate, and epochs for each architecture), the training procedure, and the use of 5-fold cross-validation. We will also report statistical significance tests (e.g., paired t-tests or McNemar’s test) on the F1 differences to support the claim that mBERT+LoRA performs best. revision: yes
Referee: [Results] Results and Discussion: The zero-shot evaluation is presented as evidence of 'typological proximity' to Persian and Russian, yet the paper provides no quantitative measure (e.g., confusion matrices or per-class transfer gaps) to support this interpretation beyond the headline F1 numbers.

Authors: The zero-shot results show comparatively higher weighted F1 for ParsBERT and ruBERT, which we interpret as a signal of typological proximity. To provide stronger quantitative support we will add per-class F1 scores across models and confusion matrices for the zero-shot setting in the revised Results section, along with explicit discussion of per-class transfer gaps. This will allow readers to evaluate the typological claim more rigorously. revision: partial

Circularity Check

0 steps flagged

Purely empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper performs a direct experimental comparison of neural models on a fixed corpus of ~44k isolated dictionary entries, reporting macro and weighted F1 scores as model outputs on held-out data. No equations, parameter fits, uniqueness theorems, or predictions are claimed; the central results are simply the observed performance numbers of mBERT+LoRA and baselines. The task definition (context-independent classification) is stated explicitly in the abstract and does not reduce to any fitted input or self-citation. This is standard empirical benchmarking whose validity can be assessed externally against the stated proxy, but the reported numbers do not circularly presuppose themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard supervised learning assumptions for neural NLP models and the representativeness of the dictionary corpus; no free parameters beyond routine training choices, no new axioms, and no invented entities are introduced.

axioms (1)

domain assumption Standard assumptions of supervised classification apply to context-free word-level POS tagging
The paper treats isolated dictionary entries as valid instances for training and evaluation without additional justification.

pith-pipeline@v0.9.0 · 9288 in / 1211 out tokens · 85473 ms · 2026-05-08T16:29:45.559617+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Accessed: 2026-04-09

URLhttps://aclanthology.org/ 2020.acl-main.747/. Accessed: 2026-04-09. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Techno...

2020
[2]

URLhttps://doi.org/10.1007/ s11063-021-10528-4. Y. Kuratov and M. Arkhipov. Adaptation of deep bidirectional multilingual transformers for Russian language. In Proceedings of the Dialogue 2019 Conference, pages 333–339, Moscow,

2019
[3]

ru/media/4281/kuratovy.pdf

URLhttps://www.dialog-21. ru/media/4281/kuratovy.pdf. Accessed: 2026-04-09. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. In10th International Conference on Learning Representations (ICLR 2022), Virtual,

2026
[4]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv.org/abs/2106.09685. Accessed: 2026-04-09. I. M. Oranskii.Introduction to Iranian Philology. Nauka, Moscow, 2nd, enlarged edition,

work page internal anchor Pith review arXiv 2026
[5]

EDN LQLURB. S. Jafari, F. Farsi, N. Ebrahimi, M. B. Sajadi, and S. Eetemadi. DadmaTools V2: An adapter-based natural language processing toolkit for the Persian language. InProceedings of the 1st Workshop on NLP for Languages Using Arabic Script (AbjadNLP 2025), pages 37–43, Abu Dhabi, UAE,

2025
[6]

Association for Computational Linguistics. M. K. Arabov. TajPersLexon: A Tajik–Persian lexical resource and hybrid model for cross-script low-resource NLP. In The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP 2026), pages 29–37, Rabat, Morocco, 2026a. Association for Computational Linguistics. R. Etezadi, M...

2026
[7]

Association for Computational Linguistics. S. Jafari, V. Hoste, E. Lefever, T. Azin, F. Roodi, Z. Dehghani Tafti, S. Bali, et al. APARSIN: A multi-variety sentiment and translation benchmark for Iranic languages. InThe Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP 2026), pages 83–97, Rabat, Morocco,

2026
[8]

Accessed: 2026-04-05

URL https://www.dslib.net/math-mod/ matematicheskie-osnovy-avtomatizirovannoj-tadzhiksko-persidskoj-konversii.html. Accessed: 2026-04-05. (In Russian). R. Merchant and K. Tang. ParsText: A digraphic corpus for Tajik-Farsi transliteration. InProceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024, pages 1–7, Torino, Italia,

2026
[9]

ELRA and ICCL. M. A. SadraeiJavaheri, E. Asgari, and H. R. Rabiee. Transformers for bridging Persian dialects: Transliteration model forTajikandIranianscripts. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16770–16775, Torino, Italia,

2024
[10]

ELRA and ICCL. R. Merchant and K. Tang. ParsTranslit: Truly versatile Tajik-Farsi transliteration. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1431–1443, Rabat, Morocco,

2026
[11]

Mullosharaf K

Association for Computational Linguistics. Mullosharaf K. Arabov. A systematic benchmark of machine transliteration models for the Tajik-Farsi language pair: A comparative study from rule-based to transformer architectures, 2026b. URLhttps://arxiv.org/abs/2605. 02270. A. M. Kurbonovich. Character-level transformer for Tajik–Persian transliteration with a ...

2026
[12]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30 (NIPS 2017), pages 5998–6008,

2017
[13]

URLhttps://papers.nips.cc/paper_files/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA,

2017
[14]

Neural Machine Translation by Jointly Learning to Align and Translate

URL https://arxiv.org/abs/1409.0473. Accessed: 2026-04-09. TajikNLPWorld. TajikNLPWorld profile on Hugging Face. Hugging Face,

work page internal anchor Pith review arXiv 2026
[15]

Accessed: 2026-04-05

URLhttps://huggingface.co/ TajikNLPWorld. Accessed: 2026-04-05. 10

2026