arxiv: 2605.03799 · v2 · submitted 2026-05-05 · 💻 cs.CL

Recognition: no theorem link

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Mullosharaf K. Arabov

Pith reviewed 2026-05-12 01:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords natural language processinglow-resource languagesTajikTatartokenizationRLHFreproducible researchlarge language models

0 comments

The pith

A twelve-session practicum teaches the full NLP pipeline while embedding original tools and benchmarks for Tajik and Tatar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a hands-on curriculum that moves step by step from tokenisation and vectorisation through fine-tuning, retrieval-augmented generation, and reinforcement learning from human feedback. It stands out by inserting fresh work on Tajik and Tatar at every stage, such as new subword tokenisers, embeddings, lexical databases, and transliteration benchmarks. This shows how standard NLP methods can be adapted to languages with scarce data while keeping the same level of technical detail. The guide requires every session to end with public code, models, and reports built on one shared corpus, turning the course into a growing research resource rather than a static textbook. It also pushes the use of open-weight models and the Hugging Face ecosystem instead of commercial APIs.

Core claim

The paper claims that the full modern NLP pipeline can be taught as a single, reproducible practicum that weaves original contributions on Tajik and Tatar throughout its twelve sessions, allowing learners to implement and evaluate methods from classical tokenisation to RLHF on a single evolving corpus while publishing all artefacts publicly.

What carries the argument

The twelve-session structure that combines concise theory, implementation plans, evaluation metrics, and the requirement to publish code and models publicly, with original Tajik and Tatar resources inserted at each relevant stage on one shared corpus.

If this is right

Students generate public repositories containing working implementations for every stage from tokenisation to RLHF.
The same methods and metrics apply directly to data-scarce languages without requiring separate pipelines.
Cumulative experiments on one evolving corpus allow direct comparison of improvements across sessions.
Open-weight models are used and extended instead of commercial APIs, producing shareable artefacts.
Transparent assessment criteria make it possible to evaluate student work consistently across sessions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The session-by-session publication model could be copied for practicums in speech processing or computer vision that target other low-resource languages.
Focusing all work on one corpus might reveal how small, incremental additions of data or rules affect performance in morphologically rich settings.
The Tajik and Tatar resources could serve as starting points for similar efforts on related Turkic or Central Asian languages.
Mandating public releases might speed up the growth of open datasets and tools for languages that currently lack them.

Load-bearing premise

The original contributions on Tajik and Tatar are genuinely new and have not appeared in prior work, and requiring students to publish code and models will produce consistently verifiable and high-quality artefacts.

What would settle it

A search of prior literature and public repositories for the claimed subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks for Tajik and Tatar would show whether the novelty holds; inspecting the published session outputs for completeness and reproducibility would test the artefact-quality claim.

read the original abstract

This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. A distinctive feature of the work is its consistent attention to low-resource and morphologically rich languages -- original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout the twelve sessions, demonstrating how modern NLP can be adapted to data-scarce environments without sacrificing rigour. Each session combines concise theory with detailed implementation plans, formalised evaluation metrics, and transparent assessment criteria. The work is not a conventional textbook: it is designed as a reproducible research artefact where every session requires publishing code, models, and reports in public repositories. All experiments are conducted on a single evolving corpus, and the work advocates open-weight models over commercial APIs, with special attention to the Hugging Face ecosystem. Designed for senior undergraduates, graduate students, and practising developers seeking to implement, compare, and deploy methods from classical ML to state-of-the-art LLM-based systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical teaching guide for the full NLP pipeline that claims new Tajik/Tatar resources without showing prior-art checks or results.

read the letter

This paper is mainly a structured curriculum walking through tokenisation, embeddings, fine-tuning, RAG, and RLHF, with a repeated focus on low-resource languages. What it does reasonably well is lay out clear session plans, evaluation metrics, and a reproducibility rule that requires students to publish code and models publicly. The emphasis on open-weight models and the Hugging Face stack is also straightforward and useful for teaching. The soft spot is the central claim of original Tajik and Tatar contributions—subword tokenisers, embeddings, lexical databases, and transliteration benchmarks. The text presents these as woven throughout the sessions but supplies no related-work section, citation comparison, or side-by-side check against existing resources, so it is impossible to judge whether anything is actually new. No experimental results or concrete outputs are shown either, which leaves the “rigour” assertion untested. The work is therefore best treated as educational material rather than a research paper. It could be helpful for instructors or self-learners who want a single evolving corpus and hands-on assignments, but it does not meet the threshold for serious peer review in a research venue.

Referee Report

1 major / 1 minor

Summary. The preprint presents a twelve-session practicum covering the full modern NLP pipeline from tokenisation and vectorisation through fine-tuning, RAG, and RLHF. A core claimed distinctive feature is the consistent integration of original contributions for low-resource morphologically rich languages, specifically subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks for Tajik and Tatar, all demonstrated on a single evolving corpus while requiring students to publish code, models, and reports publicly and preferring open-weight models via the Hugging Face ecosystem.

Significance. If the Tajik and Tatar artefacts are verifiably novel and the sessions deliver rigorous, reproducible implementations with formalised metrics, the work could provide a useful template for teaching adaptation of contemporary NLP methods to data-scarce settings without reliance on commercial APIs. The explicit reproducibility mandate and single-corpus design are positive structural choices that could support verifiable student outputs.

major comments (1)

[Abstract and session descriptions] Abstract and the description of the twelve sessions: the manuscript asserts that 'original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout' as the distinctive feature, yet supplies no related-work subsection, citation table, or explicit side-by-side comparison against prior Tajik/Tatar NLP resources. This absence makes it impossible to evaluate whether the listed artefacts constitute genuine novelty or incremental extensions, directly undermining the central claim of distinctive low-resource adaptation.

minor comments (1)

[Abstract] The abstract is unusually long and contains multiple overlapping claims about design goals; condensing it would improve readability while preserving the core description of the practicum structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our preprint. We address the single major comment below and agree that revisions are needed to better substantiate our claims.

read point-by-point responses

Referee: [Abstract and session descriptions] Abstract and the description of the twelve sessions: the manuscript asserts that 'original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout' as the distinctive feature, yet supplies no related-work subsection, citation table, or explicit side-by-side comparison against prior Tajik/Tatar NLP resources. This absence makes it impossible to evaluate whether the listed artefacts constitute genuine novelty or incremental extensions, directly undermining the central claim of distinctive low-resource adaptation.

Authors: We acknowledge the validity of this observation. The current version presents the Tajik and Tatar resources primarily through their pedagogical integration into the twelve-session pipeline rather than through a formal research lens, which has resulted in the omission of a dedicated related-work discussion or comparative table. To address this, we will add a new 'Related Work and Contributions' subsection (likely in the introduction or as a standalone section) that surveys existing NLP resources for Tajik and Tatar, cites relevant prior work on tokenisation, embeddings, lexicons, and transliteration for these languages, and includes a side-by-side comparison table. This will explicitly delineate the novel aspects of our adaptations—such as their consistent use within a single evolving corpus, emphasis on morphological richness, and open reproducibility requirements—while clarifying any incremental elements. We believe this addition will allow readers to properly evaluate the distinctiveness without changing the manuscript's core focus as a practical guide. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive practicum guide with no derivations or predictions

full rationale

The manuscript is a systematic instructional guide covering the NLP pipeline from tokenisation to RLHF, with emphasis on low-resource languages via claimed original Tajik/Tatar resources. It contains no mathematical derivations, equations, first-principles results, fitted parameters, or predictive claims that could reduce to inputs by construction. Claims of originality are asserted descriptively without any self-referential logic, self-citation chains, or ansatz smuggling that would trigger the enumerated circularity patterns. The work is self-contained as a reproducible teaching artefact rather than an analytic paper with load-bearing derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an educational practicum rather than a theoretical derivation, so the central description rests on standard NLP concepts already established in the field with no new free parameters, axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5499 in / 1177 out tokens · 38386 ms · 2026-05-12T01:45:05.119125+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 9 internal anchors

[1]

Allen.Natural Language Understanding

J. Allen.Natural Language Understanding. Benjamin/Cummings, Redwood City, CA, 2nd edition, 1995

work page 1995
[2]

Indurkhya and F

N. Indurkhya and F. J. Damerau, editors.The Handbook of Natural Language Processing. CRC Press, Boca Raton, FL, 2nd edition, 2010

work page 2010
[3]

Clark, C

A. Clark, C. Fox, and S. Lappin, editors.The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, Oxford, 2012

work page 2012
[4]

Mitkov, editor.The Oxford Handbook of Computational Linguistics

R. Mitkov, editor.The Oxford Handbook of Computational Linguistics. Oxford University Press, Oxford, 2003

work page 2003
[5]

C. D. Manning and H. Sch ¨utze.Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999

work page 1999
[6]

Goyal, S

P. Goyal, S. Pandey, K. Jain, and K. Nagpal.Deep Learning for Natural Language Processing. BPB Publications, 2018

work page 2018
[7]

Bengfort, R

B. Bengfort, R. Bilbro, and T. Ojeda.Applied Text Analysis with Python. O’Reilly, Beijing, 2018

work page 2018
[8]

H. Lane, C. Howard, and H. Hapke.Natural Language Processing in Action. Manning Publications, Shelter Island, NY , 2019

work page 2019
[9]

Jurafsky and J

D. Jurafsky and J. H. Martin.Speech and Language Processing. Prentice Hall, Upper Saddle River, NJ, 3rd edition, 2022

work page 2022
[10]

Eisenstein.Introduction to Natural Language Processing

J. Eisenstein.Introduction to Natural Language Processing. MIT Press, Cambridge, MA, 2019

work page 2019
[11]

Bojanowski, E

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017

work page 2017
[12]

M. K. Arabov and S. S. Khaibullina. Analysis of the effectiveness of subword tokenisers in a low- resource linguistic environment: Implementation experience for the tajik language.Russian Digital Libraries Journal, 29(2):546–564, 2026

work page 2026
[13]

S. Bird, E. Klein, and E. Loper.Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, 2009

work page 2009
[14]

Rao and B

D. Rao and B. McMahan.NLP with PyTorch. O’Reilly Media, Sebastopol, CA, 2020

work page 2020
[15]

Singh and A

S. Singh and A. Mahmood.The NLP Cookbook: Modern Recipes for Transformer-based Deep Learn- ing Architectures. Independently published, 2021. 134 Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF — A Textbook for Undergraduate and Graduate StudentsA PREPRINT

work page 2021
[16]

Efficient Estimation of Word Representations in Vector Space

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[17]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contex- tualized word representations. InProceedings of NAACL-HLT, pages 2227–2237, 2018

work page 2018
[18]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding.arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self- supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

work page internal anchor Pith review arXiv 1909
[20]

V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[21]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush.Natural Language Processing with Transformers. O’Reilly Media, Sebastopol, CA, 2022

work page 2022
[22]

Transformers documentation, 2025

Hugging Face. Transformers documentation, 2025. Accessed: 2025-11-27

work page 2025
[23]

J. Li, Y . Liang, and R. Zhang. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[25]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21:1–67, 2020

work page 2020
[26]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[27]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[28]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov. Unsupervised cross-lingual representation learning at scale.arXiv preprint arXiv:1911.02116, 2019

work page arXiv 1911
[29]

Course: How to build llm pipelines, 2025

Hugging Face. Course: How to build llm pipelines, 2025. Accessed: 2025-11-27

work page 2025
[30]

M. K. Arabov. Tajperslexon: A tajik–persian lexical resource and hybrid model for cross script low resource nlp. InThe Proceedings of SilkRoadNLP 2026, pages 29–37, Rabat, Morocco, 2026. ACL

work page 2026
[31]

M. K. Arabov. Tatar2vec: Word embeddings for the tatar language based on a heterogeneous corpus,

work page
[32]

2026610619, Russian Federation

Certificate of State Registration of a Computer Program No. 2026610619, Russian Federation. Application: 23 December 2025; published: 14 January 2026. 135 Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF — A Textbook for Undergraduate and Graduate StudentsA PREPRINT

work page 2025
[33]

M. K. Arabov. Developing the tajik language in the era of large language models: Corpus infras- tructure, linguistic challenges, and safety alignment.Modern Science, (12-2):85–93, 2025. EDN LQLURB

work page 2025
[34]

M. K. Arabov. A systematic benchmark of machine transliteration models for the tajik–farsi lan- guage pair: A comparative study from rule-based to transformer architectures.arXiv preprint arXiv:2605.02270, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

M. K. Arabov, R. A. Burnashev, and O. A. Medvedeva. Comparative analysis of intelligent meth- ods for automatic anomaly detection in industrial and distributed systems based on machine learning and deep learning algorithms. InProceedings of the International Russian Automation Conference (RusAutoCon), pages 279–284, 2025

work page 2025
[36]

M. K. Arabov and V . V . Sedykh. Comparative analysis of methods for modelling semantic word representations under limited language resource conditions: The case of the tajik language.Scientific and Technical Bulletin of the Volga Region, (6):196–198, 2025. EDN ZHBKFG. 136

work page 2025