pith. machine review for the scientific record. sign in

arxiv: 2605.03799 · v2 · submitted 2026-05-05 · 💻 cs.CL

Recognition: no theorem link

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Mullosharaf K. Arabov

Pith reviewed 2026-05-12 01:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords natural language processinglow-resource languagesTajikTatartokenizationRLHFreproducible researchlarge language models
0
0 comments X

The pith

A twelve-session practicum teaches the full NLP pipeline while embedding original tools and benchmarks for Tajik and Tatar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a hands-on curriculum that moves step by step from tokenisation and vectorisation through fine-tuning, retrieval-augmented generation, and reinforcement learning from human feedback. It stands out by inserting fresh work on Tajik and Tatar at every stage, such as new subword tokenisers, embeddings, lexical databases, and transliteration benchmarks. This shows how standard NLP methods can be adapted to languages with scarce data while keeping the same level of technical detail. The guide requires every session to end with public code, models, and reports built on one shared corpus, turning the course into a growing research resource rather than a static textbook. It also pushes the use of open-weight models and the Hugging Face ecosystem instead of commercial APIs.

Core claim

The paper claims that the full modern NLP pipeline can be taught as a single, reproducible practicum that weaves original contributions on Tajik and Tatar throughout its twelve sessions, allowing learners to implement and evaluate methods from classical tokenisation to RLHF on a single evolving corpus while publishing all artefacts publicly.

What carries the argument

The twelve-session structure that combines concise theory, implementation plans, evaluation metrics, and the requirement to publish code and models publicly, with original Tajik and Tatar resources inserted at each relevant stage on one shared corpus.

If this is right

  • Students generate public repositories containing working implementations for every stage from tokenisation to RLHF.
  • The same methods and metrics apply directly to data-scarce languages without requiring separate pipelines.
  • Cumulative experiments on one evolving corpus allow direct comparison of improvements across sessions.
  • Open-weight models are used and extended instead of commercial APIs, producing shareable artefacts.
  • Transparent assessment criteria make it possible to evaluate student work consistently across sessions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The session-by-session publication model could be copied for practicums in speech processing or computer vision that target other low-resource languages.
  • Focusing all work on one corpus might reveal how small, incremental additions of data or rules affect performance in morphologically rich settings.
  • The Tajik and Tatar resources could serve as starting points for similar efforts on related Turkic or Central Asian languages.
  • Mandating public releases might speed up the growth of open datasets and tools for languages that currently lack them.

Load-bearing premise

The original contributions on Tajik and Tatar are genuinely new and have not appeared in prior work, and requiring students to publish code and models will produce consistently verifiable and high-quality artefacts.

What would settle it

A search of prior literature and public repositories for the claimed subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks for Tajik and Tatar would show whether the novelty holds; inspecting the published session outputs for completeness and reproducibility would test the artefact-quality claim.

read the original abstract

This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. A distinctive feature of the work is its consistent attention to low-resource and morphologically rich languages -- original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout the twelve sessions, demonstrating how modern NLP can be adapted to data-scarce environments without sacrificing rigour. Each session combines concise theory with detailed implementation plans, formalised evaluation metrics, and transparent assessment criteria. The work is not a conventional textbook: it is designed as a reproducible research artefact where every session requires publishing code, models, and reports in public repositories. All experiments are conducted on a single evolving corpus, and the work advocates open-weight models over commercial APIs, with special attention to the Hugging Face ecosystem. Designed for senior undergraduates, graduate students, and practising developers seeking to implement, compare, and deploy methods from classical ML to state-of-the-art LLM-based systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The preprint presents a twelve-session practicum covering the full modern NLP pipeline from tokenisation and vectorisation through fine-tuning, RAG, and RLHF. A core claimed distinctive feature is the consistent integration of original contributions for low-resource morphologically rich languages, specifically subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks for Tajik and Tatar, all demonstrated on a single evolving corpus while requiring students to publish code, models, and reports publicly and preferring open-weight models via the Hugging Face ecosystem.

Significance. If the Tajik and Tatar artefacts are verifiably novel and the sessions deliver rigorous, reproducible implementations with formalised metrics, the work could provide a useful template for teaching adaptation of contemporary NLP methods to data-scarce settings without reliance on commercial APIs. The explicit reproducibility mandate and single-corpus design are positive structural choices that could support verifiable student outputs.

major comments (1)
  1. [Abstract and session descriptions] Abstract and the description of the twelve sessions: the manuscript asserts that 'original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout' as the distinctive feature, yet supplies no related-work subsection, citation table, or explicit side-by-side comparison against prior Tajik/Tatar NLP resources. This absence makes it impossible to evaluate whether the listed artefacts constitute genuine novelty or incremental extensions, directly undermining the central claim of distinctive low-resource adaptation.
minor comments (1)
  1. [Abstract] The abstract is unusually long and contains multiple overlapping claims about design goals; condensing it would improve readability while preserving the core description of the practicum structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our preprint. We address the single major comment below and agree that revisions are needed to better substantiate our claims.

read point-by-point responses
  1. Referee: [Abstract and session descriptions] Abstract and the description of the twelve sessions: the manuscript asserts that 'original contributions on Tajik and Tatar, including subword tokenisers, word embeddings, lexical databases, and transliteration benchmarks, are woven throughout' as the distinctive feature, yet supplies no related-work subsection, citation table, or explicit side-by-side comparison against prior Tajik/Tatar NLP resources. This absence makes it impossible to evaluate whether the listed artefacts constitute genuine novelty or incremental extensions, directly undermining the central claim of distinctive low-resource adaptation.

    Authors: We acknowledge the validity of this observation. The current version presents the Tajik and Tatar resources primarily through their pedagogical integration into the twelve-session pipeline rather than through a formal research lens, which has resulted in the omission of a dedicated related-work discussion or comparative table. To address this, we will add a new 'Related Work and Contributions' subsection (likely in the introduction or as a standalone section) that surveys existing NLP resources for Tajik and Tatar, cites relevant prior work on tokenisation, embeddings, lexicons, and transliteration for these languages, and includes a side-by-side comparison table. This will explicitly delineate the novel aspects of our adaptations—such as their consistent use within a single evolving corpus, emphasis on morphological richness, and open reproducibility requirements—while clarifying any incremental elements. We believe this addition will allow readers to properly evaluate the distinctiveness without changing the manuscript's core focus as a practical guide. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive practicum guide with no derivations or predictions

full rationale

The manuscript is a systematic instructional guide covering the NLP pipeline from tokenisation to RLHF, with emphasis on low-resource languages via claimed original Tajik/Tatar resources. It contains no mathematical derivations, equations, first-principles results, fitted parameters, or predictive claims that could reduce to inputs by construction. Claims of originality are asserted descriptively without any self-referential logic, self-citation chains, or ansatz smuggling that would trigger the enumerated circularity patterns. The work is self-contained as a reproducible teaching artefact rather than an analytic paper with load-bearing derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an educational practicum rather than a theoretical derivation, so the central description rests on standard NLP concepts already established in the field with no new free parameters, axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5499 in / 1177 out tokens · 38386 ms · 2026-05-12T01:45:05.119125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 9 internal anchors

  1. [1]

    Allen.Natural Language Understanding

    J. Allen.Natural Language Understanding. Benjamin/Cummings, Redwood City, CA, 2nd edition, 1995

  2. [2]

    Indurkhya and F

    N. Indurkhya and F. J. Damerau, editors.The Handbook of Natural Language Processing. CRC Press, Boca Raton, FL, 2nd edition, 2010

  3. [3]

    Clark, C

    A. Clark, C. Fox, and S. Lappin, editors.The Handbook of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, Oxford, 2012

  4. [4]

    Mitkov, editor.The Oxford Handbook of Computational Linguistics

    R. Mitkov, editor.The Oxford Handbook of Computational Linguistics. Oxford University Press, Oxford, 2003

  5. [5]

    C. D. Manning and H. Sch ¨utze.Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999

  6. [6]

    Goyal, S

    P. Goyal, S. Pandey, K. Jain, and K. Nagpal.Deep Learning for Natural Language Processing. BPB Publications, 2018

  7. [7]

    Bengfort, R

    B. Bengfort, R. Bilbro, and T. Ojeda.Applied Text Analysis with Python. O’Reilly, Beijing, 2018

  8. [8]

    H. Lane, C. Howard, and H. Hapke.Natural Language Processing in Action. Manning Publications, Shelter Island, NY , 2019

  9. [9]

    Jurafsky and J

    D. Jurafsky and J. H. Martin.Speech and Language Processing. Prentice Hall, Upper Saddle River, NJ, 3rd edition, 2022

  10. [10]

    Eisenstein.Introduction to Natural Language Processing

    J. Eisenstein.Introduction to Natural Language Processing. MIT Press, Cambridge, MA, 2019

  11. [11]

    Bojanowski, E

    P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017

  12. [12]

    M. K. Arabov and S. S. Khaibullina. Analysis of the effectiveness of subword tokenisers in a low- resource linguistic environment: Implementation experience for the tajik language.Russian Digital Libraries Journal, 29(2):546–564, 2026

  13. [13]

    S. Bird, E. Klein, and E. Loper.Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, 2009

  14. [14]

    Rao and B

    D. Rao and B. McMahan.NLP with PyTorch. O’Reilly Media, Sebastopol, CA, 2020

  15. [15]

    Singh and A

    S. Singh and A. Mahmood.The NLP Cookbook: Modern Recipes for Transformer-based Deep Learn- ing Architectures. Independently published, 2021. 134 Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF — A Textbook for Undergraduate and Graduate StudentsA PREPRINT

  16. [16]

    Efficient Estimation of Word Representations in Vector Space

    T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013

  17. [17]

    M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contex- tualized word representations. InProceedings of NAACL-HLT, pages 2227–2237, 2018

  18. [18]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding.arXiv preprint arXiv:1810.04805, 2018

  19. [19]

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self- supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

  20. [20]

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

  21. [21]

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush.Natural Language Processing with Transformers. O’Reilly Media, Sebastopol, CA, 2022

  22. [22]

    Transformers documentation, 2025

    Hugging Face. Transformers documentation, 2025. Accessed: 2025-11-27

  23. [23]

    J. Li, Y . Liang, and R. Zhang. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

  24. [24]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401, 2020

  25. [25]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21:1–67, 2020

  26. [26]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084, 2019

  27. [27]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

  28. [28]

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

    A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov. Unsupervised cross-lingual representation learning at scale.arXiv preprint arXiv:1911.02116, 2019

  29. [29]

    Course: How to build llm pipelines, 2025

    Hugging Face. Course: How to build llm pipelines, 2025. Accessed: 2025-11-27

  30. [30]

    M. K. Arabov. Tajperslexon: A tajik–persian lexical resource and hybrid model for cross script low resource nlp. InThe Proceedings of SilkRoadNLP 2026, pages 29–37, Rabat, Morocco, 2026. ACL

  31. [31]

    M. K. Arabov. Tatar2vec: Word embeddings for the tatar language based on a heterogeneous corpus,

  32. [32]

    2026610619, Russian Federation

    Certificate of State Registration of a Computer Program No. 2026610619, Russian Federation. Application: 23 December 2025; published: 14 January 2026. 135 Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF — A Textbook for Undergraduate and Graduate StudentsA PREPRINT

  33. [33]

    M. K. Arabov. Developing the tajik language in the era of large language models: Corpus infras- tructure, linguistic challenges, and safety alignment.Modern Science, (12-2):85–93, 2025. EDN LQLURB

  34. [34]

    M. K. Arabov. A systematic benchmark of machine transliteration models for the tajik–farsi lan- guage pair: A comparative study from rule-based to transformer architectures.arXiv preprint arXiv:2605.02270, 2026

  35. [35]

    M. K. Arabov, R. A. Burnashev, and O. A. Medvedeva. Comparative analysis of intelligent meth- ods for automatic anomaly detection in industrial and distributed systems based on machine learning and deep learning algorithms. InProceedings of the International Russian Automation Conference (RusAutoCon), pages 279–284, 2025

  36. [36]

    M. K. Arabov and V . V . Sedykh. Comparative analysis of methods for modelling semantic word representations under limited language resource conditions: The case of the tajik language.Scientific and Technical Bulletin of the Volga Region, (6):196–198, 2025. EDN ZHBKFG. 136