pith. machine review for the scientific record. sign in

arxiv: 2604.11066 · v1 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords kashmiripretraining datasetlanguage modellow-resource languagetext corpusnatural language processingdataset releaseperso-arabic script
0
0 comments X

The pith

The largest public pretraining dataset for Kashmiri contains 5.09 million words and 12.13 million tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KS-PRET-5M as the largest openly released dataset for pretraining language models on Kashmiri. It combines recovered archival texts from proprietary InPage files with web-sourced Unicode material, then applies an eleven-stage cleaning process that leaves almost all text in proper Kashmiri script. The resulting stream is tokenized with a multilingual model to yield roughly 2.38 subword tokens per word. The authors release the full 5.09 million words under a permissive license so that researchers can train models, build tokenizers, and conduct linguistic analysis on this previously data-scarce language. A reader would care because low-resource languages like Kashmiri have historically lacked the scale of text needed for modern NLP systems.

Core claim

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 words, 27,692,959 characters, and a vocabulary of 295,433 unique word types. The dataset was assembled from digitized archival and literary material recovered from the proprietary InPage format together with Unicode-native text from Kashmiri-language web sources. All text passed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965. Empirical tokenization with google/muril-base-cased produces a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens. The resource is released as a single text

What carries the argument

KS-PRET-5M itself, the cleaned continuous text stream assembled from archival recovery and web sources to serve directly as pretraining input.

Load-bearing premise

The eleven-stage cleaning pipeline and InPage converter preserve the linguistic content and representativeness of the original Kashmiri sources without introducing systematic errors or biases in the recovered text.

What would settle it

A systematic audit that finds frequent mistranscribed words, omitted passages, or script substitutions altering meaning across more than a small fraction of the dataset would show the cleaning and conversion steps did not preserve fidelity.

read the original abstract

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents KS-PRET-5M, the largest publicly available pretraining dataset for Kashmiri, comprising 5,090,244 words, 27,692,959 characters, and 295,433 unique word types. It is assembled from InPage-converted archival/literary sources (via Malik et al. 2024) and Unicode web text, processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, tokenized with google/muril-base-cased to yield ~12.13M subword tokens (2.383 per word), and released as a single text stream under CC BY 4.0.

Significance. If the conversion and cleaning steps preserve authentic Kashmiri linguistic content without systematic artifacts, this would be a valuable addition to low-resource language resources, enabling pretraining, tokenizer development, and computational linguistics work on Kashmiri where few large corpora exist.

major comments (2)
  1. [Abstract and data-processing description] Abstract and data-processing description: The headline statistics (5,090,244 words, 295,433 unique types, 12.13M tokens) are produced by the InPage converter and eleven-stage pipeline, yet no error-rate measurements, before/after sample pairs, manual audits, or ground-truth comparisons are reported. The Kashmiri script ratio of 0.9965 alone cannot detect orthographic substitutions, ligature errors, or word-boundary shifts that would directly change the reported counts and dataset utility.
  2. [Eleven-stage cleaning pipeline] Eleven-stage cleaning pipeline: Exact exclusion rules for each stage, their quantitative effect on final size/vocabulary, and any bias introduced by the InPage recovery are not specified, preventing assessment of whether the final corpus remains representative of the original sources.
minor comments (1)
  1. [Abstract] The subword token count is described as 'approximately 12.13 million'; an exact figure computed from the 5,090,244 words and 2.383 ratio would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on the KS-PRET-5M dataset description. We address each major comment below with honest assessments of what can be strengthened through revision.

read point-by-point responses
  1. Referee: [Abstract and data-processing description] The headline statistics (5,090,244 words, 295,433 unique types, 12.13M tokens) are produced by the InPage converter and eleven-stage pipeline, yet no error-rate measurements, before/after sample pairs, manual audits, or ground-truth comparisons are reported. The Kashmiri script ratio of 0.9965 alone cannot detect orthographic substitutions, ligature errors, or word-boundary shifts that would directly change the reported counts and dataset utility.

    Authors: We agree that the script ratio metric alone is insufficient to fully validate against orthographic or boundary errors. The pipeline reduced Devanagari contamination to 146 characters, but no ground-truth clean versions of the archival sources exist for direct error rates or audits. We will add representative before-and-after sample pairs and per-stage removal statistics in the revised data-processing section to increase transparency. revision: partial

  2. Referee: [Eleven-stage cleaning pipeline] Exact exclusion rules for each stage, their quantitative effect on final size/vocabulary, and any bias introduced by the InPage recovery are not specified, preventing assessment of whether the final corpus remains representative of the original sources.

    Authors: The manuscript provides a high-level description of the pipeline. We will revise the data-processing section to list the exact exclusion rules for each of the eleven stages and include a table quantifying the effect of each stage on word count and vocabulary size. We will also add a limitations paragraph on potential InPage converter artifacts, referencing the validation details in Malik et al. (2024). revision: yes

standing simulated objections not resolved
  • Direct error-rate measurements, manual audits, or ground-truth comparisons, as no parallel reference corpora exist for the InPage archival sources

Circularity Check

0 steps flagged

No derivations, predictions, or self-referential steps present

full rationale

The paper is a descriptive report of dataset construction from archival and web sources, followed by an eleven-stage cleaning pipeline and empirical tokenization counts. No equations, fitted parameters, predictions, or first-principles derivations are claimed. The single self-citation to the InPage converter (Malik et al. 2024) supplies an external tool whose output is then processed and counted; the reported word counts, character totals, vocabulary size, and script-ratio statistic are direct measurements of the processed text rather than quantities defined in terms of themselves or forced by the citation. The central claims therefore remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is a data resource contribution that introduces no free parameters, mathematical axioms, or invented entities; it rests only on standard assumptions about text recovery and cleaning.

axioms (2)
  • domain assumption The InPage converter accurately recovers original Kashmiri text from proprietary desktop-publishing files.
    Invoked when assembling the archival portion of the dataset.
  • domain assumption The eleven-stage cleaning pipeline removes noise while preserving authentic Kashmiri linguistic content and script usage.
    Central to the reported 0.9965 Kashmiri script ratio and low Devanagari contamination.

pith-pipeline@v0.9.0 · 5563 in / 1647 out tokens · 72433 ms · 2026-05-10T16:01:05.198970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages

  1. [1]

    A Robust InPage-to-Unicode Encoding Converter for Kashmiri and Related Lan- guages: Addressing a 15-Year-Old Challenge.Research- Gate, 2024

    Haq Nawaz Malik. A Robust InPage-to-Unicode Encoding Converter for Kashmiri and Related Lan- guages: Addressing a 15-Year-Old Challenge.Research- Gate, 2024. https://www.researchgate.net/ publication/387522744

  2. [2]

    KS-LIT-3M: A 3.1 Million Word Kashmiri Text Dataset for LLM Pretraining.arXiv preprint arXiv:2601.01091, 2026

    Haq Nawaz Malik. KS-LIT-3M: A 3.1 Million Word Kashmiri Text Dataset for LLM Pretraining.arXiv preprint arXiv:2601.01091, 2026

  3. [3]

    mT5: A Massively Multilingual Pre-trained Text- to-Text Transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A Massively Multilingual Pre-trained Text- to-Text Transformer. InProceedings of NAACL 2021, pages 483–498, 2021

  4. [4]

    Unsupervised Cross-lingual Repre- sentation Learning at Scale

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised Cross-lingual Repre- sentation Learning at Scale. InProceedings of ACL 2020, pages 8440–8451, 2020

  5. [5]

    The BigScience ROOTS dataset: A 1.6TB Composite Multilingual Dataset

    Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christo- pher Akiki, et al. The BigScience ROOTS dataset: A 1.6TB Composite Multilingual Dataset. InAdvances in Neural Information Processing Systems, 2022

  6. [6]

    Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, and 1 others

    Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savita Khosa, Atreyee Dey, Shachi Dave, Ruchi Garg, Amna Nawaz Khan, and Partha Talukdar. MuRIL: Multilingual Representations for Indian Languages.arXiv preprint arXiv:2103.10730, 2021

  7. [7]

    The State and Fate of Linguistic Diversity and Inclusion in the NLP World

    Pratik Joshi, Sebastin Santy, Amar Buber, Kalika Bali, and Monojit Choudhury. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. InProceedings of ACL 2020, pages 6282–6293, 2020

  8. [8]

    IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Lan- guage Models for Indian Languages

    Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M Khapra, and Pratyush Kumar. IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Lan- guage Models for Indian Languages. InFindings of EMNLP 2020, pages 4948–4961, 2020

  9. [9]

    Sangraha: A Large-Scale Multilingual Dataset for Indian Languages

    Mohammed Safi Ur Rahman Khan et al. Sangraha: A Large-Scale Multilingual Dataset for Indian Languages. arXiv preprint, 2024

  10. [10]

    A Dependency Treebank of Kash- miri

    Irshad Ahmad Bhat, Vasu Sharma, Jonathon Read, and Dipti Misra Sharma. A Dependency Treebank of Kash- miri. InProceedings of LaTeCH Workshop, ACL 2014, 2014

  11. [11]

    The FLORES-101 Evaluation Benchmark for Low- Resource and Multilingual Machine Translation.Trans- actions of the ACL, 10:522–538, 2022

    Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The FLORES-101 Evaluation Benchmark for Low- Resource and Multilingual Machine Translation.Trans- actions of the ACL, 10:522–538, 2022. 4 of 5 Malik and Nissar KS-PRET-5M

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceed- ings of NAACL 2019, pages 4171–4186, 2019

  13. [13]

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. XLM-R: Unsupervised Cross-lingual Representa- tion Learning at Scale.arXiv preprint arXiv:1911.02116, 2020. 5 of 5