arxiv: 2604.11066 · v1 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

Haq Nawaz Malik , Nahfid Nissar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords kashmiripretraining datasetlanguage modellow-resource languagetext corpusnatural language processingdataset releaseperso-arabic script

0 comments

The pith

The largest public pretraining dataset for Kashmiri contains 5.09 million words and 12.13 million tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KS-PRET-5M as the largest openly released dataset for pretraining language models on Kashmiri. It combines recovered archival texts from proprietary InPage files with web-sourced Unicode material, then applies an eleven-stage cleaning process that leaves almost all text in proper Kashmiri script. The resulting stream is tokenized with a multilingual model to yield roughly 2.38 subword tokens per word. The authors release the full 5.09 million words under a permissive license so that researchers can train models, build tokenizers, and conduct linguistic analysis on this previously data-scarce language. A reader would care because low-resource languages like Kashmiri have historically lacked the scale of text needed for modern NLP systems.

Core claim

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 words, 27,692,959 characters, and a vocabulary of 295,433 unique word types. The dataset was assembled from digitized archival and literary material recovered from the proprietary InPage format together with Unicode-native text from Kashmiri-language web sources. All text passed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965. Empirical tokenization with google/muril-base-cased produces a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens. The resource is released as a single text

What carries the argument

KS-PRET-5M itself, the cleaned continuous text stream assembled from archival recovery and web sources to serve directly as pretraining input.

Load-bearing premise

The eleven-stage cleaning pipeline and InPage converter preserve the linguistic content and representativeness of the original Kashmiri sources without introducing systematic errors or biases in the recovered text.

What would settle it

A systematic audit that finds frequent mistranscribed words, omitted passages, or script substitutions altering meaning across more than a small fraction of the dataset would show the cleaning and conversion steps did not preserve fidelity.

read the original abstract

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward dataset release that supplies the largest public Kashmiri pretraining corpus so far, but the lack of validation on the InPage conversion and cleaning steps leaves the headline counts uncertain.

read the letter

The paper's core contribution is the KS-PRET-5M dataset itself. It pulls together roughly 5 million words from InPage archival material and web sources, runs them through an eleven-stage clean, and releases the result as a single text stream under CC BY 4.0. The authors also report MuRIL tokenization numbers that come out higher than earlier non-Kashmiri estimates, which is a small but concrete addition to the record for this language.

Referee Report

2 major / 1 minor

Summary. The paper presents KS-PRET-5M, the largest publicly available pretraining dataset for Kashmiri, comprising 5,090,244 words, 27,692,959 characters, and 295,433 unique word types. It is assembled from InPage-converted archival/literary sources (via Malik et al. 2024) and Unicode web text, processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, tokenized with google/muril-base-cased to yield ~12.13M subword tokens (2.383 per word), and released as a single text stream under CC BY 4.0.

Significance. If the conversion and cleaning steps preserve authentic Kashmiri linguistic content without systematic artifacts, this would be a valuable addition to low-resource language resources, enabling pretraining, tokenizer development, and computational linguistics work on Kashmiri where few large corpora exist.

major comments (2)

[Abstract and data-processing description] Abstract and data-processing description: The headline statistics (5,090,244 words, 295,433 unique types, 12.13M tokens) are produced by the InPage converter and eleven-stage pipeline, yet no error-rate measurements, before/after sample pairs, manual audits, or ground-truth comparisons are reported. The Kashmiri script ratio of 0.9965 alone cannot detect orthographic substitutions, ligature errors, or word-boundary shifts that would directly change the reported counts and dataset utility.
[Eleven-stage cleaning pipeline] Eleven-stage cleaning pipeline: Exact exclusion rules for each stage, their quantitative effect on final size/vocabulary, and any bias introduced by the InPage recovery are not specified, preventing assessment of whether the final corpus remains representative of the original sources.

minor comments (1)

[Abstract] The subword token count is described as 'approximately 12.13 million'; an exact figure computed from the 5,090,244 words and 2.383 ratio would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on the KS-PRET-5M dataset description. We address each major comment below with honest assessments of what can be strengthened through revision.

read point-by-point responses

Referee: [Abstract and data-processing description] The headline statistics (5,090,244 words, 295,433 unique types, 12.13M tokens) are produced by the InPage converter and eleven-stage pipeline, yet no error-rate measurements, before/after sample pairs, manual audits, or ground-truth comparisons are reported. The Kashmiri script ratio of 0.9965 alone cannot detect orthographic substitutions, ligature errors, or word-boundary shifts that would directly change the reported counts and dataset utility.

Authors: We agree that the script ratio metric alone is insufficient to fully validate against orthographic or boundary errors. The pipeline reduced Devanagari contamination to 146 characters, but no ground-truth clean versions of the archival sources exist for direct error rates or audits. We will add representative before-and-after sample pairs and per-stage removal statistics in the revised data-processing section to increase transparency. revision: partial
Referee: [Eleven-stage cleaning pipeline] Exact exclusion rules for each stage, their quantitative effect on final size/vocabulary, and any bias introduced by the InPage recovery are not specified, preventing assessment of whether the final corpus remains representative of the original sources.

Authors: The manuscript provides a high-level description of the pipeline. We will revise the data-processing section to list the exact exclusion rules for each of the eleven stages and include a table quantifying the effect of each stage on word count and vocabulary size. We will also add a limitations paragraph on potential InPage converter artifacts, referencing the validation details in Malik et al. (2024). revision: yes

standing simulated objections not resolved

Direct error-rate measurements, manual audits, or ground-truth comparisons, as no parallel reference corpora exist for the InPage archival sources

Circularity Check

0 steps flagged

No derivations, predictions, or self-referential steps present

full rationale

The paper is a descriptive report of dataset construction from archival and web sources, followed by an eleven-stage cleaning pipeline and empirical tokenization counts. No equations, fitted parameters, predictions, or first-principles derivations are claimed. The single self-citation to the InPage converter (Malik et al. 2024) supplies an external tool whose output is then processed and counted; the reported word counts, character totals, vocabulary size, and script-ratio statistic are direct measurements of the processed text rather than quantities defined in terms of themselves or forced by the citation. The central claims therefore remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is a data resource contribution that introduces no free parameters, mathematical axioms, or invented entities; it rests only on standard assumptions about text recovery and cleaning.

axioms (2)

domain assumption The InPage converter accurately recovers original Kashmiri text from proprietary desktop-publishing files.
Invoked when assembling the archival portion of the dataset.
domain assumption The eleven-stage cleaning pipeline removes noise while preserving authentic Kashmiri linguistic content and script usage.
Central to the reported 0.9965 Kashmiri script ratio and low Devanagari contamination.

pith-pipeline@v0.9.0 · 5563 in / 1647 out tokens · 72433 ms · 2026-05-10T16:01:05.198970+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages

[1]

A Robust InPage-to-Unicode Encoding Converter for Kashmiri and Related Lan- guages: Addressing a 15-Year-Old Challenge.Research- Gate, 2024

Haq Nawaz Malik. A Robust InPage-to-Unicode Encoding Converter for Kashmiri and Related Lan- guages: Addressing a 15-Year-Old Challenge.Research- Gate, 2024. https://www.researchgate.net/ publication/387522744

work page arXiv 2024
[2]

KS-LIT-3M: A 3.1 Million Word Kashmiri Text Dataset for LLM Pretraining.arXiv preprint arXiv:2601.01091, 2026

Haq Nawaz Malik. KS-LIT-3M: A 3.1 Million Word Kashmiri Text Dataset for LLM Pretraining.arXiv preprint arXiv:2601.01091, 2026

work page arXiv 2026
[3]

mT5: A Massively Multilingual Pre-trained Text- to-Text Transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A Massively Multilingual Pre-trained Text- to-Text Transformer. InProceedings of NAACL 2021, pages 483–498, 2021

2021
[4]

Unsupervised Cross-lingual Repre- sentation Learning at Scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised Cross-lingual Repre- sentation Learning at Scale. InProceedings of ACL 2020, pages 8440–8451, 2020

2020
[5]

The BigScience ROOTS dataset: A 1.6TB Composite Multilingual Dataset

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christo- pher Akiki, et al. The BigScience ROOTS dataset: A 1.6TB Composite Multilingual Dataset. InAdvances in Neural Information Processing Systems, 2022

2022
[6]

Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, and 1 others

Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savita Khosa, Atreyee Dey, Shachi Dave, Ruchi Garg, Amna Nawaz Khan, and Partha Talukdar. MuRIL: Multilingual Representations for Indian Languages.arXiv preprint arXiv:2103.10730, 2021

work page arXiv 2021
[7]

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Pratik Joshi, Sebastin Santy, Amar Buber, Kalika Bali, and Monojit Choudhury. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. InProceedings of ACL 2020, pages 6282–6293, 2020

2020
[8]

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Lan- guage Models for Indian Languages

Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M Khapra, and Pratyush Kumar. IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Lan- guage Models for Indian Languages. InFindings of EMNLP 2020, pages 4948–4961, 2020

2020
[9]

Sangraha: A Large-Scale Multilingual Dataset for Indian Languages

Mohammed Safi Ur Rahman Khan et al. Sangraha: A Large-Scale Multilingual Dataset for Indian Languages. arXiv preprint, 2024

2024
[10]

A Dependency Treebank of Kash- miri

Irshad Ahmad Bhat, Vasu Sharma, Jonathon Read, and Dipti Misra Sharma. A Dependency Treebank of Kash- miri. InProceedings of LaTeCH Workshop, ACL 2014, 2014

2014
[11]

The FLORES-101 Evaluation Benchmark for Low- Resource and Multilingual Machine Translation.Trans- actions of the ACL, 10:522–538, 2022

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The FLORES-101 Evaluation Benchmark for Low- Resource and Multilingual Machine Translation.Trans- actions of the ACL, 10:522–538, 2022. 4 of 5 Malik and Nissar KS-PRET-5M

2022
[12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceed- ings of NAACL 2019, pages 4171–4186, 2019

2019
[13]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. XLM-R: Unsupervised Cross-lingual Representa- tion Learning at Scale.arXiv preprint arXiv:1911.02116, 2020. 5 of 5

work page arXiv 1911