Recognition: unknown
ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset
Pith reviewed 2026-05-10 16:01 UTC · model grok-4.3
The pith
The largest public pretraining dataset for Kashmiri contains 5.09 million words and 12.13 million tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 words, 27,692,959 characters, and a vocabulary of 295,433 unique word types. The dataset was assembled from digitized archival and literary material recovered from the proprietary InPage format together with Unicode-native text from Kashmiri-language web sources. All text passed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965. Empirical tokenization with google/muril-base-cased produces a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens. The resource is released as a single text
What carries the argument
KS-PRET-5M itself, the cleaned continuous text stream assembled from archival recovery and web sources to serve directly as pretraining input.
Load-bearing premise
The eleven-stage cleaning pipeline and InPage converter preserve the linguistic content and representativeness of the original Kashmiri sources without introducing systematic errors or biases in the recovered text.
What would settle it
A systematic audit that finds frequent mistranscribed words, omitted passages, or script substitutions altering meaning across more than a small fraction of the dataset would show the cleaning and conversion steps did not preserve fidelity.
read the original abstract
We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents KS-PRET-5M, the largest publicly available pretraining dataset for Kashmiri, comprising 5,090,244 words, 27,692,959 characters, and 295,433 unique word types. It is assembled from InPage-converted archival/literary sources (via Malik et al. 2024) and Unicode web text, processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, tokenized with google/muril-base-cased to yield ~12.13M subword tokens (2.383 per word), and released as a single text stream under CC BY 4.0.
Significance. If the conversion and cleaning steps preserve authentic Kashmiri linguistic content without systematic artifacts, this would be a valuable addition to low-resource language resources, enabling pretraining, tokenizer development, and computational linguistics work on Kashmiri where few large corpora exist.
major comments (2)
- [Abstract and data-processing description] Abstract and data-processing description: The headline statistics (5,090,244 words, 295,433 unique types, 12.13M tokens) are produced by the InPage converter and eleven-stage pipeline, yet no error-rate measurements, before/after sample pairs, manual audits, or ground-truth comparisons are reported. The Kashmiri script ratio of 0.9965 alone cannot detect orthographic substitutions, ligature errors, or word-boundary shifts that would directly change the reported counts and dataset utility.
- [Eleven-stage cleaning pipeline] Eleven-stage cleaning pipeline: Exact exclusion rules for each stage, their quantitative effect on final size/vocabulary, and any bias introduced by the InPage recovery are not specified, preventing assessment of whether the final corpus remains representative of the original sources.
minor comments (1)
- [Abstract] The subword token count is described as 'approximately 12.13 million'; an exact figure computed from the 5,090,244 words and 2.383 ratio would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the KS-PRET-5M dataset description. We address each major comment below with honest assessments of what can be strengthened through revision.
read point-by-point responses
-
Referee: [Abstract and data-processing description] The headline statistics (5,090,244 words, 295,433 unique types, 12.13M tokens) are produced by the InPage converter and eleven-stage pipeline, yet no error-rate measurements, before/after sample pairs, manual audits, or ground-truth comparisons are reported. The Kashmiri script ratio of 0.9965 alone cannot detect orthographic substitutions, ligature errors, or word-boundary shifts that would directly change the reported counts and dataset utility.
Authors: We agree that the script ratio metric alone is insufficient to fully validate against orthographic or boundary errors. The pipeline reduced Devanagari contamination to 146 characters, but no ground-truth clean versions of the archival sources exist for direct error rates or audits. We will add representative before-and-after sample pairs and per-stage removal statistics in the revised data-processing section to increase transparency. revision: partial
-
Referee: [Eleven-stage cleaning pipeline] Exact exclusion rules for each stage, their quantitative effect on final size/vocabulary, and any bias introduced by the InPage recovery are not specified, preventing assessment of whether the final corpus remains representative of the original sources.
Authors: The manuscript provides a high-level description of the pipeline. We will revise the data-processing section to list the exact exclusion rules for each of the eleven stages and include a table quantifying the effect of each stage on word count and vocabulary size. We will also add a limitations paragraph on potential InPage converter artifacts, referencing the validation details in Malik et al. (2024). revision: yes
- Direct error-rate measurements, manual audits, or ground-truth comparisons, as no parallel reference corpora exist for the InPage archival sources
Circularity Check
No derivations, predictions, or self-referential steps present
full rationale
The paper is a descriptive report of dataset construction from archival and web sources, followed by an eleven-stage cleaning pipeline and empirical tokenization counts. No equations, fitted parameters, predictions, or first-principles derivations are claimed. The single self-citation to the InPage converter (Malik et al. 2024) supplies an external tool whose output is then processed and counted; the reported word counts, character totals, vocabulary size, and script-ratio statistic are direct measurements of the processed text rather than quantities defined in terms of themselves or forced by the citation. The central claims therefore remain independent of any circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The InPage converter accurately recovers original Kashmiri text from proprietary desktop-publishing files.
- domain assumption The eleven-stage cleaning pipeline removes noise while preserving authentic Kashmiri linguistic content and script usage.
Reference graph
Works this paper leans on
-
[1]
Haq Nawaz Malik. A Robust InPage-to-Unicode Encoding Converter for Kashmiri and Related Lan- guages: Addressing a 15-Year-Old Challenge.Research- Gate, 2024. https://www.researchgate.net/ publication/387522744
-
[2]
Haq Nawaz Malik. KS-LIT-3M: A 3.1 Million Word Kashmiri Text Dataset for LLM Pretraining.arXiv preprint arXiv:2601.01091, 2026
-
[3]
mT5: A Massively Multilingual Pre-trained Text- to-Text Transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A Massively Multilingual Pre-trained Text- to-Text Transformer. InProceedings of NAACL 2021, pages 483–498, 2021
2021
-
[4]
Unsupervised Cross-lingual Repre- sentation Learning at Scale
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised Cross-lingual Repre- sentation Learning at Scale. InProceedings of ACL 2020, pages 8440–8451, 2020
2020
-
[5]
The BigScience ROOTS dataset: A 1.6TB Composite Multilingual Dataset
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christo- pher Akiki, et al. The BigScience ROOTS dataset: A 1.6TB Composite Multilingual Dataset. InAdvances in Neural Information Processing Systems, 2022
2022
-
[6]
Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savita Khosa, Atreyee Dey, Shachi Dave, Ruchi Garg, Amna Nawaz Khan, and Partha Talukdar. MuRIL: Multilingual Representations for Indian Languages.arXiv preprint arXiv:2103.10730, 2021
-
[7]
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Pratik Joshi, Sebastin Santy, Amar Buber, Kalika Bali, and Monojit Choudhury. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. InProceedings of ACL 2020, pages 6282–6293, 2020
2020
-
[8]
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Lan- guage Models for Indian Languages
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M Khapra, and Pratyush Kumar. IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Lan- guage Models for Indian Languages. InFindings of EMNLP 2020, pages 4948–4961, 2020
2020
-
[9]
Sangraha: A Large-Scale Multilingual Dataset for Indian Languages
Mohammed Safi Ur Rahman Khan et al. Sangraha: A Large-Scale Multilingual Dataset for Indian Languages. arXiv preprint, 2024
2024
-
[10]
A Dependency Treebank of Kash- miri
Irshad Ahmad Bhat, Vasu Sharma, Jonathon Read, and Dipti Misra Sharma. A Dependency Treebank of Kash- miri. InProceedings of LaTeCH Workshop, ACL 2014, 2014
2014
-
[11]
The FLORES-101 Evaluation Benchmark for Low- Resource and Multilingual Machine Translation.Trans- actions of the ACL, 10:522–538, 2022
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The FLORES-101 Evaluation Benchmark for Low- Resource and Multilingual Machine Translation.Trans- actions of the ACL, 10:522–538, 2022. 4 of 5 Malik and Nissar KS-PRET-5M
2022
-
[12]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceed- ings of NAACL 2019, pages 4171–4186, 2019
2019
-
[13]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. XLM-R: Unsupervised Cross-lingual Representa- tion Learning at Scale.arXiv preprint arXiv:1911.02116, 2020. 5 of 5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.