Language models degrade over 300 times in performance on Romanized Sinhala versus Unicode, with model size showing no correlation to script robustness.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2representative citing papers
SiDiaC is a new historical corpus of Sinhala literary works spanning the 5th to 20th centuries, constructed via OCR digitization, orthography modernization, and genre-based annotation.
citing papers explorer
-
Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala
Language models degrade over 300 times in performance on Romanized Sinhala versus Unicode, with model size showing no correlation to script robustness.
-
SiDiaC: Sinhala Diachronic Corpus
SiDiaC is a new historical corpus of Sinhala literary works spanning the 5th to 20th centuries, constructed via OCR digitization, orthography modernization, and genre-based annotation.