Language models degrade over 300 times in performance on Romanized Sinhala versus Unicode, with model size showing no correlation to script robustness.
Effect of Unknown and Fragmented Tokens on the Performance of Multilingual Language Models at Low-Resource Tasks
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala
Language models degrade over 300 times in performance on Romanized Sinhala versus Unicode, with model size showing no correlation to script robustness.