Introduces CHALIS benchmark dataset testing language ID on mutually intelligible cousin language pairs and orthographically noisy inputs, with evaluation showing existing systems struggle substantially.
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT )
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.
citing papers explorer
-
CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios
Introduces CHALIS benchmark dataset testing language ID on mutually intelligible cousin language pairs and orthographically noisy inputs, with evaluation showing existing systems struggle substantially.
-
On the Limits of Model Merging for Multilinguality in Pre-Training
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.
-
CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.