An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT )

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joon · 2025 · DOI 10.18653/v1/2025.acl-long.854

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Introduces CHALIS benchmark dataset testing language ID on mutually intelligible cousin language pairs and orthographically noisy inputs, with evaluation showing existing systems struggle substantially.

On the Limits of Model Merging for Multilinguality in Pre-Training

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.

CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation

cs.CL · 2026-06-19 · unverdicted · novelty 3.0

Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios cs.CL · 2026-06-04 · unverdicted · none · ref 61
Introduces CHALIS benchmark dataset testing language ID on mutually intelligible cousin language pairs and orthographically noisy inputs, with evaluation showing existing systems struggle substantially.
On the Limits of Model Merging for Multilinguality in Pre-Training cs.CL · 2026-05-25 · unverdicted · none · ref 7
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.
CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation cs.CL · 2026-06-19 · unverdicted · none · ref 5
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT )

fields

years

verdicts

representative citing papers

citing papers explorer