IFMTBench is a new benchmark for multilingual translation instruction following that tests models on single and multi-constraint scenarios using deterministic checkers and LLM judges.
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
Hy-MT2 is a family of fast-thinking multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages and effectively follow translation instructions in multiple languages. Multi-dimensional evaluations show that Hy-MT2 delivers outstanding performance across general, real-world business, domain-specific, and instruction-following translation tasks. The 7B and 30B models outperform open-source models such as DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode, while the lightweight 1.8B model also surpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall. Moreover, when paired with AngelSlim's 1.25-bit extreme quantization for on-device deployment, the lightweight 1.8B model requires only 440 MB of storage and achieves a 1.5x inference speedup.
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.
citing papers explorer
-
IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following
IFMTBench is a new benchmark for multilingual translation instruction following that tests models on single and multi-constraint scenarios using deterministic checkers and LLM judges.
-
HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains
HardMTBench is a difficulty-aware benchmark of 20,000 directional test items across 12 domains that widens GEMBA score ranges by a factor of two and reveals domain-specific weaknesses in 22 MT systems.