IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

Mao Zheng; Mingrui Sun; Mingyang Song; Zheng Li

arxiv: 2605.28218 · v1 · pith:7YD3LL2Xnew · submitted 2026-05-27 · 💻 cs.CL

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

Mingrui Sun , Mao Zheng , Zheng Li , Mingyang Song This is my paper

Pith reviewed 2026-06-29 12:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual translationinstruction followingbenchmarkconstraint adherenceglossary constraintsstructured formatsmodel evaluationmulti-constraint tasks

0 comments

The pith

A new benchmark for multilingual translation instruction following reveals that constraint adherence scales more sharply with model size than semantic quality does and correlates only weakly with general instruction-following rankings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark covering seven languages with thousands of single- and multi-constraint translation items that test preservation of formats, glossaries, context, and registers. Prior metrics such as BLEU capture only meaning and ignore these rules, while general instruction benchmarks overlook the cross-lingual setting. Evaluation of fifteen models shows larger models gain more on instruction following than on basic translation, that glossary and structured-format constraints create the steepest difficulty, and that rankings from non-translation benchmarks fail to predict behavior here. Real translation workflows routinely combine several such constraints at once, so the benchmark supplies a more direct signal for practical model selection.

Core claim

The benchmark reveals systematic gaps missed by prior protocols: instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior.

What carries the argument

IFMTBench, a collection of 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns, with instructions in all seven languages, scored by deterministic checkers on a gating subset and rubric-based LLM judge on a continuous subset under a multiplicative rule.

If this is right

Larger models will continue to show outsized gains on multi-constraint translation tasks relative to smaller models.
Glossary and structured-format constraints will remain the primary bottlenecks even as overall capability increases.
General instruction-following benchmarks will continue to give misleading signals about performance on translation-specific constraints.
Multi-constraint items will expose failures that single-constraint items miss.
Performance will vary with the language in which the instruction is given relative to the translation target.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that explicitly optimize for simultaneous satisfaction of glossary and format rules may close the observed gaps faster than scale alone.
The multiplicative scoring rule will penalize any single violation more harshly than additive alternatives, potentially changing which models appear strongest.
Extending the same constraint dimensions to additional language pairs could test whether the difficulty gradient generalizes beyond the seven languages tested.

Load-bearing premise

The assumption that deterministic checkers combined with a rubric-based LLM judge under a multiplicative scoring rule accurately and unbiasedly measure constraint adherence across the described constraint dimensions.

What would settle it

Human raters scoring the same model outputs on the identical constraints produce a different model ranking or difficulty ordering than the automated scores.

Figures

Figures reproduced from arXiv: 2605.28218 by Mao Zheng, Mingrui Sun, Mingyang Song, Zheng Li.

**Figure 1.** Figure 1: Three-stage data construction pipeline for IFMTBench. (1) Meta-prompt synthesis samples language pairs, domain tags, and style seeds with randomized parameters. (2) Constraint-driven generation instantiates prompts under seven dimensions of translation constraints. (3) Expert review verifies constraint satisfaction, rewrites unnatural phrasings, and performs a safety pass. to both parse a JSON schema and… view at source ↗

**Figure 2.** Figure 2: Hybrid evaluation pipeline. A deterministic rule validation module checks structure, glossary, code, and layout to produce a binary gate score. In parallel, a rubric-conditioned LLM judge scores style and context on a 0–5 scale, normalized to [0, 1]. The gating product and continuous mean are multiplied to yield the final IF_Score. dataset spans many-to-many directions among the seven languages, rather t… view at source ↗

**Figure 3.** Figure 3: Overall results on IFMTBench. Each row is one of 15 models, ordered by aggregate IF_Scoretotal. Filled diamonds denote single-constraint IF_Score and hollow circles denote multi-constraint IF_Score. ∆ annotates the absolute drop under composition. Models are color-coded by family. 6.3 Overall Results [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-dimension IF_Score heatmaps. (a) Single-constraint: 15 models × 7 dimensions. (b) Multi-constraint: 4 co-occurring dimensions. Darker cells indicate higher scores. The structure column (boxed) degrades most sharply under composition; glossary remains the most discriminative dimension. and context degrade more gracefully. This asymmetric pattern suggests that models reallocate capacity toward semantic… view at source ↗

**Figure 5.** Figure 5: contrasts single- and multi-constraint IF_Score for six representative models and decomposes the drop along the four co-occurring dimensions. 0 20 40 60 80 100 IF score Hy-MT2-1.8B Hy-MT2-7B Hy-MT2-A3B gemma4 A4B Qwen3.6 A3B Gemini 3.1 Pro Δ =−19.2 Δ =−17.1 Δ =−16.1 Δ =−9.3 Δ =−6.5 Δ =−7.4 (a) Single (solid) vs Multi (hatched) context glossary structure style Hy-MT2-1.8B Hy-MT2-7B Hy-MT2-A3B gemma4 A4B Q… view at source ↗

read the original abstract

Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark for constraint-aware multilingual MT evaluation, but the LLM judge for continuous scores has no reported human validation so the scaling and difficulty claims rest on untested ground.

read the letter

The paper's main contribution is IFMTBench, a dataset of 7,344 items across seven languages that tests translation models on six constraint types (glossary, format, context, register, etc.) in both single and multi-constraint settings. Instructions are given in the target languages, constraints are split into deterministic gating checks and rubric-scored continuous items, and the final metric multiplies the two to penalize any failure. That design directly targets a real workflow gap that BLEU and xCOMET ignore.

What works is the coverage and the compositional patterns. Having both single- and multi-constraint items, plus the multiplicative rule, makes reward hacking harder than simple accuracy averages. The reported pattern that instruction-following improves more steeply with model size than raw translation quality is at least plausible and worth checking.

The soft spot is the continuous scoring. The abstract gives no inter-annotator numbers, no human-LLM agreement on the rubric, and no judge-model or prompt details across the seven languages. If the LLM systematically underrates glossary adherence in lower-resource languages or overrates format constraints, the claimed difficulty ordering and the weak correlation with general instruction-following benchmarks become unreliable. Model selection and prompting details are also missing, so the 15-model comparison is hard to reproduce.

This is useful for groups already building or evaluating instruction-tuned MT systems who need a testbed beyond semantic metrics. It is not yet ready for heavy citation until the judge validation appears. A serious editor should send it to review; the core idea is sound and the gap is genuine, but the evaluation protocol needs the missing checks before the headline findings can be trusted.

Referee Report

2 major / 1 minor

Summary. The paper introduces IFMTBench, a benchmark for multilingual translation instruction following across seven languages. It comprises 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns, with instructions in all seven languages. Constraints are partitioned into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge; scores are combined multiplicatively. Evaluation of 15 models is reported to reveal that instruction following scales more sharply with model size than translation quality, that glossary and structured-format constraints dominate difficulty, and that general instruction-following rankings correlate only weakly with translation-specific behavior.

Significance. If the LLM-judge component is shown to be reliable, the benchmark would address a genuine gap between conventional MT metrics (BLEU, xCOMET) and real-world translation workflows that impose simultaneous schema, glossary, context, and register constraints. The reported weak correlation between general IF rankings and translation behavior would be a useful empirical observation for the field.

major comments (2)

[Evaluation / Methods] The central claims about sharper size scaling for instruction following, glossary/structured-format dominance, and weak correlation with general IF benchmarks all rest on the composite scores produced by the multiplicative rule. No inter-annotator agreement, human-LLM correlation, or judge-model details are reported for the rubric-based LLM judge across the seven languages (Evaluation / Methods section). Without such validation, systematic bias in the continuous subset could artifactually produce the reported difficulty gradients and model orderings.
[Experiments] The manuscript states that 15 models were evaluated but supplies no information on model selection criteria, exact prompting templates, decoding parameters, or whether the same prompt format was used for all constraint types and languages. These details are required to interpret the scaling and correlation results.

minor comments (1)

[Abstract] Abstract contains a subject-verb agreement error: "Our benchmark are available" should read "Our benchmark is available" (or "benchmarks are").

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the revisions that will be made to the manuscript.

read point-by-point responses

Referee: [Evaluation / Methods] The central claims about sharper size scaling for instruction following, glossary/structured-format dominance, and weak correlation with general IF benchmarks all rest on the composite scores produced by the multiplicative rule. No inter-annotator agreement, human-LLM correlation, or judge-model details are reported for the rubric-based LLM judge across the seven languages (Evaluation / Methods section). Without such validation, systematic bias in the continuous subset could artifactually produce the reported difficulty gradients and model orderings.

Authors: We agree that the LLM judge requires more documentation. The revised manuscript will specify the exact judge model, provide the full rubric, and include the prompting template used for scoring the continuous subset. We will also add a brief discussion of the multiplicative combination rule and its intended robustness against isolated judge errors. A comprehensive human-LLM correlation study across all languages and items was not conducted in the original work; we will note this as a limitation and indicate that the deterministic gating subset provides an independent check on a substantial portion of the data. revision: partial
Referee: [Experiments] The manuscript states that 15 models were evaluated but supplies no information on model selection criteria, exact prompting templates, decoding parameters, or whether the same prompt format was used for all constraint types and languages. These details are required to interpret the scaling and correlation results.

Authors: We acknowledge that these implementation details were omitted. The revised manuscript will add a dedicated subsection (or appendix) listing the 15 models and the criteria used for their selection, the exact prompt templates applied to single-constraint and multi-constraint items, the decoding parameters (temperature, top-p, etc.), and explicit confirmation that the same prompt structure was used across all languages and constraint types. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces IFMTBench as an independent evaluation resource with explicitly defined constraint dimensions, deterministic gating checkers, and a separate rubric-based LLM judge under a multiplicative rule. Model evaluations and reported gradients (size scaling, constraint difficulty, weak correlation with general IF) are direct empirical outputs from applying these fixed procedures to 15 external models across 7 languages. No parameter is fitted to the evaluation results and then re-used as a 'prediction'; no uniqueness theorem or ansatz is imported via self-citation; no renaming of known results occurs. The derivation chain consists solely of benchmark definition followed by independent measurement, making the findings self-contained against external model outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the LLM judge is reliable for continuous constraints and that the chosen constraint dimensions and patterns are representative; no free parameters or invented entities are described.

axioms (1)

domain assumption Rubric-based LLM judge provides consistent and accurate scores for continuous constraint adherence
Invoked for scoring the continuous subset of items.

pith-pipeline@v0.9.1-grok · 5735 in / 1158 out tokens · 45359 ms · 2026-06-29T12:29:00.269393+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 7 internal anchors

[1]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805. Nuno M Guerreiro, Ricardo Rei, Daan van Stigt, Luísa Coheur, Pierre Colombo, and André F T Mar- tins

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang

MultiIF: Benchmarking LLMs on multi- turn and multilingual instructions following.arXiv preprint arXiv:2410.15553. Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang

work page arXiv
[3]

InProceedings of the Eighth Conference on Machine Translation (WMT)

Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. InProceedings of the Eighth Conference on Machine Translation (WMT). Tom Kocmi and Christian Federmann

2023
[4]

The Llama 3 Herd of Models

The Llama 3 herd of models.arXiv preprint arXiv:2407.21783. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Generalizing Verifiable Instruction Following

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833. Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn

InFoBench: Evaluating instruction following ability in large lan- guage models.arXiv preprint arXiv:2401.03601. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn

work page arXiv
[7]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702

COMET: A neural framework for MT evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo

2020
[8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. An Yang and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Mao Zheng, Zheng Li, Tao Chen, Bo Lv, Mingrui Sun, Mingyang Song, Jinlong Song, Hong Huang, Decheng Wu, Hai Wang, Yifan Song, Yanfeng Chen, and Guanwei Zhang

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Hy-mt2: A family of fast, efficient and powerful multilingual translation models in the wild.Preprint, arXiv:2605.22064. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Instruction-Following Evaluation for Large Language Models

Instruction-following evalu- ation for large language models.arXiv preprint arXiv:2311.07911. A LLM Judge Prompt Templates This appendix provides the full prompt templates used in our hybrid evaluation framework (Sec- tion 5). We include both the glossary fallback judge prompt (Section A.1) and the style/background rubric judge prompt (Section A.2). A.1 G...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805. Nuno M Guerreiro, Ricardo Rei, Daan van Stigt, Luísa Coheur, Pierre Colombo, and André F T Mar- tins

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang

MultiIF: Benchmarking LLMs on multi- turn and multilingual instructions following.arXiv preprint arXiv:2410.15553. Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang

work page arXiv

[3] [3]

InProceedings of the Eighth Conference on Machine Translation (WMT)

Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. InProceedings of the Eighth Conference on Machine Translation (WMT). Tom Kocmi and Christian Federmann

2023

[4] [4]

The Llama 3 Herd of Models

The Llama 3 herd of models.arXiv preprint arXiv:2407.21783. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Generalizing Verifiable Instruction Following

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833. Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn

InFoBench: Evaluating instruction following ability in large lan- guage models.arXiv preprint arXiv:2401.03601. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn

work page arXiv

[7] [7]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702

COMET: A neural framework for MT evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo

2020

[8] [8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. An Yang and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Mao Zheng, Zheng Li, Tao Chen, Bo Lv, Mingrui Sun, Mingyang Song, Jinlong Song, Hong Huang, Decheng Wu, Hai Wang, Yifan Song, Yanfeng Chen, and Guanwei Zhang

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Hy-mt2: A family of fast, efficient and powerful multilingual translation models in the wild.Preprint, arXiv:2605.22064. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Instruction-Following Evaluation for Large Language Models

Instruction-following evalu- ation for large language models.arXiv preprint arXiv:2311.07911. A LLM Judge Prompt Templates This appendix provides the full prompt templates used in our hybrid evaluation framework (Sec- tion 5). We include both the glossary fallback judge prompt (Section A.1) and the style/background rubric judge prompt (Section A.2). A.1 G...

work page internal anchor Pith review Pith/arXiv arXiv