pith. sign in

arxiv: 2605.28218 · v1 · pith:7YD3LL2Xnew · submitted 2026-05-27 · 💻 cs.CL

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

Pith reviewed 2026-06-29 12:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual translationinstruction followingbenchmarkconstraint adherenceglossary constraintsstructured formatsmodel evaluationmulti-constraint tasks
0
0 comments X

The pith

A new benchmark for multilingual translation instruction following reveals that constraint adherence scales more sharply with model size than semantic quality does and correlates only weakly with general instruction-following rankings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark covering seven languages with thousands of single- and multi-constraint translation items that test preservation of formats, glossaries, context, and registers. Prior metrics such as BLEU capture only meaning and ignore these rules, while general instruction benchmarks overlook the cross-lingual setting. Evaluation of fifteen models shows larger models gain more on instruction following than on basic translation, that glossary and structured-format constraints create the steepest difficulty, and that rankings from non-translation benchmarks fail to predict behavior here. Real translation workflows routinely combine several such constraints at once, so the benchmark supplies a more direct signal for practical model selection.

Core claim

The benchmark reveals systematic gaps missed by prior protocols: instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior.

What carries the argument

IFMTBench, a collection of 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns, with instructions in all seven languages, scored by deterministic checkers on a gating subset and rubric-based LLM judge on a continuous subset under a multiplicative rule.

If this is right

  • Larger models will continue to show outsized gains on multi-constraint translation tasks relative to smaller models.
  • Glossary and structured-format constraints will remain the primary bottlenecks even as overall capability increases.
  • General instruction-following benchmarks will continue to give misleading signals about performance on translation-specific constraints.
  • Multi-constraint items will expose failures that single-constraint items miss.
  • Performance will vary with the language in which the instruction is given relative to the translation target.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives that explicitly optimize for simultaneous satisfaction of glossary and format rules may close the observed gaps faster than scale alone.
  • The multiplicative scoring rule will penalize any single violation more harshly than additive alternatives, potentially changing which models appear strongest.
  • Extending the same constraint dimensions to additional language pairs could test whether the difficulty gradient generalizes beyond the seven languages tested.

Load-bearing premise

The assumption that deterministic checkers combined with a rubric-based LLM judge under a multiplicative scoring rule accurately and unbiasedly measure constraint adherence across the described constraint dimensions.

What would settle it

Human raters scoring the same model outputs on the identical constraints produce a different model ranking or difficulty ordering than the automated scores.

Figures

Figures reproduced from arXiv: 2605.28218 by Mao Zheng, Mingrui Sun, Mingyang Song, Zheng Li.

Figure 1
Figure 1. Figure 1: Three-stage data construction pipeline for IFMT￾Bench. (1) Meta-prompt synthesis samples language pairs, domain tags, and style seeds with randomized parameters. (2) Constraint-driven generation instantiates prompts under seven dimensions of translation constraints. (3) Expert review veri￾fies constraint satisfaction, rewrites unnatural phrasings, and performs a safety pass. to both parse a JSON schema and… view at source ↗
Figure 2
Figure 2. Figure 2: Hybrid evaluation pipeline. A deterministic rule val￾idation module checks structure, glossary, code, and layout to produce a binary gate score. In parallel, a rubric-conditioned LLM judge scores style and context on a 0–5 scale, normal￾ized to [0, 1]. The gating product and continuous mean are multiplied to yield the final IF_Score. dataset spans many-to-many directions among the seven languages, rather t… view at source ↗
Figure 3
Figure 3. Figure 3: Overall results on IFMTBench. Each row is one of 15 models, ordered by aggregate IF_Scoretotal. Filled dia￾monds denote single-constraint IF_Score and hollow circles denote multi-constraint IF_Score. ∆ annotates the absolute drop under composition. Models are color-coded by family. 6.3 Overall Results [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-dimension IF_Score heatmaps. (a) Single-constraint: 15 models × 7 dimensions. (b) Multi-constraint: 4 co-occurring dimensions. Darker cells indicate higher scores. The structure column (boxed) degrades most sharply under composition; glossary remains the most discriminative dimension. and context degrade more gracefully. This asym￾metric pattern suggests that models reallocate ca￾pacity toward semantic… view at source ↗
Figure 5
Figure 5. Figure 5: contrasts single- and multi-constraint IF_Score for six representative models and decom￾poses the drop along the four co-occurring dimen￾sions. 0 20 40 60 80 100 IF score Hy-MT2-1.8B Hy-MT2-7B Hy-MT2-A3B gemma4 A4B Qwen3.6 A3B Gemini 3.1 Pro Δ =−19.2 Δ =−17.1 Δ =−16.1 Δ =−9.3 Δ =−6.5 Δ =−7.4 (a) Single (solid) vs Multi (hatched) context glossary structure style Hy-MT2-1.8B Hy-MT2-7B Hy-MT2-A3B gemma4 A4B Q… view at source ↗
read the original abstract

Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at https://github.com/Tencent-Hunyuan/Hy-MT2/tree/main/IFMTBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces IFMTBench, a benchmark for multilingual translation instruction following across seven languages. It comprises 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns, with instructions in all seven languages. Constraints are partitioned into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge; scores are combined multiplicatively. Evaluation of 15 models is reported to reveal that instruction following scales more sharply with model size than translation quality, that glossary and structured-format constraints dominate difficulty, and that general instruction-following rankings correlate only weakly with translation-specific behavior.

Significance. If the LLM-judge component is shown to be reliable, the benchmark would address a genuine gap between conventional MT metrics (BLEU, xCOMET) and real-world translation workflows that impose simultaneous schema, glossary, context, and register constraints. The reported weak correlation between general IF rankings and translation behavior would be a useful empirical observation for the field.

major comments (2)
  1. [Evaluation / Methods] The central claims about sharper size scaling for instruction following, glossary/structured-format dominance, and weak correlation with general IF benchmarks all rest on the composite scores produced by the multiplicative rule. No inter-annotator agreement, human-LLM correlation, or judge-model details are reported for the rubric-based LLM judge across the seven languages (Evaluation / Methods section). Without such validation, systematic bias in the continuous subset could artifactually produce the reported difficulty gradients and model orderings.
  2. [Experiments] The manuscript states that 15 models were evaluated but supplies no information on model selection criteria, exact prompting templates, decoding parameters, or whether the same prompt format was used for all constraint types and languages. These details are required to interpret the scaling and correlation results.
minor comments (1)
  1. [Abstract] Abstract contains a subject-verb agreement error: "Our benchmark are available" should read "Our benchmark is available" (or "benchmarks are").

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the revisions that will be made to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Methods] The central claims about sharper size scaling for instruction following, glossary/structured-format dominance, and weak correlation with general IF benchmarks all rest on the composite scores produced by the multiplicative rule. No inter-annotator agreement, human-LLM correlation, or judge-model details are reported for the rubric-based LLM judge across the seven languages (Evaluation / Methods section). Without such validation, systematic bias in the continuous subset could artifactually produce the reported difficulty gradients and model orderings.

    Authors: We agree that the LLM judge requires more documentation. The revised manuscript will specify the exact judge model, provide the full rubric, and include the prompting template used for scoring the continuous subset. We will also add a brief discussion of the multiplicative combination rule and its intended robustness against isolated judge errors. A comprehensive human-LLM correlation study across all languages and items was not conducted in the original work; we will note this as a limitation and indicate that the deterministic gating subset provides an independent check on a substantial portion of the data. revision: partial

  2. Referee: [Experiments] The manuscript states that 15 models were evaluated but supplies no information on model selection criteria, exact prompting templates, decoding parameters, or whether the same prompt format was used for all constraint types and languages. These details are required to interpret the scaling and correlation results.

    Authors: We acknowledge that these implementation details were omitted. The revised manuscript will add a dedicated subsection (or appendix) listing the 15 models and the criteria used for their selection, the exact prompt templates applied to single-constraint and multi-constraint items, the decoding parameters (temperature, top-p, etc.), and explicit confirmation that the same prompt structure was used across all languages and constraint types. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces IFMTBench as an independent evaluation resource with explicitly defined constraint dimensions, deterministic gating checkers, and a separate rubric-based LLM judge under a multiplicative rule. Model evaluations and reported gradients (size scaling, constraint difficulty, weak correlation with general IF) are direct empirical outputs from applying these fixed procedures to 15 external models across 7 languages. No parameter is fitted to the evaluation results and then re-used as a 'prediction'; no uniqueness theorem or ansatz is imported via self-citation; no renaming of known results occurs. The derivation chain consists solely of benchmark definition followed by independent measurement, making the findings self-contained against external model outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the LLM judge is reliable for continuous constraints and that the chosen constraint dimensions and patterns are representative; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Rubric-based LLM judge provides consistent and accurate scores for continuous constraint adherence
    Invoked for scoring the continuous subset of items.

pith-pipeline@v0.9.1-grok · 5735 in / 1158 out tokens · 45359 ms · 2026-06-29T12:29:00.269393+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 7 internal anchors

  1. [1]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805. Nuno M Guerreiro, Ricardo Rei, Daan van Stigt, Luísa Coheur, Pierre Colombo, and André F T Mar- tins

  2. [2]

    Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang

    MultiIF: Benchmarking LLMs on multi- turn and multilingual instructions following.arXiv preprint arXiv:2410.15553. Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang

  3. [3]

    InProceedings of the Eighth Conference on Machine Translation (WMT)

    Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. InProceedings of the Eighth Conference on Machine Translation (WMT). Tom Kocmi and Christian Federmann

  4. [4]

    The Llama 3 Herd of Models

    The Llama 3 herd of models.arXiv preprint arXiv:2407.21783. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others

  5. [5]

    Generalizing Verifiable Instruction Following

    Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833. Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu

  6. [6]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn

    InFoBench: Evaluating instruction following ability in large lan- guage models.arXiv preprint arXiv:2401.03601. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn

  7. [7]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702

    COMET: A neural framework for MT evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo

  8. [8]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. An Yang and 1 others

  9. [9]

    Qwen3 Technical Report

    Qwen3 technical report. arXiv preprint arXiv:2505.09388. Mao Zheng, Zheng Li, Tao Chen, Bo Lv, Mingrui Sun, Mingyang Song, Jinlong Song, Hong Huang, Decheng Wu, Hai Wang, Yifan Song, Yanfeng Chen, and Guanwei Zhang

  10. [10]

    Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

    Hy-mt2: A family of fast, efficient and powerful multilingual translation models in the wild.Preprint, arXiv:2605.22064. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou

  11. [11]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evalu- ation for large language models.arXiv preprint arXiv:2311.07911. A LLM Judge Prompt Templates This appendix provides the full prompt templates used in our hybrid evaluation framework (Sec- tion 5). We include both the glossary fallback judge prompt (Section A.1) and the style/background rubric judge prompt (Section A.2). A.1 G...