pith. machine review for the scientific record. sign in

arxiv: 2605.11632 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords counterfactual explanationsmultilingual generationpreference optimizationdirect preference optimizationLLM explanationsvalidity minimalitymodel alignmentself-generated counterfactuals
0
0 comments X

The pith

A preference alignment method called Macro improves the validity of multilingual self-generated counterfactual explanations by 12.55 percent on average while maintaining minimality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-generated counterfactual explanations help explain LLM predictions by minimally changing inputs to flip outputs, but extending this to non-English languages faces a validity-minimality trade-off. The paper proposes Macro, which uses Direct Preference Optimization with pairs scored by a composite function that rewards both valid flips and small changes. This approach is tested on four LLMs and seven diverse languages, yielding higher validity rates than chain-of-thought prompting without hurting minimality, and outperforming translation and supervised fine-tuning baselines. Analyses show better cross-lingual consistency and fewer errors in the generated explanations.

Core claim

Macro applies Direct Preference Optimization to multilingual SCE generation using a composite scoring function to construct preference pairs that translate the validity-minimality trade-off into training signals. Across four LLMs and seven typologically diverse languages, it improves validity by 12.55% on average over chain-of-thought without degrading minimality, avoids the minimality issues of translation baselines, and surpasses supervised fine-tuning on both metrics, with added benefits in cross-lingual alignment and error reduction.

What carries the argument

Macro, a DPO framework that builds preference pairs via a composite scoring function evaluating both validity and minimality for multilingual counterfactual generation.

If this is right

  • Validity of generated explanations increases by 12.55% on average compared to chain-of-thought prompting.
  • Minimality is preserved, unlike in translation-based methods that violate it severely.
  • Performance on both validity and minimality exceeds that of supervised fine-tuning.
  • Cross-lingual perturbation alignment improves and common generation errors decrease.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar preference optimization could help resolve trade-offs in other LLM explanation or generation tasks.
  • Testing Macro on additional low-resource languages might reveal whether the method scales without language-specific adjustments.
  • The reliance on a composite score suggests that refining the scoring function could further enhance results in specific domains.

Load-bearing premise

The composite scoring function used to build preference pairs measures the validity and minimality trade-off accurately and without introducing bias across languages and models.

What would settle it

Running the same experiments with Macro on the four LLMs and seven languages and finding no significant average improvement in validity or a degradation in minimality compared to the chain-of-thought baseline would falsify the main claim.

Figures

Figures reproduced from arXiv: 2605.11632 by Bohao Chu, Jing Yang, Qianli Wang, Simon Ostermann, Yihong Liu, Yilong Wang.

Figure 1
Figure 1. Figure 1: Overview of our three-stage framework (MACRO). Stage 1 samples counterfactual candidates via word￾level perturbations across multilingual inputs. Stage 2 ranks candidates using Rflip, Raug, and Redit to construct preference pairs. Stage 3 applies DPO to align the model toward generating minimal, effective counterfactuals. achieved without degrading minimality, marking a pronounced distinction from the tran… view at source ↗
Figure 2
Figure 2. Figure 2: The validity-minimality trade-off across lan [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relative performance change across languages for [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-lingual edit similarity score changes [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Total score distributions before and after ap [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Label distributions of the two evaluation [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prediction prompts used for the two evaluation datasets: [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Counterfactual generation prompts used for [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Dataset examples [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of MACRO on multilingual general capability measured on MMLU [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Impact of MACRO on reasoning capability measured on MMLU-ProX from the category perspective . Subfigures (a) and (b) present the category-wise performance of Qwen3-4B and Gemma3-4B, respectively [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Impact of MACRO on cross-lingual generalization measured on MMLU-ProX from the language perspec￾tive. Subfigures (a) and (b) present the language-wise performance of Qwen3-4B and Gemma3-4B, respectively [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The validity-minimality trade-off across languages across all models on [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cross-lingual edit similarity scores [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
read the original abstract

Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual self-generated counterfactual explanation (SCE) generation. Preference pairs are constructed using a composite scoring function that encodes the validity-minimality trade-off; experiments across four LLMs and seven typologically diverse languages report that Macro improves validity by 12.55% on average over a chain-of-thought baseline without degrading minimality, outperforms supervised fine-tuning, and avoids the minimality violations seen in translation-based baselines.

Significance. If the composite scoring function is shown to produce unbiased, language-agnostic preference signals that align with human judgments of explanation quality, the result would be significant for multilingual explainable AI. It would demonstrate that explicit preference optimization can resolve the validity-minimality trade-off more effectively than standard fine-tuning or translation pipelines, with potential implications for cross-lingual model interpretability.

major comments (2)
  1. [§3.2] §3.2 (Preference Pair Construction): The composite scoring function used to order pairs for DPO is described only at a high level; the explicit formula, the weighting scheme between validity and minimality components, and any cross-lingual or cross-model validation of those weights are not provided. Because the entire DPO training signal depends on the ordering induced by this function, the absence of these details makes it impossible to verify that the reported 12.55% validity gain reflects a genuine improvement rather than an artifact of the scoring rule.
  2. [§4] §4 (Experiments): The headline result of a 12.55% average validity improvement is presented without language-specific breakdowns, per-model tables, error bars, or statistical significance tests. In addition, the precise operational definitions of the validity and minimality metrics (and how they are computed for non-English inputs) are not stated. These omissions are load-bearing because the central claim is an empirical average over seven typologically diverse languages; without the supporting data it cannot be assessed whether the improvement is uniform or driven by a subset of languages or models.
minor comments (2)
  1. [Abstract] The abstract and introduction use the acronym 'Macro' without expanding it or briefly glossing its construction.
  2. [§3] Notation for the validity and minimality scores is introduced without a consolidated table of symbols, making it harder to track how the composite function is assembled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify key areas where additional detail will improve clarity and verifiability. We address each major comment below and will revise the manuscript to incorporate the requested information.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Preference Pair Construction): The composite scoring function used to order pairs for DPO is described only at a high level; the explicit formula, the weighting scheme between validity and minimality components, and any cross-lingual or cross-model validation of those weights are not provided. Because the entire DPO training signal depends on the ordering induced by this function, the absence of these details makes it impossible to verify that the reported 12.55% validity gain reflects a genuine improvement rather than an artifact of the scoring rule.

    Authors: We agree that the current high-level description in §3.2 leaves important details unspecified. In the revised manuscript we will expand this section to provide the explicit formula for the composite scoring function, the precise weighting scheme applied to the validity and minimality components, and the results of cross-lingual and cross-model validation performed to confirm that the induced preference ordering is stable. These additions will allow readers to reproduce the preference-pair construction and assess whether the reported gains arise from the scoring rule itself. revision: yes

  2. Referee: [§4] §4 (Experiments): The headline result of a 12.55% average validity improvement is presented without language-specific breakdowns, per-model tables, error bars, or statistical significance tests. In addition, the precise operational definitions of the validity and minimality metrics (and how they are computed for non-English inputs) are not stated. These omissions are load-bearing because the central claim is an empirical average over seven typologically diverse languages; without the supporting data it cannot be assessed whether the improvement is uniform or driven by a subset of languages or models.

    Authors: We acknowledge that §4 would benefit from more granular reporting. In the revision we will add language-specific and per-model tables for both validity and minimality, include error bars, and report statistical significance via paired t-tests. We will also state the operational definitions explicitly: validity is the fraction of generated SCEs that flip the model’s original prediction, and minimality is the normalized token-level edit distance. Both metrics are computed using language-appropriate tokenizers and the same underlying classifier for all languages, ensuring consistent evaluation across the seven typologically diverse languages. These changes will demonstrate that the 12.55 % average improvement is not driven by a subset of languages or models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from DPO on externally scored pairs

full rationale

The paper's chain consists of (1) defining a composite scorer to rank candidate counterfactuals, (2) building preference pairs from those rankings, (3) running DPO, and (4) measuring validity/minimality gains on held-out test sets across four LLMs and seven languages. None of these steps reduces to its own inputs by construction: the scorer is an input assumption whose correctness is tested by the downstream human-aligned metrics, the DPO objective is standard, and the reported 12.55 % average improvement is an empirical average against independent baselines. No equations, self-definitional loops, fitted-parameter-as-prediction, or load-bearing self-citations appear in the abstract or method description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that the composite scoring function produces reliable preference signals without language-specific biases; no free parameters, axioms, or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption Composite scoring function accurately reflects the validity-minimality trade-off for preference pair construction
    Invoked to enable DPO training; appears in the method description in the abstract.

pith-pipeline@v0.9.0 · 5518 in / 1296 out tokens · 54636 ms · 2026-05-13T01:13:58.101109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    and Mao, Yanke and Gao, Haonan and Lee, En-Shiun Annie

    Adelani, David Ifeoluwa and Liu, Hannah and Shen, Xiaoyu and Vassilyev, Nikita and Alabi, Jesujoba O. and Mao, Yanke and Gao, Haonan and Lee, En-Shiun Annie. SIB -200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects. Proceedings of the 18th Conference of the European Chapter of the Association for Co...

  2. [2]

    2026 , eprint=

    From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations , author=. 2026 , eprint=

  3. [3]

    Are Text Classifiers Xenophobic? A Country-Oriented Bias Detection Method with Least Confounding Variables

    Barriere, Valentin and Cifuentes, Sebastian. Are Text Classifiers Xenophobic? A Country-Oriented Bias Detection Method with Least Confounding Variables. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  4. [4]

    A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers

    Barriere, Valentin and Cifuentes, Sebastian. A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.34

  5. [5]

    2024 , volume =

    Bhattacharjee, Amrita and Moraffah, Raha and Garland, Joshua and Liu, Huan , booktitle =. 2024 , volume =. doi:10.1109/BigData62323.2024.10825537 , url =

  6. [6]

    TIGTEC: Token Importance Guided TExt Counterfactuals

    Bhan, Milan and Vittaut, Jean-No \"e l and Chesneau, Nicolas and Lesot, Marie-Jeanne. TIGTEC: Token Importance Guided TExt Counterfactuals. Machine Learning and Knowledge Discovery in Databases: Research Track. 2023

  7. [7]

    the method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

  8. [8]

    Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter

    Blaschke, Verena and Fedzechkina, Masha and Ter Hoeve, Maartje. Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.454

  9. [9]

    2026 , eprint=

    DRIV-EX: Counterfactual Explanations for Driving LLMs , author=. 2026 , eprint=

  10. [10]

    Unsupervised Cross-lingual Representation Learning at Scale

    Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

  11. [11]

    XNLI : Evaluating Cross-lingual Sentence Representations

    Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel and Schwenk, Holger and Stoyanov, Veselin. XNLI : Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1269

  12. [12]

    RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLM s

    Dang, John and Ahmadian, Arash and Marchisio, Kelly and Kreutzer, Julia and. RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.729

  13. [13]

    Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of m T 5 and B y T 5

    Dang, Thao Anh and Raviv, Limor and Galke, Lukas. Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of m T 5 and B y T 5. Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025). 2025

  14. [14]

    Can LLM s Explain Themselves Counterfactually?

    Dehghanighobadi, Zahra and Fischer, Asja and Zafar, Muhammad Bilal. Can LLM s Explain Themselves Counterfactually?. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.396

  15. [15]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  16. [16]

    Do Multilingual Language Models Think Better in E nglish?

    Etxaniz, Julen and Azkune, Gorka and Soroa, Aitor and Lopez de Lacalle, Oier and Artetxe, Mikel. Do Multilingual Language Models Think Better in E nglish?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.naacl...

  17. [17]

    Goodhart, C. A. E. Problems of Monetary Management: The UK Experience. Monetary Theory and Practice: The UK Experience. 1984. doi:10.1007/978-1-349-17295-5_4

  18. [18]

    2005 , publisher=

    The world atlas of language structures , author=. 2005 , publisher=

  19. [19]

    International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

  20. [20]

    Unlocking language barriers: Assessing pre-trained large language models across multilingual tasks and unveiling the black box with Explainable Artificial Intelligence , journal =

    Muhamet Kastrati and Ali Shariq Imran and Ehtesham Hashmi and Zenun Kastrati and Sher Muhammad Daudpota and Marenglen Biba , keywords =. Unlocking language barriers: Assessing pre-trained large language models across multilingual tasks and unveiling the black box with Explainable Artificial Intelligence , journal =. 2025 , issn =. doi:https://doi.org/10.1...

  21. [21]

    Memory-efficient NLLB -200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

    Koishekenov, Yeskendir and Berard, Alexandre and Nikoulina, Vassilina. Memory-efficient NLLB -200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.198

  22. [22]

    The annals of mathematical statistics , volume=

    On information and sufficiency , author=. The annals of mathematical statistics , volume=. 1951 , publisher=

  23. [23]

    C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning

    Lai, Viet Dac and Ngo, Nghia and Pouran Ben Veyseh, Amir and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu. C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.878

  24. [24]

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

    Lai, Viet and Nguyen, Chien and Ngo, Nghia and Nguyen, Thuat and Dernoncourt, Franck and Rossi, Ryan and Nguyen, Thien. Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023. do...

  25. [25]

    2025 , eprint=

    CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment , author=. 2025 , eprint=

  26. [26]

    Counterfactual Data Augmentation for Neural Machine Translation

    Liu, Qi and Kusner, Matt and Blunsom, Phil. Counterfactual Data Augmentation for Neural Machine Translation. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.18

  27. [27]

    2026 , eprint=

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. 2026 , eprint=

  28. [28]

    and Delaney, Eoin D

    Mayne, Harry and Kearns, Ryan Othniel and Yang, Yushi and Bean, Andrew M. and Delaney, Eoin D. and Russell, Chris and Mahdi, Adam. LLM s Don ' t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2...

  29. [29]

    Taxi1500: A Dataset for Multilingual Text Classification in 1500 Languages

    Ma, Chunlan and Imani, Ayyoob and Ye, Haotian and Pei, Renhao and Asgari, Ehsaneddin and Schuetze, Hinrich. Taxi1500: A Dataset for Multilingual Text Classification in 1500 Languages. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Pape...

  30. [30]

    LLM s for Generating and Evaluating Counterfactuals: A Comprehensive Study

    Nguyen, Van Bach and Youssef, Paul and Seifert, Christin and Schl. LLM s for Generating and Evaluating Counterfactuals: A Comprehensive Study. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.870

  31. [31]

    The Roles of English in Evaluating Multilingual Language Models

    Poelman, Wessel and de Lhoneux, Miryam. The Roles of English in Evaluating Multilingual Language Models. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). 2025

  32. [32]

    When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy

    Qi, Jirui and Chen, Shan and Xiong, Zidi and Fern \'a ndez, Raquel and Bitterman, Danielle and Bisazza, Arianna. When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1103

  33. [33]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

    Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

  34. [34]

    Explainability and Interpretability of Multilingual Large Language Models: A Survey

    Resck, Lucas and Augenstein, Isabelle and Korhonen, Anna. Explainability and Interpretability of Multilingual Large Language Models: A Survey. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1033

  35. [35]

    Explaining NLP Models via Minimal Contrastive Editing ( M i CE )

    Ross, Alexis and Marasovi \'c , Ana and Peters, Matthew. Explaining NLP Models via Minimal Contrastive Editing ( M i CE ). Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.336

  36. [36]

    2025 , isbn =

    Saha Roy, Rishiraj and Schlotthauer, Joel and Hinze, Chris and Foltyn, Andreas and Hahn, Luzian and Kuech, Fabian , title =. 2025 , isbn =. doi:10.1145/3701551.3704126 , booktitle =

  37. [37]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  38. [38]

    MAPO : Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization

    She, Shuaijie and Zou, Wei and Huang, Shujian and Zhu, Wenhao and Liu, Xiang and Geng, Xiang and Chen, Jiajun. MAPO : Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.539

  39. [39]

    m GPT : Few-Shot Learners Go Multilingual

    Shliazhko, Oleh and Fenogenova, Alena and Tikhonova, Maria and Kozlova, Anastasia and Mikhailov, Vladislav and Shavrina, Tatiana. m GPT : Few-Shot Learners Go Multilingual. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00633

  40. [40]

    BioScience , volume =

    Steigerwald, Emma and Ramírez-Castañeda, Valeria and Brandt, Débora Y C and Báldi, András and Shapiro, Julie Teresa and Bowker, Lynne and Tarvin, Rebecca D , title =. BioScience , volume =. 2022 , month =. doi:10.1093/biosci/biac062 , url =

  41. [41]

    2024 , eprint=

    Self-rationalization improves LLM as a fine-grained judge , author=. 2024 , eprint=

  42. [42]

    2024 , eprint=

    Anchored Alignment for Self-Explanations Enhancement , author=. 2024 , eprint=

  43. [43]

    2026 , eprint =

    Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation , author =. 2026 , eprint =

  44. [44]

    Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems

    Wang, Qianli and Anikina, Tatiana and Feldhus, Nils and Ostermann, Simon and Splitt, Fedor and Li, Jiaao and Tsoneva, Yoana and M. Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.29

  45. [45]

    2026 , eprint =

    iFlip: Iterative Feedback-driven Counterfactual Example Refinement , author =. 2026 , eprint =

  46. [46]

    ACL Findings , month =

    Wang, Qianli and Feldhus, Nils and Ostermann, Simon and Villa-Arenas, Luis Felipe and Möller, Sebastian and Schmitt, Vera , editor =. ACL Findings , month =. 2025 , address =. doi:10.18653/v1/2025.findings-acl.64 , pages =

  47. [47]

    A Survey on Natural Language Counterfactual Generation

    Wang, Yongjie and Qiu, Xiaoqi and Yue, Yu and Guo, Xu and Zeng, Zhiwei and Feng, Yuhong and Shen, Zhiqi. A Survey on Natural Language Counterfactual Generation. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.276

  48. [48]

    and Le, Quoc V

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  49. [49]

    MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

    Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

  50. [50]

    supports O1

    Zhao, Haiyan and Chen, Hanjie and Yang, Fan and Liu, Ninghao and Deng, Huiqi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Du, Mengnan , title =. 2024 , issue_date =. doi:10.1145/3639372 , journal =

  51. [51]

    PLOS ONE , publisher =

    EMSA: Explainable multilingual sentiment analysis models providing sentiment analysis across multiple languages , year =. PLOS ONE , publisher =. doi:10.1371/journal.pone.0333508 , author =

  52. [52]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  53. [53]

    2025 , eprint=

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. 2025 , eprint=

  54. [54]

    2024 , eprint=

    OpenAI o1 System Card , author=. 2024 , eprint=

  55. [55]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  56. [56]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=