pith. sign in

arxiv: 2308.09583 · v3 · pith:L6ZPUUAInew · submitted 2023-08-18 · 💻 cs.CL · cs.AI· cs.LG

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Pith reviewed 2026-05-17 03:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords mathematical reasoninglarge language modelsreinforcement learninginstruction evolutionchain of thoughtGSM8KMATH benchmark
0
0 comments X

The pith

WizardMath applies reinforced evol-instruct feedback to boost LLMs' math chain-of-thought reasoning without external tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WizardMath as a way to strengthen mathematical reasoning in large language models by using Reinforcement Learning from Evol-Instruct Feedback, or RLEIF. This process evolves instructions and applies process supervision directly in the math domain. A sympathetic reader would care because the resulting models, especially the 70B version, reach or exceed the performance of closed models such as GPT-3.5-Turbo, Claude 2, Gemini Pro, and early GPT-4 on GSM8K and MATH benchmarks. The work also points to instruction evolution and process supervision as central to these gains and shows strong results even with the smaller Mistral 7B base.

Core claim

By applying Reinforcement Learning from Evol-Instruct Feedback (RLEIF) to the math domain, WizardMath enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, yielding WizardMath-Mistral 7B that surpasses top-tier open-source LLMs and WizardMath 70B that outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version on GSM8K and MATH.

What carries the argument

Reinforcement Learning from Evol-Instruct Feedback (RLEIF), which evolves math instructions iteratively and reinforces correct reasoning steps through feedback.

Load-bearing premise

The reported performance gains stem primarily from the RLEIF procedure rather than from differences in the base model, data mixture, or evaluation protocol.

What would settle it

A controlled experiment that fine-tunes the identical base models on the same evolved instructions but omits the reinforcement learning feedback loop and checks whether the large accuracy lifts on GSM8K and MATH still appear.

read the original abstract

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WizardMath, which applies Reinforcement Learning from Evol-Instruct Feedback (RLEIF) to boost chain-of-thought mathematical reasoning in LLMs without external tools. It reports that WizardMath-Mistral-7B substantially outperforms leading open-source models on GSM8k and MATH, while WizardMath-70B surpasses GPT-3.5-Turbo, Claude 2, Gemini Pro, and an early GPT-4 variant; a preliminary analysis emphasizes the roles of instruction evolution and process supervision.

Significance. If the gains are shown to stem specifically from RLEIF rather than data quality or base-model differences, the work would provide a practical recipe for elevating open-source mathematical reasoning to near-proprietary levels using only evolved instructions and process-level RL, with the preliminary ablation-style exploration of evolution and supervision serving as a useful starting point for follow-on research.

major comments (2)
  1. [Experiments] The manuscript provides no direct SFT-only baseline trained on the identical Evol-Instruct dataset before applying the RL stage. Without this comparison, it remains unclear whether the headline gains on GSM8k and MATH are driven by the RLEIF reinforcement step or simply by the quality of the evolved data.
  2. [Abstract and §4] Benchmark results in the abstract and main results section are presented without error bars, details on prompt formatting, data exclusion criteria, or full training curves. These omissions make it difficult to assess the statistical reliability of the claim that WizardMath-70B outperforms the listed closed models.
minor comments (2)
  1. [Conclusion] The GitHub link is referenced but the paper would benefit from an explicit statement of which artifacts (code, data splits, evaluation prompts) are released.
  2. [Method] Notation for the process-supervision reward model could be introduced earlier and used consistently in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Experiments] The manuscript provides no direct SFT-only baseline trained on the identical Evol-Instruct dataset before applying the RL stage. Without this comparison, it remains unclear whether the headline gains on GSM8k and MATH are driven by the RLEIF reinforcement step or simply by the quality of the evolved data.

    Authors: We agree that an explicit SFT-only baseline on the identical dataset would help isolate the contribution of the RLEIF stage. In the revised manuscript we will add results from such a baseline trained on the same Evol-Instruct data, allowing direct comparison of performance before and after the reinforcement learning phase. revision: yes

  2. Referee: [Abstract and §4] Benchmark results in the abstract and main results section are presented without error bars, details on prompt formatting, data exclusion criteria, or full training curves. These omissions make it difficult to assess the statistical reliability of the claim that WizardMath-70B outperforms the listed closed models.

    Authors: We acknowledge the value of these details for assessing reliability. The revised version will include error bars from repeated evaluations where computationally feasible, explicit prompt formatting descriptions, data exclusion criteria, and full training curves in the appendix to support the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks without self-referential derivations

full rationale

The paper introduces the RLEIF procedure and reports accuracy numbers on GSM8k and MATH, comparing WizardMath variants against GPT-3.5-Turbo, Claude 2, Gemini Pro and early GPT-4. No equations appear that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no uniqueness theorem or ansatz is imported via self-citation to force the central result. The derivation chain consists of standard RL training steps whose outputs are evaluated on independent test sets; therefore the headline performance numbers are not equivalent to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that process-level feedback signals are reliable and that benchmark scores reflect genuine reasoning gains.

pith-pipeline@v0.9.0 · 5533 in / 1083 out tokens · 20517 ms · 2026-05-17T03:56:42.931558+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.

  2. Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.

  3. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  4. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  5. Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

    cs.AI 2026-04 unverdicted novelty 7.0

    CRPS synthesizes reasoning paths by contrasting high- and low-quality MCTS trajectories, enabling models trained on 60K examples to match or exceed those trained on 590K standard examples with better out-of-domain gen...

  6. CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

    cs.AI 2025-12 unverdicted novelty 7.0

    CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.

  7. Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

    cs.SE 2025-10 unverdicted novelty 7.0

    LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.

  8. SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

    cs.CR 2025-09 unverdicted novelty 7.0

    SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.

  9. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  10. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    cs.CV 2024-03 conditional novelty 7.0

    MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

  11. CodeMind: Evaluating Large Language Models for Code Reasoning

    cs.SE 2024-02 unverdicted novelty 7.0

    CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.

  12. Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

  13. DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

    cs.CL 2026-05 conditional novelty 6.0

    DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.

  14. Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

    cs.SE 2026-05 unverdicted novelty 6.0

    FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonn...

  15. Distribution Corrected Offline Data Distillation for Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

  16. CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

    cs.CV 2026-05 unverdicted novelty 6.0

    CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.

  17. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 6.0

    RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.

  18. Segment-Aligned Policy Optimization for Multi-Modal Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

  19. MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

    cs.AI 2025-11 unverdicted novelty 6.0

    MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.

  20. Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.

  21. League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

    cs.AI 2025-07 unverdicted novelty 6.0

    League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

  22. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    cs.CL 2024-12 unverdicted novelty 6.0

    HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

  23. Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

    cs.LG 2024-06 conditional novelty 6.0

    Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.

  24. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    cs.CL 2024-02 unverdicted novelty 6.0

    DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.

  25. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  26. Llemma: An Open Language Model For Mathematics

    cs.CL 2023-10 unverdicted novelty 6.0

    Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

  27. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

    cs.CL 2023-09 conditional novelty 6.0

    ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on...

  28. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  29. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    cs.CL 2023-09 conditional novelty 6.0

    MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.

  30. GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    A neuro-symbolic engine generates GeoSym127K, a 127K-question dataset with symbolic ground truths and verified CoT pairs, yielding +22.21% gains on MathVerse Vision-Only after SFT on Qwen3-VL-8B.

  31. ARMove: Learning to Predict Human Mobility through Agentic Reasoning

    cs.MA 2026-04 unverdicted novelty 5.0

    ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...

  32. Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    cs.CV 2025-02 unverdicted novelty 4.0

    Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

  33. A Survey on LLM-as-a-Judge

    cs.CL 2024-11 unverdicted novelty 4.0

    A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

  34. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  35. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  36. Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    cs.AI 2025-01 unverdicted novelty 3.0

    The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.

  37. A Survey on Knowledge Distillation of Large Language Models

    cs.CL 2024-02 accept novelty 3.0

    A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

  38. Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

    cs.CL 2025-02 unverdicted novelty 2.0

    Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 36 Pith papers · 1 internal anchor

  1. [1]

    URL https://api.semanticscholar.org/CorpusID:266818336. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Wint...

  2. [2]

    doi: 10.18653/v1/n19-1421

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https: //aclanthology.org/N19-1421. 14 Published as a conference paper at ICLR 2025 Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024. Rohan Taori, Ishaan Gulrajani, Tiany...

  3. [3]

    Instruction Evolution and SFT In the first step, we apply upward and downward instruction evolution on the GSM8k and MATH datasets, generating evolved instructions for the SFT. On the leftmost side of Figure 1, the three blue arrows, from top to bottom, represent: (a) the adoption of the instruction evolution technique, (b) the generation of evolved instr...

  4. [4]

    A” represents the original instruction, while “B,

    Reward Model Training The second step involves two reward models: the Instruction Quality Scoring Reward Model (IRM) and the Process-Supervised Reward Model (PRM), depicted in the central section of Figure 1. • IRM: We employ upward and downward evolution on a seed instruction, yielding five instructions (original + evolved). These instructions are ranked...

  5. [5]

    As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM

    Reinforcement Learning with PPO In the final step, we integrate the IRM and PRM within a Proximal Policy Optimization (PPO)-based reinforcement learning framework. As depicted in the far-right section of Figure 1, the process is as follows: (a) The first blue arrow represents instruction scoring by the IRM. (b) The second blue arrow shows PPO initializati...

  6. [6]

    • On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math

    Performance Comparison: • On Llama-2-7B and Mistral-7B-v0.1, WizardMath-SFT performs marginally below SOTA models (i.e.,Xwin-Math and Skywork-Math) and outperforms existing other excellent models (i.e.,DART-Math). • On Llama-2-13B and Llama-2-70B, WizardMath-SFT achieves comparable perfor- mance to Xwin-Math. • On all various base models, WizardMath-SFT s...

  7. [7]

    Meanwhile, WizardMath-SFT demonstrates comparable or superior performance to advanced data synthesis methods, such as DART- Math and MetaMath, across all base models

    Comparison with advanced data synthesis methods (i.e., DART-Math, MetaMath) As shown in the following Table 15, DART-Math demonstrates strong performance across various base models and the data synthesis method proposed by DART-Math shows the effectiveness and outstanding performance. Meanwhile, WizardMath-SFT demonstrates comparable or superior performan...

  8. [8]

    It also significantly enhances the mathematical reasoning capabilities of our models

    The proposed Math Evol Instruct data synthesis method is also as effective and practical as the current state-of-the-art data synthesis methods, such as DART-Math, Skywork-Math and Xwin-Math in the SFT stage. It also significantly enhances the mathematical reasoning capabilities of our models

  9. [9]

    The proposed IRM and PRM models substantially improve performance during the reinforcement learning phase. They not only continuously enhance the mathematical reasoning abilities of our 34 Published as a conference paper at ICLR 2025 Table 18: The performance comparison of WizardMath-SFT with DART-Math, Xwin-Math, and Skywork-Math on the Llama2-7B base mo...

  10. [10]

    In Table 6, we provide a detailed analysis of the effects of downward evolution

    Unlike WizardLM/WizardCoder, which primarily focus on increasing instruction difficulty, we are the first to propose the novel concept of downward evolution, a major distinction in instruction evolution. In Table 6, we provide a detailed analysis of the effects of downward evolution. Specifically, two rounds of downward evolution led to a remarkable impro...

  11. [11]

    In reinforcement learning (RL) training, we firstly propose the instruction quality scoring reward model (IRM) combined with the process supervision reward model (PRM) further enhancing WizardMath mathematical reasoning ability. As demonstrated in Table 3, our method achieves a remarkable 5%–8% improvement in GSM8k and MATH performance over the SFT backbo...

  12. [12]

    Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems

    We firstly propose to use AI to annotate the step-level PRM training data. Additionally, the training datasets for SFT, PRM, and IRM are fully synthesized using AI systems. This fully AI-automated data generation pipeline ensures scalability

  13. [13]

    It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approach proposed in our study

    WizardMath demonstrates outstanding performance across a wide range of model scales, from 100M to 1B and 70B parameters, on the benchmarks such as GSM8k, MATH, and out-of- distribution (OOD) tasks like MWPBench(Tang et al., 2024). It surpasses all existing open-source state-of-the-art models, showcasing the effectiveness and robustness of the RLEIF approa...