arxiv: 2406.18629 · v1 · pith:4CBGIT3Dnew · submitted 2024-06-26 · 💻 cs.LG · cs.AI· cs.CL

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai , Zhuotao Tian , Yukang Chen , Senqiao Yang , Xiangru Peng , Jiaya Jia This is my paper

Pith reviewed 2026-05-18 23:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords Step-DPOpreference optimizationmathematical reasoninglong-chain reasoningLLMsprocess supervisionDPOMATH benchmark

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{4CBGIT3D}

Prints a linked pith:4CBGIT3D badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Step-wise preference optimization on individual reasoning steps improves long-chain mathematical accuracy in LLMs more effectively than whole-answer DPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard Direct Preference Optimization falls short for complex math because it only judges entire answers and misses specific errors along the way. Step-DPO instead builds preference pairs that contrast correct and incorrect steps at the same position in a reasoning chain, then trains the model to favor the better step. A simple pipeline generates 10K such pairs from self-generated data, which the authors find works better than human or GPT-4 data. With fewer than 500 training steps this produces nearly a 3 percent accuracy lift on MATH for models larger than 70B parameters. Readers should care because the method shows that fine-grained process feedback can be added with modest data and still push open models past several closed-source systems on standard math tests.

Core claim

Step-DPO reframes preference optimization so that each individual reasoning step becomes the unit of comparison rather than the full final answer. The authors construct a dataset of 10K step-wise preference pairs and show that training on self-generated pairs yields better results than out-of-distribution data. When applied to Qwen2-72B-Instruct the resulting model reaches 70.8 percent on the MATH test set and 94.0 percent on GSM8K, exceeding GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

What carries the argument

Step-wise preference pairs that contrast a correct reasoning step with an incorrect one at the identical position in the chain, allowing Direct Preference Optimization to operate at process granularity instead of outcome granularity.

If this is right

Models learn to detect and avoid specific errors inside long reasoning chains rather than only judging final answers.
Only 10K step-wise pairs and under 500 training steps suffice for a nearly 3 percent accuracy increase on MATH for models exceeding 70B parameters.
Self-generated data outperforms human-written or GPT-4-generated data for this style of preference optimization.
Open models can reach or exceed the math performance of several closed-source frontier models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same step-level signal could be applied to other sequential tasks such as code generation or multi-step scientific reasoning where error localization matters.
Automated ways to generate or verify step labels might remove the remaining human effort in the pipeline and allow further scaling.
Process-level preference data may reduce the total volume of feedback needed for alignment compared with outcome-only methods.

Load-bearing premise

The pipeline that creates the step-wise preference pairs must label correct and incorrect steps accurately and without introducing systematic errors or shifts in data distribution.

What would settle it

Training a model with the Step-DPO pairs produces no accuracy gain or a loss relative to standard DPO or the untuned base model on the MATH test set.

read the original abstract

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Step-DPO applies DPO at the individual reasoning step level instead of whole answers and reports clear gains on MATH and GSM8K with only 10k pairs, though the step labeling process is the unverified part.

read the letter

The main point is that Step-DPO shifts preference optimization down to single steps in a reasoning chain rather than scoring the full answer. This gives the model more targeted signals on where long math solutions go wrong, which standard DPO apparently does not do well enough on its own. They built a pipeline to generate 10k step-wise preference pairs, mostly from the model's own outputs, and found that self-generated data works better than GPT-4 or human data because it avoids distribution shift. With that setup they get a nearly 3% accuracy lift on MATH after fewer than 500 training steps on models over 70B parameters. The tuned Qwen2-72B then scores 70.8% on MATH and 94% on GSM8K, ahead of several closed models. Releasing the code, data, and models is straightforward and helpful for anyone who wants to try it. The approach is a direct, low-overhead extension of existing DPO work, and the data efficiency stands out as the practical takeaway. The soft spot is the data pipeline. The reported improvements depend on the pairs correctly flagging the first erroneous step without systematic mistakes or noise. The abstract mentions the pipeline and the self-generated advantage but gives limited detail on validation, filtering, or error checks for those labels. If subtle arithmetic slips get mislabeled, the optimization could reinforce the wrong patterns rather than genuine fixes. That is the main place where more evidence would strengthen the case. This is aimed at researchers working on LLM alignment for reasoning tasks. Someone looking for incremental, data-light ways to improve chain-of-thought performance would find the method and numbers worth examining. I would send it to peer review. The core change is simple enough to evaluate directly, and the concrete benchmark results give referees something specific to check against the methods section.

Referee Report

2 major / 2 minor

Summary. The paper introduces Step-DPO, an extension of Direct Preference Optimization that operates on individual reasoning steps rather than complete answers, to improve long-chain mathematical reasoning in LLMs. It presents a custom pipeline for constructing 10K step-wise preference pairs (emphasizing self-generated data over GPT-4 or human data), reports that fewer than 500 training steps on these pairs yield nearly 3% accuracy gains on MATH for >70B models, and claims that Step-DPO applied to Qwen2-72B-Instruct reaches 70.8% on MATH and 94.0% on GSM8K, surpassing GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

Significance. If the step-level labels are reliable, the work demonstrates a data-efficient route to process supervision within the DPO framework for complex reasoning, with the self-generated data observation providing a useful practical insight. Public release of code, data, and models is a clear strength that aids reproducibility and follow-up work.

major comments (2)

[Section 3] Data construction pipeline (Section 3): The manuscript describes generating step-wise preference pairs by locating the first erroneous step but provides no quantitative validation of labeling accuracy, such as human agreement rates on a held-out sample, error analysis of mislabeled pairs, or checks for systematic biases (e.g., overlooking subtle arithmetic mistakes). This validation is load-bearing for the central claim that the 10K pairs produce genuine process-level supervision rather than spurious signals.
[Section 4] Experiments and ablations (Section 4): While headline results on MATH and GSM8K are reported, the paper supplies limited controls to isolate the effect of step-wise versus answer-wise DPO or to rule out confounding factors such as the specific distribution of self-generated data versus the baseline training distribution. Additional ablations (e.g., random step labeling or answer-level DPO on the same 10K pairs) would strengthen the attribution of gains to the step-wise formulation.

minor comments (2)

[Section 2] Notation for the step-wise preference loss could be clarified with an explicit equation contrasting it to standard DPO (Eq. 1 in the paper).
[Figure 2] Figure 2 or the data pipeline diagram would benefit from an example of a correctly versus incorrectly labeled step pair to illustrate the labeling rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our Step-DPO paper. The suggestions regarding validation of the data construction pipeline and the need for additional controls in the experiments are helpful for strengthening the manuscript. We address each major comment below and have revised the paper accordingly to incorporate quantitative validation and further ablations.

read point-by-point responses

Referee: [Section 3] Data construction pipeline (Section 3): The manuscript describes generating step-wise preference pairs by locating the first erroneous step but provides no quantitative validation of labeling accuracy, such as human agreement rates on a held-out sample, error analysis of mislabeled pairs, or checks for systematic biases (e.g., overlooking subtle arithmetic mistakes). This validation is load-bearing for the central claim that the 10K pairs produce genuine process-level supervision rather than spurious signals.

Authors: We agree that quantitative validation of the labeling accuracy is important to support the claim of reliable process-level supervision. In the revised manuscript, we have added a dedicated subsection in Section 3 describing a human evaluation study performed on a held-out sample of the preference pairs. This includes inter-annotator agreement rates, an error analysis of mislabeled cases, and explicit checks for systematic biases such as the potential overlooking of subtle arithmetic mistakes. The pipeline description has also been expanded to explain the multi-stage verification steps used to mitigate such biases. These additions provide direct evidence that the 10K pairs deliver genuine process supervision. revision: yes
Referee: [Section 4] Experiments and ablations (Section 4): While headline results on MATH and GSM8K are reported, the paper supplies limited controls to isolate the effect of step-wise versus answer-wise DPO or to rule out confounding factors such as the specific distribution of self-generated data versus the baseline training distribution. Additional ablations (e.g., random step labeling or answer-level DPO on the same 10K pairs) would strengthen the attribution of gains to the step-wise formulation.

Authors: We acknowledge that stronger controls would better isolate the contribution of the step-wise formulation and rule out potential confounders from the data distribution. In the revised Section 4, we have added ablations that apply answer-level DPO to the exact same 10K preference pairs for direct comparison, as well as a random step labeling baseline. These experiments help demonstrate that the observed gains are attributable to accurate step-wise supervision rather than the self-generated data distribution alone. We have also clarified the distinctions between the training distributions in the discussion of results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: Step-DPO extends DPO empirically to step pairs with held-out benchmark gains

full rationale

The paper introduces Step-DPO as an application of the existing DPO objective to newly constructed step-level preference pairs generated via a custom pipeline. All reported performance numbers (e.g., 70.8% on MATH, 94.0% on GSM8K for Qwen2-72B-Instruct) are measured on standard held-out test sets that are independent of the training objective and data construction. No derivation step, equation, or claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central results remain externally falsifiable through benchmark evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central empirical claim rests on the assumption that the authors' data construction pipeline can reliably identify and pair correct versus incorrect reasoning steps at scale.

free parameters (1)

number of training steps = <500
The paper reports using fewer than 500 Step-DPO training steps to achieve the gains.

pith-pipeline@v0.9.0 · 5867 in / 1205 out tokens · 32388 ms · 2026-05-18T23:53:13.717351+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
cs.AI 2026-05 unverdicted novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization
cs.LG 2026-05 unverdicted novelty 6.0

Offline KL-regularized MABs require sample complexity scaling as O(η S A C^π*/ε) for large regularization and Ω(S A C^π*/ε²) for small regularization, with matching lower bounds across the full range.
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
cs.LG 2025-10 unverdicted novelty 6.0

PaTaRM converts pairwise preference data into pointwise reward signals via a novel PAR mechanism and task-adaptive rubrics, reporting 8.7% gains on RewardBench/RMBench and 13.6% relative RLHF improvement.
Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning
cs.CL 2025-09 unverdicted novelty 6.0

Future Policy Approximation (FPA) improves offline RL for LLM mathematical reasoning by extrapolating future policies in logit space to proactively reweight gradients, yielding consistent gains over DPO, RPO, KTO and ...
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
cs.CL 2025-04 unverdicted novelty 6.0

ReTool uses outcome-driven RL to train 32B LLMs to dynamically use code tools during reasoning, reaching 72.5% accuracy on AIME and surpassing o1-preview.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
cs.CL 2024-11 conditional novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning
cs.CL 2026-05 unverdicted novelty 5.0

YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?
cs.CL 2026-04 unverdicted novelty 5.0

Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.
Hard Negative Sample-Augmented DPO Post-Training for Small Language Models
cs.LG 2025-12 unverdicted novelty 5.0

A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance
cs.AI 2025-10 unverdicted novelty 5.0

SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Sample-efficient LLM Optimization with Reset Replay
cs.LG 2025-08 unverdicted novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoni...
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
cs.CL 2025-07 unverdicted novelty 5.0

A DPO framework augmented with curriculum learning and two new loss parameters generates veracity explanations for Hindi news using LLMs and PLMs.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
cs.CY 2026-04 unverdicted novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 4.0

Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 20 Pith papers · 20 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Llemma: An Open Language Model For Mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv:2310.10631,

work page internal anchor Pith review arXiv
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Alphamath almost zero: process supervision without process

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process. arXiv:2405.03553,

work page arXiv
[5]

Training Verifiers to Solve Math Word Problems

10 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv:2309.17452,

work page internal anchor Pith review arXiv
[7]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. arXiv:2403.07691,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Common 7b language models already possess strong math capabilities

Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities. arXiv:2403.04706,

work page arXiv
[10]

Mario: Math reasoning with code interpreter output–a reproducible pipeline

Minpeng Liao, Wei Luo, Chengxi Li, Jing Wu, and Kai Fan. Mario: Math reasoning with code interpreter output–a reproducible pipeline. arXiv:2401.08190,

work page arXiv
[11]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Rho-1: Not all tokens are what you need

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, et al. Rho-1: Not all tokens are what you need. arXiv:2404.07965,

work page arXiv
[13]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems. arXiv:1705.04146,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2401.09003 , year=

Haoxiong Liu and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv:2401.09003,

work page arXiv
[15]

Improving large language model fine-tuning for solving math problems

Yixin Liu, Avi Singh, C Daniel Freeman, John D Co-Reyes, and Peter J Liu. Improving large language model fine-tuning for solving math problems. arXiv:2310.10047,

work page arXiv
[16]

Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms

Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms. arXiv:2402.16352,

work page arXiv
[17]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv:2308.09583,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah

URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime . Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms in grade school math. arXiv:2402.14830,

work page arXiv
[19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J´er´emy Rapin, et al. Code llama: Open foundation models for code. arXiv:2308.12950,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2403.02884 , year=

Zhengyang Tang, Xingxing Zhang, Benyou Wan, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv:2403.02884,

work page arXiv
[23]

Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning

Yongqi Tong, Dawei Li, Sizhe Wang, Yujia Wang, Fei Teng, and Jingbo Shang. Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning. arXiv:2403.20046,

work page arXiv
[24]

arXiv preprint arXiv:2402.10176 , year=

Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. arXiv:2402.10176,

work page arXiv
[25]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Zephyr: Direct Distillation of LM Alignment

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl´ementine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv:2310.16944,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2310.03731 (2023)

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv:2310.03731, 2023a. Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: Verify and re...

work page arXiv
[28]

Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data

Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data. arXiv:2405.14333,

work page arXiv
[29]

Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline

Yifan Xu, Xiao Liu, Xinghan Liu, Zhenyu Hou, Yueyan Li, Xiaohan Zhang, Zihan Wang, Aohan Zeng, Zhengxiao Du, Wenyi Zhao, et al. Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline. arXiv:2404.02893,

work page arXiv
[30]

Internlm-math: Open math large language models toward verifiable reasoning

12 Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, et al. Internlm-math: Open math large language models toward verifiable reasoning. arXiv:2402.06332,

work page arXiv
[31]

Answering questions by meta-reasoning over multiple chains of thought

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv:2304.13007,

work page arXiv
[32]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv:2309.12284,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv:2308.01825,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv:2309.05653,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Mammoth2: Scaling instructions from the web

Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. arXiv:2405.03548,

work page arXiv
[36]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Sch¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv:2205.10625,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Jiuzhang3

Kun Zhou, Beichen Zhang, Jiapeng Wang, Zhipeng Chen, Wayne Xin Zhao, Jing Sha, Zhichao Sheng, Shijin Wang, and Ji-Rong Wen. Jiuzhang3. 0: Efficiently improving mathematical reasoning by training small data synthesis models. arXiv:2405.14365,

work page arXiv
[38]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv:2406.11931,

work page internal anchor Pith review Pith/arXiv arXiv