RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

Fei Sun; Fengyuan Liu; Mengnan Du; Na Zou; Wei Shi; Yanguang Liu; Yongliang Miao

arxiv: 2606.07006 · v1 · pith:HBBWPG6Dnew · submitted 2026-06-05 · 💻 cs.LG · cs.CL

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

Yongliang Miao , Fengyuan Liu , Wei Shi , Yanguang Liu , Fei Sun , Na Zou , Mengnan Du This is my paper

Pith reviewed 2026-06-27 22:23 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords supervised fine-tuningreasoningon-policy rolloutspolicy-aware adaptationlarge language modelsmathematical reasoningcode reasoningfine-tuning methods

0 comments

The pith

RASFT improves LLM reasoning by adjusting expert imitation strength per problem using the model's own verified rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard SFT copies one expert trajectory for every problem, which risks overfitting to surface forms and ignoring what the model can already do. RASFT estimates how solvable each problem is for the current policy by generating and verifying its own rollouts. When solvability is low it increases reliance on the expert trajectory; when solvability is high it relaxes imitation and accepts correct self-generated paths. A clipped inverse ratio between a frozen reference model and the current policy limits unwanted drift from useful priors. Across multiple models the resulting policy outperforms both conventional SFT and representative RL methods on six math and two code benchmarks.

Core claim

RASFT is a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift.

What carries the argument

problem-level solvability estimated from verified on-policy rollouts, which dynamically scales the weight of expert trajectories and decides whether to accept self-generated correct solutions.

If this is right

RASFT produces higher overall accuracy than standard SFT and SFT variants on mathematical and code reasoning tasks.
The method outperforms representative RL baselines while remaining within a supervised fine-tuning regime.
The clipped inverse-ratio term keeps policy updates from erasing useful reasoning behavior learned in pre-training.
Correct trajectories generated by the model itself are retained as training targets when the policy already solves the problem reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-problem adaptation logic could be applied to non-reasoning tasks where fixed imitation risks overwriting model capabilities.
Because the method only needs on-policy samples that are already generated during training, it may reduce the volume of external expert data required.
The approach offers a middle ground between pure SFT and full RL that could be combined with existing preference or reward-model techniques.

Load-bearing premise

Problem-level solvability estimated from verified on-policy rollouts provides a reliable, unbiased signal for dynamically calibrating the strength of expert supervision without introducing training instability or selection artifacts.

What would settle it

Training identical models with the same expert data but replacing the rollout-derived solvability signal by a random or constant value and observing no accuracy gain on the same benchmarks would show that the adaptive calibration is not responsible for the reported improvement.

Figures

Figures reproduced from arXiv: 2606.07006 by Fei Sun, Fengyuan Liu, Mengnan Du, Na Zou, Wei Shi, Yanguang Liu, Yongliang Miao.

**Figure 1.** Figure 1: RASFT pipeline. (a) For each prompt, policy model πθ samples multiple rollouts, which are verified and combined with offline expert trajectory. (b) Rollout-based solvability ζi , which adaptively calibrates expert and rollout trajectory weights. (c) RASFT updates the policy model πθ by optimizing candidate trajectories with a compound weight that combines normalized trajectory weights, an inverse policy ra… view at source ↗

**Figure 2.** Figure 2: Comparison between RASFT and representa [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity to the Rollout Number. increasingly rely on reinforcement learning with sampled trajectories and outcome feedback (Cobbe et al., 2021; Yuan et al., 2023; Zelikman et al., 2022; Gulcehre et al., 2023; DeepSeek-AI, 2025). PPO (Schulman et al., 2017) has been widely used as a policy-optimization algorithm, using clipped policy updates to improve training stability. Building on this direction, GRP… view at source ↗

read the original abstract

Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RASFT adds per-problem solvability weighting from on-policy rollouts plus a clipped inverse reference ratio to standard SFT, but the abstract supplies no numbers so the gains cannot be assessed yet.

read the letter

RASFT estimates how solvable each problem is for the current policy via verified rollouts, then strengthens expert imitation on hard problems and relaxes it to include correct self-generated trajectories on easier ones. It also clips the inverse probability ratio to a frozen reference model to limit drift. This combination is the main novelty; it tries to make SFT less rigid without switching to full RL.

The approach targets a real weakness in reasoning SFT, where a single fixed trajectory can suppress the model's own distribution. The abstract claims better results than plain SFT, SFT variants, and some RL methods across six math and two code benchmarks on multiple models, which would be useful if the numbers hold.

The soft spot is the complete absence of any quantitative results, baselines, or error bars in the abstract, so there is no way to check whether the reported gains are real or artifacts. The stress-test concern about high-variance solvability estimates on hard problems is plausible and needs checking in the full paper; if the authors used enough rollouts and showed stability, that would address it. The on-policy nature of the estimates could also introduce selection effects that are not obviously controlled.

This is for groups doing post-training on reasoning models who want a lightweight SFT tweak. It deserves peer review so the experiments and implementation can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware variant of SFT for reasoning tasks. For each problem, RASFT estimates solvability from verified on-policy rollouts and uses this signal to strengthen expert imitation on hard problems while relaxing imitation and incorporating correct self-generated trajectories on problems the current policy already solves reliably. A clipped inverse-ratio term between the reference model and current policy is added to bound drift. The abstract states that experiments on multiple models across six mathematical reasoning benchmarks and two code reasoning benchmarks show RASFT outperforming SFT, SFT variants, and representative RL methods.

Significance. If the reported gains are robust, RASFT would provide a concrete mechanism for making SFT adaptive to the model's evolving capabilities without full RL, potentially improving sample efficiency on reasoning tasks. The public release of code at https://github.com/zjd1sq/RASFT is a clear strength that supports reproducibility and further investigation.

major comments (2)

[Method description of solvability estimation and adaptive weighting] The central performance claim rests on the reliability of the per-problem solvability signal derived from finite verified on-policy rollouts. On hard problems the success-rate estimator necessarily has high variance; the manuscript provides no analysis, ablation, or stability diagnostics showing that this variance does not produce erratic supervision weights or selection artifacts across training iterations.
[Description of the clipped inverse-ratio term and its integration with rollout-based weighting] The interaction between the adaptive weighting and the clipped inverse-ratio term is presented as stabilizing policy drift, yet no derivation or empirical check demonstrates that the combination prevents the on-policy conditioning from introducing systematic bias in the supervision signal.

minor comments (1)

[Abstract] The abstract asserts superior performance but contains no numerical results, dataset sizes, or statistical details; moving at least the headline numbers into the abstract would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the reliability of the solvability signal and the interaction of the drift constraint. We respond to each major comment below.

read point-by-point responses

Referee: The central performance claim rests on the reliability of the per-problem solvability signal derived from finite verified on-policy rollouts. On hard problems the success-rate estimator necessarily has high variance; the manuscript provides no analysis, ablation, or stability diagnostics showing that this variance does not produce erratic supervision weights or selection artifacts across training iterations.

Authors: We agree that finite rollouts can produce high variance in the solvability estimate on hard problems and that the manuscript lacks explicit stability diagnostics. To address this directly, we will add an ablation varying the number of rollouts (4 vs. 8) and report weight variance across iterations in the revised version. revision: yes
Referee: The interaction between the adaptive weighting and the clipped inverse-ratio term is presented as stabilizing policy drift, yet no derivation or empirical check demonstrates that the combination prevents the on-policy conditioning from introducing systematic bias in the supervision signal.

Authors: The clipped inverse-ratio term follows standard importance-sampling bounds to limit drift from the reference policy. While we provide no formal derivation of the joint effect, the reported results show consistent gains without degradation indicative of bias. We will add an empirical policy-drift analysis (KL and success-rate trends) with and without the term in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: RASFT performance claims rest on external benchmarks, not definitional reduction

full rationale

The paper defines RASFT via on-policy rollout solvability estimates and a clipped inverse-ratio term to a frozen reference model, then reports empirical gains on six math and two code benchmarks against SFT and RL baselines. No equations reduce the reported performance to a fitted quantity or self-generated signal by construction; the evaluation uses held-out benchmarks independent of the training signal. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. This is the normal non-circular case for an empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5739 in / 1131 out tokens · 27029 ms · 2026-06-27T22:23:38.527032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 42 canonical work pages · 35 internal anchors

[1]

International Conference on Learning Representations , year =

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2508.05629 , url =. 2508.05629 , archivePrefix =

work page doi:10.48550/arxiv.2508.05629
[2]

International Conference on Learning Representations , year =

Anchored Supervised Fine-Tuning , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2509.23753 , url =. 2509.23753 , archivePrefix =

work page doi:10.48550/arxiv.2509.23753
[3]

ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection , author =. arXiv preprint arXiv:2601.09195 , year =. doi:10.48550/arXiv.2601.09195 , url =. 2601.09195 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.09195
[4]

2026 , eprint =

Wang, Xiaoxuan and Zhang, Han and Wang, Haixin and Shi, Yidan and Li, Ruoyan and Han, Kaiqiao and Tong, Chenyi and Deng, Haoran and Sun, Renliang and Taylor, Alexander and Zhu, Yanqiao and Cong, Jason and Sun, Yizhou and Wang, Wei , journal =. 2026 , eprint =. doi:10.48550/arXiv.2602.21534 , url =

work page doi:10.48550/arxiv.2602.21534 2026
[5]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

arXiv preprint arXiv:2512.02556 , year =. doi:10.48550/arXiv.2512.02556 , url =. 2512.02556 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556
[6]

arXiv preprint arXiv:2503.02951 , year =

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding , author =. arXiv preprint arXiv:2503.02951 , year =. doi:10.48550/arXiv.2503.02951 , url =. 2503.02951 , archivePrefix =

work page doi:10.48550/arxiv.2503.02951
[7]

Li, Jia and Beeching, Edward and Tunstall, Lewis and Lipkin, Ben and Soletskyi, Roman and Huang, Shengyi Costa and Rasul, Kashif and Yu, Longhui and Jiang, Albert and Shen, Ziju and Qin, Zihan and Dong, Bin and Zhou, Li and Fleureau, Yann and Lample, Guillaume and Polu, Stanislas , year =
[8]

Let's Verify Step by Step

Let's Verify Step by Step , author =. arXiv preprint arXiv:2305.20050 , year =. doi:10.48550/arXiv.2305.20050 , url =. 2305.20050 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.20050
[9]

Solving Quantitative Reasoning Problems with Language Models

Solving Quantitative Reasoning Problems with Language Models , author =. arXiv preprint arXiv:2206.14858 , year =. doi:10.48550/arXiv.2206.14858 , url =. 2206.14858 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.14858
[10]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =. doi:10.48550/arXiv.2402.14008 , url =. 2402.14008 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.14008
[11]

2024 , howpublished =

2024
[12]

2025 , howpublished =

2025
[13]

2023 , howpublished =

2023
[14]

Program Synthesis with Large Language Models

Program Synthesis with Large Language Models , author =. arXiv preprint arXiv:2108.07732 , year =. doi:10.48550/arXiv.2108.07732 , url =. 2108.07732 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732
[15]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , url =. 2107.03374 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374
[16]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2201.11903 , url =. 2201.11903 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903
[17]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2308.09583 , url =. 2308.09583 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.09583
[18]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2309.12284 , url =. 2309.12284 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.12284
[19]

InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting , author =. arXiv preprint arXiv:2605.14967 , year =. doi:10.48550/arXiv.2605.14967 , url =. 2605.14967 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.14967
[20]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training , author =. International Conference on Machine Learning , year =. doi:10.48550/arXiv.2501.17161 , url =. 2501.17161 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.17161
[21]

arXiv preprint arXiv:1707.06347 , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[22]

Proximal Supervised Fine-Tuning

Proximal Supervised Fine-Tuning , author =. arXiv preprint arXiv:2508.17784 , year =. doi:10.48550/arXiv.2508.17784 , url =. 2508.17784 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.17784
[23]

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning , author =. arXiv preprint arXiv:2602.01058 , year =. doi:10.48550/arXiv.2602.01058 , url =. 2602.01058 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.01058
[24]

DeepSeek-V3 Technical Report

arXiv preprint arXiv:2412.19437 , year =. doi:10.48550/arXiv.2412.19437 , url =. 2412.19437 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437
[25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

arXiv preprint arXiv:2501.12948 , year =. doi:10.48550/arXiv.2501.12948 , url =. 2501.12948 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[26]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and Lu, Keming and Xue, Mingfeng and Lin, Runji and Liu, Tianyu and Ren, Xingzhang and Zhang, Zhenru , journal =. 2024 , eprint =. doi:10.48550/arXiv.2409.12122 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12122 2024
[27]

Solving math word problems with process- and outcome-based feedback

Solving Math Word Problems with Process- and Outcome-Based Feedback , author =. arXiv preprint arXiv:2211.14275 , year =. doi:10.48550/arXiv.2211.14275 , url =. 2211.14275 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.14275
[28]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author =. arXiv preprint arXiv:2312.08935 , year =. doi:10.48550/arXiv.2312.08935 , url =. 2312.08935 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.08935
[29]

arXiv preprint arXiv:2408.06195 , year =

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers , author =. arXiv preprint arXiv:2408.06195 , year =. doi:10.48550/arXiv.2408.06195 , url =. 2408.06195 , archivePrefix =

work page doi:10.48550/arxiv.2408.06195
[30]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, Caglar and Paine, Tom Le and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , journal =. Reinforced Self-Training. 2023 , eprint =. doi:10.48550/arXiv.2308.0...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08998 2023
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =. doi:10.48550/arXiv.2402.03300 , url =. 2402.03300 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
[32]

Learning to Reason under Off-Policy Guidance

Learning to Reason under Off-Policy Guidance , author =. arXiv preprint arXiv:2504.14945 , year =. doi:10.48550/arXiv.2504.14945 , url =. 2504.14945 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.14945
[33]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =. doi:10.48550/arXiv.2412.15115 , url =. 2412.15115 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115
[34]

2024 , month = sep, day =

Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models , author =. 2024 , month = sep, day =

2024
[35]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =. doi:10.48550/arXiv.2110.14168 , url =. 2110.14168 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168
[36]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models , author =. arXiv preprint arXiv:2308.01825 , year =. doi:10.48550/arXiv.2308.01825 , url =. 2308.01825 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.01825
[37]

, booktitle =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =. 2022 , eprint =. doi:10.48550/arXiv.2203.14465 , url =

work page doi:10.48550/arxiv.2203.14465 2022
[38]

Journal of Machine Learning Research , volume =

Scaling Instruction-Finetuned Language Models , author =. Journal of Machine Learning Research , volume =. 2024 , eprint =

2024
[39]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Mukherjee, Subhabrata and Mitra, Arindam and Jawahar, Ganesh and Agarwal, Sahaj and Palangi, Hamid and Awadallah, Ahmed , journal =. Orca: Progressive Learning from Complex Explanation Traces of. 2023 , eprint =. doi:10.48550/arXiv.2306.02707 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.02707 2023
[40]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Yue, Xiang and Qu, Xingwei and Zhang, Ge and Fu, Yao and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , journal =. 2023 , eprint =. doi:10.48550/arXiv.2309.05653 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.05653 2023
[41]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2203.02155 , url =. 2203.02155 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155
[42]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =. 2023 , doi =

2023
[43]

LIMA: Less Is More for Alignment

Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srini and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and Zhang, Susan and Ghosh, Gargi and Lewis, Mike and Zettlemoyer, Luke and Levy, Omer , booktitle =. 2023 , eprint =. doi:10.48550/arXiv.2305.11206 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.11206 2023
[44]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the. 2021 , eprint =. doi:10.48550/arXiv.2103.03874 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874 2021
[45]

Learning to summarize from human feedback

Learning to Summarize from Human Feedback , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2009.01325 , url =. 2009.01325 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.01325 2009
[46]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. arXiv preprint arXiv:2204.05862 , year =. doi:10.48550/arXiv.2204.05862 , url =. 2204.05862 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862
[47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Large Language Models are not Fair Evaluators , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , doi =

2024
[48]

International Conference on Learning Representations , year =

Evaluating Large Language Models at Evaluating Instruction Following , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2310.07641 , url =. 2310.07641 , archivePrefix =

work page doi:10.48550/arxiv.2310.07641
[49]

Measuring Coding Challenge Competence With APPS

Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Coding Challenge Competence With. 2021 , eprint =. doi:10.48550/arXiv.2105.09938 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2105.09938 2021
[50]

and Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol , journal =

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, Rémi and Eccles, Tom and Keeling, James and Gimeno, Felix and Dal Lago, Agustin and Hubert, Thomas and Choy, Peter and de Masson d'Autume, Cyprien and Babuschkin, Igor and Chen, Xinyun and Huang, Po-Sen and Welbl, Johannes and Gowal, Sven and Cherepanov,...

2022
[51]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , eprint =. doi:10.48550/arXiv.2306.05685 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
[52]

2023 , doi =

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , doi =

2023
[53]

Qwen2.5-Coder Technical Report

Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Dang, Kai and Fan, Yang and Zhang, Yichang and Yang, An and Men, Rui and Huang, Fei and Zheng, Bo and Miao, Yibo and Quan, Shanghaoran and Feng, Yunlong and Ren, Xingzhang and Ren, Xuancheng and Zhou...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024

[1] [1]

International Conference on Learning Representations , year =

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2508.05629 , url =. 2508.05629 , archivePrefix =

work page doi:10.48550/arxiv.2508.05629

[2] [2]

International Conference on Learning Representations , year =

Anchored Supervised Fine-Tuning , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2509.23753 , url =. 2509.23753 , archivePrefix =

work page doi:10.48550/arxiv.2509.23753

[3] [3]

ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection , author =. arXiv preprint arXiv:2601.09195 , year =. doi:10.48550/arXiv.2601.09195 , url =. 2601.09195 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.09195

[4] [4]

2026 , eprint =

Wang, Xiaoxuan and Zhang, Han and Wang, Haixin and Shi, Yidan and Li, Ruoyan and Han, Kaiqiao and Tong, Chenyi and Deng, Haoran and Sun, Renliang and Taylor, Alexander and Zhu, Yanqiao and Cong, Jason and Sun, Yizhou and Wang, Wei , journal =. 2026 , eprint =. doi:10.48550/arXiv.2602.21534 , url =

work page doi:10.48550/arxiv.2602.21534 2026

[5] [5]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

arXiv preprint arXiv:2512.02556 , year =. doi:10.48550/arXiv.2512.02556 , url =. 2512.02556 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556

[6] [6]

arXiv preprint arXiv:2503.02951 , year =

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding , author =. arXiv preprint arXiv:2503.02951 , year =. doi:10.48550/arXiv.2503.02951 , url =. 2503.02951 , archivePrefix =

work page doi:10.48550/arxiv.2503.02951

[7] [7]

Li, Jia and Beeching, Edward and Tunstall, Lewis and Lipkin, Ben and Soletskyi, Roman and Huang, Shengyi Costa and Rasul, Kashif and Yu, Longhui and Jiang, Albert and Shen, Ziju and Qin, Zihan and Dong, Bin and Zhou, Li and Fleureau, Yann and Lample, Guillaume and Polu, Stanislas , year =

[8] [8]

Let's Verify Step by Step

Let's Verify Step by Step , author =. arXiv preprint arXiv:2305.20050 , year =. doi:10.48550/arXiv.2305.20050 , url =. 2305.20050 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.20050

[9] [9]

Solving Quantitative Reasoning Problems with Language Models

Solving Quantitative Reasoning Problems with Language Models , author =. arXiv preprint arXiv:2206.14858 , year =. doi:10.48550/arXiv.2206.14858 , url =. 2206.14858 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.14858

[10] [10]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =. doi:10.48550/arXiv.2402.14008 , url =. 2402.14008 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.14008

[11] [11]

2024 , howpublished =

2024

[12] [12]

2025 , howpublished =

2025

[13] [13]

2023 , howpublished =

2023

[14] [14]

Program Synthesis with Large Language Models

Program Synthesis with Large Language Models , author =. arXiv preprint arXiv:2108.07732 , year =. doi:10.48550/arXiv.2108.07732 , url =. 2108.07732 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732

[15] [15]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , url =. 2107.03374 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374

[16] [16]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2201.11903 , url =. 2201.11903 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903

[17] [17]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2308.09583 , url =. 2308.09583 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.09583

[18] [18]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2309.12284 , url =. 2309.12284 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.12284

[19] [19]

InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting , author =. arXiv preprint arXiv:2605.14967 , year =. doi:10.48550/arXiv.2605.14967 , url =. 2605.14967 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.14967

[20] [20]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training , author =. International Conference on Machine Learning , year =. doi:10.48550/arXiv.2501.17161 , url =. 2501.17161 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.17161

[21] [21]

arXiv preprint arXiv:1707.06347 , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[22] [22]

Proximal Supervised Fine-Tuning

Proximal Supervised Fine-Tuning , author =. arXiv preprint arXiv:2508.17784 , year =. doi:10.48550/arXiv.2508.17784 , url =. 2508.17784 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.17784

[23] [23]

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning , author =. arXiv preprint arXiv:2602.01058 , year =. doi:10.48550/arXiv.2602.01058 , url =. 2602.01058 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.01058

[24] [24]

DeepSeek-V3 Technical Report

arXiv preprint arXiv:2412.19437 , year =. doi:10.48550/arXiv.2412.19437 , url =. 2412.19437 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437

[25] [25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

arXiv preprint arXiv:2501.12948 , year =. doi:10.48550/arXiv.2501.12948 , url =. 2501.12948 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948

[26] [26]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and Lu, Keming and Xue, Mingfeng and Lin, Runji and Liu, Tianyu and Ren, Xingzhang and Zhang, Zhenru , journal =. 2024 , eprint =. doi:10.48550/arXiv.2409.12122 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12122 2024

[27] [27]

Solving math word problems with process- and outcome-based feedback

Solving Math Word Problems with Process- and Outcome-Based Feedback , author =. arXiv preprint arXiv:2211.14275 , year =. doi:10.48550/arXiv.2211.14275 , url =. 2211.14275 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.14275

[28] [28]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author =. arXiv preprint arXiv:2312.08935 , year =. doi:10.48550/arXiv.2312.08935 , url =. 2312.08935 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.08935

[29] [29]

arXiv preprint arXiv:2408.06195 , year =

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers , author =. arXiv preprint arXiv:2408.06195 , year =. doi:10.48550/arXiv.2408.06195 , url =. 2408.06195 , archivePrefix =

work page doi:10.48550/arxiv.2408.06195

[30] [30]

Reinforced Self-Training (ReST) for Language Modeling

Gulcehre, Caglar and Paine, Tom Le and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , journal =. Reinforced Self-Training. 2023 , eprint =. doi:10.48550/arXiv.2308.0...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08998 2023

[31] [31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =. doi:10.48550/arXiv.2402.03300 , url =. 2402.03300 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300

[32] [32]

Learning to Reason under Off-Policy Guidance

Learning to Reason under Off-Policy Guidance , author =. arXiv preprint arXiv:2504.14945 , year =. doi:10.48550/arXiv.2504.14945 , url =. 2504.14945 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.14945

[33] [33]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =. doi:10.48550/arXiv.2412.15115 , url =. 2412.15115 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115

[34] [34]

2024 , month = sep, day =

Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models , author =. 2024 , month = sep, day =

2024

[35] [35]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =. doi:10.48550/arXiv.2110.14168 , url =. 2110.14168 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110.14168

[36] [36]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models , author =. arXiv preprint arXiv:2308.01825 , year =. doi:10.48550/arXiv.2308.01825 , url =. 2308.01825 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.01825

[37] [37]

, booktitle =

Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =. 2022 , eprint =. doi:10.48550/arXiv.2203.14465 , url =

work page doi:10.48550/arxiv.2203.14465 2022

[38] [38]

Journal of Machine Learning Research , volume =

Scaling Instruction-Finetuned Language Models , author =. Journal of Machine Learning Research , volume =. 2024 , eprint =

2024

[39] [39]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Mukherjee, Subhabrata and Mitra, Arindam and Jawahar, Ganesh and Agarwal, Sahaj and Palangi, Hamid and Awadallah, Ahmed , journal =. Orca: Progressive Learning from Complex Explanation Traces of. 2023 , eprint =. doi:10.48550/arXiv.2306.02707 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.02707 2023

[40] [40]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Yue, Xiang and Qu, Xingwei and Zhang, Ge and Fu, Yao and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , journal =. 2023 , eprint =. doi:10.48550/arXiv.2309.05653 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.05653 2023

[41] [41]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2203.02155 , url =. 2203.02155 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155

[42] [42]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =. 2023 , doi =

2023

[43] [43]

LIMA: Less Is More for Alignment

Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srini and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and Zhang, Susan and Ghosh, Gargi and Lewis, Mike and Zettlemoyer, Luke and Levy, Omer , booktitle =. 2023 , eprint =. doi:10.48550/arXiv.2305.11206 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.11206 2023

[44] [44]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the. 2021 , eprint =. doi:10.48550/arXiv.2103.03874 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874 2021

[45] [45]

Learning to summarize from human feedback

Learning to Summarize from Human Feedback , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2009.01325 , url =. 2009.01325 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.01325 2009

[46] [46]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. arXiv preprint arXiv:2204.05862 , year =. doi:10.48550/arXiv.2204.05862 , url =. 2204.05862 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862

[47] [47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Large Language Models are not Fair Evaluators , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , doi =

2024

[48] [48]

International Conference on Learning Representations , year =

Evaluating Large Language Models at Evaluating Instruction Following , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2310.07641 , url =. 2310.07641 , archivePrefix =

work page doi:10.48550/arxiv.2310.07641

[49] [49]

Measuring Coding Challenge Competence With APPS

Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Coding Challenge Competence With. 2021 , eprint =. doi:10.48550/arXiv.2105.09938 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2105.09938 2021

[50] [50]

and Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol , journal =

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, Rémi and Eccles, Tom and Keeling, James and Gimeno, Felix and Dal Lago, Agustin and Hubert, Thomas and Choy, Peter and de Masson d'Autume, Cyprien and Babuschkin, Igor and Chen, Xinyun and Huang, Po-Sen and Welbl, Johannes and Gowal, Sven and Cherepanov,...

2022

[51] [51]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , eprint =. doi:10.48550/arXiv.2306.05685 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023

[52] [52]

2023 , doi =

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , doi =

2023

[53] [53]

Qwen2.5-Coder Technical Report

Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Dang, Kai and Fan, Yang and Zhang, Yichang and Yang, An and Men, Rui and Huang, Fei and Zheng, Bo and Miao, Yibo and Quan, Shanghaoran and Feng, Yunlong and Ren, Xingzhang and Ren, Xuancheng and Zhou...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186 2024