pith. sign in

arxiv: 2606.07006 · v1 · pith:HBBWPG6Dnew · submitted 2026-06-05 · 💻 cs.LG · cs.CL

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

Pith reviewed 2026-06-27 22:23 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords supervised fine-tuningreasoningon-policy rolloutspolicy-aware adaptationlarge language modelsmathematical reasoningcode reasoningfine-tuning methods
0
0 comments X

The pith

RASFT improves LLM reasoning by adjusting expert imitation strength per problem using the model's own verified rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard SFT copies one expert trajectory for every problem, which risks overfitting to surface forms and ignoring what the model can already do. RASFT estimates how solvable each problem is for the current policy by generating and verifying its own rollouts. When solvability is low it increases reliance on the expert trajectory; when solvability is high it relaxes imitation and accepts correct self-generated paths. A clipped inverse ratio between a frozen reference model and the current policy limits unwanted drift from useful priors. Across multiple models the resulting policy outperforms both conventional SFT and representative RL methods on six math and two code benchmarks.

Core claim

RASFT is a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift.

What carries the argument

problem-level solvability estimated from verified on-policy rollouts, which dynamically scales the weight of expert trajectories and decides whether to accept self-generated correct solutions.

If this is right

  • RASFT produces higher overall accuracy than standard SFT and SFT variants on mathematical and code reasoning tasks.
  • The method outperforms representative RL baselines while remaining within a supervised fine-tuning regime.
  • The clipped inverse-ratio term keeps policy updates from erasing useful reasoning behavior learned in pre-training.
  • Correct trajectories generated by the model itself are retained as training targets when the policy already solves the problem reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-problem adaptation logic could be applied to non-reasoning tasks where fixed imitation risks overwriting model capabilities.
  • Because the method only needs on-policy samples that are already generated during training, it may reduce the volume of external expert data required.
  • The approach offers a middle ground between pure SFT and full RL that could be combined with existing preference or reward-model techniques.

Load-bearing premise

Problem-level solvability estimated from verified on-policy rollouts provides a reliable, unbiased signal for dynamically calibrating the strength of expert supervision without introducing training instability or selection artifacts.

What would settle it

Training identical models with the same expert data but replacing the rollout-derived solvability signal by a random or constant value and observing no accuracy gain on the same benchmarks would show that the adaptive calibration is not responsible for the reported improvement.

Figures

Figures reproduced from arXiv: 2606.07006 by Fei Sun, Fengyuan Liu, Mengnan Du, Na Zou, Wei Shi, Yanguang Liu, Yongliang Miao.

Figure 1
Figure 1. Figure 1: RASFT pipeline. (a) For each prompt, policy model πθ samples multiple rollouts, which are verified and combined with offline expert trajectory. (b) Rollout-based solvability ζi , which adaptively calibrates expert and rollout trajectory weights. (c) RASFT updates the policy model πθ by optimizing candidate trajectories with a compound weight that combines normalized trajectory weights, an inverse policy ra… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between RASFT and representa [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity to the Rollout Number. increasingly rely on reinforcement learning with sampled trajectories and outcome feedback (Cobbe et al., 2021; Yuan et al., 2023; Zelikman et al., 2022; Gulcehre et al., 2023; DeepSeek-AI, 2025). PPO (Schulman et al., 2017) has been widely used as a policy-optimization algorithm, using clipped policy updates to improve training stability. Build￾ing on this direction, GRP… view at source ↗
read the original abstract

Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware variant of SFT for reasoning tasks. For each problem, RASFT estimates solvability from verified on-policy rollouts and uses this signal to strengthen expert imitation on hard problems while relaxing imitation and incorporating correct self-generated trajectories on problems the current policy already solves reliably. A clipped inverse-ratio term between the reference model and current policy is added to bound drift. The abstract states that experiments on multiple models across six mathematical reasoning benchmarks and two code reasoning benchmarks show RASFT outperforming SFT, SFT variants, and representative RL methods.

Significance. If the reported gains are robust, RASFT would provide a concrete mechanism for making SFT adaptive to the model's evolving capabilities without full RL, potentially improving sample efficiency on reasoning tasks. The public release of code at https://github.com/zjd1sq/RASFT is a clear strength that supports reproducibility and further investigation.

major comments (2)
  1. [Method description of solvability estimation and adaptive weighting] The central performance claim rests on the reliability of the per-problem solvability signal derived from finite verified on-policy rollouts. On hard problems the success-rate estimator necessarily has high variance; the manuscript provides no analysis, ablation, or stability diagnostics showing that this variance does not produce erratic supervision weights or selection artifacts across training iterations.
  2. [Description of the clipped inverse-ratio term and its integration with rollout-based weighting] The interaction between the adaptive weighting and the clipped inverse-ratio term is presented as stabilizing policy drift, yet no derivation or empirical check demonstrates that the combination prevents the on-policy conditioning from introducing systematic bias in the supervision signal.
minor comments (1)
  1. [Abstract] The abstract asserts superior performance but contains no numerical results, dataset sizes, or statistical details; moving at least the headline numbers into the abstract would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the reliability of the solvability signal and the interaction of the drift constraint. We respond to each major comment below.

read point-by-point responses
  1. Referee: The central performance claim rests on the reliability of the per-problem solvability signal derived from finite verified on-policy rollouts. On hard problems the success-rate estimator necessarily has high variance; the manuscript provides no analysis, ablation, or stability diagnostics showing that this variance does not produce erratic supervision weights or selection artifacts across training iterations.

    Authors: We agree that finite rollouts can produce high variance in the solvability estimate on hard problems and that the manuscript lacks explicit stability diagnostics. To address this directly, we will add an ablation varying the number of rollouts (4 vs. 8) and report weight variance across iterations in the revised version. revision: yes

  2. Referee: The interaction between the adaptive weighting and the clipped inverse-ratio term is presented as stabilizing policy drift, yet no derivation or empirical check demonstrates that the combination prevents the on-policy conditioning from introducing systematic bias in the supervision signal.

    Authors: The clipped inverse-ratio term follows standard importance-sampling bounds to limit drift from the reference policy. While we provide no formal derivation of the joint effect, the reported results show consistent gains without degradation indicative of bias. We will add an empirical policy-drift analysis (KL and success-rate trends) with and without the term in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: RASFT performance claims rest on external benchmarks, not definitional reduction

full rationale

The paper defines RASFT via on-policy rollout solvability estimates and a clipped inverse-ratio term to a frozen reference model, then reports empirical gains on six math and two code benchmarks against SFT and RL baselines. No equations reduce the reported performance to a fitted quantity or self-generated signal by construction; the evaluation uses held-out benchmarks independent of the training signal. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. This is the normal non-circular case for an empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5739 in / 1131 out tokens · 27029 ms · 2026-06-27T22:23:38.527032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 42 canonical work pages · 35 internal anchors

  1. [1]

    International Conference on Learning Representations , year =

    On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2508.05629 , url =. 2508.05629 , archivePrefix =

  2. [2]

    International Conference on Learning Representations , year =

    Anchored Supervised Fine-Tuning , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2509.23753 , url =. 2509.23753 , archivePrefix =

  3. [3]

    ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

    ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection , author =. arXiv preprint arXiv:2601.09195 , year =. doi:10.48550/arXiv.2601.09195 , url =. 2601.09195 , archivePrefix =

  4. [4]

    2026 , eprint =

    Wang, Xiaoxuan and Zhang, Han and Wang, Haixin and Shi, Yidan and Li, Ruoyan and Han, Kaiqiao and Tong, Chenyi and Deng, Haoran and Sun, Renliang and Taylor, Alexander and Zhu, Yanqiao and Cong, Jason and Sun, Yizhou and Wang, Wei , journal =. 2026 , eprint =. doi:10.48550/arXiv.2602.21534 , url =

  5. [5]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    arXiv preprint arXiv:2512.02556 , year =. doi:10.48550/arXiv.2512.02556 , url =. 2512.02556 , archivePrefix =

  6. [6]

    arXiv preprint arXiv:2503.02951 , year =

    KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding , author =. arXiv preprint arXiv:2503.02951 , year =. doi:10.48550/arXiv.2503.02951 , url =. 2503.02951 , archivePrefix =

  7. [7]

    Li, Jia and Beeching, Edward and Tunstall, Lewis and Lipkin, Ben and Soletskyi, Roman and Huang, Shengyi Costa and Rasul, Kashif and Yu, Longhui and Jiang, Albert and Shen, Ziju and Qin, Zihan and Dong, Bin and Zhou, Li and Fleureau, Yann and Lample, Guillaume and Polu, Stanislas , year =

  8. [8]

    Let's Verify Step by Step

    Let's Verify Step by Step , author =. arXiv preprint arXiv:2305.20050 , year =. doi:10.48550/arXiv.2305.20050 , url =. 2305.20050 , archivePrefix =

  9. [9]

    Solving Quantitative Reasoning Problems with Language Models

    Solving Quantitative Reasoning Problems with Language Models , author =. arXiv preprint arXiv:2206.14858 , year =. doi:10.48550/arXiv.2206.14858 , url =. 2206.14858 , archivePrefix =

  10. [10]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =. doi:10.48550/arXiv.2402.14008 , url =. 2402.14008 , archivePrefix =

  11. [11]

    2024 , howpublished =

  12. [12]

    2025 , howpublished =

  13. [13]

    2023 , howpublished =

  14. [14]

    Program Synthesis with Large Language Models

    Program Synthesis with Large Language Models , author =. arXiv preprint arXiv:2108.07732 , year =. doi:10.48550/arXiv.2108.07732 , url =. 2108.07732 , archivePrefix =

  15. [15]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =. doi:10.48550/arXiv.2107.03374 , url =. 2107.03374 , archivePrefix =

  16. [16]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2201.11903 , url =. 2201.11903 , archivePrefix =

  17. [17]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2308.09583 , url =. 2308.09583 , archivePrefix =

  18. [18]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2309.12284 , url =. 2309.12284 , archivePrefix =

  19. [19]

    InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

    InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting , author =. arXiv preprint arXiv:2605.14967 , year =. doi:10.48550/arXiv.2605.14967 , url =. 2605.14967 , archivePrefix =

  20. [20]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training , author =. International Conference on Machine Learning , year =. doi:10.48550/arXiv.2501.17161 , url =. 2501.17161 , archivePrefix =

  21. [21]

    arXiv preprint arXiv:1707.06347 , year=

    Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  22. [22]

    Proximal Supervised Fine-Tuning

    Proximal Supervised Fine-Tuning , author =. arXiv preprint arXiv:2508.17784 , year =. doi:10.48550/arXiv.2508.17784 , url =. 2508.17784 , archivePrefix =

  23. [23]

    Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

    Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning , author =. arXiv preprint arXiv:2602.01058 , year =. doi:10.48550/arXiv.2602.01058 , url =. 2602.01058 , archivePrefix =

  24. [24]

    DeepSeek-V3 Technical Report

    arXiv preprint arXiv:2412.19437 , year =. doi:10.48550/arXiv.2412.19437 , url =. 2412.19437 , archivePrefix =

  25. [25]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    arXiv preprint arXiv:2501.12948 , year =. doi:10.48550/arXiv.2501.12948 , url =. 2501.12948 , archivePrefix =

  26. [26]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and Lu, Keming and Xue, Mingfeng and Lin, Runji and Liu, Tianyu and Ren, Xingzhang and Zhang, Zhenru , journal =. 2024 , eprint =. doi:10.48550/arXiv.2409.12122 , url =

  27. [27]

    Solving math word problems with process- and outcome-based feedback

    Solving Math Word Problems with Process- and Outcome-Based Feedback , author =. arXiv preprint arXiv:2211.14275 , year =. doi:10.48550/arXiv.2211.14275 , url =. 2211.14275 , archivePrefix =

  28. [28]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , author =. arXiv preprint arXiv:2312.08935 , year =. doi:10.48550/arXiv.2312.08935 , url =. 2312.08935 , archivePrefix =

  29. [29]

    arXiv preprint arXiv:2408.06195 , year =

    Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers , author =. arXiv preprint arXiv:2408.06195 , year =. doi:10.48550/arXiv.2408.06195 , url =. 2408.06195 , archivePrefix =

  30. [30]

    Reinforced Self-Training (ReST) for Language Modeling

    Gulcehre, Caglar and Paine, Tom Le and Srinivasan, Srivatsan and Konyushkova, Ksenia and Weerts, Lotte and Sharma, Abhishek and Siddhant, Aditya and Ahern, Alex and Wang, Miaosen and Gu, Chenjie and Macherey, Wolfgang and Doucet, Arnaud and Firat, Orhan and de Freitas, Nando , journal =. Reinforced Self-Training. 2023 , eprint =. doi:10.48550/arXiv.2308.0...

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =. doi:10.48550/arXiv.2402.03300 , url =. 2402.03300 , archivePrefix =

  32. [32]

    Learning to Reason under Off-Policy Guidance

    Learning to Reason under Off-Policy Guidance , author =. arXiv preprint arXiv:2504.14945 , year =. doi:10.48550/arXiv.2504.14945 , url =. 2504.14945 , archivePrefix =

  33. [33]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =. doi:10.48550/arXiv.2412.15115 , url =. 2412.15115 , archivePrefix =

  34. [34]

    2024 , month = sep, day =

    Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models , author =. 2024 , month = sep, day =

  35. [35]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =. doi:10.48550/arXiv.2110.14168 , url =. 2110.14168 , archivePrefix =

  36. [36]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models , author =. arXiv preprint arXiv:2308.01825 , year =. doi:10.48550/arXiv.2308.01825 , url =. 2308.01825 , archivePrefix =

  37. [37]

    , booktitle =

    Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah D. , booktitle =. 2022 , eprint =. doi:10.48550/arXiv.2203.14465 , url =

  38. [38]

    Journal of Machine Learning Research , volume =

    Scaling Instruction-Finetuned Language Models , author =. Journal of Machine Learning Research , volume =. 2024 , eprint =

  39. [39]

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4

    Mukherjee, Subhabrata and Mitra, Arindam and Jawahar, Ganesh and Agarwal, Sahaj and Palangi, Hamid and Awadallah, Ahmed , journal =. Orca: Progressive Learning from Complex Explanation Traces of. 2023 , eprint =. doi:10.48550/arXiv.2306.02707 , url =

  40. [40]

    MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    Yue, Xiang and Qu, Xingwei and Zhang, Ge and Fu, Yao and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , journal =. 2023 , eprint =. doi:10.48550/arXiv.2309.05653 , url =

  41. [41]

    Training language models to follow instructions with human feedback

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2203.02155 , url =. 2203.02155 , archivePrefix =

  42. [42]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =

    Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =. 2023 , doi =

  43. [43]

    LIMA: Less Is More for Alignment

    Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srini and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and Zhang, Susan and Ghosh, Gargi and Lewis, Mike and Zettlemoyer, Luke and Levy, Omer , booktitle =. 2023 , eprint =. doi:10.48550/arXiv.2305.11206 , url =

  44. [44]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the. 2021 , eprint =. doi:10.48550/arXiv.2103.03874 , url =

  45. [45]

    Learning to summarize from human feedback

    Learning to Summarize from Human Feedback , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2009.01325 , url =. 2009.01325 , archivePrefix =

  46. [46]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author =. arXiv preprint arXiv:2204.05862 , year =. doi:10.48550/arXiv.2204.05862 , url =. 2204.05862 , archivePrefix =

  47. [47]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Large Language Models are not Fair Evaluators , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , doi =

  48. [48]

    International Conference on Learning Representations , year =

    Evaluating Large Language Models at Evaluating Instruction Following , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2310.07641 , url =. 2310.07641 , archivePrefix =

  49. [49]

    Measuring Coding Challenge Competence With APPS

    Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Coding Challenge Competence With. 2021 , eprint =. doi:10.48550/arXiv.2105.09938 , url =

  50. [50]

    and Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol , journal =

    Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, Rémi and Eccles, Tom and Keeling, James and Gimeno, Felix and Dal Lago, Agustin and Hubert, Thomas and Choy, Peter and de Masson d'Autume, Cyprien and Babuschkin, Igor and Chen, Xinyun and Huang, Po-Sen and Welbl, Johannes and Gowal, Sven and Cherepanov,...

  51. [51]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , eprint =. doi:10.48550/arXiv.2306.05685 , url =

  52. [52]

    2023 , doi =

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , doi =

  53. [53]

    Qwen2.5-Coder Technical Report

    Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Dang, Kai and Fan, Yang and Zhang, Yichang and Yang, An and Men, Rui and Huang, Fei and Zheng, Bo and Miao, Yibo and Quan, Shanghaoran and Feng, Yunlong and Ren, Xingzhang and Ren, Xuancheng and Zhou...