arxiv: 2605.09922 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

Jing Li, Min Zhang, Wu Li, Yequan Wang, Yigeng Zhou, Zesheng Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM alignmentself-playself-trainingadaptive weightingfine-tuningsynthetic datareinforcement learning

0 comments

The pith

Team-based self-play with dual adaptive weighting enables stable self-supervised alignment of LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TPAW, a self-play algorithm that organizes training as collaboration and competition between the current model and its historical checkpoints. Two adaptive mechanisms reweight individual responses and modulate each checkpoint's influence to counteract poor synthetic data and the shrinking difference between good and bad outputs. The process begins from a supervised fine-tuned model and continues without further human labels. If the approach holds, iterative self-training becomes more reliable and less prone to bias amplification. Readers would care because this reduces dependence on costly human feedback while keeping optimization effective over many rounds.

Core claim

TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, combined with a response reweighting scheme that adjusts the importance of target responses and a player weighting strategy that dynamically modulates each team member's contribution during training, allowing iterative refinement of alignment without requiring additional human supervision.

What carries the argument

The team-based self-play framework with dual adaptive weighting, in which the current policy interacts with historical checkpoints while response importance and player contributions are adjusted dynamically to sustain training progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar team competition structures could stabilize self-training loops in domains such as code generation or mathematical reasoning where synthetic data quality also varies.
The method may lower reliance on external reward models by using internal model comparisons to maintain signal strength.
Varying the number or selection strategy of historical checkpoints could be tested to optimize the diversity of the competing signals.

Load-bearing premise

That the team-based self-play framework and the two adaptive weighting mechanisms sufficiently resolve sensitivity to synthetic data quality and the diminishing positive-negative gap without introducing new instabilities or biases.

What would settle it

Running TPAW and a standard self-training baseline on identical base models and synthetic data for the same number of iterations, then measuring win rates on a benchmark such as MT-Bench or AlpacaEval; failure of TPAW to exceed the baseline would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.09922 by Jing Li, Min Zhang, Wu Li, Yequan Wang, Yigeng Zhou, Zesheng Shi.

**Figure 2.** Figure 2: Subfigures (a) and (b) show the target responses reward curves from the iteration 4 training process on [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study on GSM8k. We evaluate TPAW by removing key components: without Target Response Weighting (w/o TRW); without Main Player Weighting (w/o MPW); without Team-based Mechanism (w/o Team). bilities. The greater performance gains observed on domain-specific reasoning and mathematical tasks may stem from a larger initial discrepancy between the model’s outputs and the target responses. Compared to … view at source ↗

**Figure 4.** Figure 4: Impact of hyperparameters on GSM8K accuracy. Performance data is from the fourth iteration. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of Nmax on GSM8K accuracy. γ 0.00 0.25 0.50 0.75 1.00 Iter-1 54.13 55.57 54.13 54.28 55.57 Iter-2 55.65 55.19 55.19 55.12 55.95 Iter-3 56.40 55.19 56.56 55.34 56.03 Iter-4 55.95 55.27 56.94 55.95 55.57 (a) Effect of γ. η 1 2 4 6 8 10 Iter-1 54.13 54.06 54.21 54.13 54.21 54.13 Iter-2 56.18 54.82 55.95 55.19 56.53 56.33 Iter-3 55.42 54.28 55.27 56.56 56.41 56.79 Iter-4 55.42 54.59 55.27 56.94 56.71 56… view at source ↗

read the original abstract

While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member's contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at https://github.com/lab-klc/TPAW.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPAW adds a team-based self-play setup with historical checkpoints and dual adaptive weighting to address synthetic data issues in LLM self-training, but the outperformance claims need more visible evidence to hold weight.

read the letter

The paper introduces TPAW, which runs self-play in teams where the current model works with and against its own past checkpoints, plus two adaptive weighting schemes—one for reweighting target responses and one for modulating player contributions. This directly targets the instability from low-quality synthetic data and the shrinking positive-negative gap that self-training methods often hit. The combination is presented as new, and the framing of the problems is clear and practical. Public code is a real plus for anyone wanting to test it. The idea of keeping diversity through historical opponents is straightforward and could help maintain signal during iterations. What stands out is the focus on fully self-supervised refinement starting from an SFT model without extra human input. The soft spots sit in the results. The abstract asserts consistent gains across base models and benchmarks, yet without seeing the specific metrics, baseline details, run counts, or ablation tables it is difficult to judge whether the improvements are robust or sensitive to choices in the weighting functions. If the full experiments include proper controls and error analysis, that would strengthen the case; otherwise the claims risk looking overstated. Minor concern is whether the added team and weighting machinery introduces its own instabilities that the paper does not fully explore. This is for researchers working on self-supervised alignment who need new baselines or practical tweaks to existing self-play ideas. A reader focused on reducing human oversight in LLM fine-tuning could extract useful implementation details. It deserves peer review because the core algorithm is well-motivated and the code link lowers the barrier to verification, even if revisions will likely be needed on the experimental side.

Referee Report

2 major / 3 minor

Summary. The paper proposes Team-based self-Play with dual Adaptive Weighting (TPAW), a self-supervised algorithm for aligning LLMs. It introduces a team-based framework in which the current policy model collaborates and competes against historical checkpoints, combined with two adaptive weighting mechanisms (response reweighting to adjust target response importance and player weighting to modulate team member contributions). Initialized from an SFT model, TPAW iteratively refines alignment without human supervision. The central claim is that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks, with public code released.

Significance. If the experimental results hold under scrutiny, the work provides a concrete algorithmic advance for reducing instability and bias in iterative self-training of LLMs. The team-based self-play plus dual weighting directly targets the stated problems of synthetic data sensitivity and shrinking positive-negative gaps. Public code availability strengthens the contribution by enabling direct verification and extension.

major comments (2)

[§4] §4 (Experiments): The claim of consistent outperformance is central, yet the manuscript provides no quantitative details on effect sizes, statistical significance, or variance across runs in the main results tables. Without these, it is impossible to determine whether the reported gains exceed baseline variability or arise from post-hoc hyperparameter choices.
[§3.2] §3.2 (Adaptive Weighting Mechanisms): The response reweighting and player weighting are presented as solving the diminishing gap problem, but the manuscript does not include an ablation isolating each component's contribution to stability (e.g., training curves with and without each weighting). This leaves open whether the dual weighting is load-bearing or whether simpler reweighting suffices.

minor comments (3)

[§2] The abstract states the two limitations but does not quantify them (e.g., how rapidly the positive-negative gap shrinks in prior methods). Adding a short illustrative plot or metric in §2 would strengthen the motivation.
[§3] Notation for the team members and weighting functions is introduced without a consolidated table; a single table summarizing symbols, their meanings, and update rules would improve readability.
[Appendix] The public code link is welcome, but the manuscript should include a brief reproducibility checklist (random seeds, exact hyperparameter ranges, hardware) in the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and have revised the manuscript to incorporate additional analyses and details where feasible.

read point-by-point responses

Referee: [§4] §4 (Experiments): The claim of consistent outperformance is central, yet the manuscript provides no quantitative details on effect sizes, statistical significance, or variance across runs in the main results tables. Without these, it is impossible to determine whether the reported gains exceed baseline variability or arise from post-hoc hyperparameter choices.

Authors: We agree that reporting variance, effect sizes, and statistical significance is essential to substantiate the performance claims. In the revised manuscript, we have updated all main results tables in Section 4 to include standard deviations computed over five independent runs with different random seeds. We also report Cohen's d effect sizes for the key performance differences and include p-values from paired t-tests against each baseline. To address potential concerns about hyperparameter selection, we have added a dedicated paragraph in Section 4.1 describing the tuning protocol, which used a fixed held-out validation split and grid search performed prior to final test evaluation. revision: yes
Referee: [§3.2] §3.2 (Adaptive Weighting Mechanisms): The response reweighting and player weighting are presented as solving the diminishing gap problem, but the manuscript does not include an ablation isolating each component's contribution to stability (e.g., training curves with and without each weighting). This leaves open whether the dual weighting is load-bearing or whether simpler reweighting suffices.

Authors: We acknowledge that isolating the contribution of each weighting mechanism is necessary to establish their individual and joint importance. In the revised version, we have expanded Section 3.2 with new ablation experiments and added a corresponding appendix subsection. These include training dynamics plots (positive-negative response gap and reward curves) for the full TPAW model, the model without response reweighting, the model without player weighting, and a single-weighting baseline. The results indicate that both mechanisms are required to sustain the gap and prevent instability; removing either leads to measurable degradation in stability and final performance, with the combination providing benefits beyond simpler reweighting alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents TPAW as a novel algorithmic contribution consisting of a team-based self-play framework and two explicitly designed adaptive weighting mechanisms (response reweighting and player weighting). The abstract and high-level description frame these as new constructs initialized from an SFT model, with performance claims tied to experimental outperformance rather than any reduction to fitted parameters, self-defined quantities, or prior self-citations. Public code availability allows independent reproduction. No load-bearing derivation step is shown to collapse to its own inputs by construction, and the central claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of concrete free parameters, axioms, or invented entities. The method introduces new algorithmic constructs whose internal dependencies cannot be audited here.

pith-pipeline@v0.9.0 · 5516 in / 1066 out tokens · 42721 ms · 2026-05-12T04:51:23.559969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 7 internal anchors

[1]

Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

work page
[2]

Classification Problem Solving

Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

work page
[3]

, title =

Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

work page 1980
[4]

New Ways to Make Microcircuits Smaller---Duplicate Entry

Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

work page
[5]

Clancey and Glenn Rennels , abstract =

Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

work page doi:10.1016/s0020-7373(84)80003-6 1984
[6]

and Rennels, Glenn R

Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

work page
[7]

Poligon: A System for Parallel Problem Solving

Rice, James. Poligon: A System for Parallel Problem Solving

work page
[8]

Transfer of Rule-Based Expertise through a Tutorial Dialogue

Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

work page
[9]

The Engineering of Qualitative Models

Clancey, William J. The Engineering of Qualitative Models

work page
[10]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

work page 2017
[11]

Pluto: The 'Other' Red Planet

NASA. Pluto: The 'Other' Red Planet

work page
[12]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

Self-play fine-tuning convertsweak language models to strong language models , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

work page
[13]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[14]

Advances in neural information processing systems (NeurIPS) , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems (NeurIPS) , volume=

work page
[15]

Advances in neural information processing systems (NeurIPS) , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems (NeurIPS) , volume=

work page
[16]

Advances in neural information processing systems (NeurIPS) , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems (NeurIPS) , volume=

work page
[17]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[18]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Forty-first International Conference on Machine Learning (ICML) , year =

Weizhe Yuan and Richard Yuanzhe Pang and Kyunghyun Cho and Xian Li and Sainbayar Sukhbaatar and Jing Xu and Jason Weston , title =. Forty-first International Conference on Machine Learning (ICML) , year =

work page
[20]

Smith and Daniel Khashabi and Hannaneh Hajishirzi , editor =

Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi , editor =. Self-Instruct: Aligning Language Models with Self-Generated Instructions , booktitle =

work page
[21]

Zhaoyang Wang and Weilei He and Zhiyuan Liang and Xuchao Zhang and Chetan Bansal and Ying Wei and Weitong Zhang and Huaxiu Yao , booktitle=

work page
[22]

Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228,

Smaug: Fixing failure modes of preference optimisation with dpo-positive , author=. arXiv preprint arXiv:2402.13228 , year=

work page arXiv
[23]

The Thirteenth International Conference on Learning Representations (ICLR) , year=

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

work page
[24]

Sutherland , booktitle=

Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of

work page
[25]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

Bias Amplification in Language Model Evolution: An Iterated Learning Perspective , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[26]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[27]

The Thirteenth International Conference on Learning Representations (ICLR) , year=

Self-Improvement in Language Models: The Sharpening Mechanism , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

work page
[28]

The Thirteenth International Conference on Learning Representations (ICLR) , year=

Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

work page
[29]

Advancing

Lifan Yuan and Ganqu Cui and Hanbin Wang and Ning Ding and Xingyao Wang and Boji Shan and Zeyuan Liu and Jia Deng and Huimin Chen and Ruobing Xie and Yankai Lin and Zhenghao Liu and Bowen Zhou and Hao Peng and Zhiyuan Liu and Maosong Sun , booktitle=. Advancing

work page
[30]

From \ r\ to \ Q

Rafael Rafailov and Joey Hejna and Ryan Park and Chelsea Finn , booktitle=. From \ r\ to \ Q

work page
[31]

Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

Preference fine-tuning of LLMs should leverage suboptimal, on-policy data , author=. Proceedings of the 41st International Conference on Machine Learning (ICML) , pages=

work page
[32]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Iterative reasoning preference optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[33]

Provably Mitigating Overoptimization in

Zhihan Liu and Miao Lu and Shenao Zhang and Boyi Liu and Hongyi Guo and Yingxiang Yang and Jose Blanchet and Zhaoran Wang , booktitle=. Provably Mitigating Overoptimization in

work page
[34]

Qwen2.5 Technical Report

Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[36]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[37]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2023
[38]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

2024 , publisher =

Clémentine Fourrier and Nathan Habib and Alina Lozovskaya and Konrad Szafer and Thomas Wolf , title =. 2024 , publisher =

work page 2024
[40]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[41]

2023 , publisher =

Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf , title =. 2023 , publisher =

work page 2023
[42]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS) , year=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS) , year=

work page
[43]

Proceedings of the International Conference on Learning Representations , year=

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[44]

2024 , booktitle=

Gpqa: A graduate-level google-proof q&a benchmark , author=. 2024 , booktitle=

work page 2024
[45]

2021 , booktitle=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , booktitle=

work page 2021
[46]

ACL (Findings) , year=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. ACL (Findings) , year=

work page
[47]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

work page 2023
[48]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018
[49]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[50]

9th International Conference on Learning Representations (ICLR) , year =

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations (ICLR) , year =

work page
[51]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[52]

Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume=

work page
[53]

nature , volume=

Mastering the game of go without human knowledge , author=. nature , volume=

work page
[54]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Self-play with execution feedback: Improving instruction-following capabilities of large language models , author=. arXiv preprint arXiv:2406.13542 , year=

work page arXiv
[55]

Training language models to follow instructions with human feedback , volume =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

work page
[56]

Forty-first International Conference on Machine Learning (ICML)) , year =

Harrison Lee and Samrat Phatale and Hassan Mansoor and Thomas Mesnard and Johan Ferret and Kellie Lu and Colton Bishop and Ethan Hall and Victor Carbune and Abhinav Rastogi and Sushant Prakash , title =. Forty-first International Conference on Machine Learning (ICML)) , year =

work page
[57]

Hashimoto , title =

Yann Dubois and Chen Xuechen Li and Rohan Taori and Tianyi Zhang and Ishaan Gulrajani and Jimmy Ba and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems (NeurIPS) , year =

work page
[58]

KTO: Model Alignment as Prospect Theoretic Optimization

Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

work page internal anchor Pith review arXiv
[59]

Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems (NeurIPS) , year =

Yu Meng and Mengzhou Xia and Danqi Chen , title =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems (NeurIPS) , year =

work page
[60]

Manning and Stefano Ermon and Chelsea Finn , title =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , title =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems (NeurIPS) , year =

work page
[61]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Cited on , volume=

Neural networks for machine learning lecture 6a overview of mini-batch gradient descent , author=. Cited on , volume=

work page
[63]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Flashattention-2: Faster attention with better parallelism and work partitioning , author=. arXiv preprint arXiv:2307.08691 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[65]

arXiv preprint arXiv:2405.00675 , year=

Self-play preference optimization for language model alignment , author=. arXiv preprint arXiv:2405.00675 , year=

work page arXiv
[66]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute Zero: Reinforced Self-play Reasoning with Zero Data , author=. arXiv preprint arXiv:2505.03335 , year=

work page internal anchor Pith review arXiv
[67]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Self-playing adversarial language game enhances llm reasoning , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[68]

arXiv preprint arXiv:2401.04056 , year=

A minimaximalist approach to reinforcement learning from human feedback , author=. arXiv preprint arXiv:2401.04056 , year=

work page arXiv
[69]

IBM Journal of research and development , volume=

Some studies in machine learning using the game of checkers , author=. IBM Journal of research and development , volume=. 1959 , publisher=

work page 1959
[70]

Communications of the ACM , volume=

Temporal difference learning and TD-Gammon , author=. Communications of the ACM , volume=

work page
[71]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Safety Alignment via Constrained Knowledge Unlearning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page