hub

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu · 2024 · cs.CL · arXiv 2401.10020

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

cs.LG · 2026-05-11 · conditional · novelty 7.0 · 2 refs

DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

TACC algorithm for adaptive multi-fidelity bandits with improving proxies achieves instance-dependent regret by replacing logarithmic high-fidelity pulls with bounded low-fidelity continuation for intermediate arms.

Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

cs.CL · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

Neural Garbage Collection: Learning to Forget while Learning to Reason

cs.LG · 2026-04-20 · conditional · novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

TextGrad: Automatic "Differentiation" via Text

cs.CL · 2024-06-11 · unverdicted · novelty 7.0

TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.

KTO: Model Alignment as Prospect Theoretic Optimization

cs.LG · 2024-02-02 · conditional · novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.

Pioneer Agent: Continual Improvement of Small Language Models in Production

cs.AI · 2026-04-10 · unverdicted · novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.

Can LLMs Learn to Reason Robustly under Noisy Supervision?

cs.LG · 2026-04-05 · conditional · novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

cs.CL · 2024-06-25 · unverdicted · novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

cs.SE · 2026-05-04 · unverdicted · novelty 4.0

ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.

PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs

cs.CL · 2026-04-19 · unverdicted · novelty 4.0

PoliLegalLM, trained with continued pretraining, progressive SFT, and preference RL on a legal corpus, outperforms similar-scale models on LawBench, LexEval, and a real-world PoliLegal dataset while staying competitive with much larger models.

MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

cs.CY · 2026-04-09 · unverdicted · novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

cs.CL · 2024-03-20 · unverdicted · novelty 4.0

LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

cs.CL · 2026-04-30 · unverdicted · novelty 3.0

Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 23 of 23 citing papers.

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 31 · internal anchor
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling cs.LG · 2026-05-11 · conditional · none · ref 18 · 2 links · internal anchor
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization cs.LG · 2026-05-09 · unverdicted · none · ref 35 · internal anchor
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies cs.LG · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
TACC algorithm for adaptive multi-fidelity bandits with improving proxies achieves instance-dependent regret by replacing logarithmic high-fidelity pulls with bounded low-fidelity continuation for intermediate arms.
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning cs.CL · 2026-05-03 · unverdicted · none · ref 24 · 2 links · internal anchor
Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 89 · internal anchor
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
Neural Garbage Collection: Learning to Forget while Learning to Reason cs.LG · 2026-04-20 · conditional · none · ref 16 · internal anchor
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
TextGrad: Automatic "Differentiation" via Text cs.CL · 2024-06-11 · unverdicted · none · ref 32 · internal anchor
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
KTO: Model Alignment as Prospect Theoretic Optimization cs.LG · 2024-02-02 · conditional · none · ref 23 · internal anchor
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 140 · internal anchor
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 41 · internal anchor
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation cs.CL · 2026-04-13 · unverdicted · none · ref 7 · internal anchor
MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 102 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 31 · internal anchor
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale cs.CL · 2024-06-25 · unverdicted · none · ref 64 · internal anchor
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 42 · internal anchor
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration cs.SE · 2026-05-04 · unverdicted · none · ref 19 · internal anchor
ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs cs.CL · 2026-04-19 · unverdicted · none · ref 39 · internal anchor
PoliLegalLM, trained with continued pretraining, progressive SFT, and preference RL on a legal corpus, outperforms similar-scale models on LawBench, LexEval, and a real-world PoliLegal dataset while staying competitive with much larger models.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction cs.CY · 2026-04-09 · unverdicted · none · ref 38 · internal anchor
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models cs.CL · 2024-03-20 · unverdicted · none · ref 10 · internal anchor
LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO cs.CL · 2026-04-30 · unverdicted · none · ref 13 · internal anchor
Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 285 · internal anchor
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Self-Rewarding Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer