arxiv: 2310.01377 · v2 · pith:3T7I6SRLnew · submitted 2023-10-02 · 💻 cs.CL · cs.AI· cs.LG

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui , Lifan Yuan , Ning Ding , Guanming Yao , Bingxiang He , Wei Zhu , Yuan Ni , Guotong Xie

show 4 more authors

Ruobing Xie Yankai Lin Zhiyuan Liu Maosong Sun

This is my paper

Pith reviewed 2026-05-17 16:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords AI feedbacklanguage model alignmentreinforcement learningbest-of-n samplingchat benchmarksLLaMAGPT-4feedback dataset

0 comments

The pith

A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that human feedback limits alignment research due to its small scale and narrow topics, so the authors build a much larger alternative by having GPT-4 evaluate and critique responses across 250,000 diverse user-assistant conversations. They broaden the instructions and apply bias-mitigation steps to make the AI signals more reliable, then use the resulting UltraFeedback dataset for best-of-n sampling and reinforcement learning on a LLaMA base model. This produces open-source chat models that perform strongly on standard benchmarks. A sympathetic reader would care because the approach removes the main bottleneck of collecting expensive human preferences and shows that scaled AI feedback can substitute for it in practice.

Core claim

UltraFeedback is a large-scale, high-quality, and diversified AI feedback dataset containing over 1 million GPT-4 feedbacks for 250k user-assistant conversations; when used to align a LLaMA-based model via best-of-n sampling and reinforcement learning, it produces exceptional performance on chat benchmarks and validates scaled AI feedback as an effective foundation for open-source alignment.

What carries the argument

The UltraFeedback dataset, built by broadening instructions and responses then applying bias-mitigation techniques to GPT-4 annotations, which supplies the training signal for best-of-n sampling and reinforcement learning.

If this is right

Open-source chat models can reach strong benchmark performance using only AI feedback instead of human feedback.
Best-of-n sampling combined with reinforcement learning on the feedback data improves alignment quality.
The dataset and approach serve as a foundation for further feedback-learning research.
Scaling both the amount and diversity of feedback data is what drives the alignment gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaling approach could be tested on alignment tasks beyond chat, such as instruction following or safety.
Hybrid pipelines that mix UltraFeedback with limited human data might close remaining gaps with proprietary models.
The bias-mitigation steps could be reused or refined when other large models serve as feedback providers.

Load-bearing premise

The series of techniques applied to mitigate annotation biases in GPT-4 feedback produces sufficiently reliable and unbiased signals for effective model alignment.

What would settle it

If models trained on UltraFeedback show no measurable gain over baselines trained on smaller human-feedback datasets across multiple chat benchmarks, the effectiveness of scaled AI feedback would be falsified.

read the original abstract

Learning from human feedback has become a pivot technique in aligning large language models (LLMs) with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \textbf{scale and diversity} as the key factors for feedback data to take effect. Accordingly, we first broaden instructions and responses in both amount and breadth to encompass a wider range of user-assistant interactions. Then, we meticulously apply a series of techniques to mitigate annotation biases for more reliable AI feedback. We finally present \textsc{UltraFeedback}, a large-scale, high-quality, and diversified AI feedback dataset, which contains over 1 million GPT-4 feedback for 250k user-assistant conversations from various aspects. Built upon \textsc{UltraFeedback}, we align a LLaMA-based model by best-of-$n$ sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models, serving as a solid foundation for future feedback learning research. Our data and models are available at https://github.com/thunlp/UltraFeedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces UltraFeedback, a large-scale dataset containing over 1 million GPT-4 feedbacks across 250k diverse user-assistant conversations. The authors broaden the scope of instructions and responses and apply a series of techniques to mitigate annotation biases in the GPT-4 signals. They then align a LLaMA-based model using best-of-n sampling and reinforcement learning on this dataset, reporting strong results on chat benchmarks and positioning the work as a scalable open-source alternative to human feedback for alignment research.

Significance. If the central empirical claims hold, the work supplies a publicly released, high-volume AI feedback resource that could meaningfully accelerate open-source LLM alignment experiments. The explicit focus on scale, diversity, and bias mitigation, together with the release of both data and models, constitutes a concrete contribution to the feedback-learning literature.

major comments (2)

[Dataset construction and bias-mitigation subsection] Dataset construction and bias-mitigation subsection: the manuscript describes a series of techniques to reduce GPT-4 annotation biases but provides no controlled comparison (e.g., agreement rates or win rates) of the resulting preference signals against human labels on an overlapping instruction set. Without such validation, it remains unclear whether residual GPT-4 biases (verbosity, sycophancy) are sufficiently suppressed for the subsequent RL stage to be reliable.
[Alignment experiments section] Alignment experiments section: the headline claim of 'exceptional performance' on chat benchmarks is presented without reported standard deviations across multiple runs, without explicit baseline numbers for models trained on comparable human-feedback datasets, and without ablation results isolating the contribution of the bias-mitigation steps. These omissions make it difficult to determine whether the observed gains are statistically robust and attributable to UltraFeedback quality.

minor comments (2)

[Abstract] Abstract: the phrase 'exceptional performance' is used without any numeric benchmark scores or direct comparisons, reducing immediate readability.
[Alignment experiments section] Notation: the description of best-of-n sampling and the RL objective would benefit from an explicit equation or pseudocode block to clarify the exact training procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the positive assessment of the work's significance and for the constructive major comments. We respond to each point below, acknowledging where the current manuscript is limited and describing the revisions we will make.

read point-by-point responses

Referee: [Dataset construction and bias-mitigation subsection] Dataset construction and bias-mitigation subsection: the manuscript describes a series of techniques to reduce GPT-4 annotation biases but provides no controlled comparison (e.g., agreement rates or win rates) of the resulting preference signals against human labels on an overlapping instruction set. Without such validation, it remains unclear whether residual GPT-4 biases (verbosity, sycophancy) are sufficiently suppressed for the subsequent RL stage to be reliable.

Authors: We agree that a direct controlled comparison against human labels on an overlapping set would provide valuable additional validation. The manuscript does not contain such a comparison, as collecting human annotations at the scale of 250k conversations was not feasible and is precisely the bottleneck our work seeks to address. We will revise the bias-mitigation subsection to explicitly acknowledge this limitation, discuss the known properties of GPT-4 as a judge (including residual risks of verbosity and sycophancy), and cite relevant studies on LLM-judge reliability. We will also note that downstream benchmark gains serve as an indirect indicator of signal quality. revision: yes
Referee: [Alignment experiments section] Alignment experiments section: the headline claim of 'exceptional performance' on chat benchmarks is presented without reported standard deviations across multiple runs, without explicit baseline numbers for models trained on comparable human-feedback datasets, and without ablation results isolating the contribution of the bias-mitigation steps. These omissions make it difficult to determine whether the observed gains are statistically robust and attributable to UltraFeedback quality.

Authors: We acknowledge that the experimental section would be strengthened by these elements. The current manuscript reports single-run results for the primary models and does not include explicit human-feedback baselines or full ablations on bias mitigation. We will revise the alignment experiments section to report standard deviations from any available multi-seed runs, add direct comparisons against models trained on established human-feedback datasets (e.g., HH-RLHF), and include targeted ablations isolating the bias-mitigation techniques. Due to computational constraints, the scope of new experiments will be limited to feasible re-runs and smaller-scale ablations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and external benchmark validation

full rationale

The paper presents an empirical pipeline: broadening instructions/responses, applying bias-mitigation techniques to GPT-4 annotations, releasing the resulting UltraFeedback dataset of 1M+ feedbacks, and then performing best-of-n sampling plus RL alignment on a LLaMA model whose chat-benchmark scores are reported as external evidence. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the performance claims rest on measured outcomes against independent benchmarks rather than reducing to the input data or prior author results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that GPT-4 feedback after bias mitigation is a valid proxy for human preferences.

axioms (1)

domain assumption GPT-4 can generate reliable preference feedback when annotation biases are mitigated by the described techniques
Invoked implicitly when claiming the dataset enables effective alignment

pith-pipeline@v0.9.0 · 5603 in / 1051 out tokens · 58700 ms · 2026-05-17T16:26:16.522516+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Built upon UltraFeedback, we align a LLaMA-based model by best-of-n sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation
cs.LG 2026-05 unverdicted novelty 7.0

MA-BC partitions divergent expert data while pooling non-conflicting pairs in MOMDPs, converging faster to Pareto-optimal policies than independent learners and matching a new minimax lower bound.
Mind the Gap: Structure-Aware Consistency in Preference Learning
cs.LG 2026-04 unverdicted novelty 7.0

Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guara...
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
cs.CL 2024-06 unverdicted novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...
What should post-training optimize? A test-time scaling law perspective
cs.LG 2026-05 unverdicted novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Optimal Transport for LLM Reward Modeling from Noisy Preference
cs.LG 2026-05 unverdicted novelty 6.0

SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
cs.LG 2026-05 conditional novelty 6.0

Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Pref-CTRL: Preference Driven LLM Alignment using Representation Editing
cs.CL 2026-04 unverdicted novelty 6.0

Pref-CTRL trains a multi-objective value function on preferences to guide representation editing for LLM alignment, outperforming RE-Control on benchmarks with better out-of-domain generalization.
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
cs.LG 2026-04 unverdicted novelty 6.0

MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
cs.CL 2026-04 unverdicted novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
cs.CL 2026-04 unverdicted novelty 6.0

CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller t...
VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models
cs.LG 2026-03 unverdicted novelty 6.0

VC-Soup uses a cosine-similarity consistency metric to filter data, trains value-consistent policies, and applies linear merging with Pareto filtering to improve multi-value LLM alignment trade-offs.
Robust Policy Optimization to Prevent Catastrophic Forgetting
cs.LG 2026-02 unverdicted novelty 6.0

FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
Multiplayer Nash Preference Optimization
cs.AI 2025-09 unverdicted novelty 6.0

MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Zephyr: Direct Distillation of LM Alignment
cs.LG 2023-10 accept novelty 6.0

Zephyr-7B achieves state-of-the-art chat benchmark results among 7B models by distilling alignment via dDPO on AI feedback preferences, surpassing the 70B Llama-2-Chat model on MT-Bench with no human data required.
Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
cs.CL 2026-05 unverdicted novelty 5.0

With 100 anchors the Bayesian linear corrector matches or beats the Neural-ODE flow on distribution recovery while both fix mean offset; with 1500 anchors the flow wins on MAE, Pearson correlation, and KL divergence.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba
cs.NE 2025-10 unverdicted novelty 5.0

SpikingMamba distills Mamba into an SNN LLM achieving 4.76x energy savings with a 4.78% zero-shot accuracy gap that narrows to 2.23% after RL.
Failure Modes of Maximum Entropy RLHF
cs.LG 2025-09 unverdicted novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
cs.AI 2024-10 unverdicted novelty 4.0

Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
A Survey on Knowledge Distillation of Large Language Models
cs.CL 2024-02 accept novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 21 Pith papers · 2 internal anchors

[1]

Evaluating Large Language Models Trained on Code

URL https://api.semanticscholar. org/CorpusID:266312608. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde de Oliveira Pinto, Jared Kaplan, Harri Ed- wards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5371628 2021
[2]

Hokamp and Q

URL https://api.semanticscholar. org/CorpusID:22050710. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023. 10 ULTRAFEEDBACK : Boosting Language Models with Scaled AI Feedb...

work page doi:10.18653/v1/ 2023
[3]

Self-critiquing models for assisting human evaluators

URL https://openreview.net/forum? id=9Vrb9D0WI4. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. CoRR, abs/2206.05802, 2022. doi: 10.48550/ARXIV . 2206.05802. URL https://doi.org/10.48550/ arXiv.2206.05802. John Schulman, Filip Wolski, Prafulla Dhariw...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
[4]

This may be particularly helpful if you have a busy schedule and may not have time to take them later in the day

Morning: Taking vitamin D supplements in the morning can help you start your day with adequate levels of this essential nutrient. This may be particularly helpful if you have a busy schedule and may not have time to take them later in the day

work page
[5]

Taking a vitamin D supplement after spending time outdoors can help boost your levels and ensure you’re getting enough

After sun exposure: Vitamin D is produced in the skin when it’s exposed to sunlight. Taking a vitamin D supplement after spending time outdoors can help boost your levels and ensure you’re getting enough

work page
[6]

Having a meal with a source of fat, such as avocado, nuts, or seeds, can enhance the absorption of vitamin D

With a meal: Vitamin D is a fat-soluble vitamin, which means it’s best absorbed when taken with a meal that contains healthy fats. Having a meal with a source of fat, such as avocado, nuts, or seeds, can enhance the absorption of vitamin D

work page
[7]

secret": raise Exception (

Consistency: The most important thing is to establish a consistent routine and take your vitamin D supplements at the same time every day. This will help maintain a stable level of vitamin D in your body. It’s important to note that excessive intake of vitamin D can be harmful, so it’s essential to consult with a healthcare professional to determine the a...

work page
[8]

Season 1: American Horror Story (2011) - Gaga played the role of The Countess or Elizabeth Johnson, a wealthy and sophisticated owner of a Los Angeles apartment building who resides with her ventriloquist dummy

work page 2011
[9]

They have gained fame in the 1970s with their horror rock collaboration, The Countess

Season 6: American Horror Story: Hotel (2016) - In this season, Lady Gaga portrayed German singer-songwriter Elizabeth along with her twin brother Dennis. They have gained fame in the 1970s with their horror rock collaboration, The Countess. Gaga’s portrayal in American Horror Story has added a unique dimension to the series, showcasing her acting talent,...

work page 2016
[10]

**Irrelevant**: No alignment

work page
[11]

**Partial Focus**: Addresses one aspect poorly

work page
[12]

- (2) Acknowledges both but slight deviations

**Partial Compliance**: - (1) Meets goals or restrictions, neglecting others. - (2) Acknowledges both but slight deviations

work page
[13]

**Almost There**: Near alignment, minor deviations

work page
[14]

**Comprehensive Compliance**: Fully aligns, meets all requirements. 22 ULTRAFEEDBACK : Boosting Language Models with Scaled AI Feedback Annotation Template for Critique Feedback Given my answer to an instruction, your role is to provide specific and constructive feedback for me. You should find the best way for me to learn from your feedback and improve m...

work page