arxiv: 2310.06452 · v3 · pith:XJ4TZBEWnew · submitted 2023-10-10 · 💻 cs.LG · cs.AI· cs.CL

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk , Ishita Mediratta , Christoforos Nalmpantis , Jelena Luketina , Eric Hambro , Edward Grefenstette , Roberta Raileanu This is my paper

Pith reviewed 2026-05-19 02:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords RLHFLLM fine-tuningout-of-distribution generalizationoutput diversitysupervised fine-tuninginstruction followingsummarizationtradeoff

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{XJ4TZBEW}

Prints a linked pith:XJ4TZBEW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

RLHF makes language models generalize better to new inputs than supervised fine-tuning but cuts their output diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares the effects of supervised fine-tuning, reward modeling, and the full RLHF process on two properties of LLMs: how well they handle inputs far from their training data and how varied the responses they produce are. Experiments on summarization and instruction-following tasks with two base models show that the RL stage improves generalization, with the benefit growing as the gap between training and test inputs widens. The same stage, however, lowers output diversity on multiple measures. This reveals a concrete tradeoff that affects which fine-tuning approach suits a given application.

Core claim

RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity.

What carries the argument

Staged ablation of supervised fine-tuning versus full RLHF, measured on out-of-distribution generalization and multiple output-diversity statistics across summarization and instruction-following tasks.

Load-bearing premise

The selected tasks and diversity-plus-generalization metrics serve as accurate stand-ins for the properties that matter in actual LLM use.

What would settle it

A follow-up experiment on a high-shift instruction-following task in which RLHF shows no generalization edge over SFT or maintains equal output diversity.

read the original abstract

Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLHF improves OOD generalization over SFT especially under larger shifts but cuts output diversity, shown through stage-by-stage experiments on summarization and instruction tasks.

read the letter

The main thing to know is that this paper finds RLHF produces better generalization to new inputs than SFT alone, with the advantage growing as the train-test shift increases, while also lowering output diversity across several measures. They break the process into SFT, reward modeling, and RL stages and test on two base models for both summarization and instruction following. That staged comparison is the clearest part of the work and gives some practical signal on where the generalization gain and diversity loss occur. The results line up with the abstract claim and supply concrete numbers that could help decide between methods depending on whether robustness or variety matters more for a given use case. The experiments look reasonably controlled for an empirical paper in this area. The soft spots are mostly around the proxies. The diversity metrics appear to be standard surface ones such as distinct-n or entropy, which may miss semantic variety that actually matters in deployment. The distribution shifts are created by swapping datasets rather than by measuring continuous distances or KL divergence, so it is not obvious how far the pattern generalizes beyond these particular task pairs. The tasks themselves are relevant but narrow, and the paper does not show whether the same tradeoff appears in more open-ended generation or in settings with stronger distribution shifts. Readers working on LLM fine-tuning pipelines or alignment will get the most out of it, especially if they need data on this specific tradeoff. The work is grounded enough in direct comparisons to merit sending out for peer review rather than a desk reject, though referees will probably press on the metric choices and shift quantification.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical study across two base LLMs and two tasks (summarization and instruction following) to examine how SFT, reward modeling, and RLHF affect OOD generalization and output diversity. It reports that RLHF yields better generalization than SFT, with the advantage growing as the train-test distribution shift increases, while simultaneously reducing output diversity across multiple metrics, suggesting a tradeoff in current fine-tuning pipelines.

Significance. If the central empirical findings hold under more rigorous controls, the work supplies actionable guidance for practitioners choosing between SFT and RLHF depending on whether generalization or diversity is prioritized, and it motivates targeted research to close the observed diversity gap without sacrificing generalization gains.

major comments (2)

[§4] §4 (OOD generalization results): the claim that RLHF's advantage grows with larger distribution shift is supported only by categorical dataset swaps; no continuous quantification of shift (e.g., mean embedding cosine distance, KL divergence, or Wasserstein distance between train and test distributions) is reported, so the 'particularly as the shift becomes larger' qualifier rests on an unmeasured ordinal ranking of the chosen test sets.
[§3.2] §3.2 (Diversity evaluation): the reported diversity measures (distinct-n, entropy, self-BLEU, etc.) are primarily lexical/surface-form statistics; the manuscript does not demonstrate that these correlate with semantic or functional diversity relevant to downstream use cases, leaving open the possibility that the observed diversity reduction is an artifact of the chosen proxies rather than a general property of RLHF.

minor comments (2)

[Tables/Figures] Table 1 and Figure 2: add error bars or report standard deviations across random seeds to make the SFT-vs-RLHF comparisons statistically interpretable.
[§5] §5 (Discussion): the limitations paragraph should explicitly address whether the chosen tasks and metrics are representative of real-world deployment distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (OOD generalization results): the claim that RLHF's advantage grows with larger distribution shift is supported only by categorical dataset swaps; no continuous quantification of shift (e.g., mean embedding cosine distance, KL divergence, or Wasserstein distance between train and test distributions) is reported, so the 'particularly as the shift becomes larger' qualifier rests on an unmeasured ordinal ranking of the chosen test sets.

Authors: We agree that quantifying the distribution shifts continuously would strengthen the claim. In the revision we will compute and report mean cosine distances between sentence embeddings of the training and test distributions for each dataset pair, allowing us to relate the size of the generalization advantage to a numeric measure of shift rather than relying solely on the categorical ordering. revision: yes
Referee: [§3.2] §3.2 (Diversity evaluation): the reported diversity measures (distinct-n, entropy, self-BLEU, etc.) are primarily lexical/surface-form statistics; the manuscript does not demonstrate that these correlate with semantic or functional diversity relevant to downstream use cases, leaving open the possibility that the observed diversity reduction is an artifact of the chosen proxies rather than a general property of RLHF.

Authors: We acknowledge that the primary metrics are lexical. While these are standard in the literature, we will add a supplementary analysis using embedding-based semantic diversity in the revision and include a brief discussion of the limitations of surface-form proxies for functional diversity. The consistent pattern across several lexical metrics still provides evidence of a reduction, but the added semantic check will address the concern directly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations

full rationale

The paper reports direct experimental results from training SFT, reward, and RLHF models on summarization and instruction-following tasks, then measuring OOD generalization (via accuracy/ROUGE on held-out sets) and diversity (via distinct-n, entropy, etc.). No mathematical derivation, first-principles prediction, or fitted-parameter renaming occurs; the central claims are simply the observed differences between the three fine-tuning stages. Any self-citations are incidental background and do not support the load-bearing empirical findings, which are independently verifiable from the reported training and evaluation protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study that relies on standard machine-learning assumptions about model training and evaluation rather than introducing new theoretical constructs.

axioms (1)

domain assumption Summarization and instruction-following tasks are representative of typical LLM use cases for measuring generalization and diversity.
The analysis extrapolates from these two tasks to broader LLM behavior.

pith-pipeline@v0.9.0 · 5824 in / 1263 out tokens · 48354 ms · 2026-05-19T02:29:27.172883+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
Ex Ante Evaluation of AI-Induced Idea Diversity Collapse
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs generate creative ideas with excess population-level crowding below human-relative parity across tasks, but targeted generation protocols can reduce it.
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
cs.SE 2025-08 unverdicted novelty 7.0

EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
What should post-training optimize? A test-time scaling law perspective
cs.LG 2026-05 unverdicted novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Annotations Mitigate Post-Training Mode Collapse
cs.CL 2026-05 unverdicted novelty 6.0

Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
Post-training makes large language models less human-like
cs.CL 2026-05 unverdicted novelty 6.0

Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
cs.CR 2026-04 unverdicted novelty 6.0

LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
LLM Safety From Within: Detecting Harmful Content with Internal Representations
cs.AI 2026-04 unverdicted novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Differences in Text Generated by Diffusion and Autoregressive Language Models
cs.CL 2026-04 unverdicted novelty 6.0

DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation
cs.AI 2026-02 unverdicted novelty 6.0

BEAGLE uses a semi-Markov model, Bayesian knowledge tracing with injected flaws, and decoupled strategy-code actions to make LLM agents produce authentic student learning trajectories that humans cannot distinguish fr...
Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation
cs.CL 2025-07 unverdicted novelty 6.0

CARRIAGE is a RAG framework that improves output diversity in cross-cultural recipe adaptation by enhancing retrieval and context handling, reaching Pareto efficiency on diversity and quality versus closed-book LLMs.
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling
cs.CL 2025-07 unverdicted novelty 6.0

REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.
Towards Understanding Sycophancy in Language Models
cs.CL 2023-10 conditional novelty 6.0

Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
cs.AI 2026-05 unverdicted novelty 5.0

Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 19 Pith papers · 4 internal anchors

[1]

URL http://arxiv.org/abs/2210.10760. Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez...

work page doi:10.18653/v1/n16-1014 2022
[2]

URL http://proceedings.mlr.press/v48/mniha16.html. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT: Browser-assisted question...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

GPT-4 Technical Report

Association for Computational Linguistics. doi:10.18653/v1/K16-1028. URL https: //aclanthology.org/K16-1028. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt. OpenAI. GPT-4 Technical Report, 2023. URL http://arxiv.org/abs/2303.08774. Takayuki Osa, V oot Tangkaratt, and Masashi Sugiyama. Discovering Diverse Solutions in Deep Reinforce...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/k16-1028 2022
[4]

Learning to summarize from human feedback

Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.6. URL https://aclanthology.org/2022.naacl-main.6. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022. URL http://arxiv.org/abs/2009.01325. Guy Tevet and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.naacl-main.6 2022
[5]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

URL http://arxiv.org/abs/2212.10560. Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SJeYe0N...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3209978.3210080 2020
[6]

believes

relabels outputs using a goal-conditioned reward function or feedback function and then trains a goal-conditioned policy on these outputs (similar to (Andrychowicz et al., 2017)); and ILF (Scheurer et al., 2023), which uses natural language human feedback to prompt the model to produce better outputs than its original inputs, and then optimises the model ...

work page 2017
[7]

Try not to repeat the verbs for each instruction to maximize diversity

work page
[8]

For example, you should combine questions with imperative instrucitons

The language used for the instruction also should be diverse. For example, you should combine questions with imperative instrucitons

work page
[9]

The list should include diverse types of tasks like open-ended generation, classification, editing, etc

The type of instructions should be diverse. The list should include diverse types of tasks like open-ended generation, classification, editing, etc

work page
[10]

For example, do not ask the assistant to create any visual or audio output

A GPT language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action

work page
[11]

The instructions should be in English

work page
[12]

Either an imperative sentence or a question is permitted

The instructions should be a sequential or compositional instruction containing multiple steps, where each step is related to the previous steps. Either an imperative sentence or a question is permitted

work page
[13]

Try not to repeat the verbs used for each part of the instruction across instructions to maximize diversity

work page
[14]

Make sure the output is less than 100 words

The output should be an appropriate response to the instruction and the input. Make sure the output is less than 100 words. List of 20 tasks: Figure 10: The prompt for text-davinci-003 to produce instructions for the sequential instructions dataset using the Self-Instruct protocol (Wang et al., 2023). Table 5: Example inputs from the sequential instructio...

work page 2023
[15]

J.1 D ATASET SPLITTING We create split versions of these datasets along several factors of variation in their inputs: length, sentiment, and subreddit

in the summarisation task with a different choice of ID and OOD test sets. J.1 D ATASET SPLITTING We create split versions of these datasets along several factors of variation in their inputs: length, sentiment, and subreddit. For each of these factors of variation, we create a train/test split where the train and test inputs are drawn from different part...

work page 2013