hub

Training language models to follow instructions with human feedback , url =

Ouyang, Long, Wu, Jeffrey, Jiang, Xu, Almeida, Diogo, Wainwright, Carroll, Mishkin, Pamela

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

browse 15 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

WildChat: 1M ChatGPT Interaction Logs in the Wild

cs.CL · 2024-05-02 · accept · novelty 8.0

WildChat releases a dataset of 1 million ChatGPT conversations with timestamps, demographics, and headers, claimed to be the most diverse and multilingual such resource available.

Instance-Optimal Estimation with Multiple LLM Judges on a Budget

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

cs.CL · 2025-02-28 · unverdicted · novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

cs.CL · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

Context and retrieved moral knowledge improve sentence-level Schwartz value detection more consistently than model scaling, with early-fusion RAG outperforming other variants in matched comparisons.

More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

Newer LLMs exhibit reduced syntactic and lexical diversity in English news text generation compared to older models, as measured by HPSG grammar and diversity metrics from ecology and information theory, while human-authored text shows little change.

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

torchtune: PyTorch native post-training library

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

cs.CL · 2026-05-19 · unverdicted · novelty 5.0 · 2 refs

PromptRad reformulates multi-label radiology report classification as masked language modeling and enriches verbalizers with UMLS synonyms, outperforming baselines with only 32 training examples.

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

cs.LG · 2026-05-06 · unverdicted · novelty 4.0

Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.

citing papers explorer

Showing 15 of 15 citing papers.

WildChat: 1M ChatGPT Interaction Logs in the Wild cs.CL · 2024-05-02 · accept · none · ref 54
WildChat releases a dataset of 1 million ChatGPT conversations with timestamps, demographics, and headers, claimed to be the most diverse and multilingual such resource available.
Instance-Optimal Estimation with Multiple LLM Judges on a Budget cs.LG · 2026-05-22 · unverdicted · none · ref 8
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization cs.LG · 2026-05-09 · unverdicted · none · ref 54
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis cs.CL · 2026-05-02 · unverdicted · none · ref 84
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 2
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation cs.CL · 2025-02-28 · unverdicted · none · ref 1
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than prior implicit approaches.
More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts cs.CL · 2026-05-21 · unverdicted · none · ref 10 · 2 links
Context and retrieved moral knowledge improve sentence-level Schwartz value detection more consistently than model scaling, with early-fusion RAG outperforming other variants in matched comparisons.
More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs cs.CL · 2026-05-07 · unverdicted · none · ref 62
Newer LLMs exhibit reduced syntactic and lexical diversity in English news text generation compared to older models, as measured by HPSG grammar and diversity metrics from ecology and information theory, while human-authored text shows little change.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization cs.CL · 2026-04-21 · unverdicted · none · ref 30
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 118
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
torchtune: PyTorch native post-training library cs.LG · 2026-05-20 · unverdicted · none · ref 24
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling cs.CL · 2026-05-19 · unverdicted · none · ref 60 · 2 links
PromptRad reformulates multi-label radiology report classification as masked language modeling and enriches verbalizers with UMLS synonyms, outperforming baselines with only 32 training examples.
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 16
DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 36
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization cs.LG · 2026-05-06 · unverdicted · none · ref 23
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.

Training language models to follow instructions with human feedback , url =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer