hub

RLAIF : Scaling reinforcement learning from human feedback with ai feedback

Rlaif: Scaling reinforcement learning from human feedback with ai feedback , author= · 2023 · arXiv 2309.00267

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.

Personalizing Text-to-Image Generation to Individual Taste

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on LLaVA and Qwen models.

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.

GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

cs.CL · 2026-04-10 · unverdicted · novelty 6.0

GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.

Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

cs.OS · 2026-04-09 · unverdicted · novelty 6.0

Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

cs.LG · 2024-01-02 · unverdicted · novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.

Probably Approximately Consensus: On the Learning Theory of Finding Common Ground

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

Models consensus as a PAC-learnable interval in embedded 1D opinion space via ERM that maximizes expected agreement over an issue distribution.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

OOM-RL aligns multi-agent LLM systems for software engineering by using real financial market losses as an un-hackable negative gradient, resulting in a mature-phase annualized Sharpe ratio of 2.06 via a strict test-driven workflow.

Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

cs.CL · 2026-05-10 · unverdicted · novelty 4.0

RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

cs.SE · 2026-05-04 · unverdicted · novelty 4.0

ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

Large Language Models: A Survey

cs.CL · 2024-02-09 · accept · novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

citing papers explorer

Showing 17 of 17 citing papers.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG · 2026-05-07 · unverdicted · none · ref 32
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.
Personalizing Text-to-Image Generation to Individual Taste cs.CV · 2026-04-08 · unverdicted · none · ref 27
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 63
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 103
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning cs.CV · 2026-05-08 · unverdicted · none · ref 35
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on LLaVA and Qwen models.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning cs.CL · 2026-04-22 · unverdicted · none · ref 22
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.
GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification cs.CL · 2026-04-10 · unverdicted · none · ref 16
GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate cs.OS · 2026-04-09 · unverdicted · none · ref 6
Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTFT and 2% TPOT impact.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 47
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models cs.LG · 2024-01-02 · unverdicted · none · ref 4
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
Probably Approximately Consensus: On the Learning Theory of Finding Common Ground cs.LG · 2026-04-23 · unverdicted · none · ref 7
Models consensus as a PAC-learnable interval in embedded 1D opinion space via ERM that maximizes expected agreement over an issue distribution.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 6
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems cs.AI · 2026-04-13 · unverdicted · none · ref 9
OOM-RL aligns multi-agent LLM systems for software engineering by using real financial market losses as an un-hackable negative gradient, resulting in a mature-phase annualized Sharpe ratio of 2.06 via a strict test-driven workflow.
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications cs.CL · 2026-05-10 · unverdicted · none · ref 8
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration cs.SE · 2026-05-04 · unverdicted · none · ref 3
ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 144
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 138
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

RLAIF : Scaling reinforcement learning from human feedback with ai feedback

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer