hub

Raft: Reward ranked finetuning for generative foundation model alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang · 2023 · arXiv 2304.06767

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation

cs.IR · 2026-05-06 · conditional · novelty 7.0

BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and complex metrics.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

AlignCultura: Towards Culturally Aligned Large Language Models?

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

Bias at the End of the Score

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.

Improving Video Generation with Human Feedback

cs.CV · 2025-01-23 · unverdicted · novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

Reinforced Self-Training (ReST) for Language Modeling

cs.CL · 2023-08-17 · unverdicted · novelty 6.0

ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

ReMedi: Reasoner for Medical Clinical Prediction

cs.CL · 2026-05-02 · unverdicted · novelty 5.0

ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.

Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation

cs.IR · 2026-04-07 · unverdicted · novelty 4.0

Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.

citing papers explorer

Showing 12 of 12 citing papers.

Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 37
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs cs.CV · 2026-05-10 · unverdicted · none · ref 7
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation cs.IR · 2026-05-06 · conditional · none · ref 13
BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and complex metrics.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 106
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators cs.AI · 2026-05-09 · unverdicted · none · ref 42
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 48
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
AlignCultura: Towards Culturally Aligned Large Language Models? cs.CL · 2026-04-21 · unverdicted · none · ref 20
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
Bias at the End of the Score cs.CV · 2026-04-14 · unverdicted · none · ref 13
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
Improving Video Generation with Human Feedback cs.CV · 2025-01-23 · unverdicted · none · ref 14
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Reinforced Self-Training (ReST) for Language Modeling cs.CL · 2023-08-17 · unverdicted · none · ref 7
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
ReMedi: Reasoner for Medical Clinical Prediction cs.CL · 2026-05-02 · unverdicted · none · ref 67
ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation cs.IR · 2026-04-07 · unverdicted · none · ref 3
Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.

Raft: Reward ranked finetuning for generative foundation model alignment

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer