Aligning dialogue agents with global feedback via large language model reward decomposition

Lee, Dong Won, Park, Hae Won, Breazeal, Cynthia, Morency, Louis-Philippe , date = · 2025 · arXiv 2505.15922

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

cs.LG · 2026-06-04 · unverdicted · novelty 5.0

RREDCoT approximates segment-level reward redistribution for CoT traces by querying the model itself, offering a lower-cost alternative to Monte Carlo credit assignment in reasoning-model RL.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

citing papers explorer

Showing 2 of 2 citing papers after filters.

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models cs.LG · 2026-06-04 · unverdicted · none · ref 42
RREDCoT approximates segment-level reward redistribution for CoT traces by querying the model itself, offering a lower-cost alternative to Monte Carlo credit assignment in reasoning-model RL.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 147
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

Aligning dialogue agents with global feedback via large language model reward decomposition

fields

years

verdicts

representative citing papers

citing papers explorer