Reward models in deep reinforcement learning: A survey

Reward models in deep reinforcement learning: A survey · 2025 · arXiv 2506.15421

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning

cs.SD · 2026-06-07 · unverdicted · novelty 7.0

AudioProcessBench is a new benchmark with segmented and annotated reasoning traces from six audio and omni-language models for step correctness identification and error-type detection in audio-grounded reasoning.

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

cs.LG · 2026-04-22 · conditional · novelty 6.0

Occupancy Reward Shaping extracts goal-reaching rewards from world-model occupancy measures using optimal transport, improving offline goal-conditioned RL performance 2.2x on 13 tasks without changing the optimal policy.

Multi-objective Reinforcement Learning With Augmented States Requires Rewards After Deployment

cs.LG · 2026-04-17 · accept · novelty 5.0

MORL with augmented states for non-linear utilities requires ongoing reward signal access post-deployment.

citing papers explorer

Showing 4 of 4 citing papers.

AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning cs.SD · 2026-06-07 · unverdicted · none · ref 3
AudioProcessBench is a new benchmark with segmented and annotated reasoning traces from six audio and omni-language models for step correctness identification and error-type detection in audio-grounded reasoning.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models cs.AI · 2026-05-13 · unverdicted · none · ref 27 · 2 links
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-04-22 · conditional · none · ref 23
Occupancy Reward Shaping extracts goal-reaching rewards from world-model occupancy measures using optimal transport, improving offline goal-conditioned RL performance 2.2x on 13 tasks without changing the optimal policy.
Multi-objective Reinforcement Learning With Augmented States Requires Rewards After Deployment cs.LG · 2026-04-17 · accept · none · ref 24
MORL with augmented states for non-linear utilities requires ongoing reward signal access post-deployment.

Reward models in deep reinforcement learning: A survey

fields

years

verdicts

representative citing papers

citing papers explorer