AudioProcessBench is a new benchmark with segmented and annotated reasoning traces from six audio and omni-language models for step correctness identification and error-type detection in audio-grounded reasoning.
Reward models in deep reinforcement learning: A survey
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Occupancy Reward Shaping extracts goal-reaching rewards from world-model occupancy measures using optimal transport, improving offline goal-conditioned RL performance 2.2x on 13 tasks without changing the optimal policy.
MORL with augmented states for non-linear utilities requires ongoing reward signal access post-deployment.
citing papers explorer
-
AudioProcessBench: Benchmark for Identifying Process Errors in Audio-Grounded Reasoning
AudioProcessBench is a new benchmark with segmented and annotated reasoning traces from six audio and omni-language models for step correctness identification and error-type detection in audio-grounded reasoning.