The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
arXiv preprint arXiv:2505.07291 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 5representative citing papers
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
citing papers explorer
-
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
-
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
-
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
-
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.