pith. machine review for the scientific record. sign in

arxiv: 2512.13399 · v2 · submitted 2025-12-15 · 💻 cs.AI · cs.CL

Recognition: no theorem link

Differentiable Evolutionary Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reinforcement learningreward optimizationbi-level optimizationdifferentiable evolutionpolicy gradientsmeta-learninggeneralizationautonomous reward design
0
0 comments X

The pith

DERL makes evolutionary reward search differentiable by updating a meta-optimizer with policy gradients from inner-loop validation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DERL, a bi-level framework that automates reward design for reinforcement learning on complex tasks. An outer meta-optimizer composes reward functions from atomic primitives and guides an inner policy learner, but unlike prior black-box evolutionary methods it derives updates for the meta-optimizer directly from policy gradients on validation performance. This supplies dense feedback that lets the system progressively learn a meta-gradient for task success. Experiments across embodied agents in ALFWorld, scientific simulations in ScienceWorld, and math reasoning on GSM8K and MATH show state-of-the-art results with notably stronger out-of-distribution generalization than non-differentiable baselines. Trajectory analysis is presented as evidence that the discovered rewards capture the intrinsic causal structure of the tasks.

Core claim

DERL is a bi-level framework for autonomous reward discovery in which a Meta-Optimizer evolves reward functions through composition of structured atomic primitives to guide an inner-loop policy. Differentiability enters the meta-optimization by updating the Meta-Optimizer with policy gradients computed from the inner-loop's validation performance, replacing derivative-free search with dense, actionable feedback that progressively learns a meta-gradient for task success. The approach is validated on embodied agent, scientific simulation, and mathematical reasoning benchmarks, where it reports state-of-the-art performance and improved out-of-distribution generalization while trajectory data is

What carries the argument

The Meta-Optimizer that composes reward functions from atomic primitives and receives updates via policy gradients derived from inner-loop validation performance.

If this is right

  • Reward structures can be discovered autonomously without manual engineering for each new task.
  • Meta-optimization receives dense feedback rather than sparse black-box evaluations.
  • Out-of-distribution generalization improves on embodied navigation, scientific simulation, and mathematical reasoning benchmarks.
  • Learned rewards capture the intrinsic causal structure of tasks, supporting self-improving agent alignment.
  • The bi-level setup separates reward evolution from policy learning while keeping both trainable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradient-through-validation pattern could be applied to evolve other meta-components such as policy architectures or exploration schedules.
  • Modular reward primitives may make the discovered functions easier to inspect or transfer across related environments.
  • If the method scales, it could reduce reliance on human reward design in domains where task success is hard to specify directly.
  • The approach invites tests of whether the learned meta-gradients themselves transfer to new but structurally similar tasks.

Load-bearing premise

Policy gradients derived from inner-loop validation performance supply stable, unbiased, and sufficiently dense signals for updating the outer meta-optimizer without introducing circularity or instability.

What would settle it

A controlled ablation that replaces the policy-gradient meta-updates with a derivative-free optimizer while keeping the same reward primitives and inner-loop training, then measures whether the performance and generalization gaps on ALFWorld and GSM8K disappear.

Figures

Figures reproduced from arXiv: 2512.13399 by Difan Zou, Sitao Cheng, Tianle Li, Xuhan Huang, Xunjian Yin.

Figure 1
Figure 1. Figure 1: Illustration of DERL and performance of Meta-Reward. Left: Overview of DERL versus tradi￾tional approaches. In DERL, a Meta-Optimizer generates a parameterized Meta-Reward to guide policy model evolution. Crucially, the validation performance serves as a feedback signal to update the Meta-Optimizer via policy gradients, establishing a differentiable, closed-loop optimization process. DERL eliminates the ne… view at source ↗
Figure 2
Figure 2. Figure 2: Bi-level evolutionary training for DERL. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Optimization dynamics on ALFWorld, GSM8k and MATH Benchmark. The plots illus [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution dynamics of reward structures on ALFWorld. We visualize the proportion of Sta￾ble Structures and Unstable Structures over outer￾loop steps. The consistent upward trend of stable structures demonstrates the Meta-Optimizer’s se￾lection preference for mathematical robustness [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Demonstration of the training loop our differentiable evolutionary reward. We adopt [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of Gradient Propagation in bi-level evolutionary training loop. The top row [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics of DERL-population. We present the training dynamics of DERL [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Crafting effective reward signals remains a central challenge in Reinforcement Learning (RL), especially for complex reasoning tasks. Existing automated reward optimization methods typically rely on derivative-free search heuristics that treat the reward function as a black box, failing to exploit the causal dynamics between reward structure modifications and policy performance. We introduce Differentiable Evolutionary Reinforcement Learning (DERL), a bi-level framework for the autonomous discovery of optimal reward structures. DERL employs a Meta-Optimizer that evolves a reward function through the composition of structured atomic primitives to guide an inner-loop policy. Unlike prior black-box methods, DERL introduces differentiability into the meta-optimization process by updating the Meta-Optimizer using policy gradients derived from inner-loop validation performance. This allows for the progressive learning of a "meta-gradient" for task success, providing the system with dense, actionable feedback. We validate DERL across diverse reasoning domains: embodied agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8K, MATH). Results show that DERL achieves state-of-the-art performance on agent benchmarks, substantially outperforming non-differentiable baselines-especially in out-of-distribution generalization. Trajectory analyses confirm that DERL captures the intrinsic causal structure of tasks, enabling fully autonomous, self-improving agent alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Differentiable Evolutionary Reinforcement Learning (DERL), a bi-level framework in which a Meta-Optimizer evolves reward functions by composing atomic primitives and is updated via policy gradients computed from inner-loop validation performance. The central claim is that this differentiable meta-optimization yields state-of-the-art results on embodied (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8K, MATH) benchmarks, with particular gains in out-of-distribution generalization, by capturing intrinsic task causal structure.

Significance. If the meta-gradients prove stable and unbiased, the approach would represent a meaningful advance over derivative-free reward search methods by supplying dense feedback for reward-structure discovery. The reported OOD improvements, if reproducible, would be of practical interest for aligning agents on complex reasoning tasks without manual reward engineering.

major comments (3)
  1. [§3.2] §3.2 (Meta-Optimizer update rule): the meta-gradient is defined directly from inner-loop validation returns, yet the manuscript provides no explicit guarantee that validation trajectories are independent of the meta-parameters being optimized; this risks circularity and optimistic bias in the reported SOTA and OOD gains.
  2. [§4.1] §4.1 (Experimental protocol): no variance-reduction techniques (control variates, learned baselines, or Hessian-vector products) are described for the policy-gradient estimates that are back-propagated through the outer evolutionary composition; given the known high variance of REINFORCE-style estimators, this omission undermines claims of stable meta-optimization at scale.
  3. [Table 2] Table 2 (Benchmark results): the OOD generalization improvements are presented without error bars, statistical significance tests, or ablation on the validation-split construction, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.
minor comments (2)
  1. [§1] The abstract and §1 use the term 'meta-gradient' without a precise equation reference; adding an explicit definition (e.g., Eq. (X)) would improve clarity.
  2. [Figure 3] Figure 3 (trajectory analysis) lacks axis labels and scale information, making it difficult to verify the claimed capture of causal structure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, clarifying our approach where possible and committing to revisions that strengthen the presentation and rigor of the results.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Meta-Optimizer update rule): the meta-gradient is defined directly from inner-loop validation returns, yet the manuscript provides no explicit guarantee that validation trajectories are independent of the meta-parameters being optimized; this risks circularity and optimistic bias in the reported SOTA and OOD gains.

    Authors: We agree that explicit independence between validation trajectories and meta-parameters is essential to avoid bias. In the DERL bi-level setup, the inner-loop policy is trained exclusively on training trajectories using the current reward composition, while the meta-gradient is computed solely from a held-out validation set whose trajectories are generated after inner-loop convergence and never participate in policy optimization. This separation is implicit in our experimental protocol but was not stated clearly in §3.2. We will add an explicit paragraph clarifying the train/validation split, the generation of independent validation rollouts, and a short argument that the meta-gradient therefore reflects out-of-distribution performance rather than circular feedback. No change to the core algorithm is required. revision: partial

  2. Referee: [§4.1] §4.1 (Experimental protocol): no variance-reduction techniques (control variates, learned baselines, or Hessian-vector products) are described for the policy-gradient estimates that are back-propagated through the outer evolutionary composition; given the known high variance of REINFORCE-style estimators, this omission undermines claims of stable meta-optimization at scale.

    Authors: The referee correctly notes that §4.1 does not describe variance reduction. In our implementation we applied a simple exponential moving-average baseline to the REINFORCE estimator before back-propagation through the meta-optimizer; this was sufficient to obtain stable meta-optimization on the reported benchmarks. We will revise §4.1 to document this baseline explicitly, report its effect on gradient variance, and add a brief ablation comparing runs with and without the baseline. We do not claim the current estimator is optimal at arbitrary scale, but the empirical stability we observed supports the reported results. revision: yes

  3. Referee: [Table 2] Table 2 (Benchmark results): the OOD generalization improvements are presented without error bars, statistical significance tests, or ablation on the validation-split construction, making it impossible to determine whether the gains are robust or sensitive to post-hoc choices.

    Authors: We acknowledge that Table 2 currently lacks error bars, significance tests, and validation-split ablations. We will recompute all OOD results over five random seeds, add standard-error bars to the table, include paired t-test p-values against the strongest baseline, and append a new supplementary table showing OOD performance under two alternative validation-split constructions. These additions will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DERL bi-level derivation

full rationale

The paper presents a bi-level optimization where the meta-optimizer evolves reward compositions and receives updates via policy gradients computed on inner-loop validation performance. This separation of inner policy optimization from outer meta-updates via held-out validation trajectories constitutes an independent feedback signal rather than a self-definitional or fitted-input reduction. No equations or descriptions in the provided text show the meta-gradient being equivalent to its own inputs by construction, nor do they rely on load-bearing self-citations or smuggled ansatzes. The performance claims rest on empirical results across benchmarks, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that reward functions can be usefully decomposed into composable atomic primitives and that validation performance gradients are a reliable signal for meta-optimization; no explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Reward functions can be effectively constructed by composing structured atomic primitives
    Stated as the mechanism for the Meta-Optimizer to evolve rewards
  • domain assumption Policy gradients from inner-loop validation performance supply dense actionable feedback for the meta-optimizer
    Central to introducing differentiability into the evolutionary process
invented entities (1)
  • Meta-Optimizer no independent evidence
    purpose: Evolves reward function through composition of atomic primitives and is updated via policy gradients
    New component introduced to make evolutionary reward search differentiable

pith-pipeline@v0.9.0 · 5528 in / 1387 out tokens · 77833 ms · 2026-05-16T22:31:21.753597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 18 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man´e. Con- crete problems in ai safety.arXiv preprint arXiv:1606.06565,

  2. [2]

    Learning to learn by gradient descent by gradient descent

    URLhttps://arxiv.org/abs/1606.04474. Anonymous. Temperature as a meta-policy: Adaptive temperature in LLM reinforcement learning. InSubmitted to The Fourteenth International Conference on Learning Representations,

  3. [3]

    under review

    URLhttps://openreview.net/forum?id=AoTHU2OmS6. under review. Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V . Le. Neural optimizer search with rein- forcement learning. In Doina Precup and Yee Whye Teh (eds.),Proceedings of the 34th Interna- tional Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 12 Work in Progr...

  4. [4]

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V

    URL http://proceedings.mlr.press/v70/bello17a.html. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V . Le. Sym- bolic discovery of optimization algorithms. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in ...

  5. [5]

    Xingwu Chen, Tianle Li, and Difan Zou

    URLhttp://papers.nips.cc/paper_files/paper/2023/hash/ 9a39b4925e35cf447ccba8757137d84f-Abstract-Conference.html. Xingwu Chen, Tianle Li, and Difan Zou. Reshaping reasoning in llms: A theoretical analysis of rl training dynamics through pattern selection,

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  7. [7]

    A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, et al. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407,

  8. [8]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978,

  9. [9]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046,

  10. [10]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  11. [11]

    Population Based Training of Neural Networks

    Association for Computational Lin- guistics. doi: 10.18653/v1/2024.acl-long.274. URLhttps://aclanthology.org/2024. acl-long.274/. Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population based training of...

  12. [12]

    Wurman, Peter Stone, and Craig Sher- stan

    Michel Ma, Takuma Seno, Kaushik Subramanian, Peter R. Wurman, Peter Stone, and Craig Sher- stan. Automated reward design for gran turismo, 2025a. URLhttps://arxiv.org/abs/ 2511.02094. Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General-reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652, 2025b....

  13. [13]

    URL https://openreview.net/forum?id=IEduRUO55F. Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

  14. [14]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

  15. [15]

    URLhttps://doi.org/10.1038/s41586-023-06924-6

    doi: 10.1038/ S41586-023-06924-6. URLhttps://doi.org/10.1038/s41586-023-06924-6. Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio Le´on Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, et al. Evolution strategies at the hyper- scale.arXiv preprint arXiv:2511.16652,

  16. [16]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  17. [17]

    Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399,

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399,

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

  19. [19]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  20. [20]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C ˆot´e, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,

  21. [21]

    Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning

    Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning.arXiv preprint arXiv:1712.06567,

  22. [22]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al

    URLhttps: //arxiv.org/abs/2510.04204. Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  23. [23]

    Scienceworld: Is your agent smarter than a 5th grader?arXiv preprint arXiv:2203.07540,

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Cˆot´e, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader?arXiv preprint arXiv:2203.07540,

  24. [24]

    Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    URLhttps://arxiv.org/abs/2504.20571. Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, et al. Truthrl: Incentivizing truthful llms via reinforce- ment learning.arXiv preprint arXiv:2509.25760,

  25. [25]

    Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang

    URLhttps://arxiv.org/abs/2507.16331. Xunjian Yin, Xinyi Wang, Liangming Pan, Li Lin, Xiaojun Wan, and William Yang Wang. G\” odel agent: A self-referential agent framework for recursive self-improvement.arXiv preprint arXiv:2410.04444,

  26. [26]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  27. [27]

    Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open- ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025a. Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction.arXiv preprint arXiv:2408.15240,

  28. [28]

    Rlvmr: Reinforcement learning with verifiable meta-reasoning rewards for robust long-horizon agents.arXiv preprint arXiv:2507.22844, 2025b

    15 Work in Progress Zijing Zhang, Ziyang Chen, Mingxiao Li, Zhaopeng Tu, and Xiaolong Li. Rlvmr: Reinforcement learning with verifiable meta-reasoning rewards for robust long-horizon agents.arXiv preprint arXiv:2507.22844, 2025b. A PRELIMINARYEXPERIMENTS In this section, we present a preliminary experiment designed to investigate the capacity of our propo...

  29. [29]

    meta-gradient

    Letv t denote the validation performance (or evaluation metric) of the policy modelθ t at stept. The optimization process consists of two coupled loops: •Inner Loop (Optimizee):The policy model updates its parametersθ t−1 →θ t based on the guidance of the Meta-RewardRϕt provided by the optimizer. The optimizer then generates the update instructions (param...