Recognition: unknown
Eureka: Human-Level Reward Design via Coding Large Language Models
Pith reviewed 2026-05-14 20:10 UTC · model grok-4.3
The pith
Large language models can design reward functions for robot tasks that outperform those created by human experts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Eureka uses the zero-shot generation and in-context improvement capabilities of LLMs to perform evolutionary optimization over reward code, producing functions that enable superior RL policies.
What carries the argument
Evolutionary search over LLM-written reward functions, starting from zero-shot code generation and iterating based on performance feedback.
If this is right
- Robots can acquire dexterous skills such as pen spinning through curriculum learning with these rewards.
- Human feedback can be integrated into reward design via gradient-free in-context learning without updating the LLM.
- Reward engineering becomes feasible across diverse robot types without manual template design.
Where Pith is reading between the lines
- Success in simulation suggests potential for faster prototyping of robot behaviors before real-world deployment.
- Extending this to handle safety constraints directly in the reward generation could reduce the need for separate alignment steps.
- If LLMs improve in code reliability, this method may generalize to physical hardware with minimal adjustments.
Load-bearing premise
LLM-generated reward code will produce stable policies that do not exploit simulator-specific artifacts when applied to new situations.
What would settle it
Running the generated rewards on physical robot hardware or additional unseen simulation tasks and checking whether the performance advantage over human rewards persists.
read the original abstract
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Eureka, an algorithm that uses LLMs (e.g., GPT-4) to perform evolutionary optimization over reward code for RL tasks. Without task-specific prompting or templates, it claims to generate rewards that outperform expert human-engineered rewards on 83% of 29 diverse open-source RL environments (10 robot morphologies), yielding 52% average normalized improvement. Extensions include gradient-free in-context RLHF via human feedback and a curriculum demonstration of pen-spinning with a simulated Shadow Hand.
Significance. If the central empirical claim holds under fixed hyperparameters and without simulator overfitting, the work would represent a notable advance in automating reward engineering for RL, lowering the barrier to complex manipulation skills. The LLM-driven evolutionary loop and RLHF integration are technically interesting and could influence future hybrid LLM-RL systems. The pen-spinning result is a strong qualitative demonstration, but overall significance is tempered by the need for controls that distinguish genuine task encoding from simulator-specific exploitation.
major comments (3)
- [Experimental Setup] Experimental protocol: the manuscript must explicitly document and verify that the evolutionary hyperparameters (population size, number of generations, mutation prompts) were identical across all 29 tasks. Any per-task adjustment would make the 83% win-rate and 52% normalized improvement claims non-generalizable.
- [Results] Results reporting: the 52% normalized improvement and 83% outperformance figures require a clear definition of the normalization procedure, the number of independent RL training runs per reward, statistical tests, and confirmation that no post-hoc selection of successful evolutionary trajectories occurred.
- [Discussion / Limitations] Simulator robustness: because fitness is measured solely as RL return inside the 29 fixed simulators, the central claim that Eureka rewards encode task semantics (rather than simulator artifacts such as contact models or integration quirks) requires at least one control experiment, e.g., re-evaluation under perturbed friction/contact parameters or transfer to a second physics engine.
minor comments (2)
- [Abstract / Title] The title and abstract use 'human-level' without qualification; this should be rephrased to 'outperforms human-engineered rewards' to avoid misinterpretation.
- [Method] Method section: the exact prompt templates and in-context examples supplied to the LLM should be provided in an appendix or supplementary material for reproducibility.
Axiom & Free-Parameter Ledger
free parameters (1)
- evolutionary hyperparameters (population size, number of generations, mutation prompts)
axioms (2)
- domain assumption Frontier LLMs can reliably produce executable and semantically meaningful reward code from natural-language task descriptions
- domain assumption RL training on the generated rewards will converge to policies whose performance reflects reward quality rather than simulator artifacts
Forward citations
Cited by 27 Pith papers
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
-
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
-
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
-
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
-
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.
-
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
-
A2DEPT: Large Language Model-Driven Automated Algorithm Design via Evolutionary Program Trees
A2DEPT generates complete algorithms for COPs using LLM-driven evolutionary program trees with hybrid selection and repair, reducing mean normalized optimality gap by 9.8% versus strongest AHD baselines on standard be...
-
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models
EvoNav automates the design of reward functions for RL robot navigation by evolving LLM proposals through a three-stage cheap-to-expensive evaluation process and claims better policies than hand-crafted or prior autom...
-
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
-
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.
-
Do LLM-derived graph priors improve multi-agent coordination?
LLM-generated coordination graph priors improve multi-agent reinforcement learning performance on MPE benchmarks, with models as small as 1.5B parameters proving effective.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
Generative Simulation for Policy Learning in Physical Human-Robot Interaction
A text-to-simulation pipeline using LLMs and VLMs generates synthetic pHRI data to train vision-based imitation learning policies that achieve over 80% success in zero-shot sim-to-real transfer on real assistive tasks.
-
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation
Test-time steering of pre-trained whole-body policies via sample-based planning lets legged robots generalize dynamic loco-manipulation to varied heavy objects and tasks without additional training or tuning.
-
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains
RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
-
Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots
DreamTIP adds LLM-identified task-invariant properties as auxiliary targets in Dreamer's world model plus a mixed-replay adaptation step, delivering 28.1% average simulated transfer gains and 100% real-world climb suc...
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
-
CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing
CVEvolve uses LLM agents with lineage-aware search to autonomously discover algorithms that outperform baselines on scientific image tasks including registration, peak detection, and segmentation.
-
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
-
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
AgenticRecTune deploys Actor, Critic, Insight, Skill, and Online agents plus a self-evolving Skillhub to propose, filter, test, and learn from recommendation system configurations using Gemini LLMs.
Reference graph
Works this paper leans on
-
[1]
If you see phrases like [NUM: default_value], replace the entire phrase with a numerical value
-
[2]
If you see phrases like {CHOICE: choice1, choice2, ...}, it means you should replace the entire phrase with one of the choices listed
-
[3]
If you see [optional], it means you only add that line if necessary for the task, otherwise remove that line
-
[4]
Do not invent new objects not listed here
The environment contains <INSERT OBJECTS HERE>. Do not invent new objects not listed here
-
[5]
Always start the description with [start of plan] and end it with [end of plan]
I will tell you a behavior/skill/task that I want the manipulator to perform and you will provide the full plan, even if you may only need to change a few lines. Always start the description with [start of plan] and end it with [end of plan]
-
[6]
You can assume that the hands are capable of doing anything, even for the most challenging task
-
[7]
Your plan should be as close to the provided template as possible. Do not include additional details. Prompt 2: Dexterity coder prompt 20 Published as a conference paper at ICLR 2024 We have a plan of a robot arm with palm to manipulate objects and we want you to turn that into the corresponding program with following functions: ‘‘‘ def set_min_l2_distanc...
work page 2024
-
[8]
Always format the code in code blocks
-
[9]
Your output should only consist of function calls like the example above
Do not wrap your code in a function. Your output should only consist of function calls like the example above
-
[10]
The only allowed functions you can call are the ones listed above, and do not implement them
Do not invent new functions or classes. The only allowed functions you can call are the ones listed above, and do not implement them. Do not leave unimplemented code blocks in your response
-
[11]
Do not import or use any other library
The only allowed library is numpy. Do not import or use any other library
-
[12]
If you are not sure what value to use, just use your best judge. Do not use None for anything
-
[13]
Just use a number directly based on your best guess
Do not calculate the position or direction of any object (except for the ones provided above). Just use a number directly based on your best guess
-
[14]
You do not need to make the robot do extra things not mentioned in the plan such as stopping the robot. For the sections surrounded by angle brackets <>, we specify a list of valid objects for each Dexterity task. For example, ShadowHandPen’s list of objects is defined as follows: "shadow_hand_pen": ["left_palm", "right_palm", "left_forefinger", "left_mid...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.