arxiv: 2310.12931 · v2 · submitted 2023-10-19 · 💻 cs.RO · cs.AI· cs.LG

Recognition: unknown

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma , William Liang , Guanzhi Wang , De-An Huang , Osbert Bastani , Dinesh Jayaraman , Yuke Zhu , Linxi Fan

show 1 more author

Anima Anandkumar

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:10 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords reward designlarge language modelsreinforcement learningroboticsevolutionary optimizationRLHF

0 comments

The pith

Large language models can design reward functions for robot tasks that outperform those created by human experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Eureka is an algorithm that leverages large language models to generate and evolve reward code for reinforcement learning. It does this without any task-specific prompts or templates, relying on the models' code-writing abilities. The resulting rewards lead to better robot performance in a wide variety of simulated environments compared to expert human designs. This opens the door to automating a key bottleneck in training complex robotic skills.

Core claim

Eureka uses the zero-shot generation and in-context improvement capabilities of LLMs to perform evolutionary optimization over reward code, producing functions that enable superior RL policies.

What carries the argument

Evolutionary search over LLM-written reward functions, starting from zero-shot code generation and iterating based on performance feedback.

If this is right

Robots can acquire dexterous skills such as pen spinning through curriculum learning with these rewards.
Human feedback can be integrated into reward design via gradient-free in-context learning without updating the LLM.
Reward engineering becomes feasible across diverse robot types without manual template design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success in simulation suggests potential for faster prototyping of robot behaviors before real-world deployment.
Extending this to handle safety constraints directly in the reward generation could reduce the need for separate alignment steps.
If LLMs improve in code reliability, this method may generalize to physical hardware with minimal adjustments.

Load-bearing premise

LLM-generated reward code will produce stable policies that do not exploit simulator-specific artifacts when applied to new situations.

What would settle it

Running the generated rewards on physical robot hardware or additional unseen simulation tasks and checking whether the performance advantage over human rewards persists.

read the original abstract

Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Eureka shows LLMs can evolve template-free reward code that beats human rewards on 83% of 29 robot RL tasks, but the gains may come from simulator exploitation rather than better task semantics.

read the letter

The main thing to know is that Eureka runs an evolutionary loop where GPT-4 proposes and mutates reward code, then keeps the versions that produce higher RL returns. Across 29 environments with 10 robot morphologies, this beats hand-engineered rewards on 83% of tasks for a 52% average normalized gain, and it also lets them fold in human feedback without model updates. They close with a curriculum demo of Shadow Hand pen spinning that looks impressive on video.

Referee Report

3 major / 2 minor

Summary. The paper introduces Eureka, an algorithm that uses LLMs (e.g., GPT-4) to perform evolutionary optimization over reward code for RL tasks. Without task-specific prompting or templates, it claims to generate rewards that outperform expert human-engineered rewards on 83% of 29 diverse open-source RL environments (10 robot morphologies), yielding 52% average normalized improvement. Extensions include gradient-free in-context RLHF via human feedback and a curriculum demonstration of pen-spinning with a simulated Shadow Hand.

Significance. If the central empirical claim holds under fixed hyperparameters and without simulator overfitting, the work would represent a notable advance in automating reward engineering for RL, lowering the barrier to complex manipulation skills. The LLM-driven evolutionary loop and RLHF integration are technically interesting and could influence future hybrid LLM-RL systems. The pen-spinning result is a strong qualitative demonstration, but overall significance is tempered by the need for controls that distinguish genuine task encoding from simulator-specific exploitation.

major comments (3)

[Experimental Setup] Experimental protocol: the manuscript must explicitly document and verify that the evolutionary hyperparameters (population size, number of generations, mutation prompts) were identical across all 29 tasks. Any per-task adjustment would make the 83% win-rate and 52% normalized improvement claims non-generalizable.
[Results] Results reporting: the 52% normalized improvement and 83% outperformance figures require a clear definition of the normalization procedure, the number of independent RL training runs per reward, statistical tests, and confirmation that no post-hoc selection of successful evolutionary trajectories occurred.
[Discussion / Limitations] Simulator robustness: because fitness is measured solely as RL return inside the 29 fixed simulators, the central claim that Eureka rewards encode task semantics (rather than simulator artifacts such as contact models or integration quirks) requires at least one control experiment, e.g., re-evaluation under perturbed friction/contact parameters or transfer to a second physics engine.

minor comments (2)

[Abstract / Title] The title and abstract use 'human-level' without qualification; this should be rephrased to 'outperforms human-engineered rewards' to avoid misinterpretation.
[Method] Method section: the exact prompt templates and in-context examples supplied to the LLM should be provided in an appendix or supplementary material for reproducibility.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on the pre-existing code-generation and in-context learning abilities of frontier LLMs plus standard RL assumptions; no new physical entities or fitted constants are introduced beyond typical evolutionary hyperparameters.

free parameters (1)

evolutionary hyperparameters (population size, number of generations, mutation prompts)
These control the search process and are chosen by the authors to make the optimization practical.

axioms (2)

domain assumption Frontier LLMs can reliably produce executable and semantically meaningful reward code from natural-language task descriptions
Invoked throughout the method description as the foundation for zero-shot generation.
domain assumption RL training on the generated rewards will converge to policies whose performance reflects reward quality rather than simulator artifacts
Required for the claim that higher rewards translate to better skills.

pith-pipeline@v0.9.0 · 5571 in / 1461 out tokens · 57909 ms · 2026-05-14T20:10:45.963363+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
cs.AI 2024-08 unverdicted novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
cs.AI 2026-05 unverdicted novelty 7.0

LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 7.0

Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
cs.CL 2026-05 accept novelty 7.0

Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
A2DEPT: Large Language Model-Driven Automated Algorithm Design via Evolutionary Program Trees
cs.AI 2026-04 unverdicted novelty 7.0

A2DEPT generates complete algorithms for COPs using LLM-driven evolutionary program trees with hybrid selection and repair, reducing mean normalized optimality gap by 9.8% versus strongest AHD baselines on standard be...
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
cs.CL 2026-04 unverdicted novelty 7.0

AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models
cs.RO 2026-05 unverdicted novelty 6.0

EvoNav automates the design of reward functions for RL robot navigation by evolving LLM proposals through a three-stage cheap-to-expensive evaluation process and claims better policies than hand-crafted or prior autom...
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
cs.RO 2026-05 unverdicted novelty 6.0

SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 6.0

Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.
Do LLM-derived graph priors improve multi-agent coordination?
cs.LG 2026-04 unverdicted novelty 6.0

LLM-generated coordination graph priors improve multi-agent reinforcement learning performance on MPE benchmarks, with models as small as 1.5B parameters proving effective.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
Generative Simulation for Policy Learning in Physical Human-Robot Interaction
cs.RO 2026-04 unverdicted novelty 6.0

A text-to-simulation pipeline using LLMs and VLMs generates synthetic pHRI data to train vision-based imitation learning policies that achieve over 80% success in zero-shot sim-to-real transfer on real assistive tasks.
Sumo: Dynamic and Generalizable Whole-Body Loco-Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

Test-time steering of pre-trained whole-body policies via sample-based planning lets legged robots generalize dynamic loco-manipulation to varied heavy objects and tasks without additional training or tuning.
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains
cs.RO 2026-04 unverdicted novelty 6.0

RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
Learning Task-Invariant Properties via Dreamer: Enabling Efficient Policy Transfer for Quadruped Robots
cs.RO 2026-04 unverdicted novelty 6.0

DreamTIP adds LLM-identified task-invariant properties as auxiliary targets in Dreamer's world model plus a mixed-replay adaptation step, delivering 28.1% average simulated transfer gains and 100% real-world climb suc...
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing
cs.AI 2026-05 unverdicted novelty 5.0

CVEvolve uses LLM agents with lineage-aware search to autonomously discover algorithms that outperform baselines on scientific image tasks including registration, peak detection, and segmentation.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 5.0

AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 4.0

AgenticRecTune deploys Actor, Critic, Insight, Skill, and Online agents plus a self-evolving Skillhub to propose, filter, test, and learn from recommendation system configurations using Gemini LLMs.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 22 Pith papers

[1]

If you see phrases like [NUM: default_value], replace the entire phrase with a numerical value

work page
[2]

If you see phrases like {CHOICE: choice1, choice2, ...}, it means you should replace the entire phrase with one of the choices listed

work page
[3]

If you see [optional], it means you only add that line if necessary for the task, otherwise remove that line

work page
[4]

Do not invent new objects not listed here

The environment contains <INSERT OBJECTS HERE>. Do not invent new objects not listed here

work page
[5]

Always start the description with [start of plan] and end it with [end of plan]

I will tell you a behavior/skill/task that I want the manipulator to perform and you will provide the full plan, even if you may only need to change a few lines. Always start the description with [start of plan] and end it with [end of plan]

work page
[6]

You can assume that the hands are capable of doing anything, even for the most challenging task

work page
[7]

object1",

Your plan should be as close to the provided template as possible. Do not include additional details. Prompt 2: Dexterity coder prompt 20 Published as a conference paper at ICLR 2024 We have a plan of a robot arm with palm to manipulate objects and we want you to turn that into the corresponding program with following functions: ‘‘‘ def set_min_l2_distanc...

work page 2024
[8]

Always format the code in code blocks

work page
[9]

Your output should only consist of function calls like the example above

Do not wrap your code in a function. Your output should only consist of function calls like the example above

work page
[10]

The only allowed functions you can call are the ones listed above, and do not implement them

Do not invent new functions or classes. The only allowed functions you can call are the ones listed above, and do not implement them. Do not leave unimplemented code blocks in your response

work page
[11]

Do not import or use any other library

The only allowed library is numpy. Do not import or use any other library

work page
[12]

Do not use None for anything

If you are not sure what value to use, just use your best judge. Do not use None for anything

work page
[13]

Just use a number directly based on your best guess

Do not calculate the position or direction of any object (except for the ones provided above). Just use a number directly based on your best guess

work page
[14]

shadow_hand_pen

You do not need to make the robot do extra things not mentioned in the plan such as stopping the robot. For the sections surrounded by angle brackets <>, we specify a list of valid objects for each Dexterity task. For example, ShadowHandPen’s list of objects is defined as follows: "shadow_hand_pen": ["left_palm", "right_palm", "left_forefinger", "left_mid...

work page 2024