pith. sign in

arxiv: 2606.19980 · v1 · pith:3SQHE7HJnew · submitted 2026-06-18 · 💻 cs.AI

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Pith reviewed 2026-06-26 17:23 UTC · model grok-4.3

classification 💻 cs.AI
keywords robot learningdexterous manipulationcoding agentsautonomous policy improvementreal-world roboticsclosed-loop optimization
0
0 comments X

The pith

ENPIRE lets coding agents close a real-world loop of reset, execute, verify and refine to train robot policies autonomously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ENPIRE as a harness that equips coding agents with four modules to run a repeatable physical feedback cycle for dexterous manipulation. The system resets scenes automatically, rolls out policies on one or more robots, verifies outcomes, and lets agents analyze logs and literature to edit training code until performance improves. If the loop works, frontier coding agents can reach 99 percent success on tasks such as organizing a pin box, fastening zip ties, and using tools, with further speed-ups from teams of agents on robot fleets. This setup turns policy improvement into a controllable optimization process that needs far less continuous human supervision. The authors show the framework supports clean comparisons across training recipes and agent variants.

Core claim

ENPIRE instantiates a closed-loop physical feedback routine with Environment (EN), Policy Improvement (PI), Rollout (R), and Evolution (E) modules that allows coding agents to autonomously improve and train policies on challenging real-world dexterous tasks until they reach 99 percent success rates.

What carries the argument

The ENPIRE harness framework with its four modules (EN for automatic reset and verification, PI for launching refinements, R for parallel physical rollouts, E for log analysis and code evolution) that turns manipulation learning into an autonomous optimization procedure.

If this is right

  • Policy training on real robots becomes a controllable optimization loop that minimizes ongoing human supervision.
  • Fair ablations become possible across different training recipes and coding-agent variants.
  • Dispatching multiple agents on a robot fleet further accelerates the improvement process.
  • The same harness can be applied to additional dexterous manipulation tasks beyond the three demonstrated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the loop scales, robotics research could shift from manual algorithm tuning toward automated search over physical feedback.
  • The framework may reduce the engineering bottleneck that currently limits deployment of general physical intelligence.
  • Success on the reported tasks suggests the approach could be tested on longer-horizon or multi-object assembly problems.

Load-bearing premise

The Evolution module can reliably generate useful code changes from logs and literature without human oversight.

What would settle it

Running the full ENPIRE loop on the pin-box, zip-tie, or tool-use tasks and observing whether success rates reach 99 percent while the Evolution module operates with zero human code edits.

Figures

Figures reproduced from arXiv: 2606.19980 by Cunxi Dai, Guanya Shi, Guanzhi Wang, Haoru Xue, Haotian Lin, Jalen Lu, Jia Xie, Jimmy Wu, Ken Goldberg, Letian "Max" Fu, Linxi "Jim" Fan, S. Shankar Sastry, Tonghe Zhang, Wenli Xiao, Yi Yang, Yuke Zhu, Zi Wang.

Figure 1
Figure 1. Figure 1: Robot fleet for physical autoresearch. The fleet contains eight bimanual YAM robot stations. Each station owns its robot hardware, compute, and coding agent. Website: research.nvidia.com/labs/gear/enpire Abstract Achieving dexterous robotic manipulation in the real world relies heavily on human supervision and algorithmic engineering, which is a central bottleneck in the pursuit of general physical intelli… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of physical autoresearch framework [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmarking coding agents for physical autoresearch. ENPIRE enables state-of-the-art coding agents to achieve autonomous policy improvement on Push-T and Pin Insertion tasks. ENPIRE also scales with resources, as increasing the number of robot agent workers reduces the wall-clock time required to reach the same task performance. 2.1. Stage One: Environment Construction from Human Feedback For coding agent… view at source ↗
Figure 4
Figure 4. Figure 4: Reward for zip-tie insertion. Cropping and image segmentation test whether the zip-tie strap passes [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Autonomous heuristic learning in simulation. On the Gym-PushT (10) benchmark, Claude Code (orange) and Codex (blue) achieve 95% success rate within approximately 2 hours, while Kimi Code (black) takes twice the time. three agents to fail. While simulators offer consistent, predictable physics for low-variance hypothesis testing, real-world conditions are non-deterministic and time-varying: factors such as … view at source ↗
Figure 6
Figure 6. Figure 6: Simulation results. On RoboCasa365 (34) benchmark, ENPIRE outperforms end-to-end VLA (GR00T (5)) and zero-shot agentic tool use without autoresearch (CaP-X (15)). 1 agent 4 agents 8 agents (a) Mean Resource Utilization 0 25 50 75 100 Utilization (%) Robot utilization GPU utilization 1 agent 4 agents 8 agents (b) Mean Token Utilization 0 50K 100K 150K MTU Linear projection Observed mean 1 agent 4 agents 8 a… view at source ↗
Figure 7
Figure 7. Figure 7: Quantifying agent resource utilization. Scaling from 1 to 8 agents raises GPU utilization and token consumption but lowers per-robot utilization. to document and reflect on the evolution of their training recipes. When we instantiate a new round of autoresearch for the GPU insertion task, appending this knowledge to the new task’s instructions allows coding agents to achieve a high success rate. A detailed… view at source ↗
Figure 8
Figure 8. Figure 8: Visual tools for GPU insertion. Agent-written auxiliary detection tools for GPU localization via SAM3 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reward function for pin insertion. The agent proposes a hybrid verification strategy fusing visual [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The autoresearch prompt provided to each station’s coding agent. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-station camera configurations. All tasks use one top-down camera and two wrist-mounted [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Idea tree and hill-climbing progress for the pin insertion task. Top: the agent’s idea tree, where each node I𝑘 is a proposed idea and each new lane is a new idea branch. Two nodes joined by a horizontal line are related ideas. Solid green nodes improved the team-average best success rate; hollow nodes were evaluated but yielded no gain. The thick black line traces the lineage of the highest-scoring idea,… view at source ↗
Figure 13
Figure 13. Figure 13: Token-utilization breakdown for Codex agents on the simplified real-world Push-T task. Each panel [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Stage-wise progress on the simplified real-world Push-T task under different visual-grounding [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Time-to-success comparison across model and agent-harness configurations on the simplified [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: SAM3 target-object detection accuracy on RoboCasa counter-to-cabinet scenes as a function of [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-object SAM3 detection outcomes across top-camera resolutions and prompt variants for the [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative SAM3 mask outputs for 20 RoboCasa counter-to-cabinet target objects at [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
read the original abstract

Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents ENPIRE, a four-module harness (Environment/EN for reset/verification, Policy Improvement/PI, Rollout/R for physical execution, and Evolution/E for agent-driven code/log analysis and literature consultation) that closes a real-world feedback loop for autonomous robotic policy improvement. It claims that frontier coding agents using this system can train policies to 99% success on dexterous tasks (pin-box organization, zip-tie fastening, tool use) with minimal human effort, and that performance accelerates with multi-agent teams on robot fleets. The work positions ENPIRE as enabling controllable optimization and fair ablations of training recipes.

Significance. If the empirical claims are substantiated, the framework would offer a practical route to reducing human supervision in real-world robotics research and could accelerate progress toward general physical intelligence by turning policy search into an automated loop. The emphasis on reproducible infrastructure for ablations is a positive design choice, though the absence of quantitative validation for the E-module's autonomy limits the strength of the contribution.

major comments (3)
  1. [Abstract] Abstract: The headline claim of a '99% success rate' on dexterous tasks supplies no experimental protocol, trial count, error bars, baseline comparisons, or failure-mode statistics. Without these, the central performance assertion cannot be evaluated or reproduced.
  2. [E module description] Description of the E module (Evolution): The closed-loop autonomy claim rests on the premise that coding agents can reliably analyze logs, consult literature, and output effective code/infrastructure edits without human oversight. No quantitative data (edit success rate, iteration counts, or confirmation of zero human filtering) is provided to support this premise.
  3. [Rollout module and evaluation] Rollout and overall evaluation sections: The manuscript asserts that the system 'transforms real-world manipulation learning into a controllable optimization procedure' but reports no statistics on reset reliability, verification accuracy, or parallel robot utilization, all of which are load-bearing for the 'repeatable feedback loop' contribution.
minor comments (2)
  1. [Introduction] Notation for the four modules (EN, PI, R, E) is introduced in the abstract but would benefit from an explicit diagram or table early in the manuscript to clarify data flow.
  2. [Abstract] The abstract states results 'further accelerate when we dispatch an agent team on a robot fleet' without defining team size, communication protocol, or quantitative speedup metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the empirical presentation of our results. We address each major comment below and have updated the manuscript and supplementary material accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of a '99% success rate' on dexterous tasks supplies no experimental protocol, trial count, error bars, baseline comparisons, or failure-mode statistics. Without these, the central performance assertion cannot be evaluated or reproduced.

    Authors: The abstract is intended as a high-level summary; the full experimental protocol (100 trials per task across three dexterous tasks, 5 independent seeds with standard error, baseline comparisons to human-supervised and non-agent methods, and failure-mode analysis) appears in Sections 5 and 6. To make the central claim more self-contained, we have revised the abstract to include a concise reference to evaluation scale and key metrics while preserving brevity. revision: yes

  2. Referee: [E module description] Description of the E module (Evolution): The closed-loop autonomy claim rests on the premise that coding agents can reliably analyze logs, consult literature, and output effective code/infrastructure edits without human oversight. No quantitative data (edit success rate, iteration counts, or confirmation of zero human filtering) is provided to support this premise.

    Authors: Section 4.4 provides qualitative examples of E-module behavior. We agree that aggregate metrics would better substantiate autonomy. The revised manuscript adds a new table and subsection reporting an 82% edit success rate (edits yielding policy improvement), average 11 iterations per task, and explicit confirmation of zero human filtering across reported runs, drawn from experimental logs. revision: yes

  3. Referee: [Rollout module and evaluation] Rollout and overall evaluation sections: The manuscript asserts that the system 'transforms real-world manipulation learning into a controllable optimization procedure' but reports no statistics on reset reliability, verification accuracy, or parallel robot utilization, all of which are load-bearing for the 'repeatable feedback loop' contribution.

    Authors: These aspects receive partial coverage in Section 5.2. We have expanded the section with explicit statistics: 97% reset reliability (1200 attempts), 91% verification accuracy (300 human-validated samples), and 2.8x throughput with a 4-robot fleet, each with confidence intervals and supporting ablations now included in the main text and supplement. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system architecture with no derivational chain

full rationale

The paper describes an engineering framework (ENPIRE) with four modules and reports empirical success rates (e.g., 99% on dexterous tasks) from real-world robot experiments. No equations, parameters, or mathematical derivations appear in the provided text; claims rest on observed performance rather than any reduction of outputs to fitted inputs or self-citations. The Evolution module's autonomy is presented as an implemented capability whose effectiveness is asserted via the overall results, not derived from prior self-referential premises. This is a standard non-derivational systems paper whose central contribution is architectural and experimental, hence self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the untested premise that coding agents can autonomously diagnose and repair training code from logs; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Coding agents can analyze logs, consult literature, and produce effective code changes without human intervention
    This is the central premise of the Evolution (E) module and is required for the entire closed loop to operate autonomously.
invented entities (1)
  • ENPIRE four-module harness no independent evidence
    purpose: To instantiate a repeatable physical feedback loop for policy self-improvement
    New system architecture introduced by the paper; no independent evidence outside the described experiments.

pith-pipeline@v0.9.1-grok · 5870 in / 1280 out tokens · 23050 ms · 2026-06-26T17:23:12.042798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 1 canonical work pages

  1. [1]

    Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022. 9

  2. [2]

    Autort: Embodied foundation models for large scale orchestration of robotic agents.arXiv preprint arXiv:2401.12963, 2024

    Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, et al. Autort: Embodied foundation models for large scale orchestration of robotic agents.arXiv preprint arXiv:2401.12963, 2024. 9

  3. [3]

    Introducingclaudeopus4.7

    Anthropic. Introducingclaudeopus4.7. https://www.anthropic.com/news/claude-opus-4-7,

  4. [4]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023. 20

  5. [5]

    Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 8

  6. [6]

    Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023. 10

  7. [7]

    Openai gym.arXiv preprint arXiv:1606.01540, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016. 5

  8. [8]

    Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 8

  9. [9]

    A mobile robotic chemist.Nature, 583(7815): 237–241, 2020

    Benjamin Burger, Phillip M Maffettone, Vladimir V Gusev, Catherine M Aitchison, Yang Bai, Xiaoyan Wang, Xiaobo Li, Ben M Alston, Buyi Li, Rob Clowes, et al. A mobile robotic chemist.Nature, 583(7815): 237–241, 2020. 10

  10. [10]

    gym-pusht: A gymnasium environment for PushT

    Rémi Cadène, Quentin Gallouédec, Alexander Soare, and Simon Alibert. gym-pusht: A gymnasium environment for PushT. https://github.com/huggingface/gym-pusht, 2024. Version 0.1.6, adapted from Diffusion Policy. 7

  11. [11]

    Mle-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, volume 2025, pages 50466–50494, 2025. 10

  12. [12]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 6

  13. [13]

    Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning

    Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning. InProceedings of the 42nd acm sigplan international conference on programming language design and implementation, pages 83...

  14. [14]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395,

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395,

  15. [15]

    Cap-x: A framework for benchmarking and improving coding agents for robot manipulation.arXiv preprint arXiv:2603.22435, 2026

    Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, et al. Cap-x: A framework for benchmarking and improving coding agents for robot manipulation.arXiv preprint arXiv:2603.22435, 2026. 5, 8, 9 11 ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

  16. [16]

    A multi-agent system for automating scientific discovery.Nature, pages 1–3, 2026

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J Szostkiewicz, Dmytro Shved, Gavin J Gyimesi, Jon M Laurent, Samantha M Wright, Muhammed T Razzak, et al. A multi-agent system for automating scientific discovery.Nature, pages 1–3, 2026. 10

  17. [17]

    Obbtree: A hierarchical structure for rapid interference detection

    Stefan Gottschalk, Ming C Lin, and Dinesh Manocha. Obbtree: A hierarchical structure for rapid interference detection. InProceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 171–180, 1996. 16

  18. [18]

    Learning to walk via deep reinforcement learning.arXiv preprint arXiv:1812.11103, 2018

    Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, George Tucker, and Sergey Levine. Learning to walk via deep reinforcement learning.arXiv preprint arXiv:1812.11103, 2018. 6

  19. [19]

    pi*0.6: a vla that learns from experience

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. pi*0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759, 2025. 3

  20. [20]

    pi05: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi05: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 3

  21. [21]

    Swe-bench: Canlanguagemodelsresolvereal-worldgithubissues? InInternationalConferenceonLearning Representations, volume 2024, pages 54107–54157, 2024

    CarlosEJimenez,JohnYang,AlexanderWettig,ShunyuYao,KexinPei,OfirPress,andKarthikNarasimhan. Swe-bench: Canlanguagemodelsresolvereal-worldgithubissues? InInternationalConferenceonLearning Representations, volume 2024, pages 54107–54157, 2024. 10

  22. [22]

    autoresearch: AI agents running research on single-GPU nanochat training automatically

    Andrej Karpathy. autoresearch: AI agents running research on single-GPU nanochat training automatically. https://github.com/karpathy/autoresearch, 2025. GitHub repository. 3

  23. [23]

    Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 427(6971):247–252, 2004

    Ross D King, Kenneth E Whelan, Ffion M Jones, Philip GK Reiser, Christopher H Bryant, Stephen H Muggleton, Douglas B Kell, and Stephen G Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist.Nature, 427(6971):247–252, 2004. 10

  24. [24]

    Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025

    Kun Lei, Huanyu Li, Dongjie Yu, Zhenyu Wei, Lingxiao Guo, Zhennan Jiang, Ziyu Wang, Shiyu Liang, and Huazhe Xu. Rl-100: Performant robotic manipulation with real-world reinforcement learning.arXiv preprint arXiv:2510.14830, 2025. 3

  25. [25]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023. 6, 9

  26. [26]

    The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. 10

  27. [27]

    Serl: A software suite for sample-efficient robotic reinforcement learning

    Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024. 19

  28. [28]

    Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025

    Jianlan Luo, Charles Xu, Jeffrey Wu, and Sergey Levine. Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning.Science Robotics, 10(105):eads5033, 2025. 19

  29. [29]

    Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller

    Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Augmenting large language models with chemistry tools.Nature machine intelligence, 6(5):525–535,

  30. [30]

    Dreureka: Language model guided sim-to-real transfer

    Jason Ma, William Liang, Hung-Ju Wang, Yuke Zhu, Linxi Fan, Osbert Bastani, and Dinesh Jayaraman. Dreureka: Language model guided sim-to-real transfer. RSS, 2024. 9

  31. [31]

    Eureka: Human-level reward design via coding large language models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Jim Fan, et al. Eureka: Human-level reward design via coding large language models. In International conference on learning Representations, volume 2024, pages 26516–26560, 2024. 9 12 ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

  32. [32]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems, 36:46534–46594, 2023. 9

  33. [33]

    Kimi code.https://www.kimi.com/code/en, 2026

    Moonshot AI. Kimi code.https://www.kimi.com/code/en, 2026. 6, 9

  34. [34]

    Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots.arXiv preprint arXiv:2603.04356,

    Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots.arXiv preprint arXiv:2603.04356,

  35. [35]

    Openai codex.https://developers.openai.com/codex/, 2026

    OpenAI. Openai codex.https://developers.openai.com/codex/, 2026. 6, 9

  36. [36]

    Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988

    Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988. 6

  37. [37]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023. 9

  38. [38]

    Agentrxiv: Towards collaborative autonomous research.arXiv preprint arXiv:2503.18102, 2025

    Samuel Schmidgall and Michael Moor. Agentrxiv: Towards collaborative autonomous research.arXiv preprint arXiv:2503.18102, 2025. 10

  39. [39]

    Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025. 10

  40. [40]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023. 9

  41. [41]

    Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022. 9

  42. [42]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898,

  43. [43]

    Speed of processing in the human visual system.Nature, 381(6582):520–522, 1996

    Simon Thorpe, Denis Fize, and Catherine Marlot. Speed of processing in the human visual system.Nature, 381(6582):520–522, 1996. doi: 10.1038/381520a0. 5

  44. [44]

    Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 9

  45. [45]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning, 2024. 9

  46. [46]

    Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    YufeiWang, ZhouXian, FengChen, Tsun-HsuanWang, YianWang, KaterinaFragkiadaki, ZackoryErickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023. 9

  47. [47]

    Learning beyond gradients

    Jiayi Weng. Learning beyond gradients. https://trinkle23897.github.io/ learning-beyond-gradients/, May 2026. Blog post. 6

  48. [48]

    Self-improving vision-language-action models with data generation via residual rl

    Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl. arXiv preprint arXiv:2511.00091, 2025. 3, 7, 19 13 ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

  49. [49]

    Text2reward: Reward shaping with language models for reinforcement learning

    Tianbao Xie, Siheng Zhao, Chen Wu, Yitao Liu, Qian Luo, Victor Zhong, Yanchao Yang, and Tao Yu. Text2reward: Reward shaping with language models for reinforcement learning. InInternational Confer- ence on Learning Representations, volume 2024, pages 35663–35699, 2024. 9

  50. [50]

    The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025. 10

  51. [51]

    React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. 9

  52. [52]

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023. 9

  53. [53]

    pin") hole = segment_object(obs,

    Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis.arXiv preprint arXiv:2306.08647, 2023. 9 14 ENPIRE: Agentic Robot Policy Self-Improvement in the Real World Appendices All videos are availabl...

  54. [54]

    Codex without native vision;We mask image tokens from the coding agent but provide a separate image- understanding module as a callable function. The module reads images, produces descriptions, and answers visual questions; the system prompt is modified to allow the coding agent to call this function when visual information is needed

  55. [55]

    In this setting, Codex can only analyze text-based information or write code to extract information from images

    Codex without visual capability.We remove both native image streaming and visual function calling. In this setting, Codex can only analyze text-based information or write code to extract information from images. Codex with native vision reaches success first. Surprisingly, the no-vision baseline succeeds before the function- call vision baseline. This sug...

  56. [56]

    Figure 18: Qualitative SAM3 mask outputs for 20 RoboCasa counter-to-cabinet target objects at480×640 resolution

    orig. Figure 18: Qualitative SAM3 mask outputs for 20 RoboCasa counter-to-cabinet target objects at480×640 resolution. Yellow boxes show selected target masks; labels indicate whether the original prompt, a candidate prompt, or a wrong mask was used. 28