CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Dantong Niu; Ethan Kou; Fei-Fei Li; Guanya Shi; Guanzhi Wang; Haoru Xue; Huang Huang; Jiajun Wu; Justin Yu; Karim El-Refai

arxiv: 2603.22435 · v2 · pith:OPU5XVSMnew · submitted 2026-03-23 · 💻 cs.RO · cs.AI

CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Letian Fu , Justin Yu , Karim El-Refai , Ethan Kou , Haoru Xue , Huang Huang , Wenli Xiao , Guanzhi Wang

show 8 more authors

Dantong Niu Fei-Fei Li Guanya Shi Jiajun Wu Shankar Sastry Yuke Zhu Ken Goldberg Linxi "Jim" Fan

This is my paper

classification 💻 cs.RO cs.AI

keywords agentsmanipulationcap-xframeworkimprovesacrosscap-benchcode-as-policy

0 comments

read the original abstract

"Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation--through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning--substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
cs.CV 2026-05 unverdicted novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
Recover, Discover, Plan: Learning Skills and Concepts from Robot Failures
cs.RO 2026-06 unverdicted novelty 7.0

ReSYNC learns recovery skills via RL then discovers and refines relational predicates to enable abstract planning that generalizes failure avoidance to unseen long-horizon tasks, outperforming baselines by over 50% in...
Improving Robotic Generalist Policies via Flow Reversal Steering
cs.RO 2026-06 unverdicted novelty 7.0

Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation
cs.RO 2026-06 unverdicted novelty 7.0

VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or too...
Sequential Planning via Anchored Robotic Keypoints
cs.RO 2026-06 unverdicted novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
Guava: An Effective and Universal Harness for Embodied Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

Guava harness enables 4B open-source models to achieve performance comparable to frontier models on embodied manipulation tasks by distilling capabilities from under 2K simulation trajectories using three identified d...
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
cs.RO 2026-06 unverdicted novelty 6.0

APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
cs.RO 2026-05 unverdicted novelty 6.0

GraspGen-X extends diffusion 6-DOF grasping to cross-embodiment via swept-volume gripper encoding, trained on procedural grippers and 2B grasps, claiming best zero-shot generalization to novel grippers in sim and real tests.
From Question Answering to Task Completion: A Survey on Agent System and Harness Design
cs.AI 2026-06 unverdicted novelty 4.0

Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.
ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
cs.CV 2026-04 unverdicted novelty 4.0

ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...