pith. machine review for the scientific record. sign in

arxiv: 2605.02600 · v2 · submitted 2026-05-04 · 💻 cs.RO · cs.AI

Recognition: no theorem link

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

Berk \c{C}i\c{c}ek, Mert K. Er, \"Ozg\"ur S. \"O\u{g}\"uz

Pith reviewed 2026-05-13 06:24 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords LLM-based robotic controlcontact-rich manipulationadaptive planningMPPI cost designonline system identificationVLM dynamics priorszero-shot task executionsim-to-real adaptation
0
0 comments X

The pith

CoRAL uses LLMs to design context-aware costs for sampling-based planners, then adapts physical parameters in real time to achieve over 50 percent higher success in unseen contact-rich robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs can be applied effectively to contact-rich robotic manipulation by treating them as cost-function designers for motion planners rather than direct controllers. This separation allows high-level semantic reasoning to guide low-level sampling-based execution while an online adaptation loop corrects physical estimates from vision-language models using system identification. A sympathetic reader would care because direct application of foundation models to contact tasks has been limited by missing physical grounding and sim-to-real gaps, and this modular structure claims to close those gaps without task-specific training. The central mechanism is a retrieval memory that reuses successful cost structures across repeated scenarios.

Core claim

CoRAL decouples high-level LLM reasoning from low-level control by synthesizing context-aware objective functions for an MPPI planner, using a VLM to supply semantic priors on dynamics that are then refined via real-time system identification, and iteratively modulating the cost structure based on interaction feedback to correct strategic errors. A retrieval memory reuses prior successful strategies. This architecture produces over 50 percent average gains in success rate over VLA and foundation-model baselines on novel contact-rich tasks such as wall-assisted flipping, while preserving real-time stability by keeping slow inference separate from reactive execution.

What carries the argument

The neuro-symbolic adaptation loop in which a VLM supplies semantic priors on mass and friction that online system identification refines while the LLM adjusts cost-function structure from interaction feedback.

If this is right

  • Zero-shot planning becomes feasible for tasks that require deliberate use of extrinsic contacts such as walls or surfaces.
  • Sim-to-real transfer improves because physical parameters are identified and corrected during execution rather than assumed from simulation.
  • Recurrent tasks reuse prior successful cost structures through the retrieval memory, reducing repeated high-level reasoning.
  • Real-time execution remains stable because slow LLM inference is isolated from the fast sampling-based controller.
  • The same hierarchical structure can be applied to other contact-rich domains once the planner and identifier are swapped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cost-design plus online-ID pattern could be tested on legged locomotion where contact timing is equally critical.
  • If the memory retrieval proves robust, the framework might reduce the data needed for fine-tuning foundation models on manipulation.
  • Extending the loop to include tactile feedback could further tighten the physical estimates beyond vision-only priors.
  • The approach suggests that future foundation-model controllers may benefit more from explicit cost synthesis than from end-to-end policy learning in high-contact regimes.

Load-bearing premise

Large language models can reliably synthesize stable, context-appropriate objective functions for sampling-based planners and that vision-language priors can be corrected online without introducing control instability.

What would settle it

Running the full CoRAL pipeline on a new set of contact-rich tasks and observing that success rates fall to within 10 percent of the strongest VLA or foundation-model baseline, or that the system loses stability during the online identification phase.

Figures

Figures reproduced from arXiv: 2605.02600 by Berk \c{C}i\c{c}ek, Mert K. Er, \"Ozg\"ur S. \"O\u{g}\"uz.

Figure 1
Figure 1. Figure 1: Real-world execution of CoRAL across six different view at source ↗
Figure 2
Figure 2. Figure 2: The conceptual workflow of CoRAL, illustrated with the “pick the cutting board” task. view at source ↗
Figure 3
Figure 3. Figure 3: The overall architecture of the CoRAL framework. Given an input image I, object models M and task description T, the vision module extracts world parameters θ. If the Memory Unit finds a similar successful experience, its retrieved plan (Jmem, Cmem) is used to guide the MPPI controller. Otherwise, the LLM (Task Formulation) module generates an initial plan (J0, C0). The system then enters the main executio… view at source ↗
Figure 4
Figure 4. Figure 4: CoRAL on six different tasks with the tracked pose view at source ↗
Figure 5
Figure 5. Figure 5: Online parameter (mass) adaptation in simulation and view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of the LLM-guided contact strategy on the “Flip with Wall” task. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-World Force Regulation Profile (T4) The plot shows the measured end-effector force over time during the “Push with Constant Force” task where the intended push direction is +y. The shaded region represents the target force range the LLM is instructed with. CoRAL successfully modulates the control actions to maintain contact forces within the desired bounds. “Push and Pick” task (T1), where real-world … view at source ↗
Figure 7
Figure 7. Figure 7: Real-World Force Regulation Profile (T4) The plot shows the measured end-effector force over time during the “Push with Constant Force” task where the intended push direction is +y. The shaded region represents the target force range the LLM is instructed with. CoRAL successfully modulates the control actions to maintain contact forces within the desired bounds. varied significantly from the simulation. In… view at source ↗
read the original abstract

While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents CoRAL, a modular framework for contact-rich robotic manipulation that uses LLMs to synthesize context-aware objective functions for an MPPI sampling-based planner. It incorporates a VLM for providing semantic priors on dynamics parameters like mass and friction, which are refined via online system identification in a neuro-symbolic adaptation loop, along with a retrieval-based memory unit for reusing strategies. The approach claims to achieve over 50% higher success rates compared to state-of-the-art VLA and foundation-model baselines in unseen contact-rich scenarios on both simulation and real hardware.

Significance. If the performance gains are substantiated with rigorous experimental validation, this work could advance the integration of large models into real-time adaptive control by providing a hierarchical structure that separates high-level reasoning from low-level execution, potentially improving robustness in sim-to-real transfer for manipulation tasks involving extrinsic contacts.

major comments (3)
  1. [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The central claim of boosting success rates over 50% on average lacks supporting details on the number of trials, statistical significance testing, exact implementations of the VLA and foundation-model baselines, and analysis of failure modes, making it impossible to evaluate the empirical contribution from the provided description.
  2. [Neuro-symbolic adaptation loop description] Neuro-symbolic adaptation loop description: The neuro-symbolic adaptation loop, where VLM priors are refined by online system identification while the LLM modulates cost structure, is presented without any convergence rates, noise models, or stability margins for the estimation in contact-rich regimes; this is load-bearing for the sim-to-real bridging claim and requires explicit analysis to confirm it does not destabilize the MPPI planner.
  3. [Method section on cost design] Method section on cost design: The assertion that LLMs can reliably synthesize effective context-aware objective functions for sampling-based planners in dynamic contact scenarios rests on an assumption that is not supported by any formal guarantee or ablation study in the current manuscript, which is critical given the potential for estimation errors to affect sampling distribution.
minor comments (2)
  1. [Abstract] The term 'neuro-symbolic adaptation loop' is introduced without a clear definition or diagram in the abstract, which could be clarified for readers.
  2. [Throughout] Ensure all acronyms (e.g., MPPI, VLA) are defined at first use.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive comments on our manuscript describing CoRAL. We appreciate the opportunity to address the concerns regarding experimental rigor, the neuro-symbolic loop, and the LLM cost design component. We outline our responses below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The central claim of boosting success rates over 50% on average lacks supporting details on the number of trials, statistical significance testing, exact implementations of the VLA and foundation-model baselines, and analysis of failure modes, making it impossible to evaluate the empirical contribution from the provided description.

    Authors: We agree that additional experimental details are required to substantiate the performance claims. In the revised manuscript we will expand Section 5 to report: the exact number of trials (50 independent trials per task across 12 unseen scenarios for a total of 600 trials), results of statistical significance testing (paired t-tests yielding p < 0.01 for the reported gains), precise baseline implementations (RT-X, OpenVLA, and MPPI with fixed costs, including all hyperparameters and prompt templates used), and a dedicated failure-mode breakdown (perception, planning, and execution errors with quantitative percentages). revision: yes

  2. Referee: [Neuro-symbolic adaptation loop description] Neuro-symbolic adaptation loop description: The neuro-symbolic adaptation loop, where VLM priors are refined by online system identification while the LLM modulates cost structure, is presented without any convergence rates, noise models, or stability margins for the estimation in contact-rich regimes; this is load-bearing for the sim-to-real bridging claim and requires explicit analysis to confirm it does not destabilize the MPPI planner.

    Authors: We acknowledge the absence of quantitative characterization of the adaptation loop. The revised method section will add: observed convergence rates (parameter estimation error dropping below 10% within 4–6 seconds across trials), the noise model employed (zero-mean Gaussian with variance fitted from real sensor residuals), and empirical stability margins (bounded cost variation and Lyapunov-inspired metrics showing no destabilization of MPPI sampling). A full formal stability proof for the hybrid system lies outside the present scope; we will therefore rely on and report the empirical evidence while noting this limitation. revision: partial

  3. Referee: [Method section on cost design] Method section on cost design: The assertion that LLMs can reliably synthesize effective context-aware objective functions for sampling-based planners in dynamic contact scenarios rests on an assumption that is not supported by any formal guarantee or ablation study in the current manuscript, which is critical given the potential for estimation errors to affect sampling distribution.

    Authors: We agree that an ablation study is necessary. The revision will include a new ablation (Section 5.3) comparing LLM-generated adaptive costs against fixed and randomly perturbed costs, together with sensitivity analysis to injected estimation errors. Formal guarantees on LLM outputs remain difficult to obtain given their black-box nature; we will therefore explicitly discuss this limitation in the revised text and ground the claim solely on the empirical ablation results and extensive real-world validation. revision: partial

standing simulated objections not resolved
  • Formal theoretical guarantees on the reliability of LLM-synthesized objective functions under arbitrary estimation errors

Circularity Check

0 steps flagged

No circularity: framework assembles external components without self-referential reduction

full rationale

The paper describes a modular neuro-symbolic architecture (LLM cost synthesis for MPPI, VLM priors refined by standard online system ID, retrieval memory) whose performance claims rest on experimental comparisons rather than any derivation that reduces to fitted parameters or self-citations by construction. No equations, uniqueness theorems, or ansatzes are presented that loop back to the paper's own inputs. The >50% success-rate improvement is an empirical observation, not a quantity forced by internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on domain assumptions about LLM and VLM capabilities in control settings and introduces new architectural components whose effectiveness is asserted via the reported experiments rather than independent evidence.

axioms (2)
  • domain assumption Large language models can synthesize effective context-aware objective functions for sampling-based motion planners from task descriptions
    This is the core mechanism that allows the LLM to serve as a cost designer rather than a direct controller.
  • domain assumption Vision-language models can supply useful initial semantic priors for physical parameters such as mass and friction from visual input
    These priors are the starting point for the real-time adaptation loop described in the abstract.
invented entities (2)
  • neuro-symbolic adaptation loop no independent evidence
    purpose: Refine VLM-provided physical parameter estimates using online interaction feedback
    New component introduced to correct visual ambiguity in real time.
  • retrieval-based memory unit no independent evidence
    purpose: Store and reuse successful cost-function strategies across recurrent tasks
    Proposed architectural addition to improve efficiency on repeated scenarios.

pith-pipeline@v0.9.0 · 5605 in / 1555 out tokens · 48803 ms · 2026-05-13T06:24:42.622801+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

  1. [1]

    VLA-Touch: Enhancing vision-language- action models with dual-level tactile feedback.arXiv preprint arXiv:2507.17294, 2025

    Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, and Harold Soh. Vla-touch: Enhancing vision- language-action models with dual-level tactile feedback. arXiv preprint arXiv:2507.17294, 2025

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  3. [4]

    Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

    Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739, 2025

  4. [5]

    Randall Flanagan, Miles C

    J. Randall Flanagan, Miles C. Bowman, and Roland S. Johansson. Control strategies in object manipulation tasks.Current Opinion in Neurobiology, 16(6):650–659, 2006

  5. [6]

    Garcia, David M

    Carlos E. Garcia, David M. Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 25(3):335–348, 1989

  6. [7]

    Tla: Tactile-language-action model for contact-rich manipu- lation.arXiv preprint arXiv:2503.08548, 2025

    Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Tla: Tactile-language-action model for contact-rich manipu- lation.arXiv preprint arXiv:2503.08548, 2025

  7. [8]

    ThinkAct: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu- Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

  8. [9]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning (CoRL), 2022

  9. [10]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

  10. [11]

    Johansson and J

    Roland S. Johansson and J. Randall Flanagan. Coding and use of tactile signals from the fingertips in object manipulation tasks.Nature Reviews Neuroscience, 10 (5):345–359, 2009

  11. [12]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  12. [13]

    Foster, Pannag R

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P. Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (CoRL), volume 270, pages 2098–2120. PMLR, 2025

  13. [14]

    Thakur, and Steven S

    Sung Soo Kim, Manuel Gomez-Ramirez, Pramods- ingh H. Thakur, and Steven S. Hsiao. Multimodal inter- actions between proprioceptive and cutaneous signals in primary somatosensory cortex.Neuron, 86(2):555–566, 2015

  14. [15]

    arXiv preprint arXiv:2508.07917 (2025) 1, 3, 9

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

  15. [16]

    Onetwovla: A unified vision-language-action model with adaptive reasoning

    Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, and Yang Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025

  16. [17]

    Impact: In- telligent motion planning with acceptable contact tra- jectories via vision-language models.arXiv preprint arXiv:2503.10110, 2025

    Yiyang Ling, Karan Owalekar, Oluwatobiloba Ade- sanya, Erdem Bıyık, and Daniel Seita. Impact: In- telligent motion planning with acceptable contact tra- jectories via vision-language models.arXiv preprint arXiv:2503.10110, 2025

  17. [18]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

  18. [19]

    Factr: Force-attending curriculum training for contact-rich pol- icy learning.arXiv preprint arXiv:2502.17432, 2025

    Jason Jingzhou Liu, Yulong Li, Kenneth Shaw, Tony Tao, Ruslan Salakhutdinov, and Deepak Pathak. Factr: Force-attending curriculum training for contact-rich pol- icy learning.arXiv preprint arXiv:2502.17432, 2025

  19. [20]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De- An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human- level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023

  20. [21]

    Dreureka: Language model guided sim-to- real transfer

    Yecheng Jason Ma, William Liang, Hungju Wang, Sam Wang, Yuke Zhu, Linxi Fan, Osbert Bastani, and Dinesh Jayaraman. Dreureka: Language model guided sim-to- real transfer. InRobotics: Science and Systems (RSS), 2024

  21. [22]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language- action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

  22. [23]

    Open x-embodiment: Robotic learning datasets and rt-x models

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  23. [24]

    Vision- language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025

    Ranjan Sapkota, Yang Cao, Konstantinos I. Roumelio- tis, and Manoj Karkee. Vision-language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769, 2025

  24. [25]

    Foun- dation model driven robotics: A comprehensive review

    Muhammad Tayyab Khan and Ammar Waheed. Foun- dation model driven robotics: A comprehensive review. arXiv preprint arXiv:2507, 2025

  25. [26]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

  26. [27]

    Foundationpose: Unified 6d pose estimation and tracking of novel objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, 2024

  27. [28]

    Theodorou

    Grady Williams, Andrew Aldrich, and Evangelos A. Theodorou. Model predictive path integral control: From theory to parallel computation.Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017

  28. [29]

    Re- active diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation.arXiv preprint arXiv:2503.02881, 2025

    Han Xue, Jieji Ren, Wendi Chen, Gu Zhang, Yuan Fang, Guoying Gu, Huazhe Xu, and Cewu Lu. Re- active diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation.arXiv preprint arXiv:2503.02881, 2025

  29. [30]

    Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation,

    Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation. arXiv preprint arXiv:2505.22159, 2025

  30. [31]

    Language to rewards for robotic skill synthesis

    Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montserrat Gonzalez Arenas, Hao- Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. InConference on Robot Learning (CoRL), pages 374–404. PMLR, 2023

  31. [32]

    Zawalski, W

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

  32. [33]

    VLMPC: Vision-language model predictive control for robotic manipulation

    Wentao Zhao, Jiaming Chen, Ziyu Meng, Donghui Mao, Ran Song, and Wei Zhang. VLMPC: Vision-language model predictive control for robotic manipulation. In Robotics: Science and Systems (RSS), 2024

  33. [34]

    A survey on vision-language- action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

    Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tok- enization perspective.arXiv preprint arXiv:2507.01925, 2025

  34. [35]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın-Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. APPENDIXA PRELIMINARIES In this section, we introduce the Model Predictive Path Integral (MPPI) controller, which forms the co...

  35. [36]

    The cost must be composed of weighted terms for: 14- distance-to-goal or subgoal 15- contact or interaction constraints (if relevant) 16- orientation/alignment terms (if relevant) 17- force or stability terms (if relevant) 18- control effort 19- time/step penalty 20

  36. [37]

    If the task involves MULTIPLE PHASES (e.g., push then pick, flip then place), 22you MUST: 23- infer subgoals, 24- compute a phase progress score, 25- optionally blend objectives using a soft switch (sigmoid-based). 26

  37. [38]

    If the task involves CONTACT-RICH behavior (e.g., pushing, flipping, using a wall), 28include: 29- contact incentives or penalties, 30- drift or slippage penalties, 31- force-dependent shaping terms (if sensed). 32

  38. [39]

    If the task is SIMPLE (e.g., pick-and-place), 34use a single-stage goal-driven cost with alignment and distance terms. 35

  39. [40]

    37Only include what is relevant for the input task

    ALL TERMS must be conditional on task semantics. 37Only include what is relevant for the input task. 38

  40. [41]

    ALL WEIGHTS must be numeric (choose reasonable magnitudes). 40

  41. [42]

    {TASK_DESCRIPTION}

    The function MUST be fully executable Python-like pseudocode using: 42- norms 43- dot products 44- quaternions (if needed) 45- sigmoid for soft transitions 46- optional heuristics (e.g., stability) 47 48FORMAT (MANDATORY): 49Return ONLY the code block: 50 51def state_cost(self): 52... 53return cost 54 55---------------------------------------- 56TASK DESC...