pith. sign in

arxiv: 2606.27268 · v1 · pith:LKCX7K53new · submitted 2026-06-25 · 💻 cs.RO · cs.AI

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

Pith reviewed 2026-06-26 04:45 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords test-time scalingrobotic manipulationvision-language verifiershistory bufferiterative refinementvision-language-action modelsclosed-loop feedbackembodied tasks
0
0 comments X

The pith

E-TTS improves robotic manipulation by scaling reasoning and actions at test time using history buffers and vision-language verifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces E-TTS as a modular framework to scale performance in robotic manipulation during inference without any retraining or new expert data. It addresses two gaps in prior test-time scaling work by jointly handling reasoning and action generation while incorporating historical context for long-horizon tasks. The method samples reasoning-action pairs, scores them pairwise with vision-language verifiers that draw on a stored history buffer, and refines candidates through closed-loop feedback. Experiments across multiple benchmarks, environments, and base models show consistent gains reaching 33.14 percent in simulation and 26.62 percent in real settings. Each part of the system operates as an independent module that users can configure for different tasks.

Core claim

E-TTS unifies reasoning and action scaling for robotic manipulation through pairwise joint sampling and scoring, a history buffer that supplies context to vision-language verifiers, and feedback generation that creates closed-loop iterative refinement, yielding performance increases up to 33.14 percent in simulation and 26.62 percent in real-world scenarios without additional training.

What carries the argument

The history-aware iterative refinement process that performs joint reasoning-action sampling, scores pairs with vision-language verifiers informed by a history buffer, and feeds results back to generate improved candidates in a closed loop.

If this is right

  • Robotic policies achieve higher success rates without collecting extra expert data or retraining base models.
  • Historical context stored in a buffer improves handling of sequential, long-horizon embodied tasks.
  • Closed-loop feedback from verifiers raises both inference efficiency and adaptability to changing environments.
  • Independent modules allow flexible configuration depending on task requirements.
  • Gains observed across simulation benchmarks transfer to real-world robot embodiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same history-buffer and verifier loop could extend to other sequential robot tasks such as navigation or multi-step assembly.
  • Test-time scaling of this form may reduce the need for ever-larger training datasets in robotics by shifting compute to inference.
  • Combining E-TTS with stronger or task-specific verifiers could produce further measurable lifts in success rate.
  • The pairwise sampling approach might generalize to other domains where both internal reasoning and external actions must be scaled together.

Load-bearing premise

Vision-language models can reliably score and rank sampled reasoning-action pairs using only the history buffer, and this scoring produces genuine task improvement rather than noise or verifier bias.

What would settle it

An experiment that replaces the vision-language verifier scoring with random selection among the sampled pairs and measures whether the reported performance gains still appear would falsify the central mechanism.

Figures

Figures reproduced from arXiv: 2606.27268 by Chaoyang Zhao, Jing Liu, Liang Wang, Nianfeng Liu, Peiyan Li, Tingyu Yuan, Wen Ye, Xiangnan Wu, Yan Huang, Yuan Xu.

Figure 1
Figure 1. Figure 1: Overview. E-TTS is an embodied test-time scaling framework that integrates reasoning and action scaling for robotic manipulation through history-aware, closed￾loop interactions with vision-language verifiers. When combined with standard VLA models, E-TTS consistently enhances performance, achieving up to a 33.14% improve￾ment in simulation and 26.62% in real-world scenarios. Test-time scaling (TTS) has gai… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed E-TTS framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Success rate across differ￾ent proportions. The ‘1/1’ proportion achieves peak performance across all tasks. One of the main contributions of this paper is the proposal to scale both reasoning and action. A natu￾ral question that arises is how to al￾locate computational resources when they are limited. We test five distri￾bution ratios using the E-CoT setup from Sec. 4.1, with a 60-step sam￾pling limit. Re… view at source ↗
Figure 4
Figure 4. Figure 4: Real-Robot Experiments. Overview of the real-world robotic setup and eval￾uation results. Left: Hardware configuration including ZED 2i camera, Franka Research 3 arm, and RealSense D405. Middle: Representative manipulation tasks. Right: Success rate comparison, where our method (blue) significantly outperforms the MolmoAct￾Finetuned baseline, achieving an average improvement of 26.62% [PITH_FULL_IMAGE:fig… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of spatial reasoning traces. Each example shows the overlaid keypoints representing the predicted robot trajectory, where blue and red dots indicate the start and end positions, respectively. Well-aligned traces correspond to physically plausible reasoning, while misaligned traces indicate inconsistent or infeasible spatial grounding [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of the selection process. The figure demonstrates three iterations. The first two are rejected due to incomplete object detection (missing the spoon and the cloth, respectively), while the third successfully identifies all objects and achieves the highest joint score for execution. A.8 Examples on Feedback-Guided Iterative Refinement [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between E-CoT and E-CoT+Ours (E-TTS) on Sim￾plerEnv across three manipulation tasks:“Put the Spoon on the Towel,”, “Put the Egg￾plant in the Basket,” and “Put the Carrot on the Plate.” Each row shows temporal execution frames. While E-CoT often fails to complete the placement action, our ap￾proach achieves precise object manipulation with shorter completion time and higher success ra… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of self-corrective behaviors exhibited by E-CoT + Ours in the task “Put the carrot on the plate.” Each row shows consecutive attempts from the same rollout. After failing to align the carrot with the plate in the first and second tries, the model re-evaluates its reasoning and action candidates through closed-loop feedback and successfully completes the task in the third attempt. This demonst… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison between ER-1 and ER-1+Ours (E-TTS) on SimplerEnv on three manipulation tasks: “Put the Carrot on the Plate,” “Put the Eggplant in the Basket,” and “Put the Spoon on the Towel.” While ER-1 often fails to complete fine￾grained placement due to inaccurate spatial reasoning, our E-TTS-enhanced model achieves more consistent object alignment and successful task completion [PITH_FULL_IMAG… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison between MolmoAct and MolmoAct+Ours on LIBERO on four manipulation tasks: “Open the Middle Drawer,” “Put the Alphabet Soup in the Basket,” “Pick up the Black Bowl Next to the Ramekin and Place it on the Plate,” and “Pick up the Cream Cheese and Place it in the Basket.” While MolmoAct often fails to accomplish fine-grained actions or misinterprets spatial relations, our method con￾sis… view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of real-world rollouts on two representative tasks: Put the toy lion on the shelf (top) and Press the sanitizer (bottom). Each row illustrates sequential observations from the robot during execution, where green boxes denote successful in￾teractions and red boxes indicate failed attempts. The results highlight that basemodel + E-TTS can recover from previous failures and successfully complet… view at source ↗
read the original abstract

Recently, a few works have made early attempts to study test-time scaling for embodied tasks. However, two major challenges remain unsolved: (1) reasoning can effectively improve the performance of the policy, but its scaling mechanism has seldom been studied; (2) historical information is essential, as embodied tasks are inherently long-horizon and sequential, making sole reliance on current observations for action scaling inadequate due to the lack of historical context utilization. To address these challenges, we introduce E-TTS, a modular and plug-and-play Embodied Test-Time Scaling framework that unifies reasoning and action scaling for robotic manipulation via history-aware iterative refinement with vision-language verifiers. To support joint reasoning-action scaling, E-TTS performs reasoning-action joint sampling and scoring in a pairwise manner. To better utilize historical information, E-TTS uses a history buffer to store historical context, which is then used by reasoning and action verifiers to evaluate the sampled candidates. Unlike conventional open-loop TTS methods, E-TTS introduces feedback generation into the sampling process to form a closed-loop iterative refinement mechanism, enhancing both inference efficiency and environmental adaptability. Each component functions as an independent and composable module, allowing flexible and adaptive configuration depending on task requirements. To evaluate the advantages of our framework, we conduct experiments across 4 different benchmarks, 6 environments, 3 embodiments, and 4 base vision-language-action models. The experimental results demonstrate that, without requiring additional expert data collection or retraining, E-TTS consistently improves performance, achieving up to a 33.14% increase in simulation and 26.62% in real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces E-TTS, a modular plug-and-play framework for embodied test-time scaling in robotic manipulation. It unifies reasoning and action scaling through history-aware iterative refinement: joint sampling of reasoning-action pairs, pairwise scoring by vision-language verifiers that consult a history buffer, and closed-loop feedback generation. Experiments across 4 benchmarks, 6 environments, 3 embodiments, and 4 base VLA models report consistent gains (up to 33.14% simulation, 26.62% real-world) without retraining or extra expert data.

Significance. If the reported gains are attributable to the verifier-driven refinement rather than sampling volume or verifier bias, the work would offer a practical, composable inference-time method to improve existing VLA policies on long-horizon tasks. The modular design and breadth of evaluation (multiple embodiments and base models) are strengths; the absence of parameter-free derivations or machine-checked proofs is noted but does not detract from the empirical focus.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that VLM verifiers produce genuine improvement via pairwise scoring on the history buffer is load-bearing, yet no correlation analysis, human agreement metrics, or random-scoring control is described to isolate the verifier's contribution from sampling noise. Without this, the 33.14%/26.62% gains cannot be confidently linked to the proposed mechanism.
  2. [§4.1 and Table 1] §4.1 and Table 1: the experimental setup does not report whether the reasoning and action verifiers were tuned or prompted on the same task distributions used for evaluation; if so, the 'no additional expert data' claim is weakened and the improvements may reflect verifier overfitting rather than generalizable refinement.
  3. [§4.3] §4.3 (Ablations): the paper does not present an ablation that disables the closed-loop feedback generation while keeping sampling budget fixed, leaving open whether the iterative refinement itself, rather than simply evaluating more candidates, drives the reported gains.
minor comments (2)
  1. [§3] Notation for the history buffer and verifier scoring functions is introduced without a compact mathematical definition; a single equation summarizing the joint scoring step would improve clarity.
  2. [Figure 2] Figure 2 (framework diagram) would benefit from explicit arrows showing how verifier scores feed back into the next sampling iteration.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our experimental design and commit to targeted revisions that strengthen the isolation of our proposed mechanisms without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that VLM verifiers produce genuine improvement via pairwise scoring on the history buffer is load-bearing, yet no correlation analysis, human agreement metrics, or random-scoring control is described to isolate the verifier's contribution from sampling noise. Without this, the 33.14%/26.62% gains cannot be confidently linked to the proposed mechanism.

    Authors: We agree that stronger isolation of the verifier's role would be valuable. Our §4.3 ablations already compare configurations with and without the verifier and history buffer, showing performance drops when these are removed. To directly address the concern, the revised manuscript will add a random-scoring control (replacing verifier scores with uniform random values at the same sampling budget) and report any available correlation between verifier scores and task success rates. This will help attribute gains to the proposed mechanism rather than sampling volume. revision: yes

  2. Referee: [§4.1 and Table 1] §4.1 and Table 1: the experimental setup does not report whether the reasoning and action verifiers were tuned or prompted on the same task distributions used for evaluation; if so, the 'no additional expert data' claim is weakened and the improvements may reflect verifier overfitting rather than generalizable refinement.

    Authors: The verifiers are off-the-shelf VLMs prompted with fixed, general instructions for evaluating reasoning quality and action feasibility; these prompts contain no task-specific examples, fine-tuning, or data from the evaluation benchmarks. No additional expert data or task-specific adaptation was used. We will revise §4.1 and Table 1 captions to explicitly document the prompt templates and confirm the absence of any tuning on the target distributions. revision: yes

  3. Referee: [§4.3] §4.3 (Ablations): the paper does not present an ablation that disables the closed-loop feedback generation while keeping sampling budget fixed, leaving open whether the iterative refinement itself, rather than simply evaluating more candidates, drives the reported gains.

    Authors: This is a fair observation. While existing ablations vary the history buffer and verifier usage, they do not hold total evaluations constant while removing the closed-loop feedback step. In the revision we will add this ablation: a non-iterative baseline that draws the same total number of reasoning-action candidates in a single open-loop pass (no feedback) and compares it directly to the closed-loop version. This will isolate the contribution of iterative refinement. revision: yes

Circularity Check

0 steps flagged

Empirical framework with no derivational circularity

full rationale

The paper presents E-TTS as a modular, plug-and-play framework for embodied test-time scaling, relying on history buffers, pairwise VLM scoring of reasoning-action pairs, and closed-loop refinement. All reported gains (up to 33.14% simulation, 26.62% real-world) are framed as experimental outcomes across benchmarks, environments, and base models, with no mathematical derivations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to their own inputs by construction. The structure is self-contained as an engineering contribution validated externally via task performance metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that off-the-shelf vision-language models can serve as accurate verifiers for both reasoning quality and action suitability when given historical context; this assumption is not independently validated in the provided abstract.

axioms (1)
  • domain assumption Vision-language models can evaluate and rank reasoning-action candidates using history
    This is invoked to justify the scoring step that drives iterative refinement.

pith-pipeline@v0.9.1-grok · 5851 in / 1244 out tokens · 37120 ms · 2026-06-26T04:45:40.078041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 18 linked inside Pith

  1. [1]

    5-vl technical report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  2. [2]

    In: Proceed- ings of the AAAI conference on artificial intelligence

    Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Giani- nazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., et al.: Graph of thoughts: Solving elaborate problems with large language models. In: Proceed- ings of the AAAI conference on artificial intelligence. vol. 38, pp. 17682–17690 (2024)

  3. [3]

    the method of paired comparisons

    Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika39(3/4), 324–345 (1952)

  4. [4]

    arXiv preprint arXiv:2407.21787 (2024)

    Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q.V., Ré, C., Mirhoseini, A.: Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787 (2024)

  5. [5]

    arXiv preprint arXiv:2602.12684 (2026)

    Cai, R., Guo, J., He, X., Jin, P., Li, J., Lin, B., Liu, F., Liu, W., Ma, F., Ma, K., et al.: Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684 (2026)

  6. [6]

    arXiv preprint arXiv:2304.05128 (2023) 16 W

    Chen, X., Lin, M., Schärli, N., Zhou, D.: Teaching large language models to self- debug. arXiv preprint arXiv:2304.05128 (2023) 16 W. Ye et al

  7. [7]

    IEEE Robotics and Automation Letters11(3), 2482–2489 (2026)

    Chen, Y., Huang, Y., He, K., Li, P., Wang, L.: Verm: Leveraging foundation models to create a virtual eye for efficient 3d robotic manipulation. IEEE Robotics and Automation Letters11(3), 2482–2489 (2026)

  8. [8]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chen, Y., Li, P., Huang, Y., Yang, J., Chen, K., Wang, L.: Ec-flow: Enabling ver- satile robotic manipulation from action-unlabeled videos via embodiment-centric flow. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11958–11968 (2025)

  9. [9]

    arXiv preprint arXiv:2602.03793 (2026)

    Chen,Y.,Li,P.,Yang,J.,He,K.,Wu,X.,Xu,Y.,Wang,K.,Liu,J.,Liu,N.,Huang, Y., et al.: Bridgev2w: Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793 (2026)

  10. [10]

    arXiv preprint arXiv:2502.03729 (2025)

    Clark, J., Mirchandani, S., Sadigh, D., Belkhale, S.: Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729 (2025)

  11. [11]

    arXiv preprint arXiv:2510.10975 (2025)

    Dai, M., Liu, L., Bai, Y., Liu, Y., Wang, Z., Su, R., Chen, C., Lin, L., Wu, X.: Rover: Robot reward model as test-time verifier for vision-language-action model. arXiv preprint arXiv:2510.10975 (2025)

  12. [12]

    arXiv preprint arXiv:2510.13626 (2025)

    Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., et al.: Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)

  13. [13]

    arXiv preprint arXiv:2507.16815 (2025)

    Huang, C.P., Wu, Y.H., Chen, M.H., Wang, Y.C.F., Yang, F.E.: Thinkact: Vision- language-action reasoning via reinforced visual latent planning. arXiv preprint arXiv:2507.16815 (2025)

  14. [14]

    5: a vision-language-action model with open-world generalization, 2025

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Es- mail, A., Equi, M., Finn, C., Fusai, N., et al.:π0. 5: a vision-language-action model with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054 1(2), 3

  15. [15]

    arXiv preprint arXiv:2510.05681 (2025)

    Jang, S., Kim, D., Kim, C., Kim, Y., Shin, J.: Verifier-free test-time sampling for vision language action models. arXiv preprint arXiv:2510.05681 (2025)

  16. [16]

    arXiv preprint arXiv:2406.09246 (2024)

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  17. [17]

    arXiv preprint arXiv:2506.17811 (2025)

    Kwok, J., Agia, C., Sinha, R., Foutter, M., Li, S., Stoica, I., Mirhoseini, A., Pavone, M.: Robomonkey: Scaling test-time sampling and verification for vision-language- action models. arXiv preprint arXiv:2506.17811 (2025)

  18. [18]

    arXiv preprint arXiv:2411.15124 (2024)

    Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Mi- randa, L.J.V., Liu, A., Dziri, N., Lyu, S., et al.: Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124 (2024)

  19. [19]

    arXiv preprint arXiv:2508.07917 (2025)

    Lee, J., Duan, J., Fang, H., Deng, Y., Liu, S., Li, B., Fang, B., Zhang, J., Wang, Y.R., Lee, S., et al.: Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 (2025)

  20. [20]

    arXiv preprint arXiv:2502.14382 (2025)

    Li, D., Cao, S., Cao, C., Li, X., Tan, S., Keutzer, K., Xing, J., Gonzalez, J.E., Sto- ica, I.: S*: Test time scaling for code generation. arXiv preprint arXiv:2502.14382 (2025)

  21. [21]

    arXiv preprint arXiv:2506.07961 (2025)

    Li, P., Chen, Y., Wu, H., Ma, X., Wu, X., Huang, Y., Wang, L., Kong, T., Tan, T.: Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models. arXiv preprint arXiv:2506.07961 (2025)

  22. [22]

    arXiv preprint arXiv:2604.03181 (2026)

    Li, P., Chen, Y., Xu, Y., Yang, J., Wu, X., Guo, J., Sun, N., Qian, L., Li, X., Xiao, X., et al.: Multi-view video diffusion policy: A 3d spatio-temporal-aware video action model. arXiv preprint arXiv:2604.03181 (2026)

  23. [23]

    IEEE Robotics and Automation Letters10(2), 1912–1919 (2025) E-TTS 17

    Li, P., Wu, H., Huang, Y., Cheang, C., Wang, L., Kong, T.: Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy. IEEE Robotics and Automation Letters10(2), 1912–1919 (2025) E-TTS 17

  24. [24]

    arXiv preprint arXiv:2405.05941 (2024)

    Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 (2024)

  25. [25]

    In: The Twelfth International Conference on Learning Representations (2023)

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: The Twelfth International Conference on Learning Representations (2023)

  26. [26]

    Advances in Neural Information Processing Systems36, 44776–44791 (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

  27. [27]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  28. [28]

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al.: Self-refine: Iterative refinement with self-feedback.AdvancesinNeuralInformationProcessingSystems36,46534–46594 (2023)

  29. [29]

    arXiv preprint arXiv:2508.21112 (2025)

    Qu, D., Song, H., Chen, Q., Chen, Z., Gao, X., Ye, X., Lv, Q., Shi, M., Ren, G., Ruan, C., et al.: Eo-1: Interleaved vision-text-action pretraining for general robot control. arXiv preprint arXiv:2508.21112 (2025)

  30. [30]

    In: Findings of the association for computational linguistics: EMNLP 2024

    Renze,M.:Theeffectofsamplingtemperatureonproblemsolvinginlargelanguage models. In: Findings of the association for computational linguistics: EMNLP 2024. pp. 7346–7356 (2024)

  31. [31]

    arXiv preprint arXiv:2505.21432 (2025)

    Song, H., Qu, D., Yao, Y., Chen, Q., Lv, Q., Tang, Y., Shi, M., Ren, G., Yao, M., Zhao, B., et al.: Hume: Introducing system-2 thinking in visual-language-action model. arXiv preprint arXiv:2505.21432 (2025)

  32. [32]

    arXiv preprint arXiv:2509.14889 (2025)

    Sun, N., Li, Y., Wang, C., Li, H., Liu, H.: Collabvla: Self-reflective vision-language- action model dreaming together with human. arXiv preprint arXiv:2509.14889 (2025)

  33. [33]

    arXiv preprint arXiv:2606.03784 (2026)

    Sun, N., Zhang, Y., Yang, Y., Zhao, W., Li, P., Guo, J., Song, W., Ding, P., Suo, R., Su, Y., et al.: Revisiting embodied chain-of-thought for generalizable robot manipulation. arXiv preprint arXiv:2606.03784 (2026)

  34. [34]

    arXiv preprint arXiv:2412.11974 (2024)

    Sun, Q., Hong, P., Pala, T.D., Toh, V., Tan, U., Ghosal, D., Poria, S., et al.: Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning. arXiv preprint arXiv:2412.11974 (2024)

  35. [35]

    In: Findings of the Association for Computational Linguistics: ACL 2024

    Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.X., Yang, Y., et al.: Aligning large multimodal models with factually augmented rlhf. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 13088–13110 (2024)

  36. [36]

    arXiv preprint arXiv:2310.17274 (2023)

    Sundaralingam, B., Hari, S.K.S., Fishman, A., Garrett, C., Van Wyk, K., Blukis, V., Millane, A., Oleynikova, H., Handa, A., Ramos, F., et al.: curobo: Par- allelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274 (2023)

  37. [37]

    In: Conference on Robot Learning (CoRL) (2023)

    Walke, H., Black, K., Lee, A., Kim, M.J., Du, M., Zheng, C., Zhao, T., Hansen- Estruch, P., Vuong, Q., He, A., Myers, V., Fang, K., Finn, C., Levine, S.: Bridge- data v2: A dataset for robot learning at scale. In: Conference on Robot Learning (CoRL) (2023)

  38. [38]

    arXiv preprint arXiv:2406.04692 (2024) 18 W

    Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., Zou, J.: Mixture-of-agents en- hances large language model capabilities. arXiv preprint arXiv:2406.04692 (2024) 18 W. Ye et al

  39. [39]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  40. [40]

    arXiv preprint arXiv:2509.22578 (2025)

    Xu, Y., Yang, J., Wang, X., Chen, Y., Zhu, Z., Fang, B., Huang, G., Chen, X., Ye, Y., Zhang, Q., et al.: Egodemogen: Novel egocentric demonstration generation enables viewpoint-robust manipulation. arXiv preprint arXiv:2509.22578 (2025)

  41. [41]

    arXiv preprint arXiv:2602.18020 (2026)

    Yang, J., Chen, Y., Xu, Y., Li, P., Wu, X., Wen, Z., Fang, B., Yu, T., Zhang, Z., Li, Y., et al.: Uaor: Uncertainty-aware observation reinjection for vision-language- action models. arXiv preprint arXiv:2602.18020 (2026)

  42. [42]

    arXiv preprint arXiv:2512.02834 (2025)

    Yang, S., Zhang, Y., He, H., Pan, L., Li, X., Bai, C., Li, X.: Steering vision- language-action models as anti-exploration: A test-time scaling approach. arXiv preprint arXiv:2512.02834 (2025)

  43. [43]

    Advances in neural information processing systems36, 11809–11822 (2023)

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)

  44. [44]

    arXiv preprint arXiv:2407.06023 (2024)

    Yu, P., Xu, J., Weston, J., Kulikov, I.: Distilling system 2 into system 1. arXiv preprint arXiv:2407.06023 (2024)

  45. [45]

    arXiv preprint arXiv:2512.11609 (2025)

    Yuan, T., Guan, B., Ye, W., Tian, Z., Yang, Y., Zhou, W., Li, Z., Huang, Y., Wang, P., Zhao, C., et al.: Unibyd: A unified framework for learning robotic manipulation across embodiments beyond imitation of human demonstrations. arXiv preprint arXiv:2512.11609 (2025)

  46. [46]

    arXiv preprint arXiv:2508.13998 (2025)

    Yuan, Y., Cui, H., Huang, Y., Chen, Y., Ni, F., Dong, Z., Li, P., Zheng, Y., Hao, J.: Embodied-r1: Reinforced embodied reasoning for general robotic manipulation. arXiv preprint arXiv:2508.13998 (2025)

  47. [47]

    arXiv preprint arXiv:2407.08693 (2024)

    Zawalski,M.,Chen,W.,Pertsch,K.,Mees,O.,Finn,C.,Levine,S.:Roboticcontrol via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693 (2024)

  48. [48]

    arXiv preprint arXiv:2509.11766 (2025)

    Zhai, A., Liu, B., Fang, B., Cai, C., Ma, E., Yin, E., Wang, H., Zhou, H., Wang, J., Shi, L., et al.: Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766 (2025)

  49. [49]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, S., Xu, Z., Liu, P., Yu, X., Li, Y., Gao, Q., Fei, Z., Yin, Z., Wu, Z., Jiang, Y.G., et al.: Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11142–11152 (2025)

  50. [50]

    arXiv preprint arXiv:2507.01925 (2025)

    Zhong, Y., Bai, F., Cai, S., Huang, X., Chen, Z., Zhang, X., Wang, Y., Guo, S., Guan, T., Lui, K.N., et al.: A survey on vision-language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925 (2025)

  51. [51]

    arXiv preprint arXiv:2205.10625 (2022)

    Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022)

  52. [52]

    Select grasp point

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) E-TTS 1 A Method Details A.1 More Details of E-TTS Unlike prior work such as RoboMonkey [17], which scales...

  53. [53]

    What might be problematic with the current approach?

  54. [54]

    What should the robot focus on or prioritize?

  55. [55]

    How can the action quality (reward) be improved?’ if reward_info else ” } Provide concise, actionable feedback in 2-3 sentences

    What specific aspect needs adjustment? { ’4. How can the action quality (reward) be improved?’ if reward_info else ” } Provide concise, actionable feedback in 2-3 sentences. Be constructive and specific. Overall Objective: {instruction} Robot’s Current Reasoning: {reasoning_text} Evaluation Score:{score:.2f} (Low score indicates potential issues with the ...

  56. [56]

    yes" or

    pick the strawberry ...”. By encoding purely linguistic reasoning in this struc- tured manner,Vc can evaluate the feasibility and consistency of high-level plans, forming a critical component in our history-aware, feedback-guided verification framework for sequential embodied tasks. The prompt for evaluation of these two categories is shown as: You are an...