Recognition: unknown
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
Pith reviewed 2026-05-10 16:31 UTC · model grok-4.3
The pith
ProGAL-VLA conditions robot actions on verified 3D goal embeddings from prospective reasoning to increase instruction sensitivity and ambiguity awareness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that constructing a 3D entity-centric graph, deriving symbolic sub-goals through prospective planning, and aligning them via a Grounding Alignment Contrastive loss yields a verified goal embedding that conditions every action. This embedding increases mutual information between language and actions, producing higher robustness under robot perturbations, lower language ignorance, improved entity retrieval, and calibrated selective prediction on ambiguous inputs without harming success rates on clear instructions.
What carries the argument
The verified goal embedding g_t generated from the 3D entity-centric graph (GSM), slow symbolic planner, and Grounding Alignment Contrastive loss, which conditions actions and supplies an attention-entropy ambiguity signal.
If this is right
- Robustness under robot perturbations rises from 30.3 to 71.5 percent on LIBERO-Plus.
- Language ignorance drops by a factor of three to four.
- Entity retrieval improves to 0.71 Recall@1.
- Ambiguity detection reaches 0.81 AUROC while success on unambiguous cases stays intact.
Where Pith is reading between the lines
- The verified-embedding approach may extend to non-robotic multimodal agents that must handle changing instructions.
- Attention entropy could serve as a general-purpose uncertainty signal in other grounded planning systems.
- Real-time graph construction may require further engineering for high-speed or resource-limited deployments.
- The method points toward tighter integration of symbolic planning inside end-to-end neural policies for long-horizon tasks.
Load-bearing premise
A reliable 3D entity-centric graph and slow symbolic planner can be built in real time from visual input and the contrastive loss can align entities without introducing new biases or overfitting to the benchmarks.
What would settle it
Ablating the verification step or the contrastive loss and observing no gains in robustness or ambiguity metrics on the same LIBERO-Plus and Custom Ambiguity Benchmark would falsify the claim.
Figures
read the original abstract
Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA (ProGAL-VLA), which constructs a 3D entity-centric graph (GSM), uses a slow planner to produce symbolic sub-goals, and aligns them with grounded entities via a Grounding Alignment Contrastive (GAC) loss. All actions are conditioned on a verified goal embedding $g_t$, whose attention entropy provides an intrinsic ambiguity signal. On LIBERO-Plus, ProGAL-VLA increases robustness under robot perturbations from 30.3 to 71.5 percent, reduces language ignorance by 3x-4x, and improves entity retrieval from 0.41 to 0.71 Recall@1. On the Custom Ambiguity Benchmark, it reaches AUROC 0.81 (vs. 0.52), AUPR 0.79, and raises clarification on ambiguous inputs from 0.09 to 0.81 without harming unambiguous success. The verification bottleneck increases mutual information of language-actions, the GAC loss imposes an entity-level InfoNCE bound, and attention entropy yields calibrated selective prediction, indicating that explicit verified grounding is an effective path toward instruction-sensitive, ambiguity-aware agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ProGAL-VLA, a vision-language-action model for robotic agents that constructs a 3D entity-centric graph (GSM) from visual input, employs a slow symbolic planner to generate sub-goals, and aligns entities using a Grounding Alignment Contrastive (GAC) loss. Actions are conditioned on a verified goal embedding g_t whose attention entropy signals ambiguity. It reports gains on LIBERO-Plus (robustness 30.3% to 71.5%, language ignorance reduced 3-4x, entity retrieval Recall@1 0.41 to 0.71) and on a Custom Ambiguity Benchmark (AUROC 0.81, clarification rate 0.09 to 0.81), attributing these to increased mutual information and an entity-level InfoNCE bound from the verification step.
Significance. If the results and mechanisms hold after verification, the work indicates that explicit prospective grounding and alignment can reduce reliance on visual shortcuts in VLA models, yielding more instruction-sensitive and ambiguity-aware agents. The reported numerical improvements and use of attention entropy for selective prediction are concrete contributions, though their attribution to the proposed components requires stronger evidence.
major comments (2)
- [Abstract and Methods] The central claim that verified grounding via GSM, slow planner, and GAC loss produces the reported robustness and ambiguity-handling gains depends on reliable real-time construction of the 3D entity-centric graph and execution of the symbolic planner from visual input. The manuscript provides no implementation details, runtime measurements, latency ablations, or extraction protocol for GSM (abstract and methods description), making it impossible to attribute improvements to the mechanism rather than offline processing or benchmark-specific tuning.
- [Abstract] The abstract asserts that the verification bottleneck increases mutual information of language-actions and that the GAC loss imposes an entity-level InfoNCE bound, yet no derivation, proof sketch, or analysis of introduced biases/overfitting to the Custom Ambiguity Benchmark is supplied. This is load-bearing for the theoretical contribution, as the properties may simply restate the design choices without independent support.
minor comments (2)
- [Abstract] The notation for the verified goal embedding g_t is introduced in the abstract without a clear prior definition or equation reference for its construction and verification process.
- [Experiments] No statistical significance tests, variance across runs, or ablation tables are referenced for the benchmark gains (e.g., robustness, Recall@1), which would strengthen the experimental claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications where possible and committing to revisions that strengthen the manuscript's reproducibility and theoretical grounding.
read point-by-point responses
-
Referee: [Abstract and Methods] The central claim that verified grounding via GSM, slow planner, and GAC loss produces the reported robustness and ambiguity-handling gains depends on reliable real-time construction of the 3D entity-centric graph and execution of the symbolic planner from visual input. The manuscript provides no implementation details, runtime measurements, latency ablations, or extraction protocol for GSM (abstract and methods description), making it impossible to attribute improvements to the mechanism rather than offline processing or benchmark-specific tuning.
Authors: We agree that the current abstract and methods description are insufficient for full attribution and reproducibility. While Section 3 outlines the GSM as an entity-centric 3D graph constructed via off-the-shelf detection and tracking, and the planner as a symbolic PDDL-based module, we will expand the methods with a new subsection providing: the precise real-time extraction protocol from RGB-D streams, hardware-specific runtime measurements (e.g., per-frame latency), latency ablations isolating the slow planner's contribution, and confirmation that all components run online during benchmark evaluation. These additions will directly support that performance gains arise from the online mechanisms rather than offline or tuned processing. revision: yes
-
Referee: [Abstract] The abstract asserts that the verification bottleneck increases mutual information of language-actions and that the GAC loss imposes an entity-level InfoNCE bound, yet no derivation, proof sketch, or analysis of introduced biases/overfitting to the Custom Ambiguity Benchmark is supplied. This is load-bearing for the theoretical contribution, as the properties may simply restate the design choices without independent support.
Authors: We acknowledge the need for explicit support of these claims. The manuscript provides an intuitive account in Section 4 and a derivation of the GAC loss as an entity-level InfoNCE bound in the appendix, but we will add a concise proof sketch to the main text demonstrating how the verification bottleneck increases mutual information between language and actions. We will also include an analysis of potential biases and overfitting risks on the Custom Ambiguity Benchmark, reporting results on a held-out validation split to show generalization beyond the benchmark itself. revision: yes
Circularity Check
Verification bottleneck and GAC loss properties asserted without derivation, reducing to definitional restatements of the architecture
specific steps
-
self definitional
[Abstract]
"The verification bottleneck increases mutual information of language-actions, the GAC loss imposes an entity-level InfoNCE bound, and attention entropy yields calibrated selective prediction, indicating that explicit verified grounding is an effective path toward instruction-sensitive, ambiguity-aware agents."
The quoted sentence presents increases in mutual information and an InfoNCE bound as consequences of the verification bottleneck and GAC loss. Because the paper defines the architecture precisely as using a verified goal embedding g_t and a contrastive GAC loss, these properties follow by construction from the definitions (InfoNCE is the standard contrastive objective; verification by construction conditions on g_t) rather than from any separate derivation or falsifiable analysis shown in the text.
full rationale
The paper's central claim that explicit verified grounding produces instruction-sensitive agents rests on two load-bearing assertions in the abstract: that the verification step increases mutual information and that the GAC loss imposes an entity-level InfoNCE bound. These are presented as explanatory outcomes of the method, yet the provided text supplies no independent derivation, proof, or external benchmark separating them from the design choices themselves. The GSM construction and slow planner are invoked as prerequisites but receive no runtime or extraction details, leaving the reported metric gains (robustness 30.3%→71.5%, Recall@1 0.41→0.71) unattributed to mechanism versus tuning. This yields partial circularity: the explanatory narrative collapses into the inputs by construction rather than emerging from them.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual observations can be parsed into a reliable 3D entity-centric graph (GSM)
invented entities (2)
-
Grounding Alignment Contrastive (GAC) loss
no independent evidence
-
verified goal embedding g_t
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Eureka: Evaluating and under- standing large foundation models.arXiv preprint arXiv:2409.10566, 2024
Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Sali- nas, Vibhav Vineet, James Woffinden-Luey, and Safoora Yousefi. Eureka: Evaluating and under- standing large foundation models.arXiv preprint arXiv:2409.10566, 2024. 1
-
[2]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fu- sai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV.2410.24164. 7, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[4]
Do as i can, not as i say: Grounding lan- guage in robotic affordances
Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Ju- lian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding lan- guage in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023. 2
2023
-
[5]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025. 7, 3
work page internal anchor Pith review arXiv 2025
-
[6]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: To- wards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 7, 3
work page internal anchor Pith review arXiv 2025
-
[7]
A com- prehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):1–26, 2021
Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, and Alex Hauptmann. A com- prehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):1–26, 2021. 2
2021
-
[8]
Palm-e: An embodied multi- modal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wen- long Huang, et al. Palm-e: An embodied multi- modal language model. 2023. 1, 2
2023
-
[9]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth ro- bustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626, 2025. 1, 2, 5, 6, 7, 8
work page internal anchor Pith review arXiv 2025
-
[10]
Albrecht, Peter Bell, and Amos Storkey
Dongge Han, Trevor McInroe, Adam Jelley, Ste- fano V. Albrecht, Peter Bell, and Amos Storkey. Llm-personalize: Aligning llm planners with hu- man preferences via reinforced self-training for housekeeping robots, 2024. 2
2024
-
[11]
Multimodal fusion and vision-language models: A survey for robot vision, 2025
Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, Rongtao Xu, and Shibiao Xu. Multimodal fusion and vision-language models: A survey for robot vision, 2025. 1
2025
-
[12]
arXiv preprint arXiv:2508.07650 , year=
Helong Huang, Min Cen, Kai Tan, Xingyue Quan, Guowei Huang, and Hong Zhang. Graphcot-vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous in- structions.arXiv preprint arXiv:2508.07650, 2025. 2
-
[13]
Nora: A small open-sourced gen- eralist vision language action model for embodied tasks, 2025
Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. Nora: A small open-sourced gen- eralist vision language action model for embodied tasks, 2025. 7, 2
2025
-
[14]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Nic- colo Fusai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karam- cheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pan- nag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 7, 8, 2, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Op- timizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 7, 2, 3, 4
work page internal anchor Pith review arXiv 2025
-
[17]
Gp-nerf: Generalized perception nerf for context-aware 3d scene understanding
Hao Li, Dingwen Zhang, Yalun Dai, Nian Liu, Lechao Cheng, Jingfeng Li, Jingdong Wang, and Junwei Han. Gp-nerf: Generalized perception nerf for context-aware 3d scene understanding. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21708–21718,
-
[18]
Survey of vision-language-action models for embodied manipulation, 2025
Haoran Li, Yuhui Chen, Wenbo Cui, Weiheng Liu, Kai Liu, Mingcai Zhou, Zhengtao Zhang, and Dongbin Zhao. Survey of vision-language-action models for embodied manipulation, 2025. 1
2025
-
[19]
Code as policies: Language model programs for embodied control,
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control.arXiv preprint arXiv:2209.07753, 2022. 2
-
[20]
Evaluation and enhancement of semantic grounding in large vision-language models, 2023
Jiaying Lu, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Baochen Sun, Carl Yang, and Jie Yang. Evaluation and enhancement of semantic grounding in large vision-language models, 2023. 1
2023
-
[21]
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language- action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Dominic Maggio, Marcus Abate, Jingnan Shi, Courtney Mario, and Luca Carlone. Loc-nerf: Monte carlo localization using neural radiance fields.arXiv preprint arXiv:2209.09050, 2022. 2
-
[23]
Grounded situa- tion models for robots: Where words and percepts meet
Nikolaos Mavridis and Deb Roy. Grounded situa- tion models for robots: Where words and percepts meet. In2006 IEEE/RSJ international conference on intelligent robots and systems, pages 4690–4697. IEEE, 2006. 2
2006
-
[24]
R3m: A universal visual representation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A uni- versal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022. 2
-
[25]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action mod- els.arXiv preprint arXiv:2501.09747, 2025. 7, 2
work page internal anchor Pith review arXiv 2025
-
[26]
A roadmap to guide the integration of llms in hierarchical plan- ning, 2025
Israel Puerta-Merino, Carlos N´ u˜ nez-Molina, Pablo Mesejo, and Juan Fern´ andez-Olivares. A roadmap to guide the integration of llms in hierarchical plan- ning, 2025. 2
2025
-
[27]
3d-mvp: 3d multi- view pretraining for robotic manipulation.arXiv preprint arXiv:2406.18158, 2024
Shengyi Qian, Kaichun Mo, Valts Blukis, David F Fouhey, Dieter Fox, and Ankit Goyal. 3d-mvp: 3d multiview pretraining for robotic manipulation. arXiv preprint arXiv:2406.18158, 2024. 2
-
[28]
Learning transferable visual mod- els from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual mod- els from natural language supervision. InInterna- tional conference on machine learning, pages 8748–
-
[29]
Real-world robot learning with masked visual pre- training
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre- training. InConference on Robot Learning, pages 416–426. PMLR, 2023. 2
2023
-
[30]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175,
work page internal anchor Pith review arXiv
-
[31]
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language- action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025. 2
-
[32]
Interactive post-training for vision- language-action models, 2025
Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Kr¨ ahenb¨ uhl. Interactive post-training for vision- language-action models, 2025. 7, 3
2025
-
[33]
Embodying pre- trained word embeddings through robot actions,
Minori Toyoda, Kanata Suzuki, Hiroki Mori, Yoshi- hiko Hayashi, and Tetsuya Ogata. Embodying pre- trained word embeddings through robot actions,
-
[34]
Dapeng Zhang, Jin Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012, 2025. 2
-
[35]
Unraveling cross-modality knowledge conflicts in large vision-language mod- els, 2024
Tinghui Zhu, Qin Liu, Fei Wang, Zhengzhong Tu, and Muhao Chen. Unraveling cross-modality knowledge conflicts in large vision-language mod- els, 2024. 1
2024
-
[36]
Rt-2: Vision- language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision- language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1, 2 ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-...
2023
-
[37]
Theoretical Derivations and Proofs This section expands the theoretical analysis from Section 4 and provides fully detailed derivations for the results used in the main paper. 7.1. Proof of Proposition 1 (Language Influence) Proposition 1.Under the Verification Bottleneck assumption, I(L;at|Ot,qt) =I(L;g t|Ot,qt)−I(L;gt|at,Ot,qt). (28) Proof. We first int...
2019
-
[38]
action ex- pert
Model Details This section summarizes the architectures of all policies evaluated on LIBERO-Plus [9]. We focus on the high-level design of the backbones, modality encoders, and action parameterizations. OpenVLA and OpenVLA-OFT Family:Open- VLA [15] is built around the Prismatic-7B vi- sion–language backbone. Visual observations are encoded by a dual-strea...
-
[39]
Architecture Hyperparameters We summarize the instantiated components of ProGAL-VLA
Implementation Details 9.1. Architecture Hyperparameters We summarize the instantiated components of ProGAL-VLA. During inference, the total param- eter count is dominated by the OpenVLA back- bone; the prospective plannerπslow operates asyn- chronously and does not affect control-time latency
-
[40]
language-ignorance
Extended Experimental Results We provide detailed breakdowns that complement the aggregate results in the main text and expose specific robustness properties and failure behaviors of ProGAL-VLA. 10.1. Granular Robustness Analysis Table 7 decomposes failures according to the un- derlying perturbation type. This isolates whether degradation is caused by rob...
-
[41]
At each timestept, let the prospective plannerπ slow output a sym- bolic sub-goals t from the language-vision model
Formal Specification of the Verifi- cation Bottleneck For completeness, we formalize the architectural constraint referred to as the Verification Bottle- neck in the main paper. At each timestept, let the prospective plannerπ slow output a sym- bolic sub-goals t from the language-vision model. The Grounded State Module (GSM) maps the observationO t into a...
-
[42]
Given a new observationOt, YOLO-World provides 2D detections and Metric3D provides depth esti- mates
Entity Memory Update Mecha- nism in GSM The Grounded State Module maintains a bounded entity memoryMt ={e(1) t ,...,e (Kt) t }with capacity Nmax = 16. Given a new observationOt, YOLO-World provides 2D detections and Metric3D provides depth esti- mates. Each detection is converted into a 3D entity embedding with appearance, geometry, and posi- tional attri...
-
[43]
Extract candidate entities from the current frame
-
[44]
If|Mt|<Nmax, append all entities directly
-
[45]
If memory is full, remove the oldest entries (FIFO) and insert new ones
-
[46]
The resulting memoryM t forms the node set of the temporal 3D entity graph. This memory is not used for long-horizon tem- poral reasoning; instead, it provides short-range stability and allowsπ fast to operate on a tem- porally smoothed representation that suppresses frame-level noise
-
[47]
Leth(g t) be the learned projection of the grounded entity embed- ding into a 4096-dimensional vector
Verified Goal Conditioning inπ f ast The action policyπfast conditions exclusively on the verified goalg t produced by the GSM and does not receive direct language input. Leth(g t) be the learned projection of the grounded entity embed- ding into a 4096-dimensional vector. The policy in- put at timesteptis, xt =h(g t), and the control distribution is, at ...
-
[48]
These tem- plates are not executed directly; instead, they index the grounding step
Symbolic Template Resolution The prospective plannerπslow produces symbolic templates such asgrasp green mug. These tem- plates are not executed directly; instead, they index the grounding step. Given a templatest, a corresponding attribute filter is applied to the entities in memory: Γ(st) ={e∈Mt :ematches attributes ins t}. If multiple entities satisfy ...
-
[49]
Detailed results of LIBERO-Plus This section provides a detailed characteriza- tion of generalization under distribution shifts in the LIBERO-Plus benchmark. Figure 4 reports success rates under seven perturbation dimen- sions—Camera, Robot Initialization, Language In- struction, Lighting, Background, Sensor Noise, and Scene Layout—for a wide range of VLA...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.