EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories

Christopher E. Mower; Haitham Bou-Ammar; Hassan Jaber; Luca Cagliero; Refinath S N

arxiv: 2607.00020 · v1 · pith:72RVQLYDnew · submitted 2026-06-06 · 💻 cs.RO

EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories

Hassan Jaber , Refinath S N , Luca Cagliero , Christopher E. Mower , Haitham Bou-Ammar This is my paper

Pith reviewed 2026-07-02 22:44 UTC · model grok-4.3

classification 💻 cs.RO

keywords scene graphsspatial relationsvision-language modelsrobotic manipulationembodied AIbenchmarkdatasetLIBERO

0 comments

The pith

A spatial scene-graph benchmark shows vision-language models predict plausible relations but fail on exact depth-aware and viewpoint-dependent structure in manipulation scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EmbodimentSemantic as a dataset and benchmark that represents manipulation scenes through directed object-relation-object triplets drawn from a fixed vocabulary of spatial relations. It supplies both real-world trajectories from a low-cost robot arm and a large simulator benchmark with automatic ground-truth labels generated from geometry, coordinates, projections, and visibility rules. Experiments on open-source and commercial models establish that current VLMs handle plausible relations yet consistently miss precise depth and camera-specific details. The work further checks whether feeding the graphs into VLA policy prompts yields measurable control gains.

Core claim

EmbodimentSemantic represents scenes as directed object-relation-object triplets that enable direct evaluation of object binding, relation prediction, and spatial consistency; the accompanying simulator-grounded benchmark supplies over 60K frames and 120K camera-specific scene graphs whose ground-truth relations are derived from MuJoCo geometry, world coordinates, camera projections, and visibility constraints, revealing that current VLMs often predict plausible relations but struggle with exact depth-aware and viewpoint-dependent spatial structure.

What carries the argument

Directed object-relation-object triplets that encode spatial relations between ordered object pairs, evaluated for binding accuracy, relation correctness, and cross-view consistency.

If this is right

Scene graphs provide an explicit diagnostic for object binding and relation errors in VLM perception pipelines.
Paired third-person and wrist-view graphs allow controlled measurement of viewpoint dependence.
Injecting scene graphs into existing VLA prompts supplies a direct test of whether explicit spatial structure improves downstream control.
The real-world SO101 trajectories extend the benchmark beyond simulation to practical robotic settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The triplet format could be used to generate synthetic training data that targets specific depth and occlusion failures observed in the models.
Extending the relation vocabulary or adding temporal edges across frames might expose whether current limitations are static or motion-related.
The benchmark setup offers a template for testing whether other relational representations, such as graphs with probabilistic edges, close the observed gaps.

Load-bearing premise

Ground-truth relations derived automatically from simulator geometry and projections match the spatial facts that determine successful manipulation.

What would settle it

An experiment in which VLA policies prompted with the derived scene graphs achieve no higher task success rate than identical policies without the graphs, or in which human judges systematically disagree with the simulator-derived relations on the same frames.

Figures

Figures reproduced from arXiv: 2607.00020 by Christopher E. Mower, Haitham Bou-Ammar, Hassan Jaber, Luca Cagliero, Refinath S N.

**Figure 2.** Figure 2: Real-world SO101 scene-graph prediction examples. Generated scene graphs are overlaid [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Teleoperation interface used to collect real-robot demonstrations for the bowl-placement [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Additional aggregate F1 results for VLM scene-graph prediction. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Additional precision and recall results for VLM scene-graph prediction. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Diagnostic metrics for VLM scene-graph prediction. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Per-relation F1 for lateral and depth relations. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Per-relation F1 for support and containment relations. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Spatial grounding remains a key limitation of vision-language-action (VLA) systems for robotic manipulation. While current models can recognize objects and follow language instructions, they often lack an explicit representation of how objects are arranged in space, including support, containment, ordering, occlusion, and depth-sensitive relations. We introduce EmbodimentSemantic, a spatial scene-graph dataset and benchmark for evaluating relational grounding in embodied manipulation. EmbodimentSemantic represents scenes as directed object-relation-object triplets, where each triplet specifies a spatial relation between an ordered pair of objects using a fixed set of relations. This representation enables direct evaluation of object binding, relation prediction, and spatial consistency. The dataset includes real-world manipulation observations collected with the low-cost SO101 robot arm, together with generated scene graphs for studying spatial grounding in practical robotic settings. To provide controlled validation, we also introduce a simulator-grounded LIBERO benchmark with over 60K manipulation frames and more than 120K camera-specific scene graphs across paired third-person and wrist views, where ground-truth relations are derived automatically from MuJoCo geometry, world coordinates, camera projections, and visibility constraints. We further test whether scene graphs improve downstream control by injecting them into existing VLA policy prompts. Experiments across open-source and commercial VLMs show that current models often predict plausible relations but struggle with exact depth-aware and viewpoint-dependent spatial structure. EmbodimentSemantic provides a unified framework for diagnosing spatial grounding in VLM perception and testing its utility for VLA manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EmbodimentSemantic supplies a new dataset and benchmark of directed spatial scene graphs on real and simulated manipulation trajectories, but the MuJoCo-derived labels lack any validation against human judgment or task outcomes.

read the letter

This paper's key offering is a new spatial scene-graph dataset and benchmark called EmbodimentSemantic, built on manipulation trajectories with real robot data and a large simulator set, to test VLMs on depth-aware and viewpoint-dependent relations.

It is new in its use of directed triplets for relations, the collection protocol pairing real SO101 arm observations with MuJoCo-generated graphs from LIBERO, and the focus on camera-specific views for over 120K graphs. The work does a good job laying out how to derive the ground truth automatically from geometry and projections, and it includes an attempt to see if scene graphs help in downstream VLA policies.

That gives the community a practical tool for measuring spatial grounding beyond loose object descriptions.

The soft spot is the unverified assumption that these MuJoCo-derived labels match the spatial facts relevant to manipulation. There is no mention of human annotation agreement or correlation with task success rates, so the claim that models predict plausible but not exact relations could be tied to the specific labeling rules rather than a deeper limitation. The abstract reports model struggles but without the actual numbers or analysis, it's difficult to judge the strength of that evidence.

This is aimed at the embodied robotics and VLM research community. Readers working on improving spatial understanding in robot policies would get direct value from the benchmark and data.

It has enough of a concrete contribution with the dataset and evaluation setup to deserve a serious referee, who could push on the label validation and request more detailed results.

I would recommend sending it for peer review.

Referee Report

1 major / 1 minor

Summary. The paper introduces EmbodimentSemantic, a spatial scene-graph dataset and benchmark for evaluating relational grounding in vision-language models (VLMs) and vision-language-action (VLA) systems on embodied manipulation. It represents scenes as directed object-relation-object triplets using a fixed relation vocabulary, provides real-world trajectories from a SO101 robot arm, and introduces a simulator-grounded LIBERO benchmark with over 60K frames and >120K camera-specific scene graphs whose ground-truth relations are derived automatically from MuJoCo geometry, world coordinates, camera projections, and visibility constraints. Experiments on open-source and commercial VLMs are reported to show that models predict plausible relations but struggle with exact depth-aware and viewpoint-dependent structure; the paper also tests whether injecting scene graphs into VLA policy prompts improves downstream control.

Significance. If the automatically derived ground-truth relations prove reliable proxies for manipulation-relevant spatial facts, the benchmark would offer a concrete, scalable tool for diagnosing spatial grounding failures in VLMs and for measuring whether explicit scene-graph representations aid VLA policies. The dual real/simulated construction and evaluation across multiple model classes are positive features. The absence of any reported validation of the automatic labels against human judgments or task-success correlation, however, leaves the central diagnostic claim unsupported at present.

major comments (1)

[Abstract] Abstract (benchmark construction paragraph): The headline claim that current VLMs 'struggle with exact depth-aware and viewpoint-dependent spatial structure' depends on the validity of the >120K MuJoCo-derived triplets as proxies for the spatial facts that determine manipulation success. No human-annotation agreement study, correlation with task success rates, or ablation of specific relation types is described to confirm that the automatic labels capture intended semantics rather than projection/visibility artifacts.

minor comments (1)

[Abstract] Abstract: The phrasing 'over 60K manipulation frames and more than 120K camera-specific scene graphs across paired third-person and wrist views' leaves the exact counting convention (whether 120K counts both views separately or includes only valid visible triplets) unclear; a short clarifying sentence would help.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to address concerns about the validity of our automatically derived ground-truth relations. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (benchmark construction paragraph): The headline claim that current VLMs 'struggle with exact depth-aware and viewpoint-dependent spatial structure' depends on the validity of the >120K MuJoCo-derived triplets as proxies for the spatial facts that determine manipulation success. No human-annotation agreement study, correlation with task success rates, or ablation of specific relation types is described to confirm that the automatic labels capture intended semantics rather than projection/visibility artifacts.

Authors: The >120K triplets are derived deterministically from MuJoCo's exact 3D geometry, object poses, world coordinates, camera intrinsics/extrinsics, and visibility/occlusion computations. This produces objective, viewpoint-specific relations that directly encode the spatial facts (including depth ordering and occlusion) present in the simulated environment; projection and visibility are not artifacts but explicit components of the intended semantics for testing embodied perception. The benchmark therefore evaluates whether VLMs recover these precise structures rather than approximate human-like judgments. No human agreement study or task-success correlation is reported because the focus is on objective geometric grounding in a controlled simulator setting (with real-world SO101 trajectories providing a separate, manually annotated complement). An ablation across relation types is not included but the aggregate results already demonstrate consistent failures on depth- and viewpoint-sensitive relations across models. revision: no

Circularity Check

0 steps flagged

No circularity: dataset/benchmark definition with no equations or derivations

full rationale

The paper introduces EmbodimentSemantic as a new spatial scene-graph dataset and benchmark. Ground-truth relations are defined by construction from MuJoCo geometry, coordinates, projections, and visibility rules, but this is an explicit labeling procedure for the benchmark itself rather than a claimed derivation of a result from first principles. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text. The experimental claims are direct evaluations of VLMs on this benchmark, with no reduction of outputs to inputs by construction. This matches the default case of a self-contained dataset contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset and benchmark paper; contains no mathematical derivations, fitted parameters, or postulated entities. The fixed relation vocabulary and automatic ground-truth extraction are design choices rather than free parameters or axioms.

pith-pipeline@v0.9.1-grok · 5819 in / 1106 out tokens · 24067 ms · 2026-07-02T22:44:30.593768+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[2]

Ichter, A

B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. T. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu,...
[3]

URLhttps://proceedings.mlr.press/v205/ichter23a.html
[4]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision- language-action flow model for general robot control. InProceedings ...

work page doi:10.15607/rss.2025.xxi.010 2025
[5]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. Sanketi, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Julia...

2023
[6]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025
[7]

Y . Xing, X. Luo, J. Xie, L. Gao, H. T. Shen, and J. Song. Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 of Proceedings of Machine Learning Research, pages 3239–3266. PMLR, 27–30 Sep 2025. URL https...

2025
[8]

Zhang, S

J. Zhang, S. Wu, X. Luo, H. Wu, L. Gao, H. T. Shen, and J. Song. InSpire: Vision-language- action models with intrinsic spatial reasoning, 2025

2025
[9]

I. Fang, J. Zhang, S. Tong, and C. Feng. From intention to execution: Probing the generalization boundaries of vision-language-action models.arXiv preprint arXiv:2506.09930, 2025. doi: 10.48550/arXiv.2506.09930

work page doi:10.48550/arxiv.2506.09930 2025
[10]

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. doi:10.1109/ CVPR52733.2024.01370

work page arXiv 2024
[11]

C. H. Song, V . Blukis, J. Tremblay, S. Tyree, Y . Su, and S. Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of 16 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15768–15780, 2025

2025
[12]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS.2025.XXI.011

work page doi:10.15607/rss.2025.xxi.011 2025
[13]

P. W. Battaglia, D. Kersten, and P. R. Schrater. How haptic size sensations improve distance perception.PLoS Computational Biology, 7(6):e1002080, 2011. doi:10.1371/journal.pcbi. 1002080

work page doi:10.1371/journal.pcbi 2011
[14]

Xiang, T

Y . Xiang, T. Schmidt, V . Narayanan, and D. Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems, 2018. doi:10.15607/RSS.2018.XIV .019

work page doi:10.15607/rss.2018.xiv 2018
[15]

Tremblay, T

J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. InProceedings of The 2nd Conference on Robot Learning, volume 87 ofProceedings of Machine Learning Research, pages 306–316. PMLR, 2018

2018
[16]

Hodan, F

T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, C. Sahin, F. Manhardt, F. Tombari, T.-K. Kim, J. Matas, and C. Rother. BOP: Benchmark for 6D object pose estimation. InProceedings of the European Conference on Computer Vision, pages 19–34, 2018

2018
[17]

Hutchinson, G

S. Hutchinson, G. D. Hager, and P. I. Corke. A tutorial on visual servo control.IEEE Transactions on Robotics and Automation, 12(5):651–670, 1996. doi:10.1109/70.538972

work page doi:10.1109/70.538972 1996
[18]

Chaumette and S

F. Chaumette and S. Hutchinson. Visual servo control. i. basic approaches.IEEE Robotics & Automation Magazine, 13(4):82–90, 2006. doi:10.1109/MRA.2006.250573

work page doi:10.1109/mra.2006.250573 2006
[19]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 785–799. PMLR, 2023

2023
[20]

Goyal, J

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. RVT: Robotic view transformer for 3D object manipulation. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 694–710. PMLR, 2023

2023
[21]

X. Li, M. Zhang, Y . Geng, H. Geng, Y . Long, Y . Shen, R. Zhang, J. Liu, and H. Dong. Mani- pLLM: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024

2024
[22]

H. Li, Q. Feng, Z. Zheng, J. Feng, Z. Chen, and A. Knoll. Language-guided object-centric diffusion policy for generalizable and collision-aware manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 12834–12841, 2025. doi:10.1109/ ICRA55743.2025.11127231

work page arXiv 2025
[23]

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield. Spot: Se(3) pose trajectory diffusion for object-centric manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[24]

X. Li, L. Heng, J. Liu, Y . Shen, C. Gu, Z. Liu, H. Chen, N. Han, R. Zhang, H. Tang, S. Zhang, and H. Dong. 3DS-VLA: A 3D spatial-aware vision language action model for robust multi- task manipulation. InProceedings of The 9th Conference on Robot Learning, volume 305 of Proceedings of Machine Learning Research, pages 2344–2359. PMLR, 2025. 17

2025
[25]

Krishna, Y

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123:32–73, 2017

2017
[26]

K. Yang, O. Russakovsky, and J. Deng. SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

2019
[27]

F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 06 2023. ISSN 2307-387X. doi:10.1162/tacl_a_00566. URLhttps://doi.org/10.1162/tacl_a_00566

work page doi:10.1162/tacl_a_00566 2023
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5891–5900, https://doi.org/10

T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238, 2022. doi:10.1109/CVPR52688.2022.00517

work page doi:10.1109/cvpr52688.2022.00517 2022
[29]

J. Wang, Y . Ming, Z. Shi, V . Vineet, X. Wang, Y . Li, and N. Joshi. Is a pic- ture worth a thousand words? delving into spatial reasoning for vision language mod- els, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 89cc5e613d34f90de90c21e996e60b30-Paper-Conference.pdf

2024
[30]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 44776–44791. Curran Associates, Inc., 2023. URL https://proceedings.neur...

2023
[31]

A. Zeng, M. Attarian, brian ichter, K. M. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. S. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence. Socratic models: Composing zero-shot multimodal reasoning with language. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=G2Q2Mh3avow

2023
[32]

Cheng, H

A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. InAdvances in Neural Information Processing Systems, 2024

2024
[33]

M. Du, B. Wu, Z. Li, X. Huang, and Z. Wei. EmbSpatial-Bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers, pages 346–355, 2024

2024
[34]

Z. Feng, Z. Kang, Q. Wang, Z. Du, J. Yan, S. Shubin, C. Yuan, H. Liang, Y . Deng, Q. Li, R. Yang, R. An, L. Zheng, W. Wang, S. Chen, S. Xu, Y . Liang, J. Yang, and B. Guo. Seeing across views: Benchmarking spatial reasoning of vision-language models in robotic scenes. In The Fourteenth International Conference on Learning Representations, 2026. URL https:...

2026
[35]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020. doi: 10.1109/LRA.2020.2974707

work page doi:10.1109/lra.2020.2974707 2020
[36]

Reinforcement learning with human feedback for realistic traffic simulation

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, 1...

work page doi:10.1109/icra57147.2024.10611477 2024
[37]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....

work page doi:10.15607/rss.2023.xix.025 2023
[38]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-e: An embodied multimodal language model. In A. Krause, E. Brunskill, K. Cho, B. Engelhar...

2023
[39]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T.-Y . Lin, G. Wetzstein, M.-Y . Liu, and D. Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1702–1713, 2025. doi:10.1109/CVPR52734.2025.00166

work page doi:10.1109/cvpr52734.2025.00166 2025
[40]

Huang, M

H. Huang, M. Cen, K. Tan, X. Quan, G. Huang, and H. Zhang. GraphCoT-VLA: A 3D spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous 19 instructions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18324– 18332, 2026. doi:10.1609/aaai.v40i22.38896

work page doi:10.1609/aaai.v40i22.38896 2026
[41]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization, 2026. URL https://arxiv.org/abs/2510.03827

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, J. Li, X. He, S. Zhang, Z. Fei, et al. LIBERO-Plus: In-depth robustness analysis of vision-language-action models, 2025

2025
[43]

SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

T. Hanyu, N. Chung, H. Le, T. Nguyen, Y . Ikebe, A. Gunderman, D. N. H. Minh, K. V o, T. Kieu, K. Yamazaki, C. Rainwater, A. Nguyen, and N. Le. SlotVLA: Towards Modeling of Object- Relation Representations in Robotic Manipulation.arXiv e-prints, art. arXiv:2511.06754, Nov

work page internal anchor Pith review Pith/arXiv arXiv
[44]

doi:10.48550/arXiv.2511.06754

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.06754
[45]

Liang, G

W. Liang, G. Sun, Y . He, J. Dong, S. Dai, I. Laptev, S. Khan, and Y . Cong. PixelVLA: Advancing pixel-level understanding in vision-language-action model. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=7M6ryCABIc

2026
[46]

Cadene, S

R. Cadene, S. Alibert, F. Capuano, M. Aractingi, A. Zouitine, P. Kooijmans, J. Choghari, M. Russi, C. Pascal, S. Palma, D. Aubakirova, M. Shukor, J. Moss, A. Soare, Q. Lhoest, Q. Gallouédec, and T. Wolf. Lerobot: An open-source library for end-to-end robot learning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //o...

2026

[1] [1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[2] [2]

Ichter, A

B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. T. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu,...

[3] [3]

URLhttps://proceedings.mlr.press/v205/ichter23a.html

[4] [4]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision- language-action flow model for general robot control. InProceedings ...

work page doi:10.15607/rss.2025.xxi.010 2025

[5] [5]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. Sanketi, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Julia...

2023

[6] [6]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learni...

2025

[7] [7]

Y . Xing, X. Luo, J. Xie, L. Gao, H. T. Shen, and J. Song. Shortcut learning in generalist robot policies: The role of dataset diversity and fragmentation. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 of Proceedings of Machine Learning Research, pages 3239–3266. PMLR, 27–30 Sep 2025. URL https...

2025

[8] [8]

Zhang, S

J. Zhang, S. Wu, X. Luo, H. Wu, L. Gao, H. T. Shen, and J. Song. InSpire: Vision-language- action models with intrinsic spatial reasoning, 2025

2025

[9] [9]

I. Fang, J. Zhang, S. Tong, and C. Feng. From intention to execution: Probing the generalization boundaries of vision-language-action models.arXiv preprint arXiv:2506.09930, 2025. doi: 10.48550/arXiv.2506.09930

work page doi:10.48550/arxiv.2506.09930 2025

[10] [10]

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. doi:10.1109/ CVPR52733.2024.01370

work page arXiv 2024

[11] [11]

C. H. Song, V . Blukis, J. Tremblay, S. Tyree, Y . Su, and S. Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of 16 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15768–15780, 2025

2025

[12] [12]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS.2025.XXI.011

work page doi:10.15607/rss.2025.xxi.011 2025

[13] [13]

P. W. Battaglia, D. Kersten, and P. R. Schrater. How haptic size sensations improve distance perception.PLoS Computational Biology, 7(6):e1002080, 2011. doi:10.1371/journal.pcbi. 1002080

work page doi:10.1371/journal.pcbi 2011

[14] [14]

Xiang, T

Y . Xiang, T. Schmidt, V . Narayanan, and D. Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems, 2018. doi:10.15607/RSS.2018.XIV .019

work page doi:10.15607/rss.2018.xiv 2018

[15] [15]

Tremblay, T

J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. InProceedings of The 2nd Conference on Robot Learning, volume 87 ofProceedings of Machine Learning Research, pages 306–316. PMLR, 2018

2018

[16] [16]

Hodan, F

T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis, C. Sahin, F. Manhardt, F. Tombari, T.-K. Kim, J. Matas, and C. Rother. BOP: Benchmark for 6D object pose estimation. InProceedings of the European Conference on Computer Vision, pages 19–34, 2018

2018

[17] [17]

Hutchinson, G

S. Hutchinson, G. D. Hager, and P. I. Corke. A tutorial on visual servo control.IEEE Transactions on Robotics and Automation, 12(5):651–670, 1996. doi:10.1109/70.538972

work page doi:10.1109/70.538972 1996

[18] [18]

Chaumette and S

F. Chaumette and S. Hutchinson. Visual servo control. i. basic approaches.IEEE Robotics & Automation Magazine, 13(4):82–90, 2006. doi:10.1109/MRA.2006.250573

work page doi:10.1109/mra.2006.250573 2006

[19] [19]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 785–799. PMLR, 2023

2023

[20] [20]

Goyal, J

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. RVT: Robotic view transformer for 3D object manipulation. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 694–710. PMLR, 2023

2023

[21] [21]

X. Li, M. Zhang, Y . Geng, H. Geng, Y . Long, Y . Shen, R. Zhang, J. Liu, and H. Dong. Mani- pLLM: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024

2024

[22] [22]

H. Li, Q. Feng, Z. Zheng, J. Feng, Z. Chen, and A. Knoll. Language-guided object-centric diffusion policy for generalizable and collision-aware manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 12834–12841, 2025. doi:10.1109/ ICRA55743.2025.11127231

work page arXiv 2025

[23] [23]

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield. Spot: Se(3) pose trajectory diffusion for object-centric manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

2025

[24] [24]

X. Li, L. Heng, J. Liu, Y . Shen, C. Gu, Z. Liu, H. Chen, N. Han, R. Zhang, H. Tang, S. Zhang, and H. Dong. 3DS-VLA: A 3D spatial-aware vision language action model for robust multi- task manipulation. InProceedings of The 9th Conference on Robot Learning, volume 305 of Proceedings of Machine Learning Research, pages 2344–2359. PMLR, 2025. 17

2025

[25] [25]

Krishna, Y

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123:32–73, 2017

2017

[26] [26]

K. Yang, O. Russakovsky, and J. Deng. SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

2019

[27] [27]

F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 06 2023. ISSN 2307-387X. doi:10.1162/tacl_a_00566. URLhttps://doi.org/10.1162/tacl_a_00566

work page doi:10.1162/tacl_a_00566 2023

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 5891–5900, https://doi.org/10

T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238, 2022. doi:10.1109/CVPR52688.2022.00517

work page doi:10.1109/cvpr52688.2022.00517 2022

[29] [29]

J. Wang, Y . Ming, Z. Shi, V . Vineet, X. Wang, Y . Li, and N. Joshi. Is a pic- ture worth a thousand words? delving into spatial reasoning for vision language mod- els, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 89cc5e613d34f90de90c21e996e60b30-Paper-Conference.pdf

2024

[30] [30]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 44776–44791. Curran Associates, Inc., 2023. URL https://proceedings.neur...

2023

[31] [31]

A. Zeng, M. Attarian, brian ichter, K. M. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. S. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence. Socratic models: Composing zero-shot multimodal reasoning with language. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=G2Q2Mh3avow

2023

[32] [32]

Cheng, H

A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu. SpatialRGPT: Grounded spatial reasoning in vision-language models. InAdvances in Neural Information Processing Systems, 2024

2024

[33] [33]

M. Du, B. Wu, Z. Li, X. Huang, and Z. Wei. EmbSpatial-Bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers, pages 346–355, 2024

2024

[34] [34]

Z. Feng, Z. Kang, Q. Wang, Z. Du, J. Yan, S. Shubin, C. Yuan, H. Liang, Y . Deng, Q. Li, R. Yang, R. An, L. Zheng, W. Wang, S. Chen, S. Xu, Y . Liang, J. Yang, and B. Guo. Seeing across views: Benchmarking spatial reasoning of vision-language models in robotic scenes. In The Fourteenth International Conference on Learning Representations, 2026. URL https:...

2026

[35] [35]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020. doi: 10.1109/LRA.2020.2974707

work page doi:10.1109/lra.2020.2974707 2020

[36] [36]

Reinforcement learning with human feedback for realistic traffic simulation

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, 1...

work page doi:10.1109/icra57147.2024.10611477 2024

[37] [37]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....

work page doi:10.15607/rss.2023.xix.025 2023

[38] [38]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. PaLM-e: An embodied multimodal language model. In A. Krause, E. Brunskill, K. Cho, B. Engelhar...

2023

[39] [39]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T.-Y . Lin, G. Wetzstein, M.-Y . Liu, and D. Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1702–1713, 2025. doi:10.1109/CVPR52734.2025.00166

work page doi:10.1109/cvpr52734.2025.00166 2025

[40] [40]

Huang, M

H. Huang, M. Cen, K. Tan, X. Quan, G. Huang, and H. Zhang. GraphCoT-VLA: A 3D spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous 19 instructions.Proceedings of the AAAI Conference on Artificial Intelligence, 40(22):18324– 18332, 2026. doi:10.1609/aaai.v40i22.38896

work page doi:10.1609/aaai.v40i22.38896 2026

[41] [41]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization, 2026. URL https://arxiv.org/abs/2510.03827

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, J. Li, X. He, S. Zhang, Z. Fei, et al. LIBERO-Plus: In-depth robustness analysis of vision-language-action models, 2025

2025

[43] [43]

SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

T. Hanyu, N. Chung, H. Le, T. Nguyen, Y . Ikebe, A. Gunderman, D. N. H. Minh, K. V o, T. Kieu, K. Yamazaki, C. Rainwater, A. Nguyen, and N. Le. SlotVLA: Towards Modeling of Object- Relation Representations in Robotic Manipulation.arXiv e-prints, art. arXiv:2511.06754, Nov

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

doi:10.48550/arXiv.2511.06754

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.06754

[45] [45]

Liang, G

W. Liang, G. Sun, Y . He, J. Dong, S. Dai, I. Laptev, S. Khan, and Y . Cong. PixelVLA: Advancing pixel-level understanding in vision-language-action model. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=7M6ryCABIc

2026

[46] [46]

Cadene, S

R. Cadene, S. Alibert, F. Capuano, M. Aractingi, A. Zouitine, P. Kooijmans, J. Choghari, M. Russi, C. Pascal, S. Palma, D. Aubakirova, M. Shukor, J. Moss, A. Soare, Q. Lhoest, Q. Gallouédec, and T. Wolf. Lerobot: An open-source library for end-to-end robot learning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //o...

2026