Graspmolmo: Generalizable task-oriented grasping via large-scale synthetic data generation.arXiv preprint arXiv:2505.13441,

Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Salvador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, Rose Hendrix · 2025 · arXiv 2505.13441

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

cs.CV · 2026-05-28 · conditional · novelty 7.0

VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

cs.RO · 2026-06-09 · unverdicted · novelty 5.0

Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models cs.CV · 2026-05-28 · conditional · none · ref 19
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.

Graspmolmo: Generalizable task-oriented grasping via large-scale synthetic data generation.arXiv preprint arXiv:2505.13441,

fields

years

verdicts

representative citing papers

citing papers explorer