GROW² hierarchically grounds open-world tool affordances by using VLMs for semantic selection of objects and parts followed by geometric localization with vision foundation models.
arXiv preprint arXiv:2310.13065 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
citing papers explorer
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.