Affogato: Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale
read the original abstract
Affordance grounding aims to localize where to interact with an object, a fundamental capability for embodied agents. Yet progress is bottlenecked by data: manual annotation is prohibitively expensive and confines existing datasets to a narrow set of predefined object and affordance categories. We introduce Affogato, a framework for open-vocabulary affordance grounding centered on Affogato-750K, a large-scale dataset of 750K 3D affordance heatmaps paired with natural language queries. We build it with a fully automated pipeline that orchestrates foundation models to generate them at scale without human labeling. It covers significantly more diverse categories than any existing dataset. For reliable evaluation, we further provide 5K human-verified test pairs. We also present Espresso-3D and Espresso-2D, simple yet effective models with a unified architecture across both modalities. Pretraining on Affogato-750K improves both Espresso and prior methods and yields the largest gains on unseen object and affordance categories, showing that it provides broadly transferable supervision across architectures.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
CompassAD benchmark and CompassNet framework for intent-driven affordance prediction on the appropriate object within multi-object 3D point clouds conditioned on natural language intent.
-
Affostruction: 3D Affordance Grounding with Generative Reconstruction
Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
-
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better str...
-
AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment
AffordVLA improves VLA models for robotic manipulation by implicitly injecting affordance perception through feature alignment with a zero-shot teacher, claiming SOTA results in simulation and real-world tests.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.