Affogato: Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Chunghyun Park; Dahyun Kang; Eunha Park; Junha Lee; Minsu Cho

arxiv: 2506.12009 · v2 · pith:7EXM3PXBnew · submitted 2025-06-13 · 💻 cs.CV

Affogato: Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

Junha Lee , Eunha Park , Chunghyun Park , Dahyun Kang , Minsu Cho This is my paper

classification 💻 cs.CV

keywords affordancecategoriesgroundingobjectacrossaffogatoaffogato-750kautomated

0 comments

read the original abstract

Affordance grounding aims to localize where to interact with an object, a fundamental capability for embodied agents. Yet progress is bottlenecked by data: manual annotation is prohibitively expensive and confines existing datasets to a narrow set of predefined object and affordance categories. We introduce Affogato, a framework for open-vocabulary affordance grounding centered on Affogato-750K, a large-scale dataset of 750K 3D affordance heatmaps paired with natural language queries. We build it with a fully automated pipeline that orchestrates foundation models to generate them at scale without human labeling. It covers significantly more diverse categories than any existing dataset. For reliable evaluation, we further provide 5K human-verified test pairs. We also present Espresso-3D and Espresso-2D, simple yet effective models with a unified architecture across both modalities. Pretraining on Affogato-750K improves both Espresso and prior methods and yields the largest gains on unseen object and affordance categories, showing that it provides broadly transferable supervision across architectures.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
cs.CV 2026-04 unverdicted novelty 7.0

CompassAD benchmark and CompassNet framework for intent-driven affordance prediction on the appropriate object within multi-object 3D point clouds conditioned on natural language intent.
Affostruction: 3D Affordance Grounding with Generative Reconstruction
cs.CV 2026-01 unverdicted novelty 7.0

Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
cs.CV 2026-05 unverdicted novelty 6.0

SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better str...
AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment
cs.RO 2026-05 unverdicted novelty 5.0

AffordVLA improves VLA models for robotic manipulation by implicitly injecting affordance perception through feature alignment with a zero-shot teacher, claiming SOTA results in simulation and real-world tests.