Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638

· 2026 · arXiv 2602.09638

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

Affordance2Action introduces A2A-Bench, a manipulation-oriented benchmark for scene-level task-conditioned affordance grounding covering single- and multi-region correspondences, plus an annotation pipeline, and reports gaps in existing segmentation and VLM baselines.

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

cs.CV · 2026-04-02 · unverdicted · novelty 7.0

CompassAD benchmark and CompassNet framework for intent-driven affordance prediction on the appropriate object within multi-object 3D point clouds conditioned on natural language intent.

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

AFUN predicts task-conditional functional masks and 3D post-contact motion curves from RGB-D and language, trained via a standardized multi-source data pipeline, and reports large gains over baselines on segmentation, contact prediction, and motion tasks.

citing papers explorer

Showing 3 of 3 citing papers.

Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 18
Affordance2Action introduces A2A-Bench, a manipulation-oriented benchmark for scene-level task-conditioned affordance grounding covering single- and multi-region correspondences, plus an annotation pipeline, and reports gaps in existing segmentation and VLM baselines.
CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects cs.CV · 2026-04-02 · unverdicted · none · ref 15
CompassAD benchmark and CompassNet framework for intent-driven affordance prediction on the appropriate object within multi-object 3D point clouds conditioned on natural language intent.
AFUN: Towards an Affordance Foundation Model for Functionality Understanding cs.RO · 2026-06-01 · unverdicted · none · ref 71
AFUN predicts task-conditional functional masks and 3D post-contact motion curves from RGB-D and language, trained via a standardized multi-source data pipeline, and reports large gains over baselines on segmentation, contact prediction, and motion tasks.

Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638

fields

years

verdicts

representative citing papers

citing papers explorer