arxiv: 2605.01365 · v1 · submitted 2026-05-02 · 💻 cs.CV · cs.RO

Recognition: unknown

VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

Haowen Sun , Shaolong Zhang , Mingyang Li , Chengzhong Ma , Xinzhe Chen , Qiongjie Cui , Xingyu Chen , Zeyang Liu

show 1 more author

Xuguang Lan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:24 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords open-vocabulary affordance detection3D point cloudsmultimodal large language modelsvoxel-token fusioncross-attentionzero-shot transfersegmentation masks

0 comments

The pith

Injecting multi-scale voxel features from a frozen 3D VQVAE into MLLM output tokens supplies missing spatial information for accurate open-vocabulary 3D affordance detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets open-vocabulary 3D affordance detection, where a system must locate regions on point clouds that support interactions described by novel natural-language phrases. Recent methods append special tokens to multimodal large language models and decode those tokens into masks, yet the autoregressive token generation process models sequence order rather than local geometry. VoxAfford counters this by routing each generated token as a query against multi-scale features drawn from a frozen pre-trained 3D VQVAE encoder through cross-attention, modulated by a learned compatibility gate. The enriched tokens are then combined into a spatially aware prompt and fused with per-point features to produce the final mask. The resulting pipeline improves localization for unseen objects and language instructions.

Core claim

VoxAfford bypasses the spatial poverty of autoregressively generated MLLM output tokens by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into those tokens after generation. Each token serves as a query that retrieves relevant voxel patterns at its paired scale via cross-attention, with a learned compatibility gate regulating injection strength. The enhanced tokens are aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated together with per-point features to yield the output segmentation mask.

What carries the argument

Multi-scale voxel-token fusion, in which each affordance output token acts as a cross-attention query to retrieve geometric patterns from its corresponding voxel scale of a frozen 3D VQVAE, regulated by a learned compatibility gate.

If this is right

The method reaches state-of-the-art mIoU scores with an approximately 8 percent improvement on open-vocabulary 3D affordance benchmarks.
Real-robot trials confirm zero-shot transfer of the learned masks to previously unseen objects.
The compatibility gate permits selective injection so that semantic content from the language model is preserved while geometric detail is added.
The same enhanced tokens can be aggregated via semantic-conditioned attention before being decoded with per-point features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same post-generation cross-attention pattern could be tested on other 3D vision-language tasks where autoregressive models must produce spatially precise outputs.
Replacing the frozen VQVAE with an end-to-end trainable encoder might reveal whether the current performance ceiling is limited by the frozen representation.
Extending the multi-scale injection to temporal sequences of point clouds could support affordance detection in dynamic scenes.

Load-bearing premise

That the autoregressive generation of MLLM output tokens leaves them semantically rich but spatially impoverished, and that cross-attention to multi-scale features from a frozen 3D VQVAE can reliably supply the missing spatial neighborhood relations needed for 3D localization.

What would settle it

An ablation that removes or randomizes the voxel-feature injection while keeping all other components identical and still reports comparable mIoU would show that the claimed spatial enrichment is not required for the reported gains.

Figures

Figures reproduced from arXiv: 2605.01365 by Chengzhong Ma, Haowen Sun, Mingyang Li, Qiongjie Cui, Shaolong Zhang, Xingyu Chen, Xinzhe Chen, Xuguang Lan, Zeyang Liu.

**Figure 1.** Figure 1: MLLM-based affordance region prediction for robotic manipulation. Given a scene observation, to execute a pouring instruction the robot first needs to predict the affordance region (e.g., the handle of a mug for grasping) on the target object. Left: Existing methods rely solely on MLLM output tokens to decode affordance masks, but the autoregressive bottleneck limits the geometric information carried by t… view at source ↗

**Figure 2.** Figure 2: Overview of the VoxAfford framework. (a) Multimodal Input Encoder. A Text Encoder and a Point Encoder encode the affordance query and object point cloud, respectively, and feed them into an LLM fine-tuned with LoRA that autoregressively generates affordance tokens. (b) Hierarchical Voxel-Token Spatial Fusion. A Voxel Encoder extracts multi-scale geometric features from the voxelized point cloud; cross atte… view at source ↗

**Figure 3.** Figure 3: The visualization results of our VoxAfford compared with others. view at source ↗

**Figure 4.** Figure 4: Real world tasks. The experimental task setup evaluated within a physical world context encompasses five unique environments. Each specific view at source ↗

read the original abstract

Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VoxAfford adds a post-generation cross-attention step with a compatibility gate to pull multi-scale voxel features into MLLM tokens, but the abstract gives almost no experimental backing for the claimed 8% mIoU gain.

read the letter

The core idea is straightforward: autoregressive MLLM tokens are good at semantics but weak on 3D spatial structure, so the authors inject geometric patterns from a frozen VQVAE encoder at multiple scales using cross-attention, modulated by a learned compatibility gate, then aggregate with semantic-conditioned attention before mask prediction. That specific injection mechanism after token generation is the part that does not appear in the prior MLLM segmentation work they cite. It is a clean, targeted fix for the spatial limitation they identify, and it keeps the VQVAE frozen so no extra pretraining is needed. The design choices around the gate and the per-scale querying look reasonable on the surface for controlling how much geometry gets mixed in without overwhelming the language signal. The abstract's claim of roughly 8% mIoU improvement plus successful zero-shot robot transfer on novel objects is the part that needs checking. No datasets, baselines, ablations, or error bars are mentioned, so it is impossible to tell whether the voxel fusion actually drives the gains or whether the base MLLM already does most of the work. The frozen VQVAE also leaves open the alignment question the stress-test note raises; if the latent spaces do not line up, the cross-attention may not retrieve useful geometry and the reported transfer could be overstated. This is for people working on open-vocabulary 3D perception or affordance in robotics who already use MLLMs and want a concrete way to add voxel structure. A reader interested in the fusion details would get value from the mechanism even if the numbers need verification. It has enough of a distinct technical step to deserve a serious referee rather than a desk reject, mainly so the experiments can be examined directly.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces VoxAfford for open-vocabulary 3D affordance detection on point clouds. It posits that autoregressive MLLM output tokens are semantically rich but spatially impoverished, and addresses this by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the tokens post-generation. Each token uses its affordance semantics as a query in cross-attention to retrieve patterns from the corresponding voxel scale, modulated by a learned compatibility gate; the enhanced tokens are then aggregated via semantic-conditioned attention and propagated with per-point features to produce the final mask. The authors claim state-of-the-art performance with an approximately 8% mIoU gain and successful zero-shot transfer to novel objects in real-robot experiments.

Significance. If the experimental claims hold under rigorous validation, the approach demonstrates a viable route to augmenting the spatial deficiencies of autoregressive MLLM tokens with structured 3D geometric priors while keeping the VQVAE frozen, which supports efficiency. The zero-shot robot transfer provides a concrete test of open-vocabulary generalization beyond standard benchmarks.

major comments (1)

The central claim of an 8% mIoU improvement and zero-shot transfer rests on the cross-attention successfully retrieving geometrically relevant patterns from the independently pretrained, frozen VQVAE latent space. Because the VQVAE and MLLM are trained separately, the paper must supply evidence (e.g., feature-space alignment metrics, t-SNE visualizations, or an ablation removing the VQVAE injection) that the affordance-semantic queries can linearly access affordance-discriminative geometry; without this, the gains could be driven by the base MLLM rather than the proposed voxel-token fusion. This issue is load-bearing for the method's novelty and is not addressed by the current description of the compatibility gate.

minor comments (2)

Abstract: the phrase 'approximately an 8% improvement in mIoU' should specify the exact baseline method and dataset split for immediate context.
Notation: ensure the compatibility gate is consistently denoted (e.g., as a scalar or vector) and its training objective is stated explicitly in the method section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We agree that the central claim requires stronger empirical support for the cross-attention mechanism's ability to align affordance-semantic queries with geometrically relevant patterns from the frozen VQVAE. We will revise the manuscript accordingly to address this load-bearing issue.

read point-by-point responses

Referee: The central claim of an 8% mIoU improvement and zero-shot transfer rests on the cross-attention successfully retrieving geometrically relevant patterns from the independently pretrained, frozen VQVAE latent space. Because the VQVAE and MLLM are trained separately, the paper must supply evidence (e.g., feature-space alignment metrics, t-SNE visualizations, or an ablation removing the VQVAE injection) that the affordance-semantic queries can linearly access affordance-discriminative geometry; without this, the gains could be driven by the base MLLM rather than the proposed voxel-token fusion. This issue is load-bearing for the method's novelty and is not addressed by the current description of the compatibility gate.

Authors: We acknowledge the validity of this concern and agree that the current manuscript does not provide sufficient direct evidence of feature-space alignment or the specific contribution of the voxel injection. The compatibility gate modulates injection strength but does not itself demonstrate that the queries retrieve affordance-discriminative geometry. In the revised manuscript we will add: (1) a dedicated ablation that removes the entire VQVAE cross-attention injection (replacing it with a simple learned bias or no injection), showing a substantial mIoU drop relative to the full model and thereby isolating the fusion's contribution beyond the base MLLM; (2) t-SNE visualizations of the MLLM output token queries alongside the multi-scale VQVAE features, with affordance labels overlaid to illustrate semantic-geometric clustering; and (3) quantitative alignment metrics (e.g., mean cosine similarity between matched query-voxel pairs for positive vs. negative affordance cases, plus retrieval precision at k). These additions will directly substantiate that the cross-attention accesses relevant geometric patterns and that the reported gains and zero-shot transfer are attributable to the proposed mechanism rather than the underlying MLLM alone. We will also expand the method section to clarify the role of the compatibility gate in this alignment process. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on independent pre-trained components and standard attention

full rationale

The paper describes a fusion method using cross-attention between autoregressive MLLM tokens and multi-scale features from a frozen, independently pre-trained 3D VQVAE, with a learned compatibility gate and semantic-conditioned attention for mask generation. No equation or step reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing claim rest on self-citation chains or ansatzes smuggled from prior author work. The central performance claims are empirical (mIoU gains and robot transfer) evaluated against external benchmarks, with the architecture drawing on standard, externally validated components rather than tautological redefinitions of its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, limiting visibility into exact parameter counts or additional assumptions. The method introduces one learned component and relies on the quality of an external pre-trained encoder.

free parameters (1)

compatibility gate
Learned scalar or module that modulates the strength of voxel feature injection into each output token.

axioms (1)

domain assumption A frozen pre-trained 3D VQVAE encoder supplies useful multi-scale geometric patterns relevant to affordance localization.
The method assumes the VQVAE features are already sufficient without further training or validation.

pith-pipeline@v0.9.0 · 5547 in / 1347 out tokens · 47988 ms · 2026-05-09T15:24:19.695576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Learning from 10 demos: Generalisable and sample-efficient policy learning with oriented affordance frames,

K. Rana, J. Abou-Chakra, S. Garg, R. Lee, I. Reid, and N. Suender- hauf, “Learning from 10 demos: Generalisable and sample-efficient policy learning with oriented affordance frames,”arXiv preprint arXiv:2410.12124, 2024

work page arXiv 2024
[2]

Afforddp: Generalizable diffusion policy with transferable affordance,

S. Wu, Y . Zhu, Y . Huang, K. Zhu, J. Gu, J. Yu, Y . Shi, and J. Wang, “Afforddp: Generalizable diffusion policy with transferable affordance,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6971–6980

2025
[3]

A0: An affordance-aware hier- archical model for general robotic manipulation,

R. Xu, J. Zhang, M. Guo, Y . Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang,et al., “A0: An affordance-aware hier- archical model for general robotic manipulation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 491–13 501

2025
[4]

Video-to-image affordance grounding via visual conceptual learning,

Z. Fan and K. Liang, “Video-to-image affordance grounding via visual conceptual learning,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 3972–3980

2025
[5]

Uad: Unsupervised affordance distillation for generaliza- tion in robotic manipulation,

Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generaliza- tion in robotic manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 3822–3831

2025
[6]

Affor- dancellm: Grounding affordance from vision language models,

S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li, “Affor- dancellm: Grounding affordance from vision language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7587–7597

2024
[7]

3daffordsplat: Efficient affordance reasoning with 3d gaussians,

Z. Wei, J. Lin, Y . Liu, W. Chen, J. Luo, G. Li, and L. Lin, “3daffordsplat: Efficient affordance reasoning with 3d gaussians,” in Proceedings of the 33rd ACM International Conference on Multime- dia, 2025, pp. 2821–2830

2025
[8]

3d affordancenet: A benchmark for visual object affordance understanding,

S. Deng, X. Xu, C. Wu, K. Chen, and K. Jia, “3d affordancenet: A benchmark for visual object affordance understanding,” inproceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2021, pp. 1778–1787

2021
[9]

O2o-afford: Annotation-free large-scale object-object affordance learning,

K. Mo, Y . Qin, F. Xiang, H. Su, and L. Guibas, “O2o-afford: Annotation-free large-scale object-object affordance learning,” inCon- ference on robot learning. PMLR, 2022, pp. 1666–1677

2022
[10]

Grounding 3d object affordance with language instructions, visual observations and interactions,

H. Zhu, Q. Kong, K. Xu, X. Xia, B. Deng, J. Ye, R. Xiong, and Y . Wang, “Grounding 3d object affordance with language instructions, visual observations and interactions,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 337–17 346

2025
[11]

Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models,

H. Wang, S. Wang, Y . Zhong, Z. Yang, J. Wang, Z. Cui, J. Yuan, Y . Han, M. Liu, and Y . Ma, “Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models,” inProceedings of the AAAI Conference on Artificial Intelli- gence, vol. 40, no. 12, 2026, pp. 9738–9746

2026
[12]

Task-aware 3d affordance segmentation via 2d guidance and geometric refinement,

L. He, M. Liu, Q. Ye, Y . Zhou, X. Deng, and G. Ding, “Task-aware 3d affordance segmentation via 2d guidance and geometric refinement,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 6, 2026, pp. 4654–4662

2026
[13]

Open-vocabulary affordance detection using knowledge distillation and text-point correlation,

T. Van V o, M. N. Vu, B. Huang, T. Nguyen, N. Le, T. V o, and A. Nguyen, “Open-vocabulary affordance detection using knowledge distillation and text-point correlation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 13 968–13 975

2024
[14]

Open-vocabulary 3d affordance understanding via functional text enhancement and multilevel repre- sentation alignment,

L. Wu, W. Wei, P. Yu, and J. Lan, “Open-vocabulary 3d affordance understanding via functional text enhancement and multilevel repre- sentation alignment,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 7988–7997

2025
[15]

Seqafford: Sequential 3d affordance reasoning via multimodal large language model,

C. Yu, H. Wang, Y . Shi, H. Luo, S. Yang, J. Yu, and J. Wang, “Seqafford: Sequential 3d affordance reasoning via multimodal large language model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 1691–1701

2025
[16]

3d- affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,

H. Chu, X. Deng, Q. Lv, X. Chen, Y . Li, J. Hao, and L. Nie, “3d- affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,”arXiv preprint arXiv:2502.20041, 2025

work page arXiv 2025
[17]

Openhoi: Open-world hand-object interaction synthesis with multimodal large language model,

Z. Zhang, Y . Shi, L. Yang, S. Ni, Q. Ye, and J. Wang, “Openhoi: Open-world hand-object interaction synthesis with multimodal large language model,”arXiv preprint arXiv:2505.18947, 2025

work page arXiv 2025
[18]

Recon3d: High quality 3d reconstruction from a single image using generated back-view explicit priors,

R. Chen, M. Yin, J. Shen, and W. Ma, “Recon3d: High quality 3d reconstruction from a single image using generated back-view explicit priors,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 2802–2811

2024
[19]

Uni3d: Ex- ploring unified 3d representation at scale,

J. Zhou, J. Wang, B. Ma, Y .-S. Liu, T. Huang, and X. Wang, “Uni3d: Exploring unified 3d representation at scale,”arXiv preprint arXiv:2310.06773, 2023

work page arXiv 2023
[20]

Multi-model ap- proach based on 3d functional features for tool affordance learning in robotics,

T. Mar, V . Tikhanoff, G. Metta, and L. Natale, “Multi-model ap- proach based on 3d functional features for tool affordance learning in robotics,” in2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids). IEEE, 2015, pp. 482–489

2015
[21]

Positional prompt tuning for efficient 3d representation learning,

S. Zhang, Z. Qi, R. Dong, X. Bai, and X. Wei, “Positional prompt tuning for efficient 3d representation learning,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 1171– 1180

2025
[22]

Reason3d: Searching and reasoning 3d segmentation via large language model,

K.-C. Huang, X. Li, L. Qi, S. Yan, and M.-H. Yang, “Reason3d: Searching and reasoning 3d segmentation via large language model,” in2025 International Conference on 3D Vision (3DV). IEEE, 2025, pp. 1177–1186

2025
[23]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9579–9589

2024
[24]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasude- van, A. Ku, Y . Yang, B. K. Ayan,et al., “Scaling autoregressive models for content-rich text-to-image generation,”arXiv preprint arXiv:2206.10789, vol. 2, no. 3, p. 5, 2022

work page internal anchor Pith review arXiv 2022
[25]

Cogview: Mastering text-to- image generation via transformers,

M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang,et al., “Cogview: Mastering text-to- image generation via transformers,”Advances in neural information processing systems, vol. 34, pp. 19 822–19 835, 2021

2021
[26]

Make-a-scene: Scene-based text-to-image generation with hu- man priors,

O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y . Taig- man, “Make-a-scene: Scene-based text-to-image generation with hu- man priors,” inEuropean conference on computer vision. Springer, 2022, pp. 89–106

2022
[27]

Taming transformers for high-resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883

2021
[28]

Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae,

Y . Chen, Y . Lan, S. Zhou, T. Wang, and X. Pan, “Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 28 371–28 382

2025
[29]

Shapellm: Universal 3d object understanding for embodied interaction,

Z. Qi, R. Dong, S. Zhang, H. Geng, C. Han, Z. Ge, L. Yi, and K. Ma, “Shapellm: Universal 3d object understanding for embodied interaction,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 214–238

2024
[30]

Shapellm-omni: A native multimodal llm for 3d generation and understanding.arXiv preprint arXiv:2506.01853, 2025

J. Ye, Z. Wang, R. Zhao, S. Xie, and J. Zhu, “Shapellm-omni: A native multimodal llm for 3d generation and understanding,”arXiv preprint arXiv:2506.01853, 2025

work page arXiv 2025
[31]

The senses considered as perceptual systems

J. J. Gibson, “The senses considered as perceptual systems.” 1966

1966
[32]

Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter,

Y . Tang, S. Zhang, X. Hao, P. Wang, J. Wu, Z. Wang, and S. Zhang, “Affordgrasp: In-context affordance reasoning for open-vocabulary task-oriented grasping in clutter,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 9433–9439

2025
[33]

arXiv preprint arXiv:2503.03556 , year=

X. Zhu, Y . Li, L. Cui, P. Li, H.-a. Gao, Y . Zhu, and H. Zhao, “Afford-x: Generalizable and slim affordance reasoning for task- oriented manipulation,”arXiv preprint arXiv:2503.03556, 2025

work page arXiv 2025
[34]

Affordancenet: An end-to-end deep learning approach for object affordance detection,

T.-T. Do, A. Nguyen, and I. Reid, “Affordancenet: An end-to-end deep learning approach for object affordance detection,” in2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 5882–5889

2018
[35]

Where2act: From pixels to actions for articulated 3d objects,

K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani, “Where2act: From pixels to actions for articulated 3d objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6813–6823

2021
[36]

Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects,

R. Wu, Y . Zhao, K. Mo, Z. Guo, Y . Wang, T. Wu, Q. Fan, X. Chen, L. Guibas, and H. Dong, “Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects,”arXiv preprint arXiv:2106.14440, 2021

work page arXiv 2021
[37]

Grounding 3d object affordance from 2d interactions in images,

Y . Yang, W. Zhai, H. Luo, Y . Cao, J. Luo, and Z.-J. Zha, “Grounding 3d object affordance from 2d interactions in images,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 905–10 915

2023
[38]

Laso: Language-guided affordance segmentation on 3d object,

Y . Li, N. Zhao, J. Xiao, C. Feng, X. Wang, and T.-s. Chua, “Laso: Language-guided affordance segmentation on 3d object,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 251–14 260

2024
[39]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser,et al., “Openscene: 3d scene understanding with open vocabularies,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 815–824

2023
[40]

Great: Geometry-intention collaborative inference for open-vocabulary 3d object affordance grounding,

Y . Shao, W. Zhai, Y . Yang, H. Luo, Y . Cao, and Z.-J. Zha, “Great: Geometry-intention collaborative inference for open-vocabulary 3d object affordance grounding,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 326–17 336

2025
[41]

3d-llm: Injecting the 3d world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 482–20 494, 2023

2023
[42]

Pointllm: Empowering large language models to understand point clouds,

R. Xu, X. Wang, T. Wang, Y . Chen, J. Pang, and D. Lin, “Pointllm: Empowering large language models to understand point clouds,” in European Conference on Computer Vision. Springer, 2024, pp. 131– 147

2024
[43]

Ll3da: Visual interactive instruction tuning for omni- 3d understanding reasoning and planning,

S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni- 3d understanding reasoning and planning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 428–26 438

2024
[44]

Grounded 3d-llm with referent tokens.arXiv preprint arXiv:2405.10370, 2024

Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024

work page arXiv 2024
[45]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

2023
[46]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

2023
[47]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En- hancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review arXiv 2023
[48]

Qwen-vl: A versatile vision-language model for under- standing, localization,

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for under- standing, localization,”Text Reading, and Beyond, vol. 2, no. 1, p. 1, 2023

2023
[49]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,”Advances in neural information processing systems, vol. 30, 2017

2017
[50]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling,

X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 313–19 322

2022
[51]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

2023
[52]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[53]

Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding,

K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su, “Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 909– 918

2019
[54]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

2022
[55]

Zero-shot learning of 3d point cloud objects,

A. Cheraghian, S. Rahman, and L. Petersson, “Zero-shot learning of 3d point cloud objects,” in2019 16th International Conference on Machine Vision Applications (MVA). IEEE, 2019, pp. 1–6

2019
[56]

Transduc- tive zero-shot learning for 3d point cloud classification,

A. Cheraghian, S. Rahman, D. Campbell, and L. Petersson, “Transduc- tive zero-shot learning for 3d point cloud classification,” inProceed- ings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 923–933

2020
[57]

Generative zero-shot learning for semantic segmentation of 3d point clouds,

B. Michele, A. Boulch, G. Puy, M. Bucher, and R. Marlet, “Generative zero-shot learning for semantic segmentation of 3d point clouds,” in 2021 International Conference on 3D Vision (3DV). IEEE, 2021, pp. 992–1002

2021
[58]

Open-vocabulary affordance detection in 3d point clouds,

T. Nguyen, M. N. Vu, A. Vuong, D. Nguyen, T. V o, N. Le, and A. Nguyen, “Open-vocabulary affordance detection in 3d point clouds,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 5692–5698

2023
[59]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann,et al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review arXiv 2024
[60]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding,

L. Xue, N. Yu, S. Zhang, A. Panagopoulou, J. Li, R. Mart ´ın-Mart´ın, J. Wu, C. Xiong, R. Xu, J. C. Niebles,et al., “Ulip-2: Towards scalable multimodal pre-training for 3d understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 091–27 101

2024
[61]

3d shapenets: A deep representation for volumetric shapes,

Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920

2015
[62]

Point transformer,

H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 259–16 268

2021
[63]

Affordpose: A large- scale dataset of hand-object interactions with affordance-driven hand pose,

J. Jian, X. Liu, M. Li, R. Hu, and J. Liu, “Affordpose: A large- scale dataset of hand-object interactions with affordance-driven hand pose,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 713–14 724

2023
[64]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025