Grounded 3d-llm with referent tokens

Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Runsen Xu, Ruiyuan Lyu, Dahua Lin, Jiangmiao Pang · 2024 · arXiv 2405.10370

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.

Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL samples.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

cs.CV · 2026-04-03 · unverdicted · novelty 5.0

Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.

3D-IDE: 3D Implicit Depth Emergent

cs.CV · 2026-03-28 · unverdicted · novelty 5.0

3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

cs.CV · 2026-04-23 · unverdicted · novelty 4.0

Geometric Reward Credit Assignment disentangles rewards to geometric tokens and adds reprojection consistency to boost 3D keypoint accuracy from 0.64 to 0.93 and bounding box IoU to 0.686 on a ShapeNetCore benchmark while preserving 2D performance.

citing papers explorer

Showing 7 of 7 citing papers.

VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection cs.CV · 2026-05-02 · unverdicted · none · ref 44
VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs cs.CV · 2026-04-07 · unverdicted · none · ref 12
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs cs.CV · 2026-04-01 · unverdicted · none · ref 7
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL samples.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CV · 2026-03-29 · unverdicted · none · ref 54
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs cs.CV · 2026-04-03 · unverdicted · none · ref 10
Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.
3D-IDE: 3D Implicit Depth Emergent cs.CV · 2026-03-28 · unverdicted · none · ref 12
3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment cs.CV · 2026-04-23 · unverdicted · none · ref 8
Geometric Reward Credit Assignment disentangles rewards to geometric tokens and adds reprojection consistency to boost 3D keypoint accuracy from 0.64 to 0.93 and bounding box IoU to 0.686 on a ShapeNetCore benchmark while preserving 2D performance.

Grounded 3d-llm with referent tokens

fields

years

verdicts

representative citing papers

citing papers explorer