VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot transfer.
hub
Grounded 3d-llm with referent tokens
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.
PAR3D is a part-aware 3D-MLLM framework with ScenePart dataset, Part-Aware 3D Representation Learning, and Hierarchical Segmentation Query Generation to improve part-level 3D scene understanding.
KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.
APEIRIA distills neuro-symbolic 3D reasoning programs into 3D MLLMs through a curriculum that transfers stepwise verification patterns to achieve transparent yet flexible spatial reasoning.
SSR3D-LLM improves fine-grained 3D grounding in unified 3D-LLMs by generating and scoring sequences of latent spatial reasoning steps from the query using fixed Mask3D proposals.
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.
DEGround presents a unified homogeneous framework for 3D visual grounding with shared queries and two plug-in modules for better instruction alignment, reporting a 7.52% improvement on the EmbodiedScan benchmark.
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.
3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.
Geometric Reward Credit Assignment disentangles rewards to geometric tokens and adds reprojection consistency to boost 3D keypoint accuracy from 0.64 to 0.93 and bounding box IoU to 0.686 on a ShapeNetCore benchmark while preserving 2D performance.
Presents an open ROS2-based end-to-end navigation system for quadruped robots achieving over 88% success in zero-shot real-world indoor navigation tasks using semantic scene graphs and LLM planning.
citing papers explorer
-
Open-Architecture End-to-End System for Real-World Autonomous Robot Navigation
Presents an open ROS2-based end-to-end navigation system for quadruped robots achieving over 88% success in zero-shot real-world indoor navigation tasks using semantic scene graphs and LLM planning.