Cubify Anything: Scaling Indoor 3D Object Detection

Afshin Dehghan; David Griffiths; Francisco Crespo; Gefen Kohavi; Justin Lazarow

arxiv: 2412.04458 · v1 · pith:4PMZOKJYnew · submitted 2024-12-05 · 💻 cs.CV

Cubify Anything: Scaling Indoor 3D Object Detection

Justin Lazarow , David Griffiths , Gefen Kohavi , Francisco Crespo , Afshin Dehghan This is my paper

classification 💻 cs.CV

keywords ca-1mcubifycutrdetectionobjectobjectswhileanything

0 comments

read the original abstract

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildDet3D: Scaling Promptable 3D Detection in the Wild
cs.CV 2026-04 unverdicted novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Reinforcing Dual-Path Reasoning in Spatial Vision Language Models
cs.CV 2026-06 unverdicted novelty 6.0

SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.
Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D
cs.CV 2026-04 unverdicted novelty 6.0

BoxerNet lifts 2D bounding boxes to metric 3D boxes via transformer regression with aleatoric uncertainty and median depth encoding, then fuses multi-view results to outperform CuTR by large margins on open-world benchmarks.