pith. sign in

arxiv: 2412.04458 · v1 · pith:4PMZOKJYnew · submitted 2024-12-05 · 💻 cs.CV

Cubify Anything: Scaling Indoor 3D Object Detection

classification 💻 cs.CV
keywords ca-1mcubifycutrdetectionobjectobjectswhileanything
0
0 comments X
read the original abstract

We consider indoor 3D object detection with respect to a single RGB(-D) frame acquired from a commodity handheld device. We seek to significantly advance the status quo with respect to both data and modeling. First, we establish that existing datasets have significant limitations to scale, accuracy, and diversity of objects. As a result, we introduce the Cubify-Anything 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K highly accurate laser-scanned scenes with near-perfect registration to over 3.5K handheld, egocentric captures. Next, we establish Cubify Transformer (CuTR), a fully Transformer 3D object detection baseline which rather than operating in 3D on point or voxel-based representations, predicts 3D boxes directly from 2D features derived from RGB(-D) inputs. While this approach lacks any 3D inductive biases, we show that paired with CA-1M, CuTR outperforms point-based methods - accurately recalling over 62% of objects in 3D, and is significantly more capable at handling noise and uncertainty present in commodity LiDAR-derived depth maps while also providing promising RGB only performance without architecture changes. Furthermore, by pre-training on CA-1M, CuTR can outperform point-based methods on a more diverse variant of SUN RGB-D - supporting the notion that while inductive biases in 3D are useful at the smaller sizes of existing datasets, they fail to scale to the data-rich regime of CA-1M. Overall, this dataset and baseline model provide strong evidence that we are moving towards models which can effectively Cubify Anything.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WildDet3D: Scaling Promptable 3D Detection in the Wild

    cs.CV 2026-04 unverdicted novelty 7.0

    WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

  2. Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

    cs.CV 2026-06 unverdicted novelty 6.0

    SR-REAL equips spatial VLMs with dual LOR and DTR reasoning paths trained via RL, achieving better benchmark performance through mutual reinforcement and generalization without per-task tuning.

  3. Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

    cs.CV 2026-04 unverdicted novelty 6.0

    BoxerNet lifts 2D bounding boxes to metric 3D boxes via transformer regression with aleatoric uncertainty and median depth encoding, then fuses multi-view results to outperform CuTR by large margins on open-world benchmarks.