Streaming Dense Voxel Representations for 3D Occupancy Prediction
read the original abstract
In this paper, we explore dense voxel streaming for accurate and efficient 3D occupancy prediction. While dense voxel representations offer fine-grained spatial details and streaming paradigm enables efficient temporal processing, naively combining the two introduces key challenges: (i) warping-induced distortions caused by interpolation used for temporal alignment, and (ii) degraded dynamic object representations due to motion misalignment and detail loss in image-to-voxel projection. To address these, we propose StreamOcc, a novel framework that utilizes two aggregation strategies. Specifically, it first refines propagated voxel features to reduce warping artifacts before temporal accumulation, and then selectively injects instance-level query features encoding dynamic-object semantics into the corresponding occupied voxel regions, preserving temporally consistent modeling while strengthening dynamic object representations. Unlocking effective dense voxel streaming, StreamOcc achieves state-of-the-art performance on SurroundOcc-benchmark and Occ3D-nuScenes under real-time constraints, outperforming the prior best methods by +1.3/2.5 and +1.5/2.0 in (overall/dynamic object) mIoU, respectively, while running at 83.3 ms per frame with only 2.8 GB of memory. The project page is available at https://moonseokha.github.io/StreamOcc/.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.