VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

arxiv: 2605.16911 · v1 · pith:3BCWRAFNnew · submitted 2026-05-16 · 💻 cs.CV

VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

Xun Chen , Tianchen Deng , Rui Wang , Fangjinhua Wang , Junyi Ma , Hongming Shen , Hesheng Wang , Danwei Wang This is my paper

Pith reviewed 2026-05-19 21:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D semantic occupancy predictiongeometric tokensprojection-aware deformable attentiongated fusioncross-view consistencynuScenescoarse-to-fine decoder

0 comments p. Extension

pith:3BCWRAFN Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{3BCWRAFN}

Prints a linked pith:3BCWRAFN badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Embedding camera geometry into every attention and fusion step produces more accurate 3D semantic occupancy from multi-view images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that current 3D occupancy methods stop using camera geometry after the first projection, leaving later steps like offset learning and feature aggregation without physical constraints. It introduces a way to carry geometric information forward by projecting 3D offsets back to image planes and adding the Jacobian of that projection as a bias term in attention. A view-quality gate then combines features across cameras while a coarse-to-fine decoder allocates work according to feature density. If these changes work, the result is higher accuracy on standard benchmarks together with lower decoder cost and fewer parameters in the occupancy head.

Core claim

VGGT-Occ embeds geometric tokens throughout the pipeline by means of Projection-Aware Deformable Attention that projects 3D offsets to image planes and uses the projection Jacobian as an additive bias, followed by a view-quality semantic gate and sequential coarse-to-fine gated fusion that refines low-resolution features while respecting information density.

What carries the argument

Projection-Aware Deformable Attention (PA-DA), which projects learned 3D offsets back to image planes and adds the projection Jacobian as a bias to suppress unreliable observations during attention.

If this is right

The occupancy head uses only about 41 million trainable parameters while reaching 33.00 percent IoU and 21.08 percent mIoU on SurroundOcc-nuScenes with one frame.
Two-frame inference raises the scores to 33.64 percent IoU and 21.43 percent mIoU.
Low-resolution features are refined into higher resolutions only where information density justifies the cost, lowering overall decoder computation.
Cross-view consistency is enforced by the view-quality semantic gate before final fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection-and-Jacobian bias could be inserted into other multi-view tasks such as depth completion or 3D object detection to enforce geometric consistency without extra supervision.
If the density-aware gating generalizes, similar coarse-to-fine schedules might reduce memory use in other dense prediction networks that currently process full-resolution volumes.
Testing whether the view-quality gate still works under strong lighting changes or partial camera failure would show how far the cross-view consistency claim extends beyond the nuScenes recording conditions.

Load-bearing premise

That projecting 3D offsets to image planes and adding the Jacobian as a bias term will reliably down-weight unreliable observations without introducing new inconsistencies across views.

What would settle it

An ablation on the SurroundOcc-nuScenes validation set in which the Jacobian bias is removed from PA-DA and the IoU and mIoU scores remain unchanged or improve.

Figures

Figures reproduced from arXiv: 2605.16911 by Danwei Wang, Fangjinhua Wang, Hesheng Wang, Hongming Shen, Junyi Ma, Rui Wang, Tianchen Deng, Xun Chen.

**Figure 1.** Figure 1: VGGT-Occ overview. (a) Prior methods restrict camera geometry to initial projection, leaving subsequent attention stages geometry-blind. (b) VGGT-Occ injects projection geometry into all attention stages via PA-DA, and allocates computation by voxel density via coarse-guided gated fusion. or large incidence angle). (3) Naive Cross-Camera Averaging: Features sampled from different cameras are simply average… view at source ↗

**Figure 2.** Figure 2: VGGT-Occ architecture. VGGT unified encoding produces multi-scale 2D features. PADA injects projection geometry into three stages of cross-attention at coarse scales. Density-aware decoder uses convolutions only at fine scale, with coarse-guided gated fusion bridging scales. jointly by VGGT [36], a geometry-grounded Transformer that performs cross-view reasoning during encoding. The occupancy head operate… view at source ↗

**Figure 3.** Figure 3: PA-DA: three-stage projection-aware deformable attention. Stage 1 learns 3D offsets and projects them to each camera’s image plane for cross-view consistency. Stage 2 decomposes the projection Jacobian to extract σmin, encoding per-point observation quality as an additive log-bias. Stage 3 embeds the full 2×3 Jacobian for per-camera, per-channel gated fusion. Eq. (1) by reusing the projection’s intermediat… view at source ↗

**Figure 4.** Figure 4: Visualization of the coarse-to-fine gated fusion. (Left) Cascaded fusion pipeline: base features (L0, L1) fused with intermediate predictions (Pre-L1, Pre-L2) via learned gates. (Right) Multi-view RGB inputs, final prediction (L2), and ground truth. Warmer gate colors indicate stronger coarse-level reliance. nates total memory. All memory figures are measured via nvidia-smi, capturing CUDA context and cuDN… view at source ↗

**Figure 5.** Figure 5: Gate heatmap under challenging conditions. Daytime clutter (top), heavy rain (middle), and nighttime (bottom). Warmer colors indicate higher reliance on coarse-level semantic information, while cooler colors represent a shift toward fine-scale structural details. The gating mechanism dynamically adapts to both environmental noise and local geometric complexity. D Additional Qualitative Results [PITH_FULL_… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on SurroundOcc-nuScenes. Qualitative results of VGGT-Occ compared with state-of-the-art methods. VGGT-Occ produces finer geometric structures and more accurate semantic boundaries, aligning much more closely with the ground truth in complex scenarios. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into higher resolutions, allocating computation by information density while substantially reducing decoder cost. Extensive evaluations demonstrate the effectiveness and accuracy of our approach. On SurroundOcc-nuScenes, VGGT-Occ achieves 33.00\% IoU and 21.08\% mIoU ($T{=}1$), and 33.64\% IoU and 21.43\% mIoU with $T{=}2$ inference, outperforming existing methods, with only ${\sim}41$M trainable parameters in the occupancy head. Code will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGGT-Occ keeps geometry in the attention loop via PA-DA and a density-aware decoder, but the Jacobian bias contribution remains unverified in the reported results.

read the letter

The main takeaway is that this paper tries to fix a real gap: most occupancy methods use camera geometry only for the initial lift and then treat everything else as geometry-free. VGGT-Occ adds Projection-Aware Deformable Attention that reprojects 3D offsets to image space and adds the projection Jacobian as a bias term, plus a view-quality semantic gate and a sequential coarse-to-fine gated fusion decoder that allocates compute by information density. That package is the actual novelty, and it produces 33.00% IoU and 21.08% mIoU on SurroundOcc-nuScenes with roughly 41M trainable parameters in the head, which beats the numbers they cite from prior work while staying light on the decoder side. The efficiency claim and the explicit coarse-to-fine structure are the parts that look practically useful for robotics pipelines that need to run on limited hardware. The motivation section also correctly identifies that later stages in deformable attention and cross-view aggregation usually ignore physical constraints, so the direction is coherent. The soft spot is that the central mechanism—the Jacobian bias suppressing unreliable observations—has no isolating ablation or attention-weight analysis shown in the abstract or summary. Without seeing how the bias is normalized, scaled, or compared against a plain learned bias, it is hard to tell whether the reported gains come from the geometry injection or from the rest of the architecture. The view-quality gate faces the same issue: it is described as enforcing consistency, but there is no direct evidence it avoids introducing new errors. If the full paper contains those checks and the numbers hold under standard splits, the contribution is solid. This is for groups already working on camera-based 3D scene understanding for driving or robotics. A reader who needs concrete baselines and an efficiency-focused decoder would get something usable from it. I would send it to peer review because the performance targets are specific and the design choices are stated clearly enough for referees to test the missing pieces.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VGGT-Occ, a 3D semantic occupancy prediction framework that embeds geometric tokens throughout the pipeline. It introduces Projection-Aware Deformable Attention (PA-DA) which projects 3D offsets back to image planes and adds the projection Jacobian as an additive bias to suppress unreliable observations, a view-quality semantic gate for cross-view consistency, and a sequential coarse-to-fine decoder with gated fusion that allocates computation according to information density. On the SurroundOcc-nuScenes benchmark the method reports 33.00% IoU and 21.08% mIoU (T=1) and 33.64% IoU and 21.43% mIoU (T=2) while using only ~41 M trainable parameters in the occupancy head, outperforming prior approaches.

Significance. If the geometry-grounded mechanisms and efficiency gains hold under rigorous verification, the work could meaningfully advance camera-based 3D occupancy prediction for autonomous driving and robotics. The explicit performance numbers, parameter-efficiency claim, and stated intention to release code publicly are concrete strengths that support potential impact.

major comments (2)

[Method (PA-DA)] Method section (PA-DA): the central claim that projecting 3D offsets and adding the projection Jacobian as an additive bias reliably suppresses unreliable 2D observations lacks any explicit formulation, normalization details, or derivation showing how the bias term alters attention weights relative to standard deformable attention. This mechanism is load-bearing for attributing the reported IoU/mIoU gains to geometry grounding rather than other factors.
[Experiments] Experiments section: no ablations isolate the contribution of the Jacobian bias versus the view-quality gate or the coarse-to-fine fusion, and no error bars or statistical significance tests are reported for the 33.00% IoU / 21.08% mIoU figures. Without these, the performance advantage over prior methods cannot be confidently linked to the proposed components.

minor comments (2)

[Abstract / Method] The abstract and method description refer to T=1 and T=2 inference without defining T or explaining its relation to the sequential decoder in the main text.
[Method] Notation for the Jacobian bias term and the view-quality semantic gate should be introduced with explicit equations rather than descriptive prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We appreciate the acknowledgment of the potential impact of our geometry-grounded approach for camera-based 3D occupancy prediction. We address each major comment point by point below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Method (PA-DA)] Method section (PA-DA): the central claim that projecting 3D offsets and adding the projection Jacobian as an additive bias reliably suppresses unreliable 2D observations lacks any explicit formulation, normalization details, or derivation showing how the bias term alters attention weights relative to standard deformable attention. This mechanism is load-bearing for attributing the reported IoU/mIoU gains to geometry grounding rather than other factors.

Authors: We agree that the manuscript would benefit from a more rigorous and explicit mathematical treatment of the PA-DA mechanism. While the current text describes the high-level operation of projecting 3D offsets and using the Jacobian as an additive bias, it does not include the full formulation, normalization procedure, or derivation of its effect on attention weights. In the revised version we will expand the Method section to provide these details, including the precise equations for the bias term, its normalization relative to standard deformable attention, and a short derivation showing how it modulates attention scores to down-weight unreliable projections. This addition will clarify the geometry-grounding contribution. revision: yes
Referee: [Experiments] Experiments section: no ablations isolate the contribution of the Jacobian bias versus the view-quality gate or the coarse-to-fine fusion, and no error bars or statistical significance tests are reported for the 33.00% IoU / 21.08% mIoU figures. Without these, the performance advantage over prior methods cannot be confidently linked to the proposed components.

Authors: We acknowledge that the current experimental section does not contain component-wise ablations or statistical analysis of the reported metrics. We will add a dedicated ablation study that isolates the Jacobian bias term, the view-quality semantic gate, and the sequential coarse-to-fine gated fusion. In addition, we will rerun the main experiments with multiple random seeds and report mean IoU/mIoU values together with standard deviations; we will also include a brief statistical significance assessment (e.g., paired t-test) against the strongest baseline. These changes will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces VGGT-Occ as a new architecture that embeds geometric tokens via Projection-Aware Deformable Attention (PA-DA), which projects 3D offsets and adds the projection Jacobian as a bias term, followed by a view-quality semantic gate and coarse-to-fine gated fusion. These components are described as novel integrations grounded in standard camera projection geometry rather than derived from prior fitted parameters or self-citations within the paper. The abstract and method description present the approach as an empirical proposal with reported benchmark results (33.00% IoU, 21.08% mIoU), without any equations or steps that reduce the claimed performance gains to quantities defined by construction from the inputs. No self-definitional loops, fitted-input-as-prediction patterns, or load-bearing self-citations are evident in the provided text. The derivation chain remains self-contained as a proposed method evaluated externally on SurroundOcc-nuScenes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard camera projection assumptions and introduces new algorithmic modules; no explicit free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption The standard pinhole camera projection model accurately maps 3D points to image planes.
Invoked when PA-DA projects 3D offsets back to image planes and uses the projection Jacobian.

invented entities (2)

Projection-Aware Deformable Attention (PA-DA) no independent evidence
purpose: Inject geometry into all attention stages by re-projection and Jacobian bias.
New component introduced to address geometry-agnostic stages in prior methods.
view-quality semantic gate no independent evidence
purpose: Enforce cross-view consistency during feature integration.
New gating mechanism for multi-view fusion.

pith-pipeline@v0.9.0 · 5799 in / 1597 out tokens · 66573 ms · 2026-05-19T21:15:49.584724+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations... view-quality semantic gate for cross-view consistency
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sequential coarse-to-fine decoder with gated fusion... allocating computation by information density

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

Blaschko

Maxim Berman, Amal Rannen Triki, and Matthew B. Blaschko. The Lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InCVPR, 2018

work page 2018
[2]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InCVPR, 2020

work page 2020
[3]

MonoScene: Monocular 3D semantic scene completion

Anh-Quan Cao and Raoul de Charette. MonoScene: Monocular 3D semantic scene completion. InCVPR, 2022

work page 2022
[4]

Gauss- Render: Learning 3D occupancy with Gaussian rendering

Loïck Chambon, Eloi Zablocki, Alexandre Boulch, Mickaël Chen, and Matthieu Cord. Gauss- Render: Learning 3D occupancy with Gaussian rendering. InICCV, 2025

work page 2025
[5]

Compact 3D Gaussian Splatting For Dense Visual SLAM

Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, and Weidong Chen. Compact 3D Gaussian splatting for dense visual SLAM.arXiv preprint arXiv:2403.11247, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

UniPR-3D: Towards universal visual place recognition with visual geometry grounded transformer.arXiv preprint arXiv:2512.21078, 2025

Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, and Hesheng Wang. UniPR-3D: Towards universal visual place recognition with visual geometry grounded transformer.arXiv preprint arXiv:2512.21078, 2025

work page arXiv 2025
[7]

What is the best 3D scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025

Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang, Mingrui Li, Long Chen, Lihua Xie, Danwei Wang, Jingchuan Wang, Javier Civera, Hesheng Wang, and Weidong Chen. What is the best 3D scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025

work page arXiv 2025
[8]

Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025

Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, and Hesheng Wang. Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025

work page arXiv 2025
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

work page 2021
[10]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. BEVDet: High-performance multi-camera 3D object detection in bird-eye-view.arXiv preprint arXiv:2112.11790, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Tri-perspective view for vision-based 3D semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3D semantic occupancy prediction. InCVPR, 2023

work page 2023
[12]

SelfOcc: Self- supervised vision-based 3D occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. SelfOcc: Self- supervised vision-based 3D occupancy prediction. InCVPR, 2024

work page 2024
[13]

GaussianFormer: Scene as Gaussians for vision-based 3D semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. GaussianFormer: Scene as Gaussians for vision-based 3D semantic occupancy prediction. InECCV, 2024

work page 2024
[14]

GaussianFormer-2: Probabilistic Gaussian superposition for efficient 3D occupancy prediction

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. GaussianFormer-2: Probabilistic Gaussian superposition for efficient 3D occupancy prediction. InCVPR, 2025

work page 2025
[15]

Far3D: Expanding the horizon for surround-view 3D object detection

Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3D: Expanding the horizon for surround-view 3D object detection. InAAAI, 2024

work page 2024
[16]

3D Gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4), 2023

work page 2023
[17]

SSCBench: A large-scale 3D semantic scene completion benchmark for autonomous driving.arXiv preprint arXiv:2306.09001, 2023

Yiming Li, Sihang Li, Xinhao Liu, Moonjun Gong, Kenan Li, Nuo Chen, Zijun Wang, Zhiheng Li, Tao Jiang, Fisher Yu, Yue Wang, Hang Zhao, Zhiding Yu, and Chen Feng. SSCBench: A large-scale 3D semantic scene completion benchmark for autonomous driving.arXiv preprint arXiv:2306.09001, 2023. 15

work page arXiv 2023
[18]

BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022

work page 2022
[19]

FB-OCC: 3D occupancy prediction based on forward-backward view transformation

Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. FB-OCC: 3D occupancy prediction based on forward-backward view transformation. InCVPR Workshop on End-to-End Autonomous Driving, 2023

work page 2023
[20]

Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

work page arXiv 2022
[21]

Fully sparse 3D occupancy prediction

Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3D occupancy prediction. InECCV, 2024

work page 2024
[22]

PETR: Position embedding transfor- mation for multi-view 3D object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position embedding transfor- mation for multi-view 3D object detection. InECCV, 2022

work page 2022
[23]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InCVPR, 2022

work page 2022
[24]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019
[25]

Cam4DOcc: Benchmark for camera-only 4D occupancy forecasting in autonomous driving applications

Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, and Hesheng Wang. Cam4DOcc: Benchmark for camera-only 4D occupancy forecasting in autonomous driving applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21486–21495, 2024

work page 2024
[26]

3D occupancy prediction with low-resolution queries via prototype-aware view transformation

Gyeongrok Oh, Sungjune Kim, Heeju Ko, Hyung-gun Chi, Jinkyu Kim, Dongwook Lee, Daehyun Ji, Sungjoon Choi, Sujin Jang, and Sangpil Kim. 3D occupancy prediction with low-resolution queries via prototype-aware view transformation. InCVPR, 2025

work page 2025
[27]

DINOv2: Learning robust visual features without supervision.TMLR, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.TMLR, 2024

work page 2024
[28]

RenderOcc: Vision-centric 3D occupancy prediction with 2D rendering supervision

Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shanghang Zhang. RenderOcc: Vision-centric 3D occupancy prediction with 2D rendering supervision. InICRA, 2024

work page 2024
[29]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. InECCV, 2020

work page 2020
[30]

TGSFormer: Scalable temporal Gaussian splatting for embodied semantic scene completion.arXiv preprint arXiv:2512.00300, 2025

Rui Qian, Haozhi Cao, Tianchen Deng, Tianxin Hu, Weixiang Guo, Shenghai Yuan, and Lihua Xie. TGSFormer: Scalable temporal Gaussian splatting for embodied semantic scene completion.arXiv preprint arXiv:2512.00300, 2025

work page arXiv 2025
[31]

SplatSSC: Decoupled depth-guided Gaussian splatting for semantic scene completion

Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, and Lihua Xie. SplatSSC: Decoupled depth-guided Gaussian splatting for semantic scene completion. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8520–8528, 2026

work page 2026
[32]

Orthographic feature transform for monocular 3D object detection

Thomas Roddick, Alex Kendall, and Roberto Cipolla. Orthographic feature transform for monocular 3D object detection. InBMVC, 2019

work page 2019
[33]

BePo: Dual representation for 3D occupancy prediction

Yunxiao Shi, Hong Cai, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Amin Ansari, and Fatih Porikli. BePo: Dual representation for 3D occupancy prediction. InCVPR Workshop on Autonomous Driving, 2026

work page 2026
[34]

CTF-Occ: Coarse-to-fine 3D occupancy prediction

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. CTF-Occ: Coarse-to-fine 3D occupancy prediction. InNeurIPS, 2023

work page 2023
[35]

Occ3D: A large-scale 3D occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yue Wang, Yilun Wang, and Hang Zhao. Occ3D: A large-scale 3D occupancy prediction benchmark for autonomous driving. InNeurIPS, 2023. 16

work page 2023
[36]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025

work page 2025
[37]

DETR3D: 3D object detection from multi-view images via 3D-to-2D queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In CoRL, 2021

work page 2021
[38]

PanoOcc: Unified occupancy representation for camera-based 3D panoptic segmentation

Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, and Zhaoxiang Zhang. PanoOcc: Unified occupancy representation for camera-based 3D panoptic segmentation. InCVPR, 2024

work page 2024
[39]

SurroundOcc: Multi-camera 3D occupancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. SurroundOcc: Multi-camera 3D occupancy prediction for autonomous driving. InICCV, 2023

work page 2023
[40]

A survey on occupancy perception for autonomous driving: The information fusion perspective.Information Fusion, 114:102671, 2025

Huaiyuan Xu, Junliang Chen, Shiyu Meng, Yi Wang, and Lap-Pui Chau. A survey on occupancy perception for autonomous driving: The information fusion perspective.Information Fusion, 114:102671, 2025

work page 2025
[41]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024

work page 2024
[42]

FlashOcc: Fast and memory-efficient occupancy prediction via channel-to-height plugin.arXiv preprint arXiv:2311.12058, 2023

Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. FlashOcc: Fast and memory-efficient occupancy prediction via channel-to-height plugin.arXiv preprint arXiv:2311.12058, 2023

work page arXiv 2023
[43]

SQS: Enhancing sparse perception models via query-based splatting in autonomous driving

Haiming Zhang, Yiyao Zhu, Wending Zhou, Xu Yan, Yingjie Cai, Bingbing Liu, Shuguang Cui, and Zhen Li. SQS: Enhancing sparse perception models via query-based splatting in autonomous driving. InNeurIPS, 2025

work page 2025
[44]

Vision-based 3D occupancy prediction in autonomous driving: a review and outlook.Frontiers of Computer Science, 20:2001301, 2026

Yanan Zhang, Jinqing Zhang, Zengran Wang, Junhao Xu, and Di Huang. Vision-based 3D occupancy prediction in autonomous driving: a review and outlook.Frontiers of Computer Science, 20:2001301, 2026

work page 2026
[45]

OccFormer: Dual-path transformer for vision- based 3D semantic occupancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. OccFormer: Dual-path transformer for vision- based 3D semantic occupancy prediction. InICCV, 2023

work page 2023
[46]

GaussianFormer3D: Multi-modal Gaussian- based semantic occupancy prediction with 3D deformable attention

Lingjun Zhao, Sizhe Wei, James Hays, and Lu Gan. GaussianFormer3D: Multi-modal Gaussian- based semantic occupancy prediction with 3D deformable attention. InICRA, 2026

work page 2026
[47]

Deformable DETR: Deformable transformers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. InICLR, 2021

work page 2021
[48]

Dr.Occ: Depth- and region-guided 3D occupancy from surround-view cameras for autonomous driving

Xubo Zhu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Wen Yang, and Huai Yu. Dr.Occ: Depth- and region-guided 3D occupancy from surround-view cameras for autonomous driving. InCVPR, 2026

work page 2026
[49]

QuadricFormer: Scene as superquadrics for 3D semantic occupancy prediction

Sicheng Zuo, Wenzhao Zheng, Xiaoyong Han, Longchao Yang, Yong Pan, and Jiwen Lu. QuadricFormer: Scene as superquadrics for 3D semantic occupancy prediction. InNeurIPS, 2025

work page 2025
[50]

GaussianWorld: Gaussian world model for streaming 3D occupancy prediction

Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. GaussianWorld: Gaussian world model for streaming 3D occupancy prediction. InCVPR, 2025. 17

work page 2025

[1] [1]

Blaschko

Maxim Berman, Amal Rannen Triki, and Matthew B. Blaschko. The Lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InCVPR, 2018

work page 2018

[2] [2]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InCVPR, 2020

work page 2020

[3] [3]

MonoScene: Monocular 3D semantic scene completion

Anh-Quan Cao and Raoul de Charette. MonoScene: Monocular 3D semantic scene completion. InCVPR, 2022

work page 2022

[4] [4]

Gauss- Render: Learning 3D occupancy with Gaussian rendering

Loïck Chambon, Eloi Zablocki, Alexandre Boulch, Mickaël Chen, and Matthieu Cord. Gauss- Render: Learning 3D occupancy with Gaussian rendering. InICCV, 2025

work page 2025

[5] [5]

Compact 3D Gaussian Splatting For Dense Visual SLAM

Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, and Weidong Chen. Compact 3D Gaussian splatting for dense visual SLAM.arXiv preprint arXiv:2403.11247, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

UniPR-3D: Towards universal visual place recognition with visual geometry grounded transformer.arXiv preprint arXiv:2512.21078, 2025

Tianchen Deng, Xun Chen, Ziming Li, Hongming Shen, Danwei Wang, Javier Civera, and Hesheng Wang. UniPR-3D: Towards universal visual place recognition with visual geometry grounded transformer.arXiv preprint arXiv:2512.21078, 2025

work page arXiv 2025

[7] [7]

What is the best 3D scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025

Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang, Mingrui Li, Long Chen, Lihua Xie, Danwei Wang, Jingchuan Wang, Javier Civera, Hesheng Wang, and Weidong Chen. What is the best 3D scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025

work page arXiv 2025

[8] [8]

Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025

Tianchen Deng, Wenhua Wu, Kunzhen Wu, Guangming Wang, Siting Zhu, Shenghai Yuan, Xun Chen, Guole Shen, Zhe Liu, and Hesheng Wang. Reloc-VGGT: Visual re-localization with geometry grounded transformer.arXiv preprint arXiv:2512.21883, 2025

work page arXiv 2025

[9] [9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021

work page 2021

[10] [10]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. BEVDet: High-performance multi-camera 3D object detection in bird-eye-view.arXiv preprint arXiv:2112.11790, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Tri-perspective view for vision-based 3D semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3D semantic occupancy prediction. InCVPR, 2023

work page 2023

[12] [12]

SelfOcc: Self- supervised vision-based 3D occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. SelfOcc: Self- supervised vision-based 3D occupancy prediction. InCVPR, 2024

work page 2024

[13] [13]

GaussianFormer: Scene as Gaussians for vision-based 3D semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. GaussianFormer: Scene as Gaussians for vision-based 3D semantic occupancy prediction. InECCV, 2024

work page 2024

[14] [14]

GaussianFormer-2: Probabilistic Gaussian superposition for efficient 3D occupancy prediction

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. GaussianFormer-2: Probabilistic Gaussian superposition for efficient 3D occupancy prediction. InCVPR, 2025

work page 2025

[15] [15]

Far3D: Expanding the horizon for surround-view 3D object detection

Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3D: Expanding the horizon for surround-view 3D object detection. InAAAI, 2024

work page 2024

[16] [16]

3D Gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4), 2023

work page 2023

[17] [17]

SSCBench: A large-scale 3D semantic scene completion benchmark for autonomous driving.arXiv preprint arXiv:2306.09001, 2023

Yiming Li, Sihang Li, Xinhao Liu, Moonjun Gong, Kenan Li, Nuo Chen, Zijun Wang, Zhiheng Li, Tao Jiang, Fisher Yu, Yue Wang, Hang Zhao, Zhiding Yu, and Chen Feng. SSCBench: A large-scale 3D semantic scene completion benchmark for autonomous driving.arXiv preprint arXiv:2306.09001, 2023. 15

work page arXiv 2023

[18] [18]

BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022

work page 2022

[19] [19]

FB-OCC: 3D occupancy prediction based on forward-backward view transformation

Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. FB-OCC: 3D occupancy prediction based on forward-backward view transformation. InCVPR Workshop on End-to-End Autonomous Driving, 2023

work page 2023

[20] [20]

Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

work page arXiv 2022

[21] [21]

Fully sparse 3D occupancy prediction

Haisong Liu, Yang Chen, Haiguang Wang, Zetong Yang, Tianyu Li, Jia Zeng, Li Chen, Hongyang Li, and Limin Wang. Fully sparse 3D occupancy prediction. InECCV, 2024

work page 2024

[22] [22]

PETR: Position embedding transfor- mation for multi-view 3D object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position embedding transfor- mation for multi-view 3D object detection. InECCV, 2022

work page 2022

[23] [23]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. InCVPR, 2022

work page 2022

[24] [24]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019

[25] [25]

Cam4DOcc: Benchmark for camera-only 4D occupancy forecasting in autonomous driving applications

Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, and Hesheng Wang. Cam4DOcc: Benchmark for camera-only 4D occupancy forecasting in autonomous driving applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21486–21495, 2024

work page 2024

[26] [26]

3D occupancy prediction with low-resolution queries via prototype-aware view transformation

Gyeongrok Oh, Sungjune Kim, Heeju Ko, Hyung-gun Chi, Jinkyu Kim, Dongwook Lee, Daehyun Ji, Sungjoon Choi, Sujin Jang, and Sangpil Kim. 3D occupancy prediction with low-resolution queries via prototype-aware view transformation. InCVPR, 2025

work page 2025

[27] [27]

DINOv2: Learning robust visual features without supervision.TMLR, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.TMLR, 2024

work page 2024

[28] [28]

RenderOcc: Vision-centric 3D occupancy prediction with 2D rendering supervision

Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shanghang Zhang. RenderOcc: Vision-centric 3D occupancy prediction with 2D rendering supervision. InICRA, 2024

work page 2024

[29] [29]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. InECCV, 2020

work page 2020

[30] [30]

TGSFormer: Scalable temporal Gaussian splatting for embodied semantic scene completion.arXiv preprint arXiv:2512.00300, 2025

Rui Qian, Haozhi Cao, Tianchen Deng, Tianxin Hu, Weixiang Guo, Shenghai Yuan, and Lihua Xie. TGSFormer: Scalable temporal Gaussian splatting for embodied semantic scene completion.arXiv preprint arXiv:2512.00300, 2025

work page arXiv 2025

[31] [31]

SplatSSC: Decoupled depth-guided Gaussian splatting for semantic scene completion

Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, and Lihua Xie. SplatSSC: Decoupled depth-guided Gaussian splatting for semantic scene completion. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8520–8528, 2026

work page 2026

[32] [32]

Orthographic feature transform for monocular 3D object detection

Thomas Roddick, Alex Kendall, and Roberto Cipolla. Orthographic feature transform for monocular 3D object detection. InBMVC, 2019

work page 2019

[33] [33]

BePo: Dual representation for 3D occupancy prediction

Yunxiao Shi, Hong Cai, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Amin Ansari, and Fatih Porikli. BePo: Dual representation for 3D occupancy prediction. InCVPR Workshop on Autonomous Driving, 2026

work page 2026

[34] [34]

CTF-Occ: Coarse-to-fine 3D occupancy prediction

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. CTF-Occ: Coarse-to-fine 3D occupancy prediction. InNeurIPS, 2023

work page 2023

[35] [35]

Occ3D: A large-scale 3D occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yue Wang, Yilun Wang, and Hang Zhao. Occ3D: A large-scale 3D occupancy prediction benchmark for autonomous driving. InNeurIPS, 2023. 16

work page 2023

[36] [36]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025

work page 2025

[37] [37]

DETR3D: 3D object detection from multi-view images via 3D-to-2D queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In CoRL, 2021

work page 2021

[38] [38]

PanoOcc: Unified occupancy representation for camera-based 3D panoptic segmentation

Yuqi Wang, Yuntao Chen, Xingyu Liao, Lue Fan, and Zhaoxiang Zhang. PanoOcc: Unified occupancy representation for camera-based 3D panoptic segmentation. InCVPR, 2024

work page 2024

[39] [39]

SurroundOcc: Multi-camera 3D occupancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. SurroundOcc: Multi-camera 3D occupancy prediction for autonomous driving. InICCV, 2023

work page 2023

[40] [40]

A survey on occupancy perception for autonomous driving: The information fusion perspective.Information Fusion, 114:102671, 2025

Huaiyuan Xu, Junliang Chen, Shiyu Meng, Yi Wang, and Lap-Pui Chau. A survey on occupancy perception for autonomous driving: The information fusion perspective.Information Fusion, 114:102671, 2025

work page 2025

[41] [41]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024

work page 2024

[42] [42]

FlashOcc: Fast and memory-efficient occupancy prediction via channel-to-height plugin.arXiv preprint arXiv:2311.12058, 2023

Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. FlashOcc: Fast and memory-efficient occupancy prediction via channel-to-height plugin.arXiv preprint arXiv:2311.12058, 2023

work page arXiv 2023

[43] [43]

SQS: Enhancing sparse perception models via query-based splatting in autonomous driving

Haiming Zhang, Yiyao Zhu, Wending Zhou, Xu Yan, Yingjie Cai, Bingbing Liu, Shuguang Cui, and Zhen Li. SQS: Enhancing sparse perception models via query-based splatting in autonomous driving. InNeurIPS, 2025

work page 2025

[44] [44]

Vision-based 3D occupancy prediction in autonomous driving: a review and outlook.Frontiers of Computer Science, 20:2001301, 2026

Yanan Zhang, Jinqing Zhang, Zengran Wang, Junhao Xu, and Di Huang. Vision-based 3D occupancy prediction in autonomous driving: a review and outlook.Frontiers of Computer Science, 20:2001301, 2026

work page 2026

[45] [45]

OccFormer: Dual-path transformer for vision- based 3D semantic occupancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. OccFormer: Dual-path transformer for vision- based 3D semantic occupancy prediction. InICCV, 2023

work page 2023

[46] [46]

GaussianFormer3D: Multi-modal Gaussian- based semantic occupancy prediction with 3D deformable attention

Lingjun Zhao, Sizhe Wei, James Hays, and Lu Gan. GaussianFormer3D: Multi-modal Gaussian- based semantic occupancy prediction with 3D deformable attention. InICRA, 2026

work page 2026

[47] [47]

Deformable DETR: Deformable transformers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. InICLR, 2021

work page 2021

[48] [48]

Dr.Occ: Depth- and region-guided 3D occupancy from surround-view cameras for autonomous driving

Xubo Zhu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Wen Yang, and Huai Yu. Dr.Occ: Depth- and region-guided 3D occupancy from surround-view cameras for autonomous driving. InCVPR, 2026

work page 2026

[49] [49]

QuadricFormer: Scene as superquadrics for 3D semantic occupancy prediction

Sicheng Zuo, Wenzhao Zheng, Xiaoyong Han, Longchao Yang, Yong Pan, and Jiwen Lu. QuadricFormer: Scene as superquadrics for 3D semantic occupancy prediction. InNeurIPS, 2025

work page 2025

[50] [50]

GaussianWorld: Gaussian world model for streaming 3D occupancy prediction

Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, and Jiwen Lu. GaussianWorld: Gaussian world model for streaming 3D occupancy prediction. InCVPR, 2025. 17

work page 2025