Recognition: 2 theorem links
· Lean TheoremSparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
Pith reviewed 2026-05-10 20:17 UTC · model grok-4.3
The pith
VoxSAMNet uses a dummy shortcut to skip empty voxels and foreground modulation to improve monocular 3D semantic scene completion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The core discovery is that the Dummy Shortcut for Feature Refinement (DSFR) module bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention, and the Foreground Modulation Strategy with Foreground Dropout and Text-Guided Image Filter alleviates overfitting on long-tailed classes. Together they enable state-of-the-art results of 18.2% mIoU on SemanticKITTI and 20.2% on SSCBench-KITTI-360, beating earlier monocular and stereo approaches.
What carries the argument
DSFR module using a shared dummy node to handle voxel sparsity in attention, together with Foreground Modulation Strategy to address semantic imbalance.
If this is right
- Delivers higher mIoU than previous methods on two KITTI-based benchmarks.
- Minimizes processing of the vast majority of empty voxels.
- Improves performance on rare foreground semantic classes.
- Provides evidence that sparsity and imbalance must be explicitly modeled in voxel-based SSC.
Where Pith is reading between the lines
- Similar dummy node tricks could simplify attention in other sparse 3D data structures like point clouds.
- The text-guided filter suggests a way to incorporate language priors into vision models for better class balance.
- Lower compute from skipping empties may enable real-time SSC on embedded hardware.
Load-bearing premise
That the reported performance gains result directly from the DSFR module and Foreground Modulation Strategy rather than from choices in training, augmentation, or the underlying network architecture.
What would settle it
Reproducing the baseline methods with identical training settings and finding that adding the proposed modules does not produce the claimed mIoU gains on SemanticKITTI would disprove the contribution of those components.
Figures
read the original abstract
Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VoxSAMNet for monocular 3D semantic scene completion, addressing the challenges of extreme voxel sparsity (over 93% empty voxels) and long-tailed foreground classes. It introduces the DSFR module, which uses a dummy shortcut to bypass empty voxels and deformable attention on occupied ones, along with a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF). Experiments on SemanticKITTI and SSCBench-KITTI-360 report SOTA mIoU scores of 18.2% and 20.2%, outperforming prior monocular and stereo baselines.
Significance. If the reported gains prove robustly attributable to the sparsity-aware and foreground-modulation components rather than implementation details, the work would offer a practical advance in efficient SSC for autonomous driving by reducing redundant computation on empty space and improving rare-class performance.
major comments (1)
- [Experiments] Experiments section (likely §4): The central attribution of the 18.2%/20.2% mIoU gains to DSFR and FD+TGIF is load-bearing but unsupported without matched re-implementations of baselines under identical training schedules, augmentations, optimizers, and backbones. Table 1 or 2 reports overall results but provides no ablation isolating these modules from confounders, undermining the claim that sparsity modeling and text-guided filtering are the sources of improvement.
minor comments (1)
- [Abstract] The abstract and introduction could more precisely define the DSFR dummy node and TGIF text embedding process to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and commit to revisions that strengthen the experimental validation.
read point-by-point responses
-
Referee: The central attribution of the 18.2%/20.2% mIoU gains to DSFR and FD+TGIF is load-bearing but unsupported without matched re-implementations of baselines under identical training schedules, augmentations, optimizers, and backbones. Table 1 or 2 reports overall results but provides no ablation isolating these modules from confounders, undermining the claim that sparsity modeling and text-guided filtering are the sources of improvement.
Authors: We agree that clear isolation of the DSFR module and Foreground Modulation Strategy (FD + TGIF) is essential to support the attribution of gains. The manuscript already contains ablation studies (Tables 3 and 4) that remove each proposed component while holding training schedule, augmentations, optimizer, and backbone fixed, showing consistent drops in mIoU. To directly address the concern about matched baseline re-implementations, we will add a new set of experiments in the revised version that re-train the strongest prior monocular and stereo baselines under identical conditions to our method. These results will be reported alongside the existing tables to demonstrate that the observed improvements stem from the sparsity-aware and foreground-modulation designs rather than implementation differences. revision: yes
Circularity Check
No circularity: empirical architecture evaluated on external benchmarks
full rationale
The paper proposes VoxSAMNet with DSFR module (dummy shortcut + deformable attention) and Foreground Modulation (FD + TGIF), then reports mIoU results from training and evaluation on SemanticKITTI and SSCBench-KITTI-360. These are standard empirical outcomes from external data and standard training procedures, not predictions or derivations that reduce to the paper's own inputs or equations by construction. No mathematical first-principles claims, fitted parameters renamed as predictions, or self-citation chains that bear the central result exist. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
over 93% of voxels are empty and foreground classes are rare
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Depthnet: A monoc- ular depth estimation framework
Anunay, Pankaj, and Chhavi Dhiman. Depthnet: A monoc- ular depth estimation framework. In2021 International Conference on Engineering and Emerging Technologies (ICEET), pages 1–6, 2021. 4
2021
-
[2]
Three cars approaching within 100m! enhancing distant geometry by tri-axis voxel scanning for camera-based semantic scene completion, 2025
Jongseong Bae, Junwoo Ha, and Ha Young Kim. Three cars approaching within 100m! enhancing distant geometry by tri-axis voxel scanning for camera-based semantic scene completion, 2025. 6, 7
2025
-
[3]
Se- mantickitti: A dataset for semantic scene understanding of lidar sequences
Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9297–9307,
-
[4]
Monoscene: Monoc- ular 3d semantic scene completion
Anh-Quan Cao and Raoul de Charette. Monoscene: Monoc- ular 3d semantic scene completion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3991–4001, 2022. 1, 2, 5, 6, 7
2022
-
[5]
3d u-net: learning dense volumetric segmentation from sparse annotation
¨Ozg¨un C ¸ ic ¸ek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pages 424–432. Springer,
-
[6]
Loma: Language-assisted semantic occupancy network via triplane mamba
Yubo Cui, Zhiheng Li, Jiaqiang Wang, and Zheng Fang. Loma: Language-assisted semantic occupancy network via triplane mamba. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2609–2617, 2025. 6, 7
2025
-
[7]
Fast r-cnn
Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,
-
[8]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3
2022
-
[9]
Multi-modal scene graph inspired policy for visual navigation: Y
Yu He, Kang Zhou, and T Lifang Tian. Multi-modal scene graph inspired policy for visual navigation: Y . he, k. zhou. The Journal of Supercomputing, 81(1):107, 2025. 3
2025
-
[10]
Sym- phonize 3d semantic scene completion with contextual in- stance queries
Haoyi Jiang, Tianheng Cheng, Naiyu Gao, Haoyang Zhang, Tianwei Lin, Wenyu Liu, and Xinggang Wang. Sym- phonize 3d semantic scene completion with contextual in- stance queries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20258– 20267, 2024. 5, 7, 8
2024
-
[11]
Soap: Vision-centric 3d seman- tic scene completion with scene-adaptive decoder and oc- cluded region-aware view projection
Hyo-Jun Lee, Yeong Jun Koh, Hanul Kim, Hyunseop Kim, Yonguk Lee, and Jinu Lee. Soap: Vision-centric 3d seman- tic scene completion with scene-adaptive decoder and oc- cluded region-aware view projection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17145–17154, 2025. 6, 7
2025
-
[12]
Bohan Li, Yasheng Sun, Xin Jin, Wenjun Zeng, Zheng Zhu, Xiaoefeng Wang, Yunpeng Zhang, James Okae, Hang Xiao, and Dalong Du. Stereoscene: Bev-assisted stereo match- ing empowers 3d semantic scene completion.arXiv preprint arXiv:2303.13959, 1(3):6, 2023. 2
-
[13]
Dfa3d: 3d deformable attention for 2d-to-3d feature lifting
Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, and Lei Zhang. Dfa3d: 3d deformable attention for 2d-to-3d feature lifting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6684–6693, 2023. 4
2023
-
[14]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022. 2, 3
2022
-
[15]
Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar
Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M. Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023. 1, 3, 7
2023
-
[16]
Sscbench: A large-scale 3d semantic scene comple- tion benchmark for autonomous driving
Yiming Li, Sihang Li, Xinhao Liu, Moonjun Gong, Kenan Li, Nuo Chen, Zijun Wang, Zhiheng Li, Tao Jiang, Fisher Yu, et al. Sscbench: A large-scale 3d semantic scene comple- tion benchmark for autonomous driving. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13333–13340. IEEE, 2024. 2
2024
-
[17]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bev- former: Learning bird’s-eye-view representation from multi- camera images via spatiotemporal transformers.(2022).URL https://arxiv. org/abs/2203.17270, 10, 2022. 1, 2, 3
-
[18]
Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023. 2
-
[19]
Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Rajasimman Madhivanan, and Dinesh Manocha. Et- former: Efficient triplane deformable attention for 3d seman- tic scene completion from monocular camera.arXiv preprint arXiv:2410.11019, 2024. 2
-
[20]
Skip mamba diffusion for monocular 3d semantic scene completion
Li Liang, Naveed Akhtar, Jordan Vice, Xiangrui Kong, and Ajmal Saeed Mian. Skip mamba diffusion for monocular 3d semantic scene completion. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 5155–5163, 2025. 6, 7
2025
-
[21]
Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022
Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 2, 6, 7
2022
-
[22]
Liang Lin, Zhihao Xu, Junhao Dong, Jian Zhao, Yuchen Yuan, Guibin Zhang, Miao Yu, Yiming Zhang, Zhengtao Yao, Huahui Yi, et al. Orthalign: Orthogonal subspace de- 9 composition for non-interfering multi-objective alignment. arXiv preprint arXiv:2509.24610, 2025. 1
-
[23]
Hidden in the noise: Unveiling back- doors in audio llms alignment through latent acoustic pattern triggers
Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, et al. Hidden in the noise: Unveiling back- doors in audio llms alignment through latent acoustic pattern triggers. InProceedings of the AAAI Conference on Artificial Intelligence, pages 32015–32023, 2026. 2
2026
-
[24]
Disentan- gling instance and scene contexts for 3d semantic scene com- pletion
Enyu Liu, En Yu, Sijia Chen, and Wenbing Tao. Disentan- gling instance and scene contexts for 3d semantic scene com- pletion. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 26999–27009, 2025. 1, 6, 7
2025
-
[25]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
2023
-
[26]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,
-
[27]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 6
2021
-
[28]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Vishall3d: Monocular semantic scene completion from reconstructing the visible regions to hallucinating the invisible regions
Haoang Lu, Yuanqi Su, Xiaoning Zhang, Longjun Gao, Yu Xue, and Le Wang. Vishall3d: Monocular semantic scene completion from reconstructing the visible regions to hallucinating the invisible regions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 28674–28684, 2025. 3, 5, 6, 7
2025
-
[30]
Camera- based 3d semantic scene completion with sparse guidance network.IEEE Transactions on Image Processing, 33:5468– 5481, 2024
Jianbiao Mei, Yu Yang, Mengmeng Wang, Junyu Zhu, Jong- won Ra, Yukai Ma, Laijian Li, and Yong Liu. Camera- based 3d semantic scene completion with sparse guidance network.IEEE Transactions on Image Processing, 33:5468– 5481, 2024. 3
2024
-
[31]
Occdepth: A depth-aware method for 3d semantic scene completion
Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023. 1
-
[32]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d
Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 4
2020
-
[33]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3
2021
-
[34]
Chang, Manolis Savva, and Thomas Funkhouser
Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2017. 1
2017
-
[35]
Mixssc: Forward-backward mixture for vision-based 3d semantic scene completion.IEEE Transac- tions on Circuits and Systems for Video Technology, 35(6): 5684–5696, 2025
Meng Wang, Yan Ding, Yumeng Liu, Yunchuan Qin, Ruihui Li, and Zhuo Tang. Mixssc: Forward-backward mixture for vision-based 3d semantic scene completion.IEEE Transac- tions on Circuits and Systems for Video Technology, 35(6): 5684–5696, 2025. 6, 7
2025
-
[36]
Vlscene: Vision-language guidance distillation for camera-based 3d semantic scene completion
Meng Wang, Huilong Pi, Ruihui Li, Yunchuan Qin, Zhuo Tang, and Kenli Li. Vlscene: Vision-language guidance distillation for camera-based 3d semantic scene completion. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 7808–7816, 2025. 3, 6, 7
2025
-
[37]
Meng Wang, Fan Wu, Ruihui Li, Yunchuan Qin, Zhuo Tang, and Kenli Li. Learning temporal 3d semantic scene completion via optical flow guidance.arXiv preprint arXiv:2502.14520, 2025. 6, 7
-
[38]
Vision-based 3d semantic scene completion via capture dynamic representations.Knowledge-Based Sys- tems, page 114550, 2025
Meng Wang, Fan Wu, Yunchuan Qin, Ruihui Li, Zhuo Tang, and Kenli Li. Vision-based 3d semantic scene completion via capture dynamic representations.Knowledge-Based Sys- tems, page 114550, 2025. 3
2025
-
[39]
Not all voxels are equal: Hardness-aware semantic scene completion with self- distillation
Song Wang, Jiawei Yu, Wentong Li, Wenyu Liu, Xiaolu Liu, Junbo Chen, and Jianke Zhu. Not all voxels are equal: Hardness-aware semantic scene completion with self- distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14792– 14801, 2024. 1
2024
-
[40]
H2gformer: Horizontal-to-global voxel transformer for 3d semantic scene completion
Yu Wang and Chao Tong. H2gformer: Horizontal-to-global voxel transformer for 3d semantic scene completion. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5722–5730, 2024. 6, 7
2024
-
[41]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, pages 180–191. PMLR, 2022. 2, 4
2022
-
[42]
Jiawei Yao and Jusheng Zhang. Depthssc: Depth- spatial alignment and dynamic voxel resolution for monoc- ular 3d semantic scene completion.arXiv preprint arXiv:2311.17084, 7:16, 2023. 2
-
[43]
Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space
Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang, and Hongsheng Li. Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9421–9431. IEEE, 2023. 2
2023
-
[44]
Context and geometry aware voxel transformer for semantic scene completion.Advances in Neural Information Processing Systems, 37:1531–1555, 2024
Zhu Yu, Runmin Zhang, Jiacheng Ying, Junchen Yu, Xiaohai Hu, Lun Luo, Si-Yuan Cao, and Hui-Liang Shen. Context and geometry aware voxel transformer for semantic scene completion.Advances in Neural Information Processing Systems, 37:1531–1555, 2024. 1, 3, 6
2024
-
[45]
Ascot: An adaptive self-correction chain-of-thought method for late-stage fragility in llms,
Dongxu Zhang, Ning Yang, Jihua Zhu, Jinnan Yang, Miao Xin, and Baoliang Tian. Ascot: An adaptive self-correction chain-of-thought method for late-stage fragility in llms. arXiv preprint arXiv:2508.05282, 2025. 7
-
[46]
Not all queries need deep thought: Coficot for adaptive coarse-to-fine stateful refinement,
Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang, Ning Yang, and Jihua Zhu. Not all queries need deep thought: Coficot for adaptive coarse-to-fine stateful re- finement.arXiv preprint arXiv:2603.08251, 2026. 7 10
-
[47]
Dongxu Zhang, Yiding Sun, Pengcheng Li, Yumou Liu, Hongqiang Lin, Haoran Xu, Xiaoxuan Mu, Liang Lin, Wen- biao Yan, Ning Yang, et al. Pointcot: A multi-modal bench- mark for explicit 3d geometric reasoning.arXiv preprint arXiv:2602.23945, 2026. 7
-
[48]
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang, Jihua Zhu, and Haijun Zhang. Chain-of-thought com- pression should not be blind: V-skip for efficient multi- modal reasoning via dual-path anchoring.arXiv preprint arXiv:2601.13879, 2026. 4
-
[49]
Cmhanet: A cross-modal hybrid attention network for point cloud registration.Neurocomput- ing, page 133318, 2026
Dongxu Zhang, Yingsen Wang, Yiding Sun, Haoran Xu, Peilin Fan, and Jihua Zhu. Cmhanet: A cross-modal hybrid attention network for point cloud registration.Neurocomput- ing, page 133318, 2026. 4
2026
-
[50]
Igasa: Integrated geometry- aware and skip-attention modules for enhanced point cloud registration.IEEE Transactions on Circuits and Systems for Video Technology, 2026
Dongxu Zhang, Jihua Zhu, Shiqi Li, Wenbiao Yan, Haoran Xu, Peilin Fan, and Huimin Lu. Igasa: Integrated geometry- aware and skip-attention modules for enhanced point cloud registration.IEEE Transactions on Circuits and Systems for Video Technology, 2026. 2
2026
-
[51]
Monodetr: Depth- guided transformer for monocular 3d object detection
Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. Monodetr: Depth- guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9155–9166, 2023. 4
2023
-
[52]
Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction
Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9433–9443,
-
[53]
Out-of-distribution semantic occupancy prediction,
Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, and Kailun Yang. Out-of-distribution semantic occupancy prediction,
-
[54]
The coherence trap: When mllm- crafted narratives exploit manipulated visual contexts, 2026
Yuchen Zhang, Yaxiong Wang, Yujiao Wu, Lianwei Wu, Li Zhu, and Zhedong Zheng. The coherence trap: When mllm- crafted narratives exploit manipulated visual contexts, 2026. 2
2026
-
[55]
Monoocc: Digging into monocular semantic occu- pancy prediction
Yupeng Zheng, Xiang Li, Pengfei Li, Yuhang Zheng, Bu Jin, Chengliang Zhong, Xiaoxiao Long, Hao Zhao, and Qichao Zhang. Monoocc: Digging into monocular semantic occu- pancy prediction. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18398–18405. IEEE, 2024. 1
2024
-
[56]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 1, 4 11
work page internal anchor Pith review arXiv 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.