arxiv: 2604.18831 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.RO

Recognition: unknown

Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

Haiyang Wu , Juan J. Gonzales Torres , George Vosselman , Ville Lehtola

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:45 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords indoor lidarsemantic segmentationvisual foundation modelscross-modal distillationpseudo-labelingframe-wise processing3D scene understandingSLAM datasets

0 comments

The pith

Distilling labels from visual foundation models to lidar points enables frame-wise indoor semantic segmentation without manual 3D annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether visual foundation models that segment camera images can supply training labels for a lidar semantic segmentation model in indoor scenes by projecting those labels onto corresponding 3D points. This cross-modal approach sidesteps the expense of creating ground-truth lidar labels, which has limited progress on indoor data compared with outdoor driving scenes. The authors couple each lidar scan with a VFM-processed image and evaluate on indoor SLAM sequences using both pseudo-labels and a small set of real annotations. Success here would make scalable training of indoor lidar models practical for robotics and mapping tasks.

Core claim

A 2D-to-3D distillation pipeline that transfers semantic labels from visual foundation models onto indoor lidar points produces a segmentation model reaching up to 56 percent mIoU under pseudo-label evaluation and around 36 percent mIoU on a manually annotated validation set, establishing that the cross-modal method is feasible for frame-wise indoor lidar segmentation without manual annotations.

What carries the argument

The 2D-to-3D projection step that aligns VFM-generated image labels with lidar points to create pseudo-training data for a 3D segmentation network.

If this is right

Indoor SLAM datasets become usable for training lidar models through VFM pseudo-labels.
Frame-wise processing becomes viable by pairing each lidar scan with a synchronized camera image.
The cost of creating large indoor lidar segmentation datasets decreases because manual 3D labeling is no longer required.
A small real-labeled set can still serve as validation even when the bulk of training uses distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other indoor environments where camera-lidar pairs are available but full 3D annotations are scarce.
Combining the distilled model with temporal consistency across frames could improve performance in dynamic indoor scenes.
If projection accuracy holds across more sensor setups, the method could support large-scale collection of indoor 3D semantic data for downstream mapping applications.

Load-bearing premise

The projection of 2D image labels onto 3D lidar points stays accurate enough in indoor settings despite differences in calibration, occlusions, and lighting.

What would settle it

A substantial drop in mIoU on the real-labeled indoor validation set when the same pipeline is applied to datasets with different sensor calibrations or denser occlusions would indicate the projection step fails to support usable training.

Figures

Figures reproduced from arXiv: 2604.18831 by George Vosselman, Haiyang Wu, Juan J. Gonzales Torres, Ville Lehtola.

**Figure 2.** Figure 2: Comparison of lidar scan projections in different [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of cross-modal alignment between [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Manually labeled results on the ITC dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of segmentation outputs from [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Frame-wise semantic segmentation of indoor lidar scans is a fundamental step toward higher-level 3D scene understanding and mapping applications. However, acquiring frame-wise ground truth for training deep learning models is costly and time-consuming. This challenge is largely addressed, for imagery, by Visual Foundation Models (VFMs) which segment image frames. The same VFMs may be used to train a lidar scan frame segmentation model via a 2D-to-3D distillation pipeline. The success of such distillation has been shown for autonomous driving scenes, but not yet for indoor scenes. Here, we study the feasibility of repeating this success for indoor scenes, in a frame-wise distillation manner by coupling each lidar scan with a VFM-processed camera image. The evaluation is done using indoor SLAM datasets, where pseudo-labels are used for downstream evaluation. Also, a small manually annotated lidar dataset is provided for validation, as there are no other lidar frame-wise indoor datasets with semantics. Results show that the distilled model achieves up to 56% mIoU under pseudo-label evaluation and around 36% mIoU with real-label, demonstrating the feasibility of cross-modal distillation for indoor lidar semantic segmentation without manual annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows indoor 2D-to-3D VFM distillation is possible in principle and ships a useful small labeled dataset, but 36% real mIoU plus unmeasured projection noise leaves the feasibility claim only partly supported.

read the letter

This paper applies the 2D-to-3D distillation pipeline from visual foundation models, already shown for driving scenes, to indoor lidar scans. They pair each lidar frame with a camera image, run the VFM on the image, project the labels onto the points, and train a lidar segmentation model. Evaluation uses indoor SLAM sequences for pseudo-label testing plus a new small manually annotated lidar set they release for real-label checks. The headline numbers are 56% mIoU on pseudo labels and 36% on real labels, which they present as evidence that the approach works without manual lidar annotation for indoor cases.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a study on the feasibility of using visual foundation models (VFMs) to generate pseudo-labels for training a LiDAR semantic segmentation network for indoor frame-wise scans through 2D-to-3D distillation. It couples LiDAR scans with camera images processed by VFMs, projects the labels to 3D points, and trains a LiDAR model. Evaluation on indoor SLAM datasets yields up to 56% mIoU with pseudo-labels and 36% mIoU on a provided small set of manually annotated real labels, supporting the claim that such distillation is feasible without manual annotations for indoor scenes.

Significance. If the projection accuracy holds, this work is significant as it addresses the scarcity of annotated indoor LiDAR data by extending outdoor cross-modal distillation techniques indoors, which could enable scalable training for robotics and 3D mapping applications. The provision of a small manually annotated LiDAR dataset is a valuable contribution for benchmarking and validation.

major comments (1)

[Abstract and Evaluation] Abstract and Evaluation: The feasibility claim depends on the 2D-to-3D projection of VFM labels serving as usable supervision. The manuscript reports 56% pseudo mIoU and 36% real-label mIoU but does not quantify projection error or label agreement between the projected VFM labels and the manual annotations. This leaves open whether the performance gap reflects successful transfer or label noise from indoor challenges such as occlusions, calibration drift, and viewpoint variations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the value of the manually annotated indoor LiDAR dataset. We address the major comment below and will revise the manuscript to strengthen the evaluation.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation: The feasibility claim depends on the 2D-to-3D projection of VFM labels serving as usable supervision. The manuscript reports 56% pseudo mIoU and 36% real-label mIoU but does not quantify projection error or label agreement between the projected VFM labels and the manual annotations. This leaves open whether the performance gap reflects successful transfer or label noise from indoor challenges such as occlusions, calibration drift, and viewpoint variations.

Authors: We agree that explicitly quantifying the agreement between projected VFM pseudo-labels and the manual annotations would better isolate projection errors from the distillation performance and strengthen the feasibility claim. In the revised manuscript we will add this analysis by computing mIoU (and per-class IoU) between the projected labels and the provided ground-truth annotations on the small manually labeled set. We will also expand the discussion to address indoor-specific projection challenges such as occlusions, calibration drift, and viewpoint variations, including any observed error patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feasibility claim rests on independent datasets and new annotations

full rationale

The manuscript reports mIoU numbers obtained by training a lidar segmentation model on 2D-to-3D projected VFM pseudo-labels and then measuring performance on both those pseudo-labels and a newly supplied small set of manual lidar annotations drawn from indoor SLAM sequences. No equations, fitted parameters, or self-citations are presented that would make any reported performance number equivalent to its own input by construction. The central feasibility statement is therefore an ordinary empirical observation that can be falsified by re-running the pipeline on the released annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or axioms; the approach implicitly relies on standard supervised distillation assumptions and accurate 2D-3D calibration.

pith-pipeline@v0.9.0 · 5522 in / 1178 out tokens · 36344 ms · 2026-05-10T04:45:51.252233+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 8 canonical work pages · 2 internal anchors

[1]

2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Multi-modal lidar dataset for benchmarking general-purpose localization and mapping algorithms , author=. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2022 , organization=

2022
[2]

IEEE Robotics and Automation Letters , volume=

M2dgr: A multi-sensor and multi-scenario slam dataset for ground robots , author=. IEEE Robotics and Automation Letters , volume=. 2021 , publisher=

2021
[3]

The International Journal of Robotics Research , volume=

Ntu viral: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint , author=. The International Journal of Robotics Research , volume=. 2022 , publisher=

2022
[4]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

nuscenes: A multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Using a waffle iron for automotive point cloud semantic segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[6]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Oneformer: One transformer to rule universal image segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[7]

DINOv2: Learning Robust Visual Features without Supervision

Dinov2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Three pillars improving vision foundation model distillation for lidar , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[9]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Randla-net: Efficient semantic segmentation of large-scale point clouds , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[10]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

3d semantic parsing of large-scale indoor spaces , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[11]

W., Meng, T., Caine, B., Ngiam, J., Peng, D., and Tan, M

Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining , author=. arXiv preprint arXiv:2104.04687 , year=

work page arXiv
[12]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Image-to-lidar self-supervised distillation for autonomous driving data , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[13]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised Pointcloud Understanding , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

2023
[14]

Advances in Neural Information Processing Systems , volume=

Segment any point cloud sequences by distilling vision foundation models , author=. Advances in Neural Information Processing Systems , volume=
[15]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Multi-space alignments towards universal lidar segmentation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[16]

arXiv preprint arXiv:2503.19912 , year=

SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining , author=. arXiv preprint arXiv:2503.19912 , year=

work page arXiv
[17]

Proceedings of the Asian Conference on Computer Vision , pages=

The Devil is in the Details: Simple Remedies for Image-to-LiDAR Representation Learning , author=. Proceedings of the Asian Conference on Computer Vision , pages=
[18]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Self-supervised pretraining of 3d features on any point-cloud , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[19]

International Journal of Applied Earth Observation and Geoinformation , volume=

Digital twin of a city: Review of technology serving city needs , author=. International Journal of Applied Earth Observation and Geoinformation , volume=. 2022 , publisher=

2022
[20]

The Photogrammetric Record , volume=

An Indoor Laser Inertial SLAM Method Fusing Semantics and Planes , author=. The Photogrammetric Record , volume=. 2025 , publisher=

2025
[21]

IEEE Robotics and Automation Letters , volume=

S-Graphs+: Real-Time Localization and Mapping Leveraging Hierarchical Representations , author=. IEEE Robotics and Automation Letters , volume=. 2023 , publisher=

2023
[22]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Point transformer v3: Simpler faster stronger , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[23]

DINOv3

Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Journal of building engineering , volume=

A BIM-enabled digital twin framework for real-time indoor environment monitoring and visualization by integrating autonomous robotics, LiDAR-based 3D mobile mapping, IoT sensing, and indoor positioning technologies , author=. Journal of building engineering , volume=. 2024 , publisher=

2024
[25]

arXiv preprint arXiv:2412.01398 , year=

Holistic understanding of 3d scenes as universal scene description , author=. arXiv preprint arXiv:2412.01398 , year=

work page arXiv
[26]

arXiv preprint arXiv:2506.06804 , year=

IRS: Instance-Level 3D Scene Graphs via Room Prior Guided LiDAR-Camera Fusion , author=. arXiv preprint arXiv:2506.06804 , year=

work page arXiv
[27]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Semantickitti: A dataset for semantic scene understanding of lidar sequences , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[28]

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences , volume=

Pushing the limit to near real-time indoor LiDAR-based semantic segmentation , author=. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences , volume=. 2024 , publisher=

2024
[29]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Point transformer , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[30]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

4d spatio-temporal convnets: Minkowski convolutional neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[31]

IEEE Transactions on Circuits and Systems for Video Technology , year=

NUC-Net: Non-uniform Cylindrical Partition Network for Efficient LiDAR Semantic Segmentation , author=. IEEE Transactions on Circuits and Systems for Video Technology , year=
[32]

Authorea Preprints , year=

Towards Autonomous Off-road Navigation in Forests: Feasibility of Scan Matching and Semantic Segmentation , author=. Authorea Preprints , year=
[33]

arXiv preprint arXiv:2503.18944 , year=

DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation , author=. arXiv preprint arXiv:2503.18944 , year=

work page arXiv
[34]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

PointDC: Unsupervised semantic segmentation of 3D point clouds via cross-modal distillation and super-voxel clustering , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[35]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[36]

arXiv preprint arXiv:2507.18176 , year=

Unsupervised Domain Adaptation for 3D LiDAR Semantic Segmentation Using Contrastive Learning and Multi-Model Pseudo Labeling , author=. arXiv preprint arXiv:2507.18176 , year=

work page arXiv
[37]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Masked-attention mask transformer for universal image segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[38]

IEEE Robotics and Automation Letters , year=

Wu, Haiyang and Vosselman, George and Lehtola, Ville , title=. IEEE Robotics and Automation Letters , year=
[39]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

M3net: Multimodal multi-task learning for 3d detection, segmentation, and occupancy prediction in autonomous driving , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=