Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

Chun Li; Feng Xiao; Hongbin Xu; Ming Li; Qiuxia Wu; Siqi Wei; Tian Lan

arxiv: 2605.20737 · v1 · pith:5KIPWHEYnew · submitted 2026-05-20 · 💻 cs.CV

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

Siqi Wei , Hongbin Xu , Feng Xiao , Tian Lan , Chun Li , Ming Li , Qiuxia Wu This is my paper

Pith reviewed 2026-05-21 04:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords unsupervised 3D segmentationpoint cloudlong-tail ambiguitylanguage priorshierarchical clusteringcontrastive alignmentsemantic priorsminority class representation

0 comments

The pith

Language priors from models resolve long-tail ambiguity by guiding hierarchical clustering in unsupervised 3D point cloud segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Purely visual clustering in unsupervised 3D segmentation absorbs features from rare classes into dominant groups, producing imbalanced scene labels. LangTail counters this by extracting balanced category knowledge from language models as entity-level semantic priors. These priors are aligned at multiple levels with the visual features inside a hierarchical clustering process. The alignment steers the formation of distinct groups for minor classes rather than letting them merge away. This yields more accurate unsupervised segmentation across indoor and outdoor 3D scenes without any manual labels.

Core claim

LangTail constructs an entity-level semantic prior from language models that captures balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories.

What carries the argument

Entity-level semantic prior extracted from language models and injected via contrastive alignment into a hierarchical clustering framework to enforce multi-level associations with visually underrepresented classes.

If this is right

Multi-level associations between language priors and visual features compensate for the biased attention of clustering toward dominant classes.
Hierarchical clustering produces more discriminative representations for underrepresented categories instead of absorbing them.
The approach yields measurable gains in mean intersection-over-union on real-world 3D datasets such as ScanNet-v2, S3DIS, and nuScenes.
Unsupervised segmentation becomes viable for scenes that follow natural long-tail class distributions without requiring labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same language-prior mechanism could be tested on other unsupervised tasks such as 2D image clustering to isolate the contribution of the hierarchical structure.
Controlled experiments that vary the granularity of the language priors would reveal the minimal level of semantic detail needed to protect minority classes.
Deployment on streaming 3D data from moving sensors would show whether the alignment remains stable when new rare objects appear over time.

Load-bearing premise

Language models hold balanced world knowledge that can be transferred to 3D visual features through contrastive alignment without creating new semantic mismatches or biases.

What would settle it

Apply the method to a controlled 3D dataset where language model categories are deliberately shifted away from the visual object distributions and check whether minority-class separation still improves or reverts to the baseline absorption pattern.

Figures

Figures reproduced from arXiv: 2605.20737 by Chun Li, Feng Xiao, Hongbin Xu, Ming Li, Qiuxia Wu, Siqi Wei, Tian Lan.

**Figure 1.** Figure 1: Analysis of the Long-tail ambiguity problem in unsupervised 3D semantic segmentation. (a) visualizes the per-category mIoU results (in blue) and the per-category point counts (in orange). The long-tail amiguity problem can be observed, where near-zero predictions occur at the tail class. (b) visualizes the confusion matrix between the predicted results and the ground truth under each category. It can be ob… view at source ↗

**Figure 2.** Figure 2: Overview of our LangTail. The Entity Branch inject balanced language priors into the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on ScanNet (the first row) and S3DIS (the second row) datasets.The [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of point cloud segmentation on the nuScenes datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: https://github.com/Whisky0129/langtail_official.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LangTail injects entity-level language priors via contrastive alignment into hierarchical clustering to reduce minor-class absorption in unsupervised 3D segmentation, with reported mIoU gains that need baseline and ablation checks to confirm the source.

read the letter

LangTail uses language model priors to counter the tendency of pure visual clustering to swallow minor classes into dominant ones in 3D point clouds. The approach builds entity-level semantic priors from LMs and feeds them into a hierarchical contrastive setup so that underrepresented categories get better separation at multiple granularities. This is the concrete step beyond standard visual-only methods cited in the abstract. The reported lifts of roughly 13 points mIoU on ScanNet-v2, 13 on S3DIS, and 9 on nuScenes are the main empirical claim, and they target a limitation that shows up in robotics and scene understanding pipelines. If the full experiments include clean ablations that isolate the prior contribution and fair re-implementations of the baselines, the result would be a practical increment for people working on label-free 3D segmentation. The mechanism itself is straightforward: the priors supply the balanced category knowledge that visual features alone lack, and the multi-level alignment is meant to keep that signal from being drowned out during clustering. That framing is reasonable given how skewed the class distributions are in these datasets. The softer parts sit in the experimental section. The abstract gives headline numbers but leaves baseline details, statistical tests, and exact loss weighting between clustering and alignment implicit. Without those, it is hard to rule out that some of the lift comes from implementation choices rather than the language component. There is also the open question of whether the LM priors carry their own frequency skews or 3D-domain mismatches that the contrastive step does not fully cancel. The paper promises code release, which would help settle these points. This work is aimed at researchers who already follow unsupervised 3D segmentation and want a concrete way to improve tail-class recall without adding labels. A reader who cares about vision-language transfer in clustering settings will find the specific association mechanism worth examining. It is solid enough on the problem statement and the proposed fix to merit peer review, even if the validation needs tightening in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes LangTail, a language-guided hierarchical learning framework for unsupervised 3D point cloud segmentation. It extracts entity-level semantic priors from language models and injects them via contrastive alignment into a hierarchical clustering pipeline to mitigate long-tail ambiguity, where features of minor classes are absorbed by dominant visual clusters. The work reports large mIoU gains of +13.5 on ScanNet-v2, +12.9 on S3DIS, and +8.9 on nuScenes, attributing the improvements to multi-level associations between language priors and visually underrepresented classes.

Significance. If the central claims hold after verification, the result would be significant for unsupervised 3D segmentation by showing that external language priors can compensate for visual long-tail biases in clustering. The planned code release supports reproducibility and allows direct inspection of the alignment mechanism. The approach is a clear attempt to move beyond purely visual similarity-based methods.

major comments (2)

[§3] §3 (Method), contrastive alignment description: no equation or loss term is given for how the language-prior alignment objective is balanced against the hierarchical clustering objective, so it is impossible to verify whether the mechanism corrects visual dominance or simply adds an external bias source as feared in the stress-test note.
[Experimental results] Experimental results section and Table 1: the headline mIoU gains are reported without any description of baseline re-implementations, ablation on the language-prior component, data-split details, or statistical significance across runs; this directly undermines attribution of the +13.5/+12.9/+8.9 improvements to the proposed multi-level association rather than implementation differences.

minor comments (1)

[Abstract] The abstract and method overview use the term 'multi-granularity semantic structure' without defining the specific granularity levels or how they map to the entity-level prior.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of major revision. We address each major comment below and will update the manuscript to improve clarity and experimental rigor.

read point-by-point responses

Referee: [§3] §3 (Method), contrastive alignment description: no equation or loss term is given for how the language-prior alignment objective is balanced against the hierarchical clustering objective, so it is impossible to verify whether the mechanism corrects visual dominance or simply adds an external bias source as feared in the stress-test note.

Authors: We agree that an explicit formulation is needed. The current manuscript describes the injection of language priors via contrastive alignment into the hierarchical clustering pipeline but does not provide the combined objective. In the revision we will add a formal loss equation in §3 that defines the total objective as a weighted sum of the hierarchical clustering loss and the language-prior contrastive alignment loss, including the balancing hyperparameter. This will allow readers to verify the relative influence of each term. revision: yes
Referee: [Experimental results] Experimental results section and Table 1: the headline mIoU gains are reported without any description of baseline re-implementations, ablation on the language-prior component, data-split details, or statistical significance across runs; this directly undermines attribution of the +13.5/+12.9/+8.9 improvements to the proposed multi-level association rather than implementation differences.

Authors: We acknowledge the need for greater transparency. The revised Experimental results section will include: (i) details on baseline re-implementations and any adaptations made, (ii) a new ablation study that isolates the language-prior component, (iii) explicit train/val/test split information for ScanNet-v2, S3DIS, and nuScenes, and (iv) mean mIoU and standard deviation computed over multiple independent runs with different random seeds. These additions will support the attribution of gains to the proposed multi-level language-visual associations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on external pre-trained LMs and visual extractors

full rationale

The derivation chain begins with external language models providing entity-level semantic priors, which are then contrastively aligned into a hierarchical clustering pipeline on 3D features. Performance is evaluated via mIoU on held-out benchmarks (ScanNet-v2, S3DIS, nuScenes) rather than any quantity fitted from the same data and re-labeled as a prediction. No equations reduce the reported gains to self-referential fits, and no load-bearing uniqueness theorem or ansatz is imported from the authors' own prior work. The central claim therefore remains independent of the target datasets' labels or statistics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that language models supply unbiased category knowledge and on standard contrastive-learning assumptions; no free parameters are explicitly named in the abstract, and no new physical or mathematical entities are postulated.

axioms (2)

domain assumption Language models encode balanced and fine-grained world knowledge across categories that can compensate for visual clustering bias.
Invoked to justify injection of entity-level semantic priors into the hierarchical clustering process.
domain assumption Contrastive alignment between language priors and visual features produces more discriminative representations for underrepresented classes.
Central to the claim that minor classes are prevented from absorption by dominant clusters.

pith-pipeline@v0.9.0 · 5815 in / 1512 out tokens · 52380 ms · 2026-05-21T04:47:32.628280+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LangTail first constructs an entity-level semantic prior from language models... injected into a hierarchical clustering framework via contrastive alignment... Lentity = −wci 1/M ∑ log exp(fθ(Pi)·b+/τ) / (exp(...) + ∑ exp(...))
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ward-based hierarchical clustering... K={120,80,Kprim}... dual-branch (Local/Global) with spectral graph Fourier basis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

[1]

Driving on point clouds: Motion planning, trajectory optimization, and terrain assessment in generic nonplanar environments,

P. Krüsi, P. Furgale, M. Bosse, and R. Siegwart, “Driving on point clouds: Motion planning, trajectory optimization, and terrain assessment in generic nonplanar environments,”Journal of Field Robotics, vol. 34, no. 5, pp. 940–984, 2017

work page 2017
[2]

Deep learning for image and point cloud fusion in autonomous driving: A review,

Y . Cui, R. Chen, W. Chu, L. Chen, D. Tian, Y . Li, and D. Cao, “Deep learning for image and point cloud fusion in autonomous driving: A review,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 2, pp. 722–739, 2021

work page 2021
[3]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

arXiv preprint arXiv:2601.03782 , year=

W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei, “Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,”arXiv preprint arXiv:2601.03782, 2026

work page arXiv 2026
[5]

Randla-net: Efficient semantic segmentation of large-scale point clouds,

Q. Hu, B. Yang, L. Xie, S. Rosa, Y . Guo, Z. Wang, N. Trigoni, and A. Markham, “Randla-net: Efficient semantic segmentation of large-scale point clouds,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 108–11 117

work page 2020
[6]

Point transformer v3: Simpler faster stronger,

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler faster stronger,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4840–4851

work page 2024
[7]

Oneformer3d: One transformer for unified point cloud segmentation,

M. Kolodiazhnyi, A. V orontsova, A. Konushin, and D. Rukhovich, “Oneformer3d: One transformer for unified point cloud segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 943–20 953

work page 2024
[8]

Semantickitti: A dataset for semantic scene understanding of lidar sequences,

J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9297–9307

work page 2019
[9]

Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges,

Q. Hu, B. Yang, S. Khalid, W. Xiao, N. Trigoni, and A. Markham, “Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4977–4987

work page 2021
[10]

Growsp: Unsupervised semantic segmentation of 3d point clouds,

Z. Zhang, B. Yang, B. Wang, and B. Li, “Growsp: Unsupervised semantic segmentation of 3d point clouds,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2023, pp. 17 619–17 629

work page 2023
[11]

Pointdc: Unsupervised semantic segmentation of 3d point clouds via cross-modal distillation and super-voxel clustering,

Z. Chen, H. Xu, W. Chen, Z. Zhou, H. Xiao, B. Sun, X. Xieet al., “Pointdc: Unsupervised semantic segmentation of 3d point clouds via cross-modal distillation and super-voxel clustering,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 290–14 299

work page 2023
[12]

Logosp: Local-global grouping of superpoints for unsupervised semantic segmentation of 3d point clouds,

Z. Zhang, W. Dai, H. Wen, and B. Yang, “Logosp: Local-global grouping of superpoints for unsupervised semantic segmentation of 3d point clouds,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 1374–1384

work page 2025
[13]

Seed1.5-VL Technical Report

D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wanget al., “Seed1. 5-vl technical report,”arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Seed1.8 Model Card: Towards Generalized Real-World Agency

B. Seed, “Seed1. 8 model card: Towards generalized real-world agency,”arXiv preprint arXiv:2603.20633, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[18]

Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering,

J. H. Cho, U. Mall, K. Bala, and B. Hariharan, “Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 794–16 804

work page 2021
[19]

Invariant information clustering for unsupervised image classifica- tion and segmentation,

X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering for unsupervised image classifica- tion and segmentation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9865–9874

work page 2019
[20]

Deep clustering for unsupervised learning of visual features,

M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 132–149

work page 2018
[21]

Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,

S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany, “Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,” inEuropean conference on computer vision. Springer, 2020, pp. 574–591

work page 2020
[22]

Exploring data-efficient 3d scene understanding with contrastive scene contexts,

J. Hou, B. Graham, M. Nießner, and S. Xie, “Exploring data-efficient 3d scene understanding with contrastive scene contexts,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 587–15 597

work page 2021
[23]

U3ds3: Unsupervised 3d semantic scene segmentation,

J. Liu, Z. Yu, T. P. Breckon, and H. P. Shum, “U3ds3: Unsupervised 3d semantic scene segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 3759–3768

work page 2024
[24]

P-slcr: Unsupervised point cloud semantic segmentation via prototypes structure learning and consistent reasoning,

L. Zhan, J. Jie, T. Zhou, Y . Du, Y . Zheng, and X. Duan, “P-slcr: Unsupervised point cloud semantic segmentation via prototypes structure learning and consistent reasoning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 349–12 357

work page 2026
[25]

Growsp++: Growing superpoints and primitives for unsupervised 3d semantic segmentation,

Z. Zhang, W. Dai, B. Wang, B. Li, and B. Yang, “Growsp++: Growing superpoints and primitives for unsupervised 3d semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

work page 2026
[26]

Scan: Learning to classify images without labels,

W. Van Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” inEuropean conference on computer vision. Springer, 2020, pp. 268–285

work page 2020
[27]

Unsupervised semantic segmenta- tion by distilling feature correspondences,

M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, “Unsupervised semantic segmenta- tion by distilling feature correspondences,”arXiv preprint arXiv:2203.08414, 2022

work page arXiv 2022
[28]

Controlrm: Fast and controllable 3d generation via large reconstruction model,

H. Xu, W. Chen, Z. Zhou, F. Xiao, B. Sun, M. Z. Shou, and W. Kang, “Controlrm: Fast and controllable 3d generation via large reconstruction model,”arXiv preprint arXiv:2410.09592, 2024

work page arXiv 2024
[29]

Cyc3d: Fine-grained controllable 3d generation via cycle consistency regularization,

H. Xu, C. Yu, F. Xiao, J. Xing, H. Ci, W. Chen, F. Wang, and M. Li, “Cyc3d: Fine-grained controllable 3d generation via cycle consistency regularization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 21, 2026, pp. 17 895–17 903

work page 2026
[30]

Self-supervised multi-view stereo via effective co- segmentation and data-augmentation,

H. Xu, Z. Zhou, Y . Qiao, W. Kang, and Q. Wu, “Self-supervised multi-view stereo via effective co- segmentation and data-augmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3030–3038

work page 2021
[31]

Digging into uncertainty in self-supervised multi-view stereo,

H. Xu, Z. Zhou, Y . Wang, W. Kang, B. Sun, H. Li, and Y . Qiao, “Digging into uncertainty in self-supervised multi-view stereo,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6078–6087

work page 2021
[32]

Costformer: Cost transformer for cost aggregation in multi-view stereo,

W. Chen, H. Xu, Z. Zhou, Y . Liu, B. Sun, W. Kang, and X. Xie, “Costformer: Cost transformer for cost aggregation in multi-view stereo,”arXiv preprint arXiv:2305.10320, 2023

work page arXiv 2023
[33]

Semi-supervised deep multi-view stereo,

H. Xu, W. Chen, Y . Liu, Z. Zhou, H. Xiao, B. Sun, X. Xie, and W. Kang, “Semi-supervised deep multi-view stereo,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4616–4625

work page 2023
[34]

Robustmvs: Single domain generalized deep multi-view stereo,

H. Xu, W. Chen, B. Sun, X. Xie, and W. Kang, “Robustmvs: Single domain generalized deep multi-view stereo,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 9181–9194, 2024. 11

work page 2024
[35]

Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,

X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, “Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2639–2650

work page 2023
[36]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouseret al., “Openscene: 3d scene understanding with open vocabularies,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 815–824

work page 2023
[37]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,

L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1179– 1189

work page 2023
[38]

Openshape: Scaling up 3d shape representation towards open-world understanding,

M. Liu, R. Shi, K. Kuang, Y . Zhu, X. Li, S. Han, H. Cai, F. Porikli, and H. Su, “Openshape: Scaling up 3d shape representation towards open-world understanding,”Advances in neural information processing systems, vol. 36, pp. 44 860–44 879, 2023

work page 2023
[39]

Foundational models for 3d point clouds: A survey and outlook,

V . Thengane, X. Zhu, S. Bouzerdoum, S. L. Phung, and Y . Li, “Foundational models for 3d point clouds: A survey and outlook,”arXiv preprint arXiv:2501.18594, 2025

work page arXiv 2025
[40]

Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding,

J. Yang, R. Ding, W. Deng, Z. Wang, and X. Qi, “Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 823–19 832

work page 2024
[41]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

work page 2017
[43]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

work page 2020
[45]

Pointnet: Deep learning on point sets for 3d classification and segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660

work page 2017
[46]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[47]

3d semantic segmentation with submanifold sparse convolutional networks,

B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9224–9232

work page 2018
[48]

Dynamic graph cnn for learning on point clouds,

Y . Wang, Y . Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,”ACM Transactions on Graphics (tog), vol. 38, no. 5, pp. 1–12, 2019

work page 2019
[49]

Pointcnn: Convolution on x-transformed points,

Y . Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” Advances in neural information processing systems, vol. 31, 2018. 12 APPENDIX A Broader Imapacts Our work significantly advances unsupervised 3D semantic segmentation by enabling accurate, label-free discovery of complex scene structures directly from raw po...

work page 2018

[1] [1]

Driving on point clouds: Motion planning, trajectory optimization, and terrain assessment in generic nonplanar environments,

P. Krüsi, P. Furgale, M. Bosse, and R. Siegwart, “Driving on point clouds: Motion planning, trajectory optimization, and terrain assessment in generic nonplanar environments,”Journal of Field Robotics, vol. 34, no. 5, pp. 940–984, 2017

work page 2017

[2] [2]

Deep learning for image and point cloud fusion in autonomous driving: A review,

Y . Cui, R. Chen, W. Chu, L. Chen, D. Tian, Y . Li, and D. Cao, “Deep learning for image and point cloud fusion in autonomous driving: A review,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 2, pp. 722–739, 2021

work page 2021

[3] [3]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

arXiv preprint arXiv:2601.03782 , year=

W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei, “Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,”arXiv preprint arXiv:2601.03782, 2026

work page arXiv 2026

[5] [5]

Randla-net: Efficient semantic segmentation of large-scale point clouds,

Q. Hu, B. Yang, L. Xie, S. Rosa, Y . Guo, Z. Wang, N. Trigoni, and A. Markham, “Randla-net: Efficient semantic segmentation of large-scale point clouds,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 108–11 117

work page 2020

[6] [6]

Point transformer v3: Simpler faster stronger,

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler faster stronger,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4840–4851

work page 2024

[7] [7]

Oneformer3d: One transformer for unified point cloud segmentation,

M. Kolodiazhnyi, A. V orontsova, A. Konushin, and D. Rukhovich, “Oneformer3d: One transformer for unified point cloud segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 943–20 953

work page 2024

[8] [8]

Semantickitti: A dataset for semantic scene understanding of lidar sequences,

J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9297–9307

work page 2019

[9] [9]

Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges,

Q. Hu, B. Yang, S. Khalid, W. Xiao, N. Trigoni, and A. Markham, “Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4977–4987

work page 2021

[10] [10]

Growsp: Unsupervised semantic segmentation of 3d point clouds,

Z. Zhang, B. Yang, B. Wang, and B. Li, “Growsp: Unsupervised semantic segmentation of 3d point clouds,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2023, pp. 17 619–17 629

work page 2023

[11] [11]

Pointdc: Unsupervised semantic segmentation of 3d point clouds via cross-modal distillation and super-voxel clustering,

Z. Chen, H. Xu, W. Chen, Z. Zhou, H. Xiao, B. Sun, X. Xieet al., “Pointdc: Unsupervised semantic segmentation of 3d point clouds via cross-modal distillation and super-voxel clustering,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 290–14 299

work page 2023

[12] [12]

Logosp: Local-global grouping of superpoints for unsupervised semantic segmentation of 3d point clouds,

Z. Zhang, W. Dai, H. Wen, and B. Yang, “Logosp: Local-global grouping of superpoints for unsupervised semantic segmentation of 3d point clouds,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 1374–1384

work page 2025

[13] [13]

Seed1.5-VL Technical Report

D. Guo, F. Wu, F. Zhu, F. Leng, G. Shi, H. Chen, H. Fan, J. Wang, J. Jiang, J. Wanget al., “Seed1. 5-vl technical report,”arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Seed1.8 Model Card: Towards Generalized Real-World Agency

B. Seed, “Seed1. 8 model card: Towards generalized real-world agency,”arXiv preprint arXiv:2603.20633, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021

[18] [18]

Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering,

J. H. Cho, U. Mall, K. Bala, and B. Hariharan, “Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 794–16 804

work page 2021

[19] [19]

Invariant information clustering for unsupervised image classifica- tion and segmentation,

X. Ji, J. F. Henriques, and A. Vedaldi, “Invariant information clustering for unsupervised image classifica- tion and segmentation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9865–9874

work page 2019

[20] [20]

Deep clustering for unsupervised learning of visual features,

M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 132–149

work page 2018

[21] [21]

Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,

S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany, “Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,” inEuropean conference on computer vision. Springer, 2020, pp. 574–591

work page 2020

[22] [22]

Exploring data-efficient 3d scene understanding with contrastive scene contexts,

J. Hou, B. Graham, M. Nießner, and S. Xie, “Exploring data-efficient 3d scene understanding with contrastive scene contexts,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 587–15 597

work page 2021

[23] [23]

U3ds3: Unsupervised 3d semantic scene segmentation,

J. Liu, Z. Yu, T. P. Breckon, and H. P. Shum, “U3ds3: Unsupervised 3d semantic scene segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 3759–3768

work page 2024

[24] [24]

P-slcr: Unsupervised point cloud semantic segmentation via prototypes structure learning and consistent reasoning,

L. Zhan, J. Jie, T. Zhou, Y . Du, Y . Zheng, and X. Duan, “P-slcr: Unsupervised point cloud semantic segmentation via prototypes structure learning and consistent reasoning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 15, 2026, pp. 12 349–12 357

work page 2026

[25] [25]

Growsp++: Growing superpoints and primitives for unsupervised 3d semantic segmentation,

Z. Zhang, W. Dai, B. Wang, B. Li, and B. Yang, “Growsp++: Growing superpoints and primitives for unsupervised 3d semantic segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

work page 2026

[26] [26]

Scan: Learning to classify images without labels,

W. Van Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, and L. Van Gool, “Scan: Learning to classify images without labels,” inEuropean conference on computer vision. Springer, 2020, pp. 268–285

work page 2020

[27] [27]

Unsupervised semantic segmenta- tion by distilling feature correspondences,

M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, “Unsupervised semantic segmenta- tion by distilling feature correspondences,”arXiv preprint arXiv:2203.08414, 2022

work page arXiv 2022

[28] [28]

Controlrm: Fast and controllable 3d generation via large reconstruction model,

H. Xu, W. Chen, Z. Zhou, F. Xiao, B. Sun, M. Z. Shou, and W. Kang, “Controlrm: Fast and controllable 3d generation via large reconstruction model,”arXiv preprint arXiv:2410.09592, 2024

work page arXiv 2024

[29] [29]

Cyc3d: Fine-grained controllable 3d generation via cycle consistency regularization,

H. Xu, C. Yu, F. Xiao, J. Xing, H. Ci, W. Chen, F. Wang, and M. Li, “Cyc3d: Fine-grained controllable 3d generation via cycle consistency regularization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 21, 2026, pp. 17 895–17 903

work page 2026

[30] [30]

Self-supervised multi-view stereo via effective co- segmentation and data-augmentation,

H. Xu, Z. Zhou, Y . Qiao, W. Kang, and Q. Wu, “Self-supervised multi-view stereo via effective co- segmentation and data-augmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3030–3038

work page 2021

[31] [31]

Digging into uncertainty in self-supervised multi-view stereo,

H. Xu, Z. Zhou, Y . Wang, W. Kang, B. Sun, H. Li, and Y . Qiao, “Digging into uncertainty in self-supervised multi-view stereo,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6078–6087

work page 2021

[32] [32]

Costformer: Cost transformer for cost aggregation in multi-view stereo,

W. Chen, H. Xu, Z. Zhou, Y . Liu, B. Sun, W. Kang, and X. Xie, “Costformer: Cost transformer for cost aggregation in multi-view stereo,”arXiv preprint arXiv:2305.10320, 2023

work page arXiv 2023

[33] [33]

Semi-supervised deep multi-view stereo,

H. Xu, W. Chen, Y . Liu, Z. Zhou, H. Xiao, B. Sun, X. Xie, and W. Kang, “Semi-supervised deep multi-view stereo,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4616–4625

work page 2023

[34] [34]

Robustmvs: Single domain generalized deep multi-view stereo,

H. Xu, W. Chen, B. Sun, X. Xie, and W. Kang, “Robustmvs: Single domain generalized deep multi-view stereo,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 9181–9194, 2024. 11

work page 2024

[35] [35]

Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,

X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, “Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2639–2650

work page 2023

[36] [36]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouseret al., “Openscene: 3d scene understanding with open vocabularies,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 815–824

work page 2023

[37] [37]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,

L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1179– 1189

work page 2023

[38] [38]

Openshape: Scaling up 3d shape representation towards open-world understanding,

M. Liu, R. Shi, K. Kuang, Y . Zhu, X. Li, S. Han, H. Cai, F. Porikli, and H. Su, “Openshape: Scaling up 3d shape representation towards open-world understanding,”Advances in neural information processing systems, vol. 36, pp. 44 860–44 879, 2023

work page 2023

[39] [39]

Foundational models for 3d point clouds: A survey and outlook,

V . Thengane, X. Zhu, S. Bouzerdoum, S. L. Phung, and Y . Li, “Foundational models for 3d point clouds: A survey and outlook,”arXiv preprint arXiv:2501.18594, 2025

work page arXiv 2025

[40] [40]

Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding,

J. Yang, R. Ding, W. Deng, Z. Wang, and X. Qi, “Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 823–19 832

work page 2024

[41] [41]

SAM 2: Segment Anything in Images and Videos

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

work page 2017

[43] [43]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

work page 2020

[45] [45]

Pointnet: Deep learning on point sets for 3d classification and segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660

work page 2017

[46] [46]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[47] [47]

3d semantic segmentation with submanifold sparse convolutional networks,

B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9224–9232

work page 2018

[48] [48]

Dynamic graph cnn for learning on point clouds,

Y . Wang, Y . Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,”ACM Transactions on Graphics (tog), vol. 38, no. 5, pp. 1–12, 2019

work page 2019

[49] [49]

Pointcnn: Convolution on x-transformed points,

Y . Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” Advances in neural information processing systems, vol. 31, 2018. 12 APPENDIX A Broader Imapacts Our work significantly advances unsupervised 3D semantic segmentation by enabling accurate, label-free discovery of complex scene structures directly from raw po...

work page 2018