pith. machine review for the scientific record. sign in

arxiv: 2604.17949 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot anomaly detectionmultimodal anomaly detectionindustrial inspectionvision-language groundinganomaly segmentationexplainable AIgrounded detectionsensor fusion
0
0 comments X

The pith

ZSG-IAD detects industrial anomalies zero-shot by grounding language descriptions in multimodal sensor data to produce masks and explanatory reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a vision-language framework that processes RGB images, sensor images, and 3D point clouds to identify defects in factory settings without any task-specific training data. It targets the black-box limitation of existing detectors by outputting both pixel-level masks and structured reports that tie decisions to physical evidence. A language-guided two-hop process first selects relevant feature slots from the inputs and then refines them into detailed localizations through modulation. Additional rule-based optimization encourages consistent and coherent outputs across steps. Sympathetic readers would see value in systems that let manufacturers justify anomaly calls with traceable sensor evidence rather than opaque scores.

Core claim

ZSG-IAD is a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, it generates structured anomaly reports and pixel-level masks via a language-guided two-hop grounding module that selects evidence-like latent slots for coarse spatial support and then modulates feature maps with channel-spatial gating and a decoder for fine masks, further stabilized by Executable-Rule GRPO to enforce output structure, region consistency, and reasoning coherence.

What carries the argument

Language-guided two-hop grounding module that uses anomaly-related sentences to select latent slots from multimodal features for coarse support, then applies channel-spatial gating to modulate those features into fine-grained anomaly masks.

If this is right

  • Zero-shot operation permits detection of previously unseen defect types across multiple benchmarks using only the provided multimodal inputs.
  • Structured reports and masks supply physically grounded explanations that increase transparency over prior black-box detectors.
  • Multimodal fusion of RGB, sensor, and 3D data improves both localization accuracy and explanatory coherence.
  • Executable-Rule GRPO yields outputs that maintain anomaly-region consistency and reasoning-conclusion alignment without extra labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-hop grounding pattern could be tested on other sensor-rich domains such as autonomous driving or medical imaging where explanations are required.
  • Integration with robotic arms might allow the generated masks to trigger automated repair actions directly from the detected regions.
  • The method suggests a path toward reducing reliance on large labeled defect datasets in quality-control pipelines by shifting the burden to language-based feature selection.

Load-bearing premise

Anomaly-related language sentences can reliably select and modulate multimodal features into accurate pixel masks and coherent reports without any task-specific training or fine-tuning.

What would settle it

On an industrial anomaly benchmark the model produces masks with low overlap to ground-truth defect regions or reports that contradict the visible evidence in the input data.

Figures

Figures reproduced from arXiv: 2604.17949 by Jiaxiang Song, Qiuhui Chen, Shuai Tan, Weimin Zhong.

Figure 1
Figure 1. Figure 1: From vague anomaly descriptions (generic VLM, GPT-4o [1]) to [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ZSG-IAD: frozen 2D/3D encoders with lightweight adapters, multimodal fusion, slot-based evidence tokens for structured reporting, and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of multimodal grounded reporting. Each example is arranged from left to right as: RGB, sensor, point cloud, coarse map, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualization of language-guided two-hop grounding. Each [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion coherence. Experiments across multiple industrial anomaly benchmarks show strong zero-shot performance and more transparent, physically grounded explanations than prior methods. We will release code and annotations to support future research on trustworthy industrial anomaly detection systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. It takes RGB images, sensor images, and 3D point clouds as input and outputs structured anomaly reports together with pixel-level masks. The core technical contribution is a language-guided two-hop grounding module: anomaly-related sentences first select evidence-like latent slots from multimodal features to produce coarse spatial support; these slots then modulate feature maps through channel-spatial gating and a lightweight decoder to yield fine-grained masks. An Executable-Rule GRPO stage with verifiable rewards is added to enforce output structure, anomaly-region consistency, and reasoning-conclusion coherence. The abstract states that experiments on multiple industrial anomaly benchmarks demonstrate strong zero-shot performance and more transparent, physically grounded explanations than prior methods.

Significance. If the zero-shot claims hold with rigorous quantitative support, the work would advance trustworthy industrial inspection by replacing black-box detectors with interpretable, multimodal systems that link decisions to physically meaningful evidence without task-specific fine-tuning. Releasing code and annotations would further strengthen its utility for the community.

major comments (3)
  1. [Abstract] Abstract: the central claim of 'strong zero-shot performance' is asserted without any quantitative metrics, benchmark scores, ablation tables, or implementation details. This absence prevents evaluation of whether the two-hop grounding and GRPO components actually deliver the promised gains over baselines.
  2. [Method (language-guided two-hop grounding)] Method description of the two-hop grounding module: the claim that pre-trained vision-language models can reliably select anomaly-related sentences and modulate multimodal (RGB/sensor/3D) features into accurate masks rests on an untested assumption that natural-image alignments transfer to subtle, texture-specific, or geometry-dependent industrial defects. No evidence is supplied that this selection step produces usable coarse support without domain adaptation.
  3. [Method (Executable-Rule GRPO)] Executable-Rule GRPO stage: the use of verifiable rewards for structure and coherence is presented as preserving strict zero-shot operation, yet the reward computation itself may constitute a form of adaptation or optimization that requires clarification on whether any task-specific data or tuning is involved.
minor comments (2)
  1. The promise to release code and annotations is noted positively and should be retained.
  2. Notation for multimodal feature distillation and slot selection could be made more explicit with a diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'strong zero-shot performance' is asserted without any quantitative metrics, benchmark scores, ablation tables, or implementation details. This absence prevents evaluation of whether the two-hop grounding and GRPO components actually deliver the promised gains over baselines.

    Authors: We agree that the abstract would benefit from quantitative support. In the revised manuscript we have updated the abstract to report key zero-shot metrics (average AUROC and AUPRO across the evaluated industrial benchmarks) and to note the contributions of the two-hop grounding and GRPO stages. Full tables, ablations, and implementation details remain in the main text. revision: yes

  2. Referee: [Method (language-guided two-hop grounding)] Method description of the two-hop grounding module: the claim that pre-trained vision-language models can reliably select anomaly-related sentences and modulate multimodal (RGB/sensor/3D) features into accurate masks rests on an untested assumption that natural-image alignments transfer to subtle, texture-specific, or geometry-dependent industrial defects. No evidence is supplied that this selection step produces usable coarse support without domain adaptation.

    Authors: The framework is designed to operate strictly zero-shot, using off-the-shelf pre-trained VLMs with no domain adaptation or fine-tuning on industrial data. While transfer from natural-image pre-training is an assumption, the paper's experiments on multiple industrial benchmarks show that the selected latent slots yield effective coarse support, as reflected in the final mask accuracy and qualitative results. We have added a short discussion in the method section on this design choice together with additional qualitative examples of the intermediate coarse support on industrial images. revision: partial

  3. Referee: [Method (Executable-Rule GRPO)] Executable-Rule GRPO stage: the use of verifiable rewards for structure and coherence is presented as preserving strict zero-shot operation, yet the reward computation itself may constitute a form of adaptation or optimization that requires clarification on whether any task-specific data or tuning is involved.

    Authors: The Executable-Rule GRPO stage applies deterministic, rule-based verifiable rewards that evaluate output structure, anomaly-region consistency, and reasoning coherence directly from the generated text and masks. No task-specific data, labels, or parameter tuning are used; the rules are fixed and general. We have revised the method section to state this explicitly and to confirm that the procedure remains strictly zero-shot. revision: yes

Circularity Check

0 steps flagged

No mathematical derivation chain present; framework is descriptive with experimental validation

full rationale

The paper describes a multimodal vision-language framework for zero-shot grounded industrial anomaly detection, including a language-guided two-hop grounding module and Executable-Rule GRPO for structured outputs. No equations, derivations, or first-principles predictions appear in the provided text. Central claims rest on experimental results across industrial anomaly benchmarks rather than any self-referential reductions, fitted parameters renamed as predictions, or self-citation chains that collapse the argument. The approach is self-contained as a proposed architecture whose performance is evaluated externally, with no load-bearing step equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, fitted parameters, or new postulated entities; the contribution is an architectural framework.

pith-pipeline@v0.9.0 · 5481 in / 1088 out tokens · 42446 ms · 2026-05-10T05:37:10.777563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  2. [2]

    Learning transfer- able visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Rameshet al., “Learning transfer- able visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

  3. [3]

    Gpt-4 technical report,

    OpenAI, “Gpt-4 technical report,” 2023

  4. [4]

    Mvtec ad – a comprehensive real-world dataset for unsupervised anomaly detection,

    P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad – a comprehensive real-world dataset for unsupervised anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019, pp. 9592–9600

  5. [5]

    Cutpaste: Self-supervised learning for anomaly detection and localization,

    C.-L. Li, K. Sohn, J. Yoon, and T. Pfister, “Cutpaste: Self-supervised learning for anomaly detection and localization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 9664–9674

  6. [6]

    Towards total recall in industrial anomaly detection,

    K. Roth, L. Pemula, J. Zepeda, B. Sch ¨olkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 14 318–14 328

  7. [7]

    Multimodal industrial anomaly detection via hybrid fusion,

    Y . Wang, J. Peng, J. Zhang, R. Yi, Y . Wang, and C. Wang, “Multimodal industrial anomaly detection via hybrid fusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 8032–8041

  8. [8]

    Real-iad d3: A real- world 2d/pseudo-3d/3d dataset for industrial anomaly detection,

    W. Zhu, L. Wang, Z. Zhou, C. Wanget al., “Real-iad d3: A real- world 2d/pseudo-3d/3d dataset for industrial anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 15 214–15 223

  9. [9]

    Multi-sensor object anomaly detection: Unifying appearance, geometry, and internal properties,

    W. Li, B. Zheng, X. Xu, J. Gan, F. Lu, X. Li, N. Ni, Z. Tian, X. Huang, S. Gao, and Y . Wu, “Multi-sensor object anomaly detection: Unifying appearance, geometry, and internal properties,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 9984–9993

  10. [10]

    Enhancing trans- parency and trust in AI-powered manufacturing: A survey of explainable AI (XAI) applications in smart manufacturing in the era of industry 4.0/5.0,

    L. Nikiforidis, M. Kyrtsoglou, and A. Vafeiadis, “Enhancing trans- parency and trust in AI-powered manufacturing: A survey of explainable AI (XAI) applications in smart manufacturing in the era of industry 4.0/5.0,”ICT Express, vol. 11, no. 1, pp. 135–148, 2025

  11. [11]

    Explainable AI for industrial fault diagnosis: A systematic review,

    J. Cac ¸˜ao, J. Santos, and M. Antunes, “Explainable AI for industrial fault diagnosis: A systematic review,”Journal of Industrial Information Integration, vol. 47, p. 100905, Sep. 2025

  12. [12]

    A semantic framework for condition monitoring in industry 4.0 based on evolving knowledge bases,

    F. Giustozzi, J. Saunier, and C. Zanni-Merk, “A semantic framework for condition monitoring in industry 4.0 based on evolving knowledge bases,”Semantic Web, vol. 15, no. 3, pp. 583–611, 2024. [Online]. Available: https://doi.org/10.3233/SW-233481

  13. [13]

    Grounded language-image pre-training,

    L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, K.-W. Chang, and J. Gao, “Grounded language-image pre-training,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 965–10 975

  14. [14]

    Image segmentation using text and image prompts,

    T. L ¨uddecke and A. Ecker, “Image segmentation using text and image prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 7086–7096

  15. [15]

    Denseclip: Language-guided dense prediction with context-aware prompting,

    Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 18 082–18 091

  16. [16]

    Lavt: Language-aware vision transformer for referring image segmentation,

    Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 18 155–18 165

  17. [17]

    Towards open set deep networks,

    A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  18. [18]

    Recent advances in open set recognition: A survey,

    C. Geng, S.-J. Huang, and S. Chen, “Recent advances in open set recognition: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3614–3631, 2021

  19. [19]

    Winclip: Zero-/few-shot anomaly classification and segmentation,

    J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 19 606–19 616

  20. [20]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly- aware clip,

    W. Ma, X. Zhang, Q. Yao, F. Tang, C. Wu, Y . Li, R. Yan, Z. Jiang, and S. Zhou, “Aa-clip: Enhancing zero-shot anomaly detection via anomaly- aware clip,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 4744–4754

  21. [21]

    Learning to dispatch for job shop scheduling via deep reinforcement learning,

    C. Zhang, W. Song, Z. Cao, J. Zhang, P. S. Tan, and C. Xu, “Learning to dispatch for job shop scheduling via deep reinforcement learning,” inAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020. [Online]. Available: https://proceedings.neurips.cc/paper/ 2020/hash/11958dfee29b6709f48a9ba0387a2431-Abstract.html

  22. [22]

    Delay- aware microservice coordination in mobile edge computing: A rein- forcement learning approach,

    S. Wang, Y . Guo, N. Zhang, P. Yang, A. Zhou, and X. Shen, “Delay- aware microservice coordination in mobile edge computing: A rein- forcement learning approach,”IEEE Transactions on Mobile Computing, vol. 20, no. 3, pp. 939–951, 2021

  23. [23]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  24. [24]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  25. [25]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

  27. [27]

    Object-centric learning with slot attention,

    F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,”Advances in neural information processing systems, vol. 33, pp. 11 525–11 538, 2020

  28. [28]

    DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

    P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding shar- ing,”arXiv preprint arXiv:2111.09543, 2021

  29. [29]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

  30. [30]

    The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization.arXiv preprint arXiv:2112.09045, 2021

    P. Bergmann, X. Jin, D. Sattlegger, and C. Steger, “The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization,”arXiv preprint arXiv:2112.09045, 2021

  31. [31]

    The eyecandies dataset for unsupervised multimodal anomaly detection and localization,

    L. Bonfiglioli, M. Toschi, D. Silvestri, N. Fioraio, and D. De Gregorio, “The eyecandies dataset for unsupervised multimodal anomaly detection and localization,” inProceedings of the 16th Asian Conference on Computer Vision (ACCV), 2022

  32. [32]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024. [Online]. Available: https://arxiv.org/abs/2408.03326

  33. [33]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tanet al., “Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024. [Online]. Available: https: //arxiv.org/abs/2409.12191

  34. [34]

    Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

    OpenGVLab Team, “Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,” Jul 2024, accessed: 2025-12-24. [Online]. Available: https://internvl.github.io/blog/2024-07-02-InternVL-2.0/

  35. [36]

    Laurençon, L

    [Online]. Available: https://arxiv.org/abs/2405.02246

  36. [37]

    Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows.arXiv preprint arXiv:2111.07677, 2021

    J. Yu, Y . Zheng, X. Wang, W. Li, Y . Wu, R. Zhao, and L. Wu, “Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows,”arXiv preprint arXiv:2111.07677, 2021

  37. [38]

    Efficientad: Accurate visual anomaly detection at millisecond-level latencies,

    K. Batzner, L. Heckler, and R. K ¨onig, “Efficientad: Accurate visual anomaly detection at millisecond-level latencies,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

  38. [39]

    The Llama 3 Herd of Models

    A. G. et al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  39. [40]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

  40. [41]

    Point transformer,

    H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 259–16 268