arxiv: 2604.17949 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection

Qiuhui Chen , Jiaxiang Song , Shuai Tan , Weimin Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot anomaly detectionmultimodal anomaly detectionindustrial inspectionvision-language groundinganomaly segmentationexplainable AIgrounded detectionsensor fusion

0 comments

The pith

ZSG-IAD detects industrial anomalies zero-shot by grounding language descriptions in multimodal sensor data to produce masks and explanatory reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a vision-language framework that processes RGB images, sensor images, and 3D point clouds to identify defects in factory settings without any task-specific training data. It targets the black-box limitation of existing detectors by outputting both pixel-level masks and structured reports that tie decisions to physical evidence. A language-guided two-hop process first selects relevant feature slots from the inputs and then refines them into detailed localizations through modulation. Additional rule-based optimization encourages consistent and coherent outputs across steps. Sympathetic readers would see value in systems that let manufacturers justify anomaly calls with traceable sensor evidence rather than opaque scores.

Core claim

ZSG-IAD is a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, it generates structured anomaly reports and pixel-level masks via a language-guided two-hop grounding module that selects evidence-like latent slots for coarse spatial support and then modulates feature maps with channel-spatial gating and a decoder for fine masks, further stabilized by Executable-Rule GRPO to enforce output structure, region consistency, and reasoning coherence.

What carries the argument

Language-guided two-hop grounding module that uses anomaly-related sentences to select latent slots from multimodal features for coarse support, then applies channel-spatial gating to modulate those features into fine-grained anomaly masks.

If this is right

Zero-shot operation permits detection of previously unseen defect types across multiple benchmarks using only the provided multimodal inputs.
Structured reports and masks supply physically grounded explanations that increase transparency over prior black-box detectors.
Multimodal fusion of RGB, sensor, and 3D data improves both localization accuracy and explanatory coherence.
Executable-Rule GRPO yields outputs that maintain anomaly-region consistency and reasoning-conclusion alignment without extra labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-hop grounding pattern could be tested on other sensor-rich domains such as autonomous driving or medical imaging where explanations are required.
Integration with robotic arms might allow the generated masks to trigger automated repair actions directly from the detected regions.
The method suggests a path toward reducing reliance on large labeled defect datasets in quality-control pipelines by shifting the burden to language-based feature selection.

Load-bearing premise

Anomaly-related language sentences can reliably select and modulate multimodal features into accurate pixel masks and coherent reports without any task-specific training or fine-tuning.

What would settle it

On an industrial anomaly benchmark the model produces masks with low overlap to ground-truth defect regions or reports that contradict the visible evidence in the input data.

Figures

Figures reproduced from arXiv: 2604.17949 by Jiaxiang Song, Qiuhui Chen, Shuai Tan, Weimin Zhong.

**Figure 2.** Figure 2: Overview of ZSG-IAD: frozen 2D/3D encoders with lightweight adapters, multimodal fusion, slot-based evidence tokens for structured reporting, and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples of multimodal grounded reporting. Each example is arranged from left to right as: RGB, sensor, point cloud, coarse map, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of language-guided two-hop grounding. Each [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion coherence. Experiments across multiple industrial anomaly benchmarks show strong zero-shot performance and more transparent, physically grounded explanations than prior methods. We will release code and annotations to support future research on trustworthy industrial anomaly detection systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZSG-IAD offers a two-hop language grounding plus GRPO setup for zero-shot multimodal industrial anomaly detection, but the zero-shot claim looks fragile without domain adaptation.

read the letter

The main thing to know is that this paper puts forward a multimodal zero-shot framework for producing both pixel masks and structured reports on industrial defects, using language to guide feature selection from RGB, sensor, and 3D inputs. The two-hop grounding step—where anomaly sentences pick latent slots for coarse support, then channel-spatial gating refines them—plus the Executable-Rule GRPO for output consistency, is the core architecture on offer. It does address a genuine pain point: black-box detectors that industry finds hard to trust because decisions lack physical grounding. The multimodal scope and the plan to release code and annotations are practical positives that could help others test similar ideas. The framing around transparency and verifiable rewards also shows clear intent to move beyond standard anomaly scores. That said, the soft spot is right where the stress-test note flags it. Industrial defects are often subtle, texture-driven, or geometry-specific, and the approach assumes pre-trained vision-language models can select and modulate the right evidence-like slots without any task-specific data or fine-tuning. That assumption is not obviously safe, and the abstract supplies no numbers, ablations, or failure cases to check whether the masks actually align with real defects or just look plausible. If the full experiments show consistent gains over baselines on multiple benchmarks with clear zero-shot controls, the contribution strengthens; otherwise the performance claims stay hard to evaluate. This paper is aimed at researchers working on explainable anomaly detection or grounded multimodal models in applied settings. Someone already building zero-shot or industrial CV systems might pick up the two-hop pattern or the GRPO reward design for their own pipelines. It deserves a serious referee because the problem is real and the proposal is concrete enough to review on its experiments and implementation details, even if heavy revision on the zero-shot validation is likely needed. I would send it out for peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. It takes RGB images, sensor images, and 3D point clouds as input and outputs structured anomaly reports together with pixel-level masks. The core technical contribution is a language-guided two-hop grounding module: anomaly-related sentences first select evidence-like latent slots from multimodal features to produce coarse spatial support; these slots then modulate feature maps through channel-spatial gating and a lightweight decoder to yield fine-grained masks. An Executable-Rule GRPO stage with verifiable rewards is added to enforce output structure, anomaly-region consistency, and reasoning-conclusion coherence. The abstract states that experiments on multiple industrial anomaly benchmarks demonstrate strong zero-shot performance and more transparent, physically grounded explanations than prior methods.

Significance. If the zero-shot claims hold with rigorous quantitative support, the work would advance trustworthy industrial inspection by replacing black-box detectors with interpretable, multimodal systems that link decisions to physically meaningful evidence without task-specific fine-tuning. Releasing code and annotations would further strengthen its utility for the community.

major comments (3)

[Abstract] Abstract: the central claim of 'strong zero-shot performance' is asserted without any quantitative metrics, benchmark scores, ablation tables, or implementation details. This absence prevents evaluation of whether the two-hop grounding and GRPO components actually deliver the promised gains over baselines.
[Method (language-guided two-hop grounding)] Method description of the two-hop grounding module: the claim that pre-trained vision-language models can reliably select anomaly-related sentences and modulate multimodal (RGB/sensor/3D) features into accurate masks rests on an untested assumption that natural-image alignments transfer to subtle, texture-specific, or geometry-dependent industrial defects. No evidence is supplied that this selection step produces usable coarse support without domain adaptation.
[Method (Executable-Rule GRPO)] Executable-Rule GRPO stage: the use of verifiable rewards for structure and coherence is presented as preserving strict zero-shot operation, yet the reward computation itself may constitute a form of adaptation or optimization that requires clarification on whether any task-specific data or tuning is involved.

minor comments (2)

The promise to release code and annotations is noted positively and should be retained.
Notation for multimodal feature distillation and slot selection could be made more explicit with a diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'strong zero-shot performance' is asserted without any quantitative metrics, benchmark scores, ablation tables, or implementation details. This absence prevents evaluation of whether the two-hop grounding and GRPO components actually deliver the promised gains over baselines.

Authors: We agree that the abstract would benefit from quantitative support. In the revised manuscript we have updated the abstract to report key zero-shot metrics (average AUROC and AUPRO across the evaluated industrial benchmarks) and to note the contributions of the two-hop grounding and GRPO stages. Full tables, ablations, and implementation details remain in the main text. revision: yes
Referee: [Method (language-guided two-hop grounding)] Method description of the two-hop grounding module: the claim that pre-trained vision-language models can reliably select anomaly-related sentences and modulate multimodal (RGB/sensor/3D) features into accurate masks rests on an untested assumption that natural-image alignments transfer to subtle, texture-specific, or geometry-dependent industrial defects. No evidence is supplied that this selection step produces usable coarse support without domain adaptation.

Authors: The framework is designed to operate strictly zero-shot, using off-the-shelf pre-trained VLMs with no domain adaptation or fine-tuning on industrial data. While transfer from natural-image pre-training is an assumption, the paper's experiments on multiple industrial benchmarks show that the selected latent slots yield effective coarse support, as reflected in the final mask accuracy and qualitative results. We have added a short discussion in the method section on this design choice together with additional qualitative examples of the intermediate coarse support on industrial images. revision: partial
Referee: [Method (Executable-Rule GRPO)] Executable-Rule GRPO stage: the use of verifiable rewards for structure and coherence is presented as preserving strict zero-shot operation, yet the reward computation itself may constitute a form of adaptation or optimization that requires clarification on whether any task-specific data or tuning is involved.

Authors: The Executable-Rule GRPO stage applies deterministic, rule-based verifiable rewards that evaluate output structure, anomaly-region consistency, and reasoning coherence directly from the generated text and masks. No task-specific data, labels, or parameter tuning are used; the rules are fixed and general. We have revised the method section to state this explicitly and to confirm that the procedure remains strictly zero-shot. revision: yes

Circularity Check

0 steps flagged

No mathematical derivation chain present; framework is descriptive with experimental validation

full rationale

The paper describes a multimodal vision-language framework for zero-shot grounded industrial anomaly detection, including a language-guided two-hop grounding module and Executable-Rule GRPO for structured outputs. No equations, derivations, or first-principles predictions appear in the provided text. Central claims rest on experimental results across industrial anomaly benchmarks rather than any self-referential reductions, fitted parameters renamed as predictions, or self-citation chains that collapse the argument. The approach is self-contained as a proposed architecture whose performance is evaluated externally, with no load-bearing step equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, fitted parameters, or new postulated entities; the contribution is an architectural framework.

pith-pipeline@v0.9.0 · 5481 in / 1088 out tokens · 42446 ms · 2026-05-10T05:37:10.777563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 11 canonical work pages · 6 internal anchors

[1]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Learning transfer- able visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Rameshet al., “Learning transfer- able visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

2021
[3]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” 2023

2023
[4]

Mvtec ad – a comprehensive real-world dataset for unsupervised anomaly detection,

P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad – a comprehensive real-world dataset for unsupervised anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019, pp. 9592–9600

2019
[5]

Cutpaste: Self-supervised learning for anomaly detection and localization,

C.-L. Li, K. Sohn, J. Yoon, and T. Pfister, “Cutpaste: Self-supervised learning for anomaly detection and localization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 9664–9674

2021
[6]

Towards total recall in industrial anomaly detection,

K. Roth, L. Pemula, J. Zepeda, B. Sch ¨olkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 14 318–14 328

2022
[7]

Multimodal industrial anomaly detection via hybrid fusion,

Y . Wang, J. Peng, J. Zhang, R. Yi, Y . Wang, and C. Wang, “Multimodal industrial anomaly detection via hybrid fusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 8032–8041

2023
[8]

Real-iad d3: A real- world 2d/pseudo-3d/3d dataset for industrial anomaly detection,

W. Zhu, L. Wang, Z. Zhou, C. Wanget al., “Real-iad d3: A real- world 2d/pseudo-3d/3d dataset for industrial anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 15 214–15 223

2025
[9]

Multi-sensor object anomaly detection: Unifying appearance, geometry, and internal properties,

W. Li, B. Zheng, X. Xu, J. Gan, F. Lu, X. Li, N. Ni, Z. Tian, X. Huang, S. Gao, and Y . Wu, “Multi-sensor object anomaly detection: Unifying appearance, geometry, and internal properties,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 9984–9993

2025
[10]

Enhancing trans- parency and trust in AI-powered manufacturing: A survey of explainable AI (XAI) applications in smart manufacturing in the era of industry 4.0/5.0,

L. Nikiforidis, M. Kyrtsoglou, and A. Vafeiadis, “Enhancing trans- parency and trust in AI-powered manufacturing: A survey of explainable AI (XAI) applications in smart manufacturing in the era of industry 4.0/5.0,”ICT Express, vol. 11, no. 1, pp. 135–148, 2025

2025
[11]

Explainable AI for industrial fault diagnosis: A systematic review,

J. Cac ¸˜ao, J. Santos, and M. Antunes, “Explainable AI for industrial fault diagnosis: A systematic review,”Journal of Industrial Information Integration, vol. 47, p. 100905, Sep. 2025

2025
[12]

A semantic framework for condition monitoring in industry 4.0 based on evolving knowledge bases,

F. Giustozzi, J. Saunier, and C. Zanni-Merk, “A semantic framework for condition monitoring in industry 4.0 based on evolving knowledge bases,”Semantic Web, vol. 15, no. 3, pp. 583–611, 2024. [Online]. Available: https://doi.org/10.3233/SW-233481

work page doi:10.3233/sw-233481 2024
[13]

Grounded language-image pre-training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, K.-W. Chang, and J. Gao, “Grounded language-image pre-training,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 965–10 975

2022
[14]

Image segmentation using text and image prompts,

T. L ¨uddecke and A. Ecker, “Image segmentation using text and image prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 7086–7096

2022
[15]

Denseclip: Language-guided dense prediction with context-aware prompting,

Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 18 082–18 091

2022
[16]

Lavt: Language-aware vision transformer for referring image segmentation,

Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 18 155–18 165

2022
[17]

Towards open set deep networks,

A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[18]

Recent advances in open set recognition: A survey,

C. Geng, S.-J. Huang, and S. Chen, “Recent advances in open set recognition: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3614–3631, 2021

2021
[19]

Winclip: Zero-/few-shot anomaly classification and segmentation,

J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 19 606–19 616

2023
[20]

Aa-clip: Enhancing zero-shot anomaly detection via anomaly- aware clip,

W. Ma, X. Zhang, Q. Yao, F. Tang, C. Wu, Y . Li, R. Yan, Z. Jiang, and S. Zhou, “Aa-clip: Enhancing zero-shot anomaly detection via anomaly- aware clip,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 4744–4754

2025
[21]

Learning to dispatch for job shop scheduling via deep reinforcement learning,

C. Zhang, W. Song, Z. Cao, J. Zhang, P. S. Tan, and C. Xu, “Learning to dispatch for job shop scheduling via deep reinforcement learning,” inAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020. [Online]. Available: https://proceedings.neurips.cc/paper/ 2020/hash/11958dfee29b6709f48a9ba0387a2431-Abstract.html

2020
[22]

Delay- aware microservice coordination in mobile edge computing: A rein- forcement learning approach,

S. Wang, Y . Guo, N. Zhang, P. Yang, A. Zhou, and X. Shen, “Delay- aware microservice coordination in mobile edge computing: A rein- forcement learning approach,”IEEE Transactions on Mobile Computing, vol. 20, no. 3, pp. 939–951, 2021

2021
[23]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[24]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[25]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Object-centric learning with slot attention,

F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,”Advances in neural information processing systems, vol. 33, pp. 11 525–11 538, 2020

2020
[28]

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding shar- ing,”arXiv preprint arXiv:2111.09543, 2021

work page internal anchor Pith review arXiv 2021
[29]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900

2022
[30]

The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization.arXiv preprint arXiv:2112.09045, 2021

P. Bergmann, X. Jin, D. Sattlegger, and C. Steger, “The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization,”arXiv preprint arXiv:2112.09045, 2021

work page arXiv 2021
[31]

The eyecandies dataset for unsupervised multimodal anomaly detection and localization,

L. Bonfiglioli, M. Toschi, D. Silvestri, N. Fioraio, and D. De Gregorio, “The eyecandies dataset for unsupervised multimodal anomaly detection and localization,” inProceedings of the 16th Asian Conference on Computer Vision (ACCV), 2022

2022
[32]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024. [Online]. Available: https://arxiv.org/abs/2408.03326

work page Pith review arXiv 2024
[33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tanet al., “Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024. [Online]. Available: https: //arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

OpenGVLab Team, “Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,” Jul 2024, accessed: 2025-12-24. [Online]. Available: https://internvl.github.io/blog/2024-07-02-InternVL-2.0/

2024
[36]

Laurençon, L

[Online]. Available: https://arxiv.org/abs/2405.02246

work page arXiv
[37]

Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows.arXiv preprint arXiv:2111.07677, 2021

J. Yu, Y . Zheng, X. Wang, W. Li, Y . Wu, R. Zhao, and L. Wu, “Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows,”arXiv preprint arXiv:2111.07677, 2021

work page arXiv 2021
[38]

Efficientad: Accurate visual anomaly detection at millisecond-level latencies,

K. Batzner, L. Heckler, and R. K ¨onig, “Efficientad: Accurate visual anomaly detection at millisecond-level latencies,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

2024
[39]

The Llama 3 Herd of Models

A. G. et al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

2022
[41]

Point transformer,

H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 259–16 268

2021