Fingerprinting Inference Systems of Large Language Models

Anna Wimbauer; Erik Imgrund; Jonas M\"oller; Konrad Rieck

arxiv: 2605.29979 · v1 · pith:XBB2X5IMnew · submitted 2026-05-28 · 💻 cs.CR · cs.LG

Fingerprinting Inference Systems of Large Language Models

Anna Wimbauer , Jonas M\"oller , Erik Imgrund , Konrad Rieck This is my paper

Pith reviewed 2026-06-29 06:57 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords LLM fingerprintinginference system identificationnumerical deviationsside-channel analysisLLM securityhardware platform detectionattention backendprompt-response analysis

0 comments

The pith

LLM prompt responses can identify the inference engine, attention backend, and hardware platform.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that running the same model on different inference systems produces small numerical differences that affect the generated text. These differences arise from the engine, attention implementation, and hardware, and they remain detectable even when temperature sampling introduces randomness. A fingerprinting method is presented that uses prompt-response pairs to classify which components are in use. The result matters because any user who can query the model can learn details about its underlying system. The authors conclude that complete prevention would require removing all numerical differences across stacks, which is impractical.

Core claim

Numerical deviations induced by the inference engine, attention backend, and hardware platform are characteristic of those components and propagate to observable textual outputs, allowing reliable identification of the inference system from prompt-response behavior even at non-zero temperature.

What carries the argument

Fingerprinting method that analyzes prompt-response behavior to detect component-specific numerical deviations.

If this is right

Any party that can query an LLM can identify its inference engine, attention backend, and hardware.
Preventing fingerprinting requires eliminating numerical differences between hardware and software stacks.
Partial mitigations are possible but cannot fully remove the exposure.
The security of LLM deployments is affected because internal implementation choices become observable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same approach could be tested on whether it distinguishes between different quantization schemes or compiler optimizations.
Service operators might need to standardize on a single inference stack to limit exposure.
This form of side information could interact with existing attacks that rely on knowing the exact model implementation.

Load-bearing premise

The small numerical deviations from different inference components remain unique and detectable in text outputs despite temperature sampling and other variables.

What would settle it

A test set of prompts where responses from two different inference systems show no distinguishable statistical patterns in token probabilities or output distributions.

Figures

Figures reproduced from arXiv: 2605.29979 by Anna Wimbauer, Erik Imgrund, Jonas M\"oller, Konrad Rieck.

**Figure 1.** Figure 1: An LLM inference system. Inference systems. An LLM inference system comprises several interacting layers, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Reduction of prompts for deterministic decoding. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Fingerprinting accuracy as a function of the number of prompts for deterministic decoding [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Fingerprinting accuracy as a function of the number of aggregated samples [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Fingerprinting accuracy (%) at deterministic decoding where the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

The behavior of LLMs does not depend solely on the model itself. Components of the inference system, such as the inference engine, attention backend, and hardware platform, subtly influence how inputs are processed. These components differ in their implementations and thereby induce small numerical deviations across systems when running the same model. While prior work has established the theoretical existence of such deviations, their security implications have remained unexplored. In this paper, we show that these deviations are characteristic of specific components and propagate to observable textual outputs, exposing the inference system to any party that can query the model. Building on this observation, we introduce a fingerprinting method that analyzes the prompt-response behavior of LLMs to identify components of the inference system. Our empirical evaluation demonstrates that the inference engine, attention backend, and underlying hardware platform can be identified reliably, even when the LLM is operated at non-zero temperature. We show that preventing fingerprinting is fundamentally hard, as it would require eliminating numerical differences between hardware and software stacks. We therefore propose partial mitigations and discuss their impact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Numerical deviations in LLM inference let you fingerprint the engine, backend, and hardware from black-box queries.

read the letter

The main thing to know is that this paper turns known numerical differences across inference stacks into a working fingerprinting attack that works even at non-zero temperature.

They start from prior theoretical results on these deviations and show they propagate to distinguishable text outputs. The new element is the practical method plus the security framing: any querier can learn about the underlying system. They also make the case that complete prevention would require eliminating all hardware and software differences, which is unrealistic, and sketch partial mitigations.

The empirical claim is that identification of the three components is reliable. If the controls and numbers hold, that is a concrete result for the security side of deployed models.

The soft spot is the lack of visible detail on prompt construction, accuracy numbers, or how sampling variance was handled. Without those, it is hard to judge whether the signal stays distinguishable under realistic conditions or if certain model choices mask it. The abstract positions the work as empirical observation rather than fitted parameters, which is the right framing.

This is for people working on ML security, privacy of inference services, or side-channel issues. A reader who cares about what leaks from deployed LLMs will get a clear attack surface and mitigation discussion. It is worth sending to a serious referee because the angle is new and the basic observation is grounded in real system differences.

Referee Report

0 major / 1 minor

Summary. The manuscript claims that numerical deviations induced by different inference engines, attention backends, and hardware platforms when running the same LLM produce characteristic prompt-response behaviors that enable reliable fingerprinting of the inference system, even at non-zero temperature. The authors introduce a fingerprinting method based on this observation, present an empirical evaluation demonstrating reliable component identification, argue that complete prevention is fundamentally difficult because it would require eliminating all numerical differences across stacks, and propose partial mitigations.

Significance. If the empirical results hold under appropriate controls, the work identifies a previously unexplored attack surface for remote identification of LLM backend components, extending prior theoretical observations on numerical deviations into a practical security implication. The demonstration that fingerprinting remains feasible despite temperature sampling and the discussion of why mitigation is hard constitute the main contributions.

minor comments (1)

The abstract would benefit from including at least one quantitative result (e.g., identification accuracy or number of trials) to convey the strength of the empirical claim without requiring the reader to reach the evaluation section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for recommending minor revision. The report accurately captures the core claim that numerical deviations across inference engines, attention backends, and hardware platforms produce observable, characteristic prompt-response behaviors that enable fingerprinting even at non-zero temperature. No major comments were provided in the report.

Circularity Check

0 steps flagged

Empirical fingerprinting with no circular derivation

full rationale

The paper presents an empirical demonstration that system-specific numerical deviations produce distinguishable outputs, evaluated via prompt-response behavior. No equations, derivations, or fitted parameters are described that reduce to self-definition or self-citation by construction. The central claim rests on experimental observation rather than a load-bearing theoretical chain that collapses to its inputs. This is the expected non-finding for an empirical security measurement paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no specific free parameters, axioms, or invented entities are detailed. The claim rests on the domain assumption that numerical deviations exist and are characteristic of components.

axioms (1)

domain assumption Numerical deviations between different inference systems exist and are characteristic of specific components
Invoked as the foundation for the fingerprinting approach, referenced as established by prior work.

pith-pipeline@v0.9.1-grok · 5712 in / 1161 out tokens · 34758 ms · 2026-06-29T06:57:35.942951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages

[1]

Carlini, D

N. Carlini, D. Paleka, K. D. Dvijotham, T. Steinke, J. Hayase, A. F. Cooper, K. Lee, M. Jagielski, M. Nasr, A. Conmy, E. Wallace, D. Rolnick, and F. Tramèr. Stealing part of a production language model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceedings of Machine Learning Research, pages ...

2024
[3]

48550/arXiv.2405.02803

URL https://doi.org/10. 48550/arXiv.2405.02803. M. Gubri, D. Ulmer, H. Lee, S. Yun, and S. J. Oh. TRAP: targeted random adversarial prompt honeypot for black-box identification. InFindings of the Association for Computational Linguis- tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, Findings of ACL, pages 11496–11517. Association...

work page arXiv 2024
[4]

ORPO: Mono- lithic preference optimization without reference model

doi: 10.18653/V1/2024. FINDINGS-ACL.683. URLhttps://doi.org/10.18653/v1/2024.findings-acl.683. B. Hui, H. Yuan, N. Gong, P. Burlina, and Y . Cao. Pleak: Prompt leaking attacks against large language model applications. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, page 3600–3614, New York, NY , USA,

work page doi:10.18653/v1/2024 2024
[5]

ISBN 9798400706363

Association for Computing Machinery. ISBN 9798400706363. URL https://doi.org/10.1145/3658644. 3670370. W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th Symposium on Operating Systems Principles, SOSP ...

work page doi:10.1145/3658644
[6]

Gonzalez, Hao Zhang, and Ion Sto- ica

Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URLhttps://doi.org/10.1145/3600006.3613165. MITRE. CVE-2026-22778: Remote code execution in vLLM multimodal video processing. https: //nvd.nist.gov/vuln/detail/CVE-2026-22778, 2026a. CVSS 9.8 (Critical). MITRE. CVE-2026-27893: Remote code execution in vLLM via trust_rem...

work page doi:10.1145/3600006.3613165 2026
[7]

Pasquini, E

D. Pasquini, E. M. Kornaropoulos, and G. Ateniese. Llmmap: Fingerprinting for large language models. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 299–318. USENIX Association,

2025
[8]

URL https://www.usenix.org/ conference/usenixsecurity25/presentation/pasquini. M. Renze. The effect of sampling temperature on problem solving in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346–7356, Miami, Florida, USA, Nov

2024
[9]

SciRepEval: A multi-format benchmark for scientific document representations

Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-emnlp.432. URLhttps://aclanthology.org/2024.findings-emnlp.432/. A. Schlögl, N. Hofer, and R. Böhme. Causes and effects of unanticipated numerical deviations in neural network inference frameworks. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural ...

work page doi:10.18653/v1/ 2024
[10]

URL http://papers.nips.cc/paper_files/paper/ 2023/hash/af076c3bdbf935b81d808e37c5ede463-Abstract-Conference.html. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–...

2023
[11]

ISBN 9781510860964

Curran Associates Inc. ISBN 9781510860964. G. Wu, Z. Zhang, Y . Zhang, W. Wang, J. Niu, Y . Wu, and Y . Zhang. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant LLM serving. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28,

2025
[12]

X. Wu, L. Ying, G. Chen, Y . Gu, and H. Qu. Cache me, catch you: Cache related security threats in LLM serving frameworks. In33rd Annual Network and Distributed System Security Symposium, NDSS 2026, San Diego, California, USA, February 23-27,

2026
[13]

Y . Yang, C. Li, Q. Li, O. Ma, H. Wang, Z. Wang, Y . Gao, W. Chen, and S. Ji. PRSA: prompt stealing attacks against real-world prompt services. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 2283–2302. USENIX Association,

2025
[14]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. A Societal Impact This paper introduces a fingerprinting technique for large language model (LLM) components that leverages subtle numerical variations across hardware platforms as a side-channel for component identification. A potential concern is that adversaries could repurpose this method to profile or target...

2023

[1] [1]

Carlini, D

N. Carlini, D. Paleka, K. D. Dvijotham, T. Steinke, J. Hayase, A. F. Cooper, K. Lee, M. Jagielski, M. Nasr, A. Conmy, E. Wallace, D. Rolnick, and F. Tramèr. Stealing part of a production language model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceedings of Machine Learning Research, pages ...

2024

[2] [3]

48550/arXiv.2405.02803

URL https://doi.org/10. 48550/arXiv.2405.02803. M. Gubri, D. Ulmer, H. Lee, S. Yun, and S. J. Oh. TRAP: targeted random adversarial prompt honeypot for black-box identification. InFindings of the Association for Computational Linguis- tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, Findings of ACL, pages 11496–11517. Association...

work page arXiv 2024

[3] [4]

ORPO: Mono- lithic preference optimization without reference model

doi: 10.18653/V1/2024. FINDINGS-ACL.683. URLhttps://doi.org/10.18653/v1/2024.findings-acl.683. B. Hui, H. Yuan, N. Gong, P. Burlina, and Y . Cao. Pleak: Prompt leaking attacks against large language model applications. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, page 3600–3614, New York, NY , USA,

work page doi:10.18653/v1/2024 2024

[4] [5]

ISBN 9798400706363

Association for Computing Machinery. ISBN 9798400706363. URL https://doi.org/10.1145/3658644. 3670370. W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th Symposium on Operating Systems Principles, SOSP ...

work page doi:10.1145/3658644

[5] [6]

Gonzalez, Hao Zhang, and Ion Sto- ica

Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URLhttps://doi.org/10.1145/3600006.3613165. MITRE. CVE-2026-22778: Remote code execution in vLLM multimodal video processing. https: //nvd.nist.gov/vuln/detail/CVE-2026-22778, 2026a. CVSS 9.8 (Critical). MITRE. CVE-2026-27893: Remote code execution in vLLM via trust_rem...

work page doi:10.1145/3600006.3613165 2026

[6] [7]

Pasquini, E

D. Pasquini, E. M. Kornaropoulos, and G. Ateniese. Llmmap: Fingerprinting for large language models. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 299–318. USENIX Association,

2025

[7] [8]

URL https://www.usenix.org/ conference/usenixsecurity25/presentation/pasquini. M. Renze. The effect of sampling temperature on problem solving in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346–7356, Miami, Florida, USA, Nov

2024

[8] [9]

SciRepEval: A multi-format benchmark for scientific document representations

Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-emnlp.432. URLhttps://aclanthology.org/2024.findings-emnlp.432/. A. Schlögl, N. Hofer, and R. Böhme. Causes and effects of unanticipated numerical deviations in neural network inference frameworks. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural ...

work page doi:10.18653/v1/ 2024

[9] [10]

URL http://papers.nips.cc/paper_files/paper/ 2023/hash/af076c3bdbf935b81d808e37c5ede463-Abstract-Conference.html. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–...

2023

[10] [11]

ISBN 9781510860964

Curran Associates Inc. ISBN 9781510860964. G. Wu, Z. Zhang, Y . Zhang, W. Wang, J. Niu, Y . Wu, and Y . Zhang. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant LLM serving. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28,

2025

[11] [12]

X. Wu, L. Ying, G. Chen, Y . Gu, and H. Qu. Cache me, catch you: Cache related security threats in LLM serving frameworks. In33rd Annual Network and Distributed System Security Symposium, NDSS 2026, San Diego, California, USA, February 23-27,

2026

[12] [13]

Y . Yang, C. Li, Q. Li, O. Ma, H. Wang, Z. Wang, Y . Gao, W. Chen, and S. Ji. PRSA: prompt stealing attacks against real-world prompt services. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 2283–2302. USENIX Association,

2025

[13] [14]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. A Societal Impact This paper introduces a fingerprinting technique for large language model (LLM) components that leverages subtle numerical variations across hardware platforms as a side-channel for component identification. A potential concern is that adversaries could repurpose this method to profile or target...

2023