Fingerprinting Inference Systems of Large Language Models
Pith reviewed 2026-06-29 06:57 UTC · model grok-4.3
The pith
LLM prompt responses can identify the inference engine, attention backend, and hardware platform.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Numerical deviations induced by the inference engine, attention backend, and hardware platform are characteristic of those components and propagate to observable textual outputs, allowing reliable identification of the inference system from prompt-response behavior even at non-zero temperature.
What carries the argument
Fingerprinting method that analyzes prompt-response behavior to detect component-specific numerical deviations.
If this is right
- Any party that can query an LLM can identify its inference engine, attention backend, and hardware.
- Preventing fingerprinting requires eliminating numerical differences between hardware and software stacks.
- Partial mitigations are possible but cannot fully remove the exposure.
- The security of LLM deployments is affected because internal implementation choices become observable.
Where Pith is reading between the lines
- The same approach could be tested on whether it distinguishes between different quantization schemes or compiler optimizations.
- Service operators might need to standardize on a single inference stack to limit exposure.
- This form of side information could interact with existing attacks that rely on knowing the exact model implementation.
Load-bearing premise
The small numerical deviations from different inference components remain unique and detectable in text outputs despite temperature sampling and other variables.
What would settle it
A test set of prompts where responses from two different inference systems show no distinguishable statistical patterns in token probabilities or output distributions.
Figures
read the original abstract
The behavior of LLMs does not depend solely on the model itself. Components of the inference system, such as the inference engine, attention backend, and hardware platform, subtly influence how inputs are processed. These components differ in their implementations and thereby induce small numerical deviations across systems when running the same model. While prior work has established the theoretical existence of such deviations, their security implications have remained unexplored. In this paper, we show that these deviations are characteristic of specific components and propagate to observable textual outputs, exposing the inference system to any party that can query the model. Building on this observation, we introduce a fingerprinting method that analyzes the prompt-response behavior of LLMs to identify components of the inference system. Our empirical evaluation demonstrates that the inference engine, attention backend, and underlying hardware platform can be identified reliably, even when the LLM is operated at non-zero temperature. We show that preventing fingerprinting is fundamentally hard, as it would require eliminating numerical differences between hardware and software stacks. We therefore propose partial mitigations and discuss their impact.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that numerical deviations induced by different inference engines, attention backends, and hardware platforms when running the same LLM produce characteristic prompt-response behaviors that enable reliable fingerprinting of the inference system, even at non-zero temperature. The authors introduce a fingerprinting method based on this observation, present an empirical evaluation demonstrating reliable component identification, argue that complete prevention is fundamentally difficult because it would require eliminating all numerical differences across stacks, and propose partial mitigations.
Significance. If the empirical results hold under appropriate controls, the work identifies a previously unexplored attack surface for remote identification of LLM backend components, extending prior theoretical observations on numerical deviations into a practical security implication. The demonstration that fingerprinting remains feasible despite temperature sampling and the discussion of why mitigation is hard constitute the main contributions.
minor comments (1)
- The abstract would benefit from including at least one quantitative result (e.g., identification accuracy or number of trials) to convey the strength of the empirical claim without requiring the reader to reach the evaluation section.
Simulated Author's Rebuttal
We thank the referee for their review and for recommending minor revision. The report accurately captures the core claim that numerical deviations across inference engines, attention backends, and hardware platforms produce observable, characteristic prompt-response behaviors that enable fingerprinting even at non-zero temperature. No major comments were provided in the report.
Circularity Check
Empirical fingerprinting with no circular derivation
full rationale
The paper presents an empirical demonstration that system-specific numerical deviations produce distinguishable outputs, evaluated via prompt-response behavior. No equations, derivations, or fitted parameters are described that reduce to self-definition or self-citation by construction. The central claim rests on experimental observation rather than a load-bearing theoretical chain that collapses to its inputs. This is the expected non-finding for an empirical security measurement paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Numerical deviations between different inference systems exist and are characteristic of specific components
Reference graph
Works this paper leans on
-
[1]
Carlini, D
N. Carlini, D. Paleka, K. D. Dvijotham, T. Steinke, J. Hayase, A. F. Cooper, K. Lee, M. Jagielski, M. Nasr, A. Conmy, E. Wallace, D. Rolnick, and F. Tramèr. Stealing part of a production language model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceedings of Machine Learning Research, pages ...
2024
-
[3]
URL https://doi.org/10. 48550/arXiv.2405.02803. M. Gubri, D. Ulmer, H. Lee, S. Yun, and S. J. Oh. TRAP: targeted random adversarial prompt honeypot for black-box identification. InFindings of the Association for Computational Linguis- tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, Findings of ACL, pages 11496–11517. Association...
-
[4]
ORPO: Mono- lithic preference optimization without reference model
doi: 10.18653/V1/2024. FINDINGS-ACL.683. URLhttps://doi.org/10.18653/v1/2024.findings-acl.683. B. Hui, H. Yuan, N. Gong, P. Burlina, and Y . Cao. Pleak: Prompt leaking attacks against large language model applications. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, page 3600–3614, New York, NY , USA,
-
[5]
Association for Computing Machinery. ISBN 9798400706363. URL https://doi.org/10.1145/3658644. 3670370. W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th Symposium on Operating Systems Principles, SOSP ...
-
[6]
Gonzalez, Hao Zhang, and Ion Sto- ica
Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URLhttps://doi.org/10.1145/3600006.3613165. MITRE. CVE-2026-22778: Remote code execution in vLLM multimodal video processing. https: //nvd.nist.gov/vuln/detail/CVE-2026-22778, 2026a. CVSS 9.8 (Critical). MITRE. CVE-2026-27893: Remote code execution in vLLM via trust_rem...
-
[7]
Pasquini, E
D. Pasquini, E. M. Kornaropoulos, and G. Ateniese. Llmmap: Fingerprinting for large language models. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 299–318. USENIX Association,
2025
-
[8]
URL https://www.usenix.org/ conference/usenixsecurity25/presentation/pasquini. M. Renze. The effect of sampling temperature on problem solving in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346–7356, Miami, Florida, USA, Nov
2024
-
[9]
SciRepEval: A multi-format benchmark for scientific document representations
Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-emnlp.432. URLhttps://aclanthology.org/2024.findings-emnlp.432/. A. Schlögl, N. Hofer, and R. Böhme. Causes and effects of unanticipated numerical deviations in neural network inference frameworks. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural ...
-
[10]
URL http://papers.nips.cc/paper_files/paper/ 2023/hash/af076c3bdbf935b81d808e37c5ede463-Abstract-Conference.html. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–...
2023
-
[11]
ISBN 9781510860964
Curran Associates Inc. ISBN 9781510860964. G. Wu, Z. Zhang, Y . Zhang, W. Wang, J. Niu, Y . Wu, and Y . Zhang. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant LLM serving. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28,
2025
-
[12]
X. Wu, L. Ying, G. Chen, Y . Gu, and H. Qu. Cache me, catch you: Cache related security threats in LLM serving frameworks. In33rd Annual Network and Distributed System Security Symposium, NDSS 2026, San Diego, California, USA, February 23-27,
2026
-
[13]
Y . Yang, C. Li, Q. Li, O. Ma, H. Wang, Z. Wang, Y . Gao, W. Chen, and S. Ji. PRSA: prompt stealing attacks against real-world prompt services. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 2283–2302. USENIX Association,
2025
-
[14]
ISBN 9798331314385
Curran Associates Inc. ISBN 9798331314385. A Societal Impact This paper introduces a fingerprinting technique for large language model (LLM) components that leverages subtle numerical variations across hardware platforms as a side-channel for component identification. A potential concern is that adversaries could repurpose this method to profile or target...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.