pith. sign in

arxiv: 2605.29979 · v1 · pith:XBB2X5IMnew · submitted 2026-05-28 · 💻 cs.CR · cs.LG

Fingerprinting Inference Systems of Large Language Models

Pith reviewed 2026-06-29 06:57 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords LLM fingerprintinginference system identificationnumerical deviationsside-channel analysisLLM securityhardware platform detectionattention backendprompt-response analysis
0
0 comments X

The pith

LLM prompt responses can identify the inference engine, attention backend, and hardware platform.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that running the same model on different inference systems produces small numerical differences that affect the generated text. These differences arise from the engine, attention implementation, and hardware, and they remain detectable even when temperature sampling introduces randomness. A fingerprinting method is presented that uses prompt-response pairs to classify which components are in use. The result matters because any user who can query the model can learn details about its underlying system. The authors conclude that complete prevention would require removing all numerical differences across stacks, which is impractical.

Core claim

Numerical deviations induced by the inference engine, attention backend, and hardware platform are characteristic of those components and propagate to observable textual outputs, allowing reliable identification of the inference system from prompt-response behavior even at non-zero temperature.

What carries the argument

Fingerprinting method that analyzes prompt-response behavior to detect component-specific numerical deviations.

If this is right

  • Any party that can query an LLM can identify its inference engine, attention backend, and hardware.
  • Preventing fingerprinting requires eliminating numerical differences between hardware and software stacks.
  • Partial mitigations are possible but cannot fully remove the exposure.
  • The security of LLM deployments is affected because internal implementation choices become observable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same approach could be tested on whether it distinguishes between different quantization schemes or compiler optimizations.
  • Service operators might need to standardize on a single inference stack to limit exposure.
  • This form of side information could interact with existing attacks that rely on knowing the exact model implementation.

Load-bearing premise

The small numerical deviations from different inference components remain unique and detectable in text outputs despite temperature sampling and other variables.

What would settle it

A test set of prompts where responses from two different inference systems show no distinguishable statistical patterns in token probabilities or output distributions.

Figures

Figures reproduced from arXiv: 2605.29979 by Anna Wimbauer, Erik Imgrund, Jonas M\"oller, Konrad Rieck.

Figure 1
Figure 1. Figure 1: An LLM inference system. Inference systems. An LLM inference system comprises several interacting layers, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reduction of prompts for deterministic decoding. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fingerprinting accuracy as a function of the number of prompts for deterministic decoding [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fingerprinting accuracy as a function of the number of aggregated samples [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fingerprinting accuracy (%) at deterministic decoding where the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

The behavior of LLMs does not depend solely on the model itself. Components of the inference system, such as the inference engine, attention backend, and hardware platform, subtly influence how inputs are processed. These components differ in their implementations and thereby induce small numerical deviations across systems when running the same model. While prior work has established the theoretical existence of such deviations, their security implications have remained unexplored. In this paper, we show that these deviations are characteristic of specific components and propagate to observable textual outputs, exposing the inference system to any party that can query the model. Building on this observation, we introduce a fingerprinting method that analyzes the prompt-response behavior of LLMs to identify components of the inference system. Our empirical evaluation demonstrates that the inference engine, attention backend, and underlying hardware platform can be identified reliably, even when the LLM is operated at non-zero temperature. We show that preventing fingerprinting is fundamentally hard, as it would require eliminating numerical differences between hardware and software stacks. We therefore propose partial mitigations and discuss their impact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript claims that numerical deviations induced by different inference engines, attention backends, and hardware platforms when running the same LLM produce characteristic prompt-response behaviors that enable reliable fingerprinting of the inference system, even at non-zero temperature. The authors introduce a fingerprinting method based on this observation, present an empirical evaluation demonstrating reliable component identification, argue that complete prevention is fundamentally difficult because it would require eliminating all numerical differences across stacks, and propose partial mitigations.

Significance. If the empirical results hold under appropriate controls, the work identifies a previously unexplored attack surface for remote identification of LLM backend components, extending prior theoretical observations on numerical deviations into a practical security implication. The demonstration that fingerprinting remains feasible despite temperature sampling and the discussion of why mitigation is hard constitute the main contributions.

minor comments (1)
  1. The abstract would benefit from including at least one quantitative result (e.g., identification accuracy or number of trials) to convey the strength of the empirical claim without requiring the reader to reach the evaluation section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for recommending minor revision. The report accurately captures the core claim that numerical deviations across inference engines, attention backends, and hardware platforms produce observable, characteristic prompt-response behaviors that enable fingerprinting even at non-zero temperature. No major comments were provided in the report.

Circularity Check

0 steps flagged

Empirical fingerprinting with no circular derivation

full rationale

The paper presents an empirical demonstration that system-specific numerical deviations produce distinguishable outputs, evaluated via prompt-response behavior. No equations, derivations, or fitted parameters are described that reduce to self-definition or self-citation by construction. The central claim rests on experimental observation rather than a load-bearing theoretical chain that collapses to its inputs. This is the expected non-finding for an empirical security measurement paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no specific free parameters, axioms, or invented entities are detailed. The claim rests on the domain assumption that numerical deviations exist and are characteristic of components.

axioms (1)
  • domain assumption Numerical deviations between different inference systems exist and are characteristic of specific components
    Invoked as the foundation for the fingerprinting approach, referenced as established by prior work.

pith-pipeline@v0.9.1-grok · 5712 in / 1161 out tokens · 34758 ms · 2026-06-29T06:57:35.942951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages

  1. [1]

    Carlini, D

    N. Carlini, D. Paleka, K. D. Dvijotham, T. Steinke, J. Hayase, A. F. Cooper, K. Lee, M. Jagielski, M. Nasr, A. Conmy, E. Wallace, D. Rolnick, and F. Tramèr. Stealing part of a production language model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceedings of Machine Learning Research, pages ...

  2. [3]

    48550/arXiv.2405.02803

    URL https://doi.org/10. 48550/arXiv.2405.02803. M. Gubri, D. Ulmer, H. Lee, S. Yun, and S. J. Oh. TRAP: targeted random adversarial prompt honeypot for black-box identification. InFindings of the Association for Computational Linguis- tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, Findings of ACL, pages 11496–11517. Association...

  3. [4]

    ORPO: Mono- lithic preference optimization without reference model

    doi: 10.18653/V1/2024. FINDINGS-ACL.683. URLhttps://doi.org/10.18653/v1/2024.findings-acl.683. B. Hui, H. Yuan, N. Gong, P. Burlina, and Y . Cao. Pleak: Prompt leaking attacks against large language model applications. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, page 3600–3614, New York, NY , USA,

  4. [5]

    ISBN 9798400706363

    Association for Computing Machinery. ISBN 9798400706363. URL https://doi.org/10.1145/3658644. 3670370. W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the 29th Symposium on Operating Systems Principles, SOSP ...

  5. [6]

    Gonzalez, Hao Zhang, and Ion Sto- ica

    Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URLhttps://doi.org/10.1145/3600006.3613165. MITRE. CVE-2026-22778: Remote code execution in vLLM multimodal video processing. https: //nvd.nist.gov/vuln/detail/CVE-2026-22778, 2026a. CVSS 9.8 (Critical). MITRE. CVE-2026-27893: Remote code execution in vLLM via trust_rem...

  6. [7]

    Pasquini, E

    D. Pasquini, E. M. Kornaropoulos, and G. Ateniese. Llmmap: Fingerprinting for large language models. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 299–318. USENIX Association,

  7. [8]

    URL https://www.usenix.org/ conference/usenixsecurity25/presentation/pasquini. M. Renze. The effect of sampling temperature on problem solving in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346–7356, Miami, Florida, USA, Nov

  8. [9]

    SciRepEval: A multi-format benchmark for scientific document representations

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-emnlp.432. URLhttps://aclanthology.org/2024.findings-emnlp.432/. A. Schlögl, N. Hofer, and R. Böhme. Causes and effects of unanticipated numerical deviations in neural network inference frameworks. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural ...

  9. [10]

    URL http://papers.nips.cc/paper_files/paper/ 2023/hash/af076c3bdbf935b81d808e37c5ede463-Abstract-Conference.html. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–...

  10. [11]

    ISBN 9781510860964

    Curran Associates Inc. ISBN 9781510860964. G. Wu, Z. Zhang, Y . Zhang, W. Wang, J. Niu, Y . Wu, and Y . Zhang. I know what you asked: Prompt leakage via kv-cache sharing in multi-tenant LLM serving. In32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28,

  11. [12]

    X. Wu, L. Ying, G. Chen, Y . Gu, and H. Qu. Cache me, catch you: Cache related security threats in LLM serving frameworks. In33rd Annual Network and Distributed System Security Symposium, NDSS 2026, San Diego, California, USA, February 23-27,

  12. [13]

    Y . Yang, C. Li, Q. Li, O. Ma, H. Wang, Z. Wang, Y . Gao, W. Chen, and S. Ji. PRSA: prompt stealing attacks against real-world prompt services. In34th USENIX Security Symposium, USENIX Security 2025, Seattle, WA, USA, August 13-15, 2025, pages 2283–2302. USENIX Association,

  13. [14]

    ISBN 9798331314385

    Curran Associates Inc. ISBN 9798331314385. A Societal Impact This paper introduces a fingerprinting technique for large language model (LLM) components that leverages subtle numerical variations across hardware platforms as a side-channel for component identification. A potential concern is that adversaries could repurpose this method to profile or target...