pith. machine review for the scientific record. sign in

arxiv: 2604.18179 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.AI

Recognition: unknown

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:47 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords model substitution detectionsparse autoencoderscommitment protocolsMerkle treesLLM auditingadversarial robustnesshosted language models
0
0 comments X

The pith

A Merkle-tree commitment to per-position sparse-autoencoder feature traces lets verifiers detect silent model substitution in hosted LLMs even when the provider knows the audit rules in advance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hosted LLM providers can advertise a strong model while routing ordinary users to a cheaper substitute because probe-after-return checks leave a parallel-serve side-channel. The paper closes that channel with a commit-open protocol: the provider first commits via Merkle tree to a sparse-autoencoder feature-trace sketch of every output token at a published probe layer. Verifiers later open random positions, score the traces against a public probe library using cross-backend noise calibration, and apply a fixed-threshold joint-consistency z-score rule. Experiments show that seventeen attackers, including same-family lifts, cross-family substitutes, and rank-128 adaptive LoRA, all fail this rule at one scale-stable threshold while the same attackers pass a matched parallel-serve baseline; a white-box backpropagation attack through the frozen encoder does not close the margin and a forgery attack that never runs the honest model is bounded by an intrinsic-dimension argument. The commitment step adds at most 2.1 percent to forward-only wall-clock time at batch size 32.

Core claim

By forcing the provider to commit to sparse-autoencoder feature traces before any opening request, the protocol creates a session-wide record that a verifier can check for joint consistency at random positions, rejecting substitution even against protocol-aware attackers.

What carries the argument

The commit-open protocol that uses a Merkle tree to bind the provider to per-position sparse-autoencoder (SAE) feature-trace sketches at a fixed probe layer, scored by a public named-circuit probe library and a fixed-threshold joint-consistency z-score rule.

If this is right

  • All seventeen tested attackers spanning same-family lifts, cross-family substitutes, and rank-128 adaptive LoRA are rejected at one shared scale-stable threshold.
  • The identical attackers evade a matched SVIP-style parallel-serve baseline.
  • A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the detection margin.
  • A feature-forgery attacker that never runs the honest model is bounded in closed form by an intrinsic-dimension argument.
  • The commitment step adds no more than 2.1 percent to forward-only wall-clock time at batch size 32.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use would require providers to publish their probe libraries in advance so independent parties can perform audits.
  • The same commitment structure could apply to other detectable deviations such as safety-filter bypasses if distinctive feature traces for those behaviors exist.
  • Periodic updates to the probe library may be needed as new model families appear, otherwise the distinctiveness assumption could weaken over time.

Load-bearing premise

That a public probe library calibrated across backends produces feature traces distinctive enough between different models that the fixed z-score rule still works when the provider knows the protocol ahead of time.

What would settle it

Finding a substitute model or fine-tune that produces outputs a user would accept as coming from the advertised model yet still matches the committed SAE traces closely enough to pass the joint-consistency z-score test at the published threshold.

Figures

Figures reproduced from arXiv: 2604.18179 by Ziyang Liu.

Figure 1
Figure 1. Figure 1: Commit-open verification. Left: four provider strategies — (A) honest, (B) substitute + substitute’s trace (detected), (C) substitute + M-trace (pays honest compute), (D) adaptive φ (bounded by SAE capacity). Right: the provider commits a Merkle root of per-position SAE traces before the verifier’s probe, closing SVIP’s after-the-fact side-channel (dashed). commitment, which on our serving stack costs with… view at source ↗
Figure 2
Figure 2. Figure 2: Commit-open mechanics. (1) The provider forwards (x, y) through M and encodes the layer-L residual stream ht with a public SAE encoder E, keeping the top-32 active features per position. (2) Per-position leaves leaf t = H(meta, t, tok32t) are assembled into a Merkle tree with root R; the provider publishes (x, y, R). (3) The verifier samples random positions V and requests Merkle openings, verifying each p… view at source ↗
Figure 3
Figure 3. Figure 3: E2 same-family separability (Qwen3-1.7B vs. lifted Qwen3-0.6B), on the legacy per [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Rank-64 LoRA attacker frontier on the Qwen3-1.7B target. All four operating points sit above τpool=1.509; STAGEA is the strongest we evaluate (2.78×τpool at 1.55× Pile ppx). Adding utility regularisation moves the attacker up and to the right. This slice of the frontier is an operating￾range, not a ceiling (Section 6). Detection margin grows with SAE width. A scale-matched cross-family substitute — Qwen2.5… view at source ↗
Figure 5
Figure 5. Figure 5: SVIP parallel-serve vs. commit-open across [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Two-backbone separability on the per-probe Mahalanobis diagnostic scale (log [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Aggregator N-probe sweep: mean AUC across four attacker centers at each α-weakened operating point. ∆AUC between N=1 and N=96 is largest for small α, and AUC plateaus by N≈32. I Partial Mechanistic Auditability We ablate the top-32 features of each probe class and measure class-specific effect via ∆ KL = KL(pclean ∥ pabl) − KL(pclean ∥ prec). Of the four circuit classes tested, three local-circuit classes … view at source ↗
Figure 8
Figure 8. Figure 8: E9 batched commit overhead. (a) Per-batch latency for forward-only (A) and forward + [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: reports the fit and held-out R2 of the public-corpus linear alignment map φ for each cross￾family attacker, alongside the verifier-side joint z-score. Two attackers (Phi-3.5-mini, OLMo-2-7B) exhibit strongly negative held-out R2 , confirming that the public-corpus φ does not generalise across model families. Partial explanatory analysis for C3; detection is established independently by the joint z-score. Q… view at source ↗
Figure 10
Figure 10. Figure 10: E4 rank-constrained LoRA diagnostic. (a) LoRA’s [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: E10 circuit-ablation effect matrix. Cell [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: E6 per-category attackability under the E3 attack suite. Policy: rotate attackable classes; [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Feature-forgery attack ladder and library-rotation transfer gap for Qwen3 + Gemma. Even [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
read the original abstract

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a commit-open protocol to detect silent model substitution by hosted LLM providers. Providers commit via Merkle tree to per-position SAE feature-trace sketches of outputs at a published probe layer before any verifier opening. Verifiers open random positions, score traces against a public named-circuit probe library (calibrated on cross-backend noise), and apply a fixed-threshold joint-consistency z-score rule. Experiments on Qwen3-1.7B, Gemma-2-2B, and a 4.5x Gemma-2-9B scale-up with 131k-feature SAE show all 17 tested attackers (same-family lifts, cross-family substitutes, rank-≤128 adaptive LoRA) rejected at one shared scale-stable threshold; the same attackers evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack backpropagating through the frozen SAE encoder fails to close the margin, while a feature-forgery attacker never running the honest model is bounded in closed form by an intrinsic-dimension argument. Commitment overhead is ≤2.1% at batch 32.

Significance. If the empirical and theoretical results hold, the work supplies a concrete, low-overhead mechanism for auditing hosted LLMs that closes the parallel-serve side-channel left by probe-after-return schemes. The combination of cryptographic commitment, SAE feature traces, and a public probe library is a novel contribution to the model-auditing literature. Explicit credit is due for testing a diverse attacker suite, including adaptive LoRA and white-box gradient attacks, plus the closed-form intrinsic-dimension bound on forgery. These elements would strengthen practical trust in commercial LLM services if the calibration independence and statistical robustness concerns are resolved.

major comments (3)
  1. [Abstract] Abstract: the central empirical claim that 'all 17 attackers are rejected at a shared, scale-stable threshold' is presented without error bars, the exact numerical threshold, details on how the probe library was constructed, or statistical justification for the z-score rule. This is load-bearing for the paper's main result.
  2. [Protocol description and §4 (experiments)] Protocol and experimental sections: the detection rule relies on an empirically calibrated noise model and public probe library, yet no demonstration is given that the library construction and calibration are independent of the 17 test attackers or the specific backbones used. This raises a circularity risk for the fixed-threshold rule against protocol-aware adversaries.
  3. [Attacker evaluation] Attacker evaluation: while the white-box backprop attack through the frozen SAE and the intrinsic-dimension bound on feature forgery are reported, the manuscript does not quantify the margin by which these attacks fail or show that the bound remains non-vacuous once the provider knows the exact opening distribution and z-score rule in advance.
minor comments (2)
  1. [Abstract and results] The abstract and results would benefit from a table listing the exact 17 attackers, their categories, and the per-attacker z-scores or distances to threshold.
  2. [Notation] Notation for the 'per-position SAE-feature-trace sketch' and the joint-consistency z-score should be defined once in a dedicated notation subsection rather than introduced piecemeal.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested details, clarifications, and additional analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that 'all 17 attackers are rejected at a shared, scale-stable threshold' is presented without error bars, the exact numerical threshold, details on how the probe library was constructed, or statistical justification for the z-score rule. This is load-bearing for the paper's main result.

    Authors: We agree that the abstract would benefit from greater specificity on these load-bearing elements. In revision we will state the exact shared threshold value, include error bars derived from the repeated experimental runs, briefly describe the probe library construction from the public named-circuit set, and point to the statistical justification of the joint-consistency z-score rule (which appears in §3). These additions will be kept concise while making the central claim fully transparent. revision: yes

  2. Referee: [Protocol description and §4 (experiments)] Protocol and experimental sections: the detection rule relies on an empirically calibrated noise model and public probe library, yet no demonstration is given that the library construction and calibration are independent of the 17 test attackers or the specific backbones used. This raises a circularity risk for the fixed-threshold rule against protocol-aware adversaries.

    Authors: The calibration uses cross-backend noise collected from a diverse set of models and the probe library is built from fixed, publicly documented named circuits; neither step incorporates the 17 evaluation attackers. To remove any ambiguity we will add an explicit independence check in the revised §4: we re-calibrate the noise model while deliberately excluding the test backbones and attackers, then verify that the same fixed threshold continues to separate all 17 attackers. This directly demonstrates that the rule is not circular with respect to the reported evaluation set. revision: yes

  3. Referee: [Attacker evaluation] Attacker evaluation: while the white-box backprop attack through the frozen SAE and the intrinsic-dimension bound on feature forgery are reported, the manuscript does not quantify the margin by which these attacks fail or show that the bound remains non-vacuous once the provider knows the exact opening distribution and z-score rule in advance.

    Authors: We will expand the attacker-evaluation subsection to report the precise margins: the z-score gap between the white-box back-propagation attack and the detection threshold, and the numerical probability bound obtained from the intrinsic-dimension argument. We will also add a short analysis showing that the forgery bound remains non-vacuous even under full knowledge of the opening-position distribution and the z-score rule, because the bound is derived from the SAE feature-space dimensionality and the sparsity constraint rather than from the specific opening schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity in protocol derivation or empirical claims

full rationale

The paper presents a commit-open protocol whose core steps (Merkle commitment to SAE feature-trace sketches, random-position opening, and fixed-threshold z-score decision) are defined independently of the test outcomes. Empirical results on 17 attackers are reported as validation rather than as a derivation that reduces to fitted parameters or self-citations by construction. The public probe library and cross-backend noise calibration are described as external inputs, with no equations or claims showing that the rejection threshold or distinctiveness argument collapses to the input data or prior self-work. The intrinsic-dimension bound on feature-forgery is presented as a closed-form argument separate from the fitted elements. This is the normal case of a self-contained protocol paper whose central claims do not reduce to their own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The protocol rests on the assumption that SAE features at a fixed probe layer capture model identity sufficiently to survive cross-backend noise and adaptive attacks; the z-score threshold and probe library are calibrated quantities whose independence from the evaluation set is not demonstrated in the abstract.

free parameters (2)
  • joint-consistency z-score threshold
    Fixed threshold chosen to separate honest from all tested attacker traces; value not stated but described as scale-stable.
  • probe-library calibration parameters
    Cross-backend noise model used to score opened positions; treated as public but empirically fitted.
axioms (1)
  • domain assumption SAE features extracted at the published probe layer remain distinctive across model families and scales even after cross-backend noise.
    Required for the z-score rule to reject substitutions while accepting honest runs.
invented entities (1)
  • per-position SAE-feature-trace sketch no independent evidence
    purpose: Compact representation committed via Merkle tree before verification to prevent post-hoc forgery.
    New construct introduced to bind the provider to the served output without revealing full tokens.

pith-pipeline@v0.9.0 · 5570 in / 1606 out tokens · 46934 ms · 2026-05-10T04:47:50.457417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    LegoSNARK: Modular design and compo- sition of succinct zero-knowledge proofs

    Matteo Campanelli, Dario Fiore, and Anaïs Querol. LegoSNARK: Modular design and compo- sition of succinct zero-knowledge proofs. InProceedings of the 2019 ACM SIGSAC Confer- ence on Computer and Communications Security (CCS), pages 2075–2092. ACM, 2019. doi: 10.1145/3319535.3339820

  2. [2]

    Kornaropoulos, and Giuseppe Ateniese

    Dario Pasquini, Evgenios M. Kornaropoulos, and Giuseppe Ateniese. LLMmap: Fingerprinting for large language models. In34th USENIX Security Symposium, pages 299–318. USENIX Association, 2025

  3. [3]

    In Ku, L.-W., Martins, A

    Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pages 3277–3306. Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024. NA...

  4. [4]

    SVIP: Towards Verifiable Inference of Open-source Large Language Models

    Yifan Sun, Yuhang Li, Yue Zhang, Yuchen Jin, and Huan Zhang. SVIP: Towards verifiable inference of open-source large language models.CoRR, abs/2410.22307, 2024. doi: 10.48550/ ARXIV .2410.22307

  5. [5]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR). OpenReview.net, 2022

  6. [6]

    TOPLOC: A locality sensitive hashing scheme for trustless verifiable inference

    Jack Min Ong, Matthew Di Ferrante, Aaron Pazdera, Ryan Garner, Sami Jaghouar, Manveer Basra, Max Ryabinin, and Johannes Hagemann. TOPLOC: A locality sensitive hashing scheme for trustless verifiable inference. InInternational Conference on Machine Learning (ICML). PMLR / OpenReview.net, 2025. 11

  7. [7]

    Compact proofs of model performance via mechanistic interpretability

    Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, and Lawrence Chan. Compact proofs of model performance via mechanistic interpretability. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  8. [8]

    Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR). OpenReview.net, 2023

  9. [9]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  10. [10]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  11. [11]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  12. [12]

    SAELens: Training and analyzing sparse autoencoders, 2024

    Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. SAELens: Training and analyzing sparse autoencoders, 2024. Software package, https://github.com/jbloomAus/ SAELens

  13. [13]

    Transcoders find interpretable LLM feature circuits

    Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  14. [14]

    Proofs of useful work

    Marshall Ball, Alon Rosen, Manuel Sabin, and Prashant Nalini Vasudevan. Proofs of useful work. Cryptology ePrint Archive, Paper 2017/203, 2017. URL https://eprint.iacr.org/ 2017/203

  15. [15]

    Intel TDX demystified: A top-down approach.ACM Computing Surveys, 56(9):238:1–238:33, 2024

    Pau-Chen Cheng, Wojciech Ozga, Enriquillo Valdez, Salman Ahmed, Zhongshu Gu, Hani Jamjoom, Hubertus Franke, and James Bottomley. Intel TDX demystified: A top-down approach.ACM Computing Surveys, 56(9):238:1–238:33, 2024. doi: 10.1145/3652597

  16. [17]

    Design and implementation of a TCG-based integrity measurement architecture

    Reiner Sailer, Xiaolan Zhang, Trent Jaeger, and Leendert van Doorn. Design and implementation of a TCG-based integrity measurement architecture. In13th USENIX Security Symposium, pages 223–238. USENIX Association, 2004

  17. [18]

    Crosby and Dan S

    Scott A. Crosby and Dan S. Wallach. Efficient data structures for tamper-evident logging. In 18th USENIX Security Symposium, pages 317–334. USENIX Association, 2009

  18. [19]

    Certificate transparency

    Ben Laurie, Adam Langley, and Emilia Käsper. Certificate transparency. RFC 6962, IETF, 2013

  19. [20]

    Thomas Hou, and Wenjing Lou

    Heng Jin, Chaoyu Zhang, Hexuan Yu, Shanghao Shi, Ning Zhang, Y . Thomas Hou, and Wenjing Lou. Trusting what you cannot see: Auditable fine-tuning and inference for proprietary AI. CoRR, abs/2603.07466, 2026. 12

  20. [21]

    Verifiable fine-tuning for LLMs: Zero-knowledge training proofs bound to data provenance and policy.CoRR, abs/2510.16830, 2025

    Hasan Akgul, Daniel Borg, Arta Berisha, Amina Rahimova, Andrej Novak, and Mila Petrov. Verifiable fine-tuning for LLMs: Zero-knowledge training proofs bound to data provenance and policy.CoRR, abs/2510.16830, 2025

  21. [22]

    Ralph C. Merkle. A digital signature based on a conventional encryption function. In Carl Pomerance, editor,Advances in Cryptology - CRYPTO ’87, Lecture Notes in Computer Science, pages 369–378. Springer, 1987. doi: 10.1007/3-540-48184-2\_32

  22. [23]

    Sequential tests of statistical hypotheses.Annals of Mathematical Statistics, 16 (2):117–186, 1945

    Abraham Wald. Sequential tests of statistical hypotheses.Annals of Mathematical Statistics, 16 (2):117–186, 1945

  23. [24]

    A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

    Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

  24. [30]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800GB dataset of diverse text for language modeling.CoRR, abs/2101.00027, 2021

  25. [31]

    On the generalised distance in statistics.Proceedings of the National Institute of Sciences of India, 2(1):49–55, 1936

    Prasanta Chandra Mahalanobis. On the generalised distance in statistics.Proceedings of the National Institute of Sciences of India, 2(1):49–55, 1936

  26. [32]

    C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial.Biometrika, 26(4):404–413, 1934. doi: 10.1093/biomet/26.4.404

  27. [33]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca D. Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2.CoRR, abs/2408.05147, 2024. doi: 10.48550/ARXIV .2408.05147. 14 A Threat Model Table 6 defines the adversary we consider. Every...