arxiv: 2604.03199 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.CR· cs.LG

Recognition: no theorem link

Learning the Signature of Memorization in Autoregressive Language Models

David Ili\'c, David Stanojevi\'c, Evgeny Grigorenko, Kostadin Cvejoski

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:49 UTC · model grok-4.3

classification 💻 cs.CL cs.CRcs.LG

keywords membership inferencememorizationfine-tuninglanguage modelstransfer learningautoregressive modelstransformersMamba

0 comments

The pith

Fine-tuning language models creates an invariant memorization signature that transfers across architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning any autoregressive language model on any corpus produces a consistent pattern in the model's output probabilities on the training data. This pattern can be captured by training a classifier solely on transformer models and then applied without retraining to entirely different families such as state-space models and linear-attention models. A sympathetic reader would care because it turns membership inference from a collection of hand-crafted rules into a scalable learned method that works even when the target model shares no internal mechanisms with the training data.

Core claim

Fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. Training a membership inference classifier exclusively on transformer-based models yields zero-shot transfer to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936 respectively, each higher than the 0.908 AUC obtained on held-out transformers. The same signature appears in simple likelihood methods, confirming that it exists independently of the detection approach.

What carries the argument

The Learned Transfer Membership Inference Attack (LT-MIA), which reframes membership inference as sequence classification over per-token distributional statistics extracted from the model's output probabilities.

If this is right

Fine-tuning supplies unlimited labeled training data for the classifier because membership labels are known by construction.
LT-MIA raises true-positive rate at 0.1 percent false-positive rate by a factor of 2.8 over the strongest prior baseline on transformer models.
The same classifier trained only on natural-language data still reaches 0.865 AUC on code-generation models.
Even non-learned likelihood baselines exhibit strong cross-architecture transfer, showing the signature is not an artifact of the classifier architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the signature is produced by any gradient-based update on cross-entropy, similar classifiers could be trained for vision or reinforcement-learning models.
Model developers could run the classifier internally to audit whether fine-tuning has memorized private user data before deployment.
Unlearning techniques might be evaluated by whether they reduce the detectable signature rather than only by downstream accuracy.
The existence of the signature suggests that memorization is a low-level consequence of the training objective rather than a high-level architectural choice.

Load-bearing premise

The memorization pattern learned from transformer fine-tuning will generalize to any architecture that performs gradient descent on cross-entropy loss, regardless of shared computational mechanisms.

What would settle it

Train an autoregressive model with an optimization procedure other than gradient descent on cross-entropy loss and measure whether the LT-MIA classifier still achieves AUC above 0.9 on held-out fine-tuned examples from that model.

Figures

Figures reproduced from arXiv: 2604.03199 by David Ili\'c, David Stanojevi\'c, Evgeny Grigorenko, Kostadin Cvejoski.

**Figure 2.** Figure 2: LT-MIA pipeline. Given a text sample and black-box access to a fine-tuned target model and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Feature importance measured as AUC drop when each feature group is ablated. Comparison [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of training diversity on generalization, with total samples fixed at 18,000. “Train [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a learned MIA trained only on transformers that transfers zero-shot to Mamba, RWKV, and RecurrentGemma with AUCs above 0.93, but the claim that this proves an invariant signature from GD on CE alone lacks isolating ablations.

read the letter

The main result is that a classifier trained on per-token statistics from transformer fine-tunes transfers to three non-transformer families it never saw, reaching 0.963 AUC on Mamba, 0.972 on RWKV-4, and 0.936 on RecurrentGemma, each with a fresh dataset. On transformers it lifts true-positive rate 2.8 times over the best baseline at 0.1 percent false-positive rate. They get unlimited labeled data by controlling the fine-tunes themselves, which removes the usual shadow-model cost and lets them train the attack at scale. The code is released, which helps anyone who wants to check or extend it. The transfer to code after training only on text is also a clean extra data point. This moves membership inference from hand-tuned heuristics to something that can be learned and generalized, and the reported numbers are concrete enough to take seriously. The soft spot is the causal story. The authors say the only shared ingredient across families is gradient descent on cross-entropy loss, yet every model they test uses comparable optimizers, schedules, and fine-tuning lengths. Without runs that hold architecture fixed while changing the training recipe, it is hard to know whether the signal is truly architecture-invariant or just tied to the shared optimization details. Simple likelihood baselines also transfer, which supports that something general is present, but does not tighten the attribution. The experimental design is otherwise clean and the circularity burden is low because labels come from independent runs. This is worth a serious referee for anyone working on privacy auditing of language models. The empirical transfer stands on its own even if the deeper interpretation needs tighter controls.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Learned Transfer MIA (LT-MIA), a learned membership inference attack for fine-tuned autoregressive language models. A classifier is trained exclusively on transformer models using per-token distributional statistics from fine-tuning runs (where membership is known by construction), then evaluated zero-shot on unseen architectures (Mamba, RWKV-4, RecurrentGemma) and datasets, reporting AUCs of 0.963, 0.972, and 0.936 respectively—exceeding held-out transformer performance (0.908 AUC). The work also reports transfer to code data (0.865 AUC) and a 2.8× improvement in TPR at 0.1% FPR over baselines on transformers, attributing the transferable signal to gradient descent on cross-entropy loss.

Significance. If the transfer results hold under the reported protocols, the paper demonstrates that fine-tuning produces an architecture-invariant memorization signature detectable by data-driven methods rather than hand-crafted heuristics. The unlimited labeled data from fine-tuning removes the shadow-model requirement and enables scaling via training diversity. Releasing code and the trained classifier is a clear strength that supports reproducibility and follow-on work. The observation that even simple likelihood methods transfer provides independent evidence for the signature's existence.

major comments (1)

[Experiments] Experiments section: the central claim that the signature's only necessary condition is gradient descent on cross-entropy loss (with no shared computational mechanisms across families) is not fully isolated. All evaluated models share comparable optimizer families, learning-rate schedules, and fine-tuning durations; without ablations that vary these factors while holding architecture fixed, the high zero-shot transfer (0.963–0.972 AUC) could be driven by shared optimization dynamics rather than the stated minimal commonality.

minor comments (1)

[Abstract] Abstract: the claim of 2.8× higher TPR at 0.1% FPR should include a parenthetical reference to the exact baseline and the section/table where the comparison is reported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that the signature's only necessary condition is gradient descent on cross-entropy loss (with no shared computational mechanisms across families) is not fully isolated. All evaluated models share comparable optimizer families, learning-rate schedules, and fine-tuning durations; without ablations that vary these factors while holding architecture fixed, the high zero-shot transfer (0.963–0.972 AUC) could be driven by shared optimization dynamics rather than the stated minimal commonality.

Authors: We appreciate the referee's point on potential confounds. The models do employ comparable optimizers (primarily AdamW variants), learning-rate schedules, and fine-tuning durations as standard practice for each family. However, their core mechanisms remain fundamentally distinct: self-attention, selective state-space models, linear attention with recurrence, and gated recurrence share no computational primitives. The classifier, trained exclusively on transformers, transfers zero-shot and even exceeds held-out transformer performance, which we interpret as evidence that the signal originates from the shared gradient-descent process on cross-entropy loss. We agree that dedicated ablations varying only the optimizer or schedule while fixing architecture would provide stronger isolation. We will add a concise limitations paragraph in the discussion section acknowledging this and proposing such ablations as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: classifier trained on independently labeled fine-tune runs transfers to held-out architectures

full rationale

The paper constructs labeled training data for the LT-MIA classifier by fine-tuning transformers on known corpora (membership known by construction from the training split). It then evaluates zero-shot transfer on entirely separate architectures (Mamba, RWKV-4, RecurrentGemma) and datasets never seen in classifier training, reporting AUCs of 0.963/0.972/0.936. No equation or claim reduces a prediction to a fitted parameter defined on the same data; the central result is an empirical generalization test whose inputs (training runs) are independent of the test models. No self-citations are load-bearing for the transfer claim, and no ansatz or uniqueness theorem is invoked to force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that fine-tuning creates unlimited labeled data and that gradient descent on cross-entropy produces a detectable invariant signature; no explicit free parameters beyond the learned classifier weights are introduced.

axioms (1)

domain assumption Gradient descent on cross-entropy loss produces consistent memorization patterns across architectures that share no computational mechanisms
Invoked when the abstract states that the four families share only gradient descent on cross-entropy loss yet exhibit transferable detection performance.

pith-pipeline@v0.9.0 · 5634 in / 1313 out tokens · 53739 ms · 2026-05-13T19:49:12.612428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 8 internal anchors

[1]

Deep learning with differential privacy

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318,

work page 2016
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi...

work page arXiv
[4]

Membership inference attacks from first principles

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. InIEEE Symposium on Security and Privacy, pages 1897–1914,

work page 1914
[5]

Window-based membership inference attacks against fine-tuned large language models

Yuetian Chen, Yuntao Du, Kaiyuan Zhang, Ashish Kundu, Charles Fleming, Bruno Ribeiro, and Ninghui Li. Window-based membership inference attacks against fine-tuned large language models. arXiv preprint arXiv:2601.02751,

work page arXiv
[6]

Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-GPT: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208,

work page arXiv
[7]

Do membership inference attacks work on large language models?arXiv preprint arXiv:2402.07841, 2024

Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, and Hannaneh Hajishirzi. Do membership inference attacks work on large language models?arXiv preprint arXiv:2402.07841,

work page arXiv
[8]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Powerful Training-Free Membership Inference Against Autoregressive Language Models

David Ili´c, David Stanojevi´c, and Kostadin Cvejoski. Powerful training-free membership inference against autoregressive language models.arXiv preprint arXiv:2601.12104,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Membership inference attacks against language models via neigh- bourhood comparison

Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schölkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. Membership inference attacks against language models via neigh- bourhood comparison. InFindings of the Association for Computational Linguistics: ACL 2023, pages 11330–11343,

work page 2023
[12]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

News category dataset.arXiv preprint arXiv:2209.11429,

Rishabh Misra. News category dataset.arXiv preprint arXiv:2209.11429,

work page arXiv
[14]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807,

work page 2018
[15]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116,

work page internal anchor Pith review arXiv
[16]

RWKV: Reinventing RNNs for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guan...

work page 2023
[17]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[19]

Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789, 2023

Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789,

work page arXiv
[20]

Min-k%++: Improved baseline for detecting pre-training data from large language models

Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Yang, and Hai Li. Min-k%++: Improved baseline for detecting pre-training data from large language models. arXiv preprint arXiv:2404.02936,

work page arXiv
[21]

Table 4: Full results on held-out transformers: AUC

12 A Full Results Tables 4–9 present complete results for all model-dataset combinations. Table 4: Full results on held-out transformers: AUC. Model Dataset Loss Min-K%++ Zlib RefLoss EZ-MIA LT-MIA GPT-2 AG News 0.745 0.704 0.717 0.7900.9600.945 GPT-2 WikiText 0.745 0.696 0.713 0.814 0.9710.980 GPT-2 XSum 0.768 0.719 0.760 0.9560.9940.991 GPT-2 Code 0.618...

work page arXiv 2080
[22]

D Classifier Architecture Ablation Table 11 presents full results for the classifier architecture comparison

where V is vocabulary size. D Classifier Architecture Ablation Table 11 presents full results for the classifier architecture comparison. All variants are trained on identical features from 30 model-dataset combinations (540,000 samples total); only the classifier architecture differs. Sequence modeling contributes 5.0 AUC points over pooling (0.925 vs. 0...

work page arXiv 2020