Recognition: no theorem link
Learning the Signature of Memorization in Autoregressive Language Models
Pith reviewed 2026-05-13 19:49 UTC · model grok-4.3
The pith
Fine-tuning language models creates an invariant memorization signature that transfers across architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. Training a membership inference classifier exclusively on transformer-based models yields zero-shot transfer to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936 respectively, each higher than the 0.908 AUC obtained on held-out transformers. The same signature appears in simple likelihood methods, confirming that it exists independently of the detection approach.
What carries the argument
The Learned Transfer Membership Inference Attack (LT-MIA), which reframes membership inference as sequence classification over per-token distributional statistics extracted from the model's output probabilities.
If this is right
- Fine-tuning supplies unlimited labeled training data for the classifier because membership labels are known by construction.
- LT-MIA raises true-positive rate at 0.1 percent false-positive rate by a factor of 2.8 over the strongest prior baseline on transformer models.
- The same classifier trained only on natural-language data still reaches 0.865 AUC on code-generation models.
- Even non-learned likelihood baselines exhibit strong cross-architecture transfer, showing the signature is not an artifact of the classifier architecture.
Where Pith is reading between the lines
- If the signature is produced by any gradient-based update on cross-entropy, similar classifiers could be trained for vision or reinforcement-learning models.
- Model developers could run the classifier internally to audit whether fine-tuning has memorized private user data before deployment.
- Unlearning techniques might be evaluated by whether they reduce the detectable signature rather than only by downstream accuracy.
- The existence of the signature suggests that memorization is a low-level consequence of the training objective rather than a high-level architectural choice.
Load-bearing premise
The memorization pattern learned from transformer fine-tuning will generalize to any architecture that performs gradient descent on cross-entropy loss, regardless of shared computational mechanisms.
What would settle it
Train an autoregressive model with an optimization procedure other than gradient descent on cross-entropy loss and measure whether the LT-MIA classifier still achieves AUC above 0.9 on held-out fine-tuned examples from that model.
Figures
read the original abstract
All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Learned Transfer MIA (LT-MIA), a learned membership inference attack for fine-tuned autoregressive language models. A classifier is trained exclusively on transformer models using per-token distributional statistics from fine-tuning runs (where membership is known by construction), then evaluated zero-shot on unseen architectures (Mamba, RWKV-4, RecurrentGemma) and datasets, reporting AUCs of 0.963, 0.972, and 0.936 respectively—exceeding held-out transformer performance (0.908 AUC). The work also reports transfer to code data (0.865 AUC) and a 2.8× improvement in TPR at 0.1% FPR over baselines on transformers, attributing the transferable signal to gradient descent on cross-entropy loss.
Significance. If the transfer results hold under the reported protocols, the paper demonstrates that fine-tuning produces an architecture-invariant memorization signature detectable by data-driven methods rather than hand-crafted heuristics. The unlimited labeled data from fine-tuning removes the shadow-model requirement and enables scaling via training diversity. Releasing code and the trained classifier is a clear strength that supports reproducibility and follow-on work. The observation that even simple likelihood methods transfer provides independent evidence for the signature's existence.
major comments (1)
- [Experiments] Experiments section: the central claim that the signature's only necessary condition is gradient descent on cross-entropy loss (with no shared computational mechanisms across families) is not fully isolated. All evaluated models share comparable optimizer families, learning-rate schedules, and fine-tuning durations; without ablations that vary these factors while holding architecture fixed, the high zero-shot transfer (0.963–0.972 AUC) could be driven by shared optimization dynamics rather than the stated minimal commonality.
minor comments (1)
- [Abstract] Abstract: the claim of 2.8× higher TPR at 0.1% FPR should include a parenthetical reference to the exact baseline and the section/table where the comparison is reported.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the recommendation for minor revision. We address the major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that the signature's only necessary condition is gradient descent on cross-entropy loss (with no shared computational mechanisms across families) is not fully isolated. All evaluated models share comparable optimizer families, learning-rate schedules, and fine-tuning durations; without ablations that vary these factors while holding architecture fixed, the high zero-shot transfer (0.963–0.972 AUC) could be driven by shared optimization dynamics rather than the stated minimal commonality.
Authors: We appreciate the referee's point on potential confounds. The models do employ comparable optimizers (primarily AdamW variants), learning-rate schedules, and fine-tuning durations as standard practice for each family. However, their core mechanisms remain fundamentally distinct: self-attention, selective state-space models, linear attention with recurrence, and gated recurrence share no computational primitives. The classifier, trained exclusively on transformers, transfers zero-shot and even exceeds held-out transformer performance, which we interpret as evidence that the signal originates from the shared gradient-descent process on cross-entropy loss. We agree that dedicated ablations varying only the optimizer or schedule while fixing architecture would provide stronger isolation. We will add a concise limitations paragraph in the discussion section acknowledging this and proposing such ablations as future work. revision: partial
Circularity Check
No circularity: classifier trained on independently labeled fine-tune runs transfers to held-out architectures
full rationale
The paper constructs labeled training data for the LT-MIA classifier by fine-tuning transformers on known corpora (membership known by construction from the training split). It then evaluates zero-shot transfer on entirely separate architectures (Mamba, RWKV-4, RecurrentGemma) and datasets never seen in classifier training, reporting AUCs of 0.963/0.972/0.936. No equation or claim reduces a prediction to a fitted parameter defined on the same data; the central result is an empirical generalization test whose inputs (training runs) are independent of the test models. No self-citations are load-bearing for the transfer claim, and no ansatz or uniqueness theorem is invoked to force the outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient descent on cross-entropy loss produces consistent memorization patterns across architectures that share no computational mechanisms
Reference graph
Works this paper leans on
-
[1]
Deep learning with differential privacy
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318,
work page 2016
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi...
-
[4]
Membership inference attacks from first principles
Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. InIEEE Symposium on Security and Privacy, pages 1897–1914,
work page 1914
-
[5]
Window-based membership inference attacks against fine-tuned large language models
Yuetian Chen, Yuntao Du, Kaiyuan Zhang, Ashish Kundu, Charles Fleming, Bruno Ribeiro, and Ninghui Li. Window-based membership inference attacks against fine-tuned large language models. arXiv preprint arXiv:2601.02751,
-
[6]
Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster
Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-GPT: Open compute-optimal language models trained on the cerebras wafer-scale cluster.arXiv preprint arXiv:2304.03208,
-
[7]
Do membership inference attacks work on large language models?arXiv preprint arXiv:2402.07841, 2024
Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, and Hannaneh Hajishirzi. Do membership inference attacks work on large language models?arXiv preprint arXiv:2402.07841,
-
[8]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Powerful Training-Free Membership Inference Against Autoregressive Language Models
David Ili´c, David Stanojevi´c, and Kostadin Cvejoski. Powerful training-free membership inference against autoregressive language models.arXiv preprint arXiv:2601.12104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Membership inference attacks against language models via neigh- bourhood comparison
Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schölkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. Membership inference attacks against language models via neigh- bourhood comparison. InFindings of the Association for Computational Linguistics: ACL 2023, pages 11330–11343,
work page 2023
-
[12]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
News category dataset.arXiv preprint arXiv:2209.11429,
Rishabh Misra. News category dataset.arXiv preprint arXiv:2209.11429,
-
[14]
Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807,
work page 2018
-
[15]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116,
work page internal anchor Pith review arXiv
-
[16]
RWKV: Reinventing RNNs for the transformer era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guan...
work page 2023
-
[17]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[19]
Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789, 2023
Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models.arXiv preprint arXiv:2310.16789,
-
[20]
Min-k%++: Improved baseline for detecting pre-training data from large language models
Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Yang, and Hai Li. Min-k%++: Improved baseline for detecting pre-training data from large language models. arXiv preprint arXiv:2404.02936,
-
[21]
Table 4: Full results on held-out transformers: AUC
12 A Full Results Tables 4–9 present complete results for all model-dataset combinations. Table 4: Full results on held-out transformers: AUC. Model Dataset Loss Min-K%++ Zlib RefLoss EZ-MIA LT-MIA GPT-2 AG News 0.745 0.704 0.717 0.7900.9600.945 GPT-2 WikiText 0.745 0.696 0.713 0.814 0.9710.980 GPT-2 XSum 0.768 0.719 0.760 0.9560.9940.991 GPT-2 Code 0.618...
-
[22]
where V is vocabulary size. D Classifier Architecture Ablation Table 11 presents full results for the classifier architecture comparison. All variants are trained on identical features from 30 model-dataset combinations (540,000 samples total); only the classifier architecture differs. Sequence modeling contributes 5.0 AUC points over pooling (0.925 vs. 0...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.