Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.
hub Canonical reference
Scalable Extraction of Training Data from (Production) Language Models
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Differential privacy versions of TTA methods achieve privacy on ImageNet-C with small accuracy cost and can improve stability via clipping in continual settings.
VisualLeakBench shows vision-language agents propagate PII and unsafe text from images to tool arguments at 78.8% and 85.5% baseline rates, with defensive prompts reducing PII leakage to 2% mainly by suppressing tool use while unsafe-text leakage stays at 52.6%.
ALU uses public data to suppress unlearning cost quadratically while characterizing distribution mismatch effects, enabling mass unlearning with maintained utility.
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
CAMP formalizes Cumulative PII Exposure and uses a session registry, co-occurrence graph, and CPE score to trigger retroactive masking in multi-turn LLM conversations, neutralizing re-identifiable profiles in synthetic tests while keeping utility intact.
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
New canary crafting via greedy influence-based init and bilevel optimization for diversity in embedding space yields stronger one-run privacy leakage estimates at lower cost.
RAG models exhibit a monitoring-control gap: they acknowledge epistemic conflicts in accumulating documents yet fail to constrain unsafe recommendations, with single-turn tests overestimating safety.
Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
ZeroUnlearn reformulates machine unlearning as knowledge re-mapping via model editing, using multiplicative updates with closed-form solutions for efficient few-shot removal of sensitive representations while preserving utility.
Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
Perplexity differencing on completions from short random prefills surfaces finetuning objectives in the vast majority of tested model organisms across sizes and types.
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
InvisibleInk achieves high-utility differentially private long-form LLM text generation at 4-8x the cost of non-private generation by isolating and clipping sensitive logits and sampling from a small superset of top-k private tokens without privacy cost.
An empirical audit of one web-scraped ML training dataset reveals persistent PII after sanitization, which the authors combine with legal analysis to highlight privacy risks and advocate redefining 'publicly available' data for AI training.
Machine unlearning in LLMs is often reversible via fine-tuning, indicating suppression not deletion, and a new representation-level framework identifies four forgetting regimes based on reversibility and catastrophicity.
A new extraction technique applied to 200 books and 14 LLMs finds that memorization of full books is rare except in specific high-capacity models where entire texts can be recovered verbatim.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
-
Most Current Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Perplexity differencing on completions from short random prefills surfaces finetuning objectives in the vast majority of tested model organisms across sizes and types.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
-
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Machine unlearning in LLMs is often reversible via fine-tuning, indicating suppression not deletion, and a new representation-level framework identifies four forgetting regimes based on reversibility and catastrophicity.
-
Extracting memorized pieces of (copyrighted) books from open-weight language models
A new extraction technique applied to 200 books and 14 LLMs finds that memorization of full books is rare except in specific high-capacity models where entire texts can be recovered verbatim.
-
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
-
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
-
Gemma 3 Technical Report
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
-
Towards the Anonymization of the Language Modeling
Authors introduce MLM and CLM specialization methods that avoid memorizing identifiers in sensitive training data while aiming for a privacy-utility tradeoff on medical datasets.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.