A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
arXiv preprint arXiv:2402.07841 , year=
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
MRMMIA is a multi-recall-probe membership inference attack that extracts signals from chat agent memory and outperforms baselines in black-, gray-, and white-box settings.
DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
Controlled experiments across 96 LoRA adapters show that reduced optimizer updates explain nearly all observed memorization drops in DP-SGD fine-tuning, HMAC pseudonymization cuts exposure 40-61% without creating new targets, and 1-3B models achieve only 0.19-0.28 F1 under the tested budget.
Black-box membership inference on text-to-music models reaches up to 98.6% accuracy by training an auditor on semantic alignment patterns extracted from shadow-model generations.
TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.
PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.
CoLA reveals that subset training creates new privacy leakage surfaces via side-channel metadata and model outputs, enabling training-membership and selection-participation membership inference attacks.
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
CatShift detects training data membership in LLMs by comparing output shifts induced by fine-tuning on member versus non-member data, relying on catastrophic forgetting without requiring logit access.
citing papers explorer
-
TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs
TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.
-
From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.