SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.
Yongchan Kwon, Eric Wu, Kevin Wu, and James Zou
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
PRISM weights target examples by model preference to build an improved direction for influence-based data selection in LLM fine-tuning.
citing papers explorer
-
Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies
SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.
-
PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning
PRISM weights target examples by model preference to build an improved direction for influence-based data selection in LLM fine-tuning.