Fine-tuning on new knowledge induces propagating hallucinations in LLMs by weakening attention to key entities, with mitigation via reintroducing known knowledge during later training stages.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
LLM analysis of highly-upvoted Reddit comments yields 64-72 macro/meso/micro values per year; existing prosocial measures capture only 18% on average while the method also recovers and extends prior qualitative taxonomies.
LifeAlign uses focalized preference optimization and short-to-long memory consolidation via dimensionality reduction to let LLMs align with new preferences while retaining prior knowledge.
citing papers explorer
-
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
Fine-tuning on new knowledge induces propagating hallucinations in LLMs by weakening attention to key entities, with mitigation via reintroducing known knowledge during later training stages.
-
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
-
Uncovering the Internet's Hidden Values: An Empirical Study of Desirable Behavior Using Highly-Upvoted Content on Reddit
LLM analysis of highly-upvoted Reddit comments yields 64-72 macro/meso/micro values per year; existing prosocial measures capture only 18% on average while the method also recovers and extends prior qualitative taxonomies.
-
LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization
LifeAlign uses focalized preference optimization and short-to-long memory consolidation via dimensionality reduction to let LLMs align with new preferences while retaining prior knowledge.
- The Ratchet Effect in Silico: How Interaction Drives Cumulative Intelligence in Large Language Models