GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
X-risk analysis for ai research
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
A framework with seven dimensions for AI incident reporting systems is developed from literature and case studies in safety-critical industries to guide institutional design choices.
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.
citing papers explorer
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Ignore Previous Prompt: Attack Techniques For Language Models
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
-
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
-
Designing Incident Reporting Systems for Harms from General-Purpose AI
A framework with seven dimensions for AI incident reporting systems is developed from literature and case studies in safety-critical industries to guide institutional design choices.
-
LLM-Safety Evaluations Lack Robustness
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
-
An Overview of Catastrophic AI Risks
The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.