X-risk analysis for ai research

Dan Hendrycks, Mantas Mazeika · 2022 · arXiv 2206.05862

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 3 other 1

citation-polarity summary

background 2 support 1 unclear 1

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Ignore Previous Prompt: Attack Techniques For Language Models

cs.CL · 2022-11-17 · unverdicted · novelty 6.0

PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

cs.CY · 2026-04-22 · unverdicted · novelty 5.0

AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.

Designing Incident Reporting Systems for Harms from General-Purpose AI

cs.CY · 2025-11-08 · conditional · novelty 4.0

A framework with seven dimensions for AI incident reporting systems is developed from literature and case studies in safety-critical industries to guide institutional design choices.

LLM-Safety Evaluations Lack Robustness

cs.CR · 2025-03-04 · unverdicted · novelty 4.0

LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.

An Overview of Catastrophic AI Risks

cs.CY · 2023-06-21 · accept · novelty 3.0

The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.

citing papers explorer

Showing 6 of 6 citing papers.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small cs.LG · 2022-11-01 · conditional · none · ref 8
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Ignore Previous Prompt: Attack Techniques For Language Models cs.CL · 2022-11-17 · unverdicted · none · ref 9
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem cs.CY · 2026-04-22 · unverdicted · none · ref 42
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
Designing Incident Reporting Systems for Harms from General-Purpose AI cs.CY · 2025-11-08 · conditional · none · ref 9
A framework with seven dimensions for AI incident reporting systems is developed from literature and case studies in safety-critical industries to guide institutional design choices.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 27
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
An Overview of Catastrophic AI Risks cs.CY · 2023-06-21 · accept · none · ref 102
The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.

X-risk analysis for ai research

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer