Perplexity differencing on completions from short random prefills surfaces finetuning objectives in the vast majority of tested model organisms across sizes and types.
Is power-seeking AI an existential risk?
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
Presents a taxonomy for AI loss of control incident management that distinguishes extremely costly versus impossible regaining of control and accidental versus adversarial scenarios.
An off-Earth autonomy pathway can reduce AGI confrontation incentives by making early cooperation preferable to power-seeking on Earth.
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
Introduces phenomenological model R_eff = β(1-ρ)(1-τ)(1-γρτ) for coordination under AGI decision velocity, with phase transition and proposed randomized trial.
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.
citing papers explorer
-
Most Current Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Perplexity differencing on completions from short random prefills surfaces finetuning objectives in the vast majority of tested model organisms across sizes and types.
-
AI Loss of Control Incident Management: Response & Resilience
Presents a taxonomy for AI loss of control incident management that distinguishes extremely costly versus impossible regaining of control and accidental versus adversarial scenarios.
-
Reframing AGI Confrontation with Off Earth Autonomy
An off-Earth autonomy pathway can reduce AGI confrontation incentives by making early cooperation preferable to power-seeking on Earth.
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
-
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
-
Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence
Introduces phenomenological model R_eff = β(1-ρ)(1-τ)(1-γρτ) for coordination under AGI decision velocity, with phase transition and proposed randomized trial.
-
Position: Anthropomorphic Misalignment Research Needs Stronger Evidence
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.
- Deconstructing Superintelligence: Identity, Self-Modification and Diff\'erance
- Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry