pith. sign in

hub Canonical reference

Artificial Intelligence, Values and Alignment

Canonical reference. 88% of citing Pith papers cite this work as background.

19 Pith papers citing it
572 external citations · Crossref
Background 88% of classified citations
abstract

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

hub tools

citation-role summary

background 8

citation-polarity summary

roles

background 8

polarities

background 7 support 1

representative citing papers

A Roadmap to Pluralistic Alignment

cs.AI · 2024-02-07 · unverdicted · novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

How Value Induction Reshapes LLM Behaviour

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.

Open Problems in Frontier AI Risk Management

cs.LG · 2026-04-28 · unverdicted · novelty 3.0

The paper maps unresolved challenges in frontier AI risk management, classifies them into lack of consensus, framework misalignment, or implementation shortfalls, and identifies actors best positioned to address each.

citing papers explorer

Showing 19 of 19 citing papers.