super hub Canonical reference

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Yuntao Bai · 2021 · cs.CL · arXiv 2112.00861

Canonical reference. 82% of citing Pith papers cite this work as background.

137 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 137 citing papers more from Amanda Askell arXiv PDF

abstract

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 method 2

citation-polarity summary

background 14 use method 2 support 1

claims ledger

abstract Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment,

authors

Amanda Askell Anna Chen Dawn Drain Deep Ganguli Tom Henighan Yuntao Bai

co-cited works

representative citing papers

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

cs.CL · 2023-08-02 · conditional · novelty 8.0

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Teaching Models to Express Their Uncertainty in Words

cs.CL · 2022-05-28 · unverdicted · novelty 8.0

GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.

TruthfulQA: Measuring How Models Mimic Human Falsehoods

cs.CL · 2021-09-08 · unverdicted · novelty 8.0

A new benchmark reveals that language models including GPT-3 are truthful on only 58% of questions designed to elicit popular misconceptions, far below human performance of 94%, with larger models performing worse.

Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

cs.CR · 2026-06-19 · unverdicted · novelty 7.0

Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

Doc-to-Atom decomposes documents into composable micro-LoRA adapters selected by a query router for efficient long-context QA.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Chatbots Output Meaningful (but Problematic) Language

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

LLM outputs are meaningful according to standard theories of human language, without requiring anthropomorphic assumptions about the models.

How's it going? Reinforcement learning in language models recruits a functional welfare axis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.

Self-Policy Distillation via Capability-Selective Subspace Projection

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

cs.SE · 2026-05-18 · conditional · novelty 7.0

The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.

EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent

cs.NE · 2026-05-10 · unverdicted · novelty 7.0

EvoPref applies NSGA-II evolutionary optimization with archive-based diversity to populations of LoRA adapters, yielding 18% higher preference coverage and 47% lower collapse than gradient descent baselines while matching alignment quality.

Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

Three Models of RLHF Annotation: Extension, Evidence, and Authority

cs.CY · 2026-04-28 · unverdicted · novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

cs.HC · 2026-04-24 · conditional · novelty 7.0

An LLM-native five-factor psychometric instrument produces stable self-report structure but fails to predict observed behavior, and reveals a shared textual-surface bias between self-report and LLM judges that human raters do not share.

citing papers explorer

Showing 21 of 21 citing papers after filters.

Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing cs.AI · 2026-07-02 · unverdicted · none · ref 12 · internal anchor
Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents cs.AI · 2026-04-21 · unverdicted · none · ref 19 · internal anchor
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment cs.AI · 2026-07-01 · unverdicted · none · ref 25 · internal anchor
HARC couples harmfulness and refusal directions across prompt and response positions via subspace fine-tuning, achieving better robustness-capability-usability trade-off than six baselines while transferring across model families.
When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLMs cs.AI · 2026-06-23 · unverdicted · none · ref 1 · internal anchor
LLMs suppress causal caution in practical advisory contexts (rates drop from 91.7-100% to 6.7-18.3%) but recover it with a self-correction prompt (to 71.4-100%).
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy cs.AI · 2026-06-06 · unverdicted · none · ref 30 · internal anchor
A new stress-testing framework for medical LLMs reveals hidden safety failures in quantized and medically fine-tuned models that standard benchmarks miss.
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization cs.AI · 2026-06-05 · unverdicted · none · ref 1 · internal anchor
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.
Understanding Annotator Safety Policy with Interpretability cs.AI · 2026-05-06 · unverdicted · none · ref 26 · internal anchor
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest cs.AI · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing cs.AI · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 2 · internal anchor
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning cs.AI · 2025-05-26 · unverdicted · none · ref 2 · internal anchor
Slower multimodal reasoning models exhibit inverse scaling in truthfulness by fabricating details under ambiguous visual inputs, while faster models remain more cautious via broader inference.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 165 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework cs.AI · 2024-05-20 · unverdicted · none · ref 6 · internal anchor
OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 238 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models cs.AI · 2026-06-04 · unverdicted · none · ref 74 · internal anchor
HyperLoRA amortizes federated LoRA adaptation via hypernetwork-generated initializations and product-space aggregation to fix structural bias and initialization lag.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment cs.AI · 2023-08-10 · accept · none · ref 20 · internal anchor
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization cs.AI · 2026-06-08 · unverdicted · none · ref 124 · internal anchor
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
Emergent alignment and the projectability of ethical personas cs.AI · 2026-06-08 · unverdicted · none · ref 7 · internal anchor
Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.
TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment cs.AI · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
TriEval is an open-source pipeline for multi-parameter LLM evaluation that runs on standard hardware and was tested on four models.
The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem cs.AI · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.

A General Language Assistant as a Laboratory for Alignment

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer