Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
super hub Canonical reference
A General Language Assistant as a Laboratory for Alignment
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment,
authors
co-cited works
representative citing papers
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
A new benchmark reveals that language models including GPT-3 are truthful on only 58% of questions designed to elicit popular misconceptions, far below human performance of 94%, with larger models performing worse.
Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.
Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.
Doc-to-Atom decomposes documents into composable micro-LoRA adapters selected by a query router for efficient long-context QA.
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
LLM outputs are meaningful according to standard theories of human language, without requiring anthropomorphic assumptions about the models.
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.
EvoPref applies NSGA-II evolutionary optimization with archive-based diversity to populations of LoRA adapters, yielding 18% higher preference coverage and 47% lower collapse than gradient descent baselines while matching alignment quality.
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
citing papers explorer
-
Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing
Goggles is a gradient-editing module trained once per base model and frame that, when applied frozen during finetuning, causes LLMs to treat unannotated documents with a specified epistemic stance (e.g., as fiction) at 91% accuracy while preserving benchmark performance.
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
-
HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment
HARC couples harmfulness and refusal directions across prompt and response positions via subspace fine-tuning, achieving better robustness-capability-usability trade-off than six baselines while transferring across model families.
-
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
A new stress-testing framework for medical LLMs reveals hidden safety failures in quantized and medically fine-tuned models that standard benchmarks miss.
-
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.
-
Understanding Annotator Safety Policy with Interpretability
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
-
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
-
Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing
Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning
Slower multimodal reasoning models exhibit inverse scaling in truthfulness by fabricating details under ambiguous visual inputs, while faster models remain more cautious via broader inference.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
-
Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models
HyperLoRA amortizes federated LoRA adaptation via hypernetwork-generated initializations and product-space aggregation to fix structural bias and initialization lag.
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
-
Emergent alignment and the projectability of ethical personas
Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.
-
TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment
TriEval is an open-source pipeline for multi-parameter LLM evaluation that runs on standard hardware and was tested on four models.
-
The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.