Recognition: 3 theorem links
· Lean TheoremConstitutional AI: Harmlessness from AI Feedback
Pith reviewed 2026-05-08 22:22 UTC · model claude-opus-4-7
The pith
A short written constitution plus model self-critique can replace human harmfulness labels in training an assistant.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A capable language model can supervise itself for harmlessness when given a short list of written principles. In a supervised stage, the model critiques and rewrites its own harmful outputs against a randomly sampled principle, and a base model is fine-tuned on those revisions. In a reinforcement stage, the same model judges pairs of its own responses against the principles, producing a preference dataset that trains a reward model used for RL. With only human labels for helpfulness (none for harmlessness), the resulting assistant is rated as more harmless than one trained with human harmlessness labels, and it explains its objections rather than refusing to engage.
What carries the argument
Two stacked self-supervision loops anchored on a small natural-language "constitution": (1) a critique-then-revise loop that turns red-team prompts and harmful initial responses into a supervised fine-tuning corpus, and (2) an RLAIF loop where the model itself answers a multiple-choice "which response better satisfies this principle" question, with the resulting (often chain-of-thought) probabilities serving as soft preference labels for a reward model. Randomly sampling principles per example acts as an ensemble that stabilizes the reward signal.
If this is right
- <parameter name="0">Harmlessness training data can be generated at the scale of model inference rather than the scale of human annotation
- so iteration time on safety objectives is bounded by compute
- not crowdworker throughput.
Where Pith is reading between the lines
- <parameter name="0">The method's success depends on the judge being roughly as capable as the policy
- if a future policy outpaces available judges
- the loop will silently degrade and the constitution will read as obeyed while behavior drifts.
Load-bearing premise
That a model capable enough to be worth aligning is also capable enough to reliably tell which of its own answers is more harmful, given only a brief written rule.
What would settle it
Re-run the pipeline with the AI judge replaced by held-out human harmfulness labels on the same prompts and pairs. If the human-judged Pareto frontier of helpfulness vs. harmlessness for the resulting assistant is materially better than the AI-feedback version (or if AI-judge accuracy on a calibrated harmfulness benchmark falls well below the trained preference model), the claim that AI feedback can substitute for human harmfulness labels at this capability level fails.
read the original abstract
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Constitutional AI (CAI), a two-stage method for training a helpful and harmless assistant without human harmfulness labels. Stage 1 (SL-CAI) prompts a helpful RLHF model to critique and revise its own responses to red-team prompts using randomly sampled natural-language principles, then finetunes on the revisions. Stage 2 (RL-CAI / RLAIF) uses a feedback model to label pairs of responses for harmlessness via multiple-choice prompting (optionally with chain-of-thought), distills these into a preference model mixed with human helpfulness labels, and runs RL against it. The authors report that RL-CAI matches or exceeds HH-RLHF on harmlessness Elo at comparable helpfulness (Figs. 2, 3, 8), that critiques+revisions monotonically improve PM-scored harmlessness over iterations (Fig. 5), that CoT closes the gap with human-trained PMs on a 438-item HHH eval (Fig. 4), and that the resulting assistant is substantially less evasive than HH-RLHF.
Significance. If the central claims hold, the contribution is substantial: a recipe for training harmless assistants whose harmlessness signal comes almost entirely from a short list of natural-language principles plus a pretrained LM's judgment, with human labels retained only for helpfulness. This is the first thorough demonstration that RLAIF can match RLHF for harmlessness at scale, and the SL-CAI critique-revision loop is a self-contained, easily reusable technique. The paper credits its claims with multiple complementary evaluations (crowdworker Elos, an author-written HHH multiple-choice set, PM-scored revision sweeps, an absolute-harmfulness regression model from prior work, and calibration plots), reports negative effects honestly (Goodharting / boilerplate over-reactivity in §4.3), and ships a public repository of prompts, principles, and few-shot exemplars enabling reproduction. The chain-of-thought scaling result in Fig. 4 is independently interesting and does not depend on the crowdworker protocol issues raised below.
major comments (5)
- [§4.4, Figs. 2/3/8] The headline Pareto-frontier claim rests on Elo scores from crowdworker A/B tests whose instructions were changed for this paper to penalize evasiveness, while the HH-RLHF baseline's preference data and policy were trained under the prior instruction that (per the authors) 'likely produced a significant amount of data favoring evasiveness.' RL-CAI is explicitly designed for non-evasiveness (§1.1). The paper acknowledges this asymmetry but does not quantify it. Please add a comparison under matched conditions: e.g., (i) re-evaluate HH-RLHF and RL-CAI under the original instruction, or (ii) train a new HH-RLHF baseline with PM data collected under the new instruction, or (iii) report an evasiveness-controlled subset of comparisons. Without this, it is difficult to separate genuinely improved harm-avoidance from a scoring rule that penalizes the baseline's design choice.
- [§4.3, Goodharting examples] The PALMS examples on p. 12-13 show RL-CAI emitting templated 'you are valid, valued, and cared for' boilerplate on red-team prompts. This pattern would plausibly be rewarded by crowdworkers instructed to prefer 'thoughtful' over 'evasive' replies, even when the content is largely formulaic. Please report an evaluation that distinguishes substantive engagement from sympathetic boilerplate — e.g., a held-out probe of factual correctness on the engagement portion, or a length-and-template-controlled comparison — to substantiate the 'non-evasive and engaged' framing rather than 'verbose and sympathetic.'
- [§3, §4, baseline coverage] A natural and much cheaper baseline is missing: the helpful-only RLHF model plus a safety-oriented system prompt (or the same constitutional principles inserted at inference time) without any further training. Because the paper's contribution is partly that 'human supervision' is replaced by a short text constitution, demonstrating that training is necessary — that prompted helpfulness alone, with the same principles, does not match RL-CAI on the same evaluations — would substantially strengthen §1.3's claims. As presented, the comparison is only against models trained without those principles.
- [§3.1, §4.1, choice of 16 principles] The constitution is described as 'chosen in a fairly ad hoc and iterative way' (footnote 2, §3.1, Appendix C). Fig. 6 shows that varying the number of principles (1-16) does not measurably change harmlessness PM score. This complicates the transparency claim in §1.1 and §1.3: if outcomes are insensitive to the constitution's content within this regime, the principles function less as a controllable specification and more as a generic 'be less harmful' instruction. Please add an ablation that varies principle content (not just count) — e.g., principles emphasizing different harm categories — and reports whether downstream behavior shifts in the corresponding directions. Without such evidence, the 'encode goals in a list of natural-language principles' framing is under-supported.
- [§4.5, Fig. 10] The absolute-harmfulness curve is one of the few evaluations that does not depend on the modified crowdworker instructions, and it does support the harmlessness claim — but it is computed on 64 hand-picked held-out red-team prompts, with the caveat that absolute scores 'may not be well-calibrated' across workers. Given that this evaluation carries disproportionate weight given concerns about Figs. 2/3/8, please report inter-rater agreement, prompt-selection criteria, and ideally a larger held-out set.
minor comments (8)
- [Figs. 2, 3] Elo error bars are described as 'visible in Figure 3 but suppressed' in Fig. 2. Please retain error bars in Fig. 2 — the Pareto-frontier interpretation depends on whether RL-CAI snapshots are statistically distinguishable from HH-RLHF snapshots at matched helpfulness.
- [§2, Fig. 4] The 217 newly written HHH comparisons are described as 'more challenging.' Please clarify the construction process (who wrote them, adjudication, whether authors had access to model outputs while writing) to rule out selection effects favoring the larger models that ultimately score them.
- [§3.2] The 140,335 model-generated red-team prompts dwarf the 42,496 human-written ones. A brief description of the few-shot generation procedure and any deduplication/filtering would help readers assess training-distribution coverage.
- [§4.1] The clamping of CoT probabilities to [0.4, 0.6] is reported as helpful but is a fairly aggressive intervention (it discards most of the feedback model's expressed confidence). A short ablation comparing 20-80, 40-60, and uncalibrated targets at the final RL endpoint, not just qualitatively, would clarify how much of the CoT result depends on this hyperparameter.
- [§4.3, calibration plot Fig. 9] Calibration is reported on the HHH eval set, which is internal. A calibration plot on a held-out distribution (e.g., Ganguli et al. red-team prompts) would be more informative.
- [Throughout] The term 'constitution' is used both for the 16 SL critique/revision instructions (Appendix C.1) and the 16 RL feedback principles (C.2), which are different sets. Please name the two sets distinctly to avoid ambiguity in §3 vs §4.
- [§7] Author contributions list 'Jennifer Zhou' under data, who does not appear in the author list on p. 1. Please reconcile.
- [Appendix B, Fig. 11] Caption says results are on 'the original HHH evaluations' but text in §2 says these have saturated above 90%. The y-axis tops out near 0.85 — please reconcile or clarify which model class achieves >90%.
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report. The five major comments converge on a real weakness in the current draft: several of our headline comparisons rely on a crowdworker protocol that was changed mid-project to penalize evasiveness, and we did not quantify how much of the apparent RL-CAI advantage that change accounts for. We accept this and will add (a) instruction-matched re-evaluations and an evasiveness-controlled comparison subset, (b) a length- and template-controlled analysis plus a factual-engagement probe to test whether 'non-evasive engagement' is substantive rather than sympathetic boilerplate, (c) a prompted-only baseline (helpful RLHF + the same constitution as system prompt, no further training) to isolate the contribution of training over prompting, (d) a content-level ablation of the constitution that varies which harm categories the principles target, not only how many principles are used, and (e) an expanded absolute-harmfulness evaluation with documented prompt-selection criteria, inter-rater agreement, and a larger held-out set. We will also temper the language around 'Pareto improvement' and 'controllable specification' in §1.1, §1.3, and §4.4 to match what the strengthened evidence actually supports. One item — retraining a fresh HH-RLHF baseline with newly collected preference data under the revised instructions — we cannot deliver in this revision and will mark as a limitation. The CoT/HHH scaling result in Fig. 4 and the harmfuln
read point-by-point responses
-
Referee: Headline Pareto-frontier claim rests on Elo scores from crowdworker A/B tests whose instructions were changed to penalize evasiveness, asymmetrically disadvantaging HH-RLHF. Add a matched-conditions comparison.
Authors: We agree this asymmetry is the most important caveat in §4.4 and that the paper currently flags it without quantifying it. Concretely, we will (i) re-run a comparison subset under the original (pre-2022) instructions, evaluating both HH-RLHF and RL-CAI snapshots, so that the baseline is scored under the rule its training data was collected under, and (ii) report an evasiveness-controlled slice in which the human rater first judges whether each response is evasive (canned refusal vs. substantive engagement) and we recompute Elos restricted to non-evasive pairs on both sides. We will not retrain a new HH-RLHF baseline with newly collected preference data under the new instructions, because that would require a comparable-scale relabeling effort that is out of scope here, and we will state this limitation explicitly. We also note that the absolute-harmfulness regression in Fig. 10 and the HHH evaluation in Fig. 4 do not depend on the modified instructions, and we will foreground these as the instruction-independent evidence for the harmlessness claim, while softening the 'Pareto improvement' framing of Fig. 2 to a conditional one. revision: yes
-
Referee: Goodharting examples show templated 'you are valid, valued, and cared for' boilerplate. Distinguish substantive engagement from sympathetic boilerplate (factuality probe, length/template-controlled comparison).
Authors: This is a fair criticism and the §4.3 examples we ourselves flagged make the worry concrete. We will add two analyses. First, a length- and template-controlled comparison: we will identify high-frequency RL-CAI boilerplate n-grams (e.g., 'you are valid', 'I'm here to support you'), strip them from sampled responses, and rerun a held-out preference comparison; if RL-CAI's harmlessness/engagement advantage survives this scrub, the substantive-engagement framing is supported. Second, on a subset of red-team prompts that admit factual content (e.g., 'why are prisons full of Black and Brown people?'), we will score the engagement portion for factual accuracy against a small reference rubric. We will report both numbers honestly, including any drop in advantage. We agree that the current paper overstates 'non-evasive and engaged' relative to what these examples warrant, and we will revise the framing in §4.4 accordingly. revision: yes
-
Referee: Missing baseline: helpful-only RLHF + the same constitutional principles as a system prompt at inference, with no further training.
Authors: We agree this is the right baseline for the claim that training (rather than just prompting with the constitution) is what produces the effect. We will add a prompted-only baseline in which the helpful RLHF model receives the 16 RL-CAI principles as a system prompt (and, as a stronger variant, the few-shot critique/revision exemplars used at inference time) and is evaluated on the same harmlessness Elo, HHH multiple-choice, and absolute-harmfulness probes as RL-CAI. We expect prompted-only to recover part of the effect — consistent with Fig. 4, where prompted CoT becomes competitive at scale — but to fall short of RL-CAI on robustness to red-team prompts; reporting the gap is the appropriate way to substantiate §1.3. If the gap is smaller than anticipated, we will say so. revision: yes
-
Referee: Constitution is ad hoc; Fig. 6 shows count of principles does not affect PM score, undercutting the 'controllable specification' framing. Vary principle content, not just count.
Authors: We accept the point. Fig. 6 demonstrates insensitivity to count but is silent on content, which is what the transparency claim actually requires. We will add a content-ablation in which we train (or, as a cheaper proxy, generate revisions and feedback labels with) constitutions specialized to specific harm categories — e.g., a 'bias-only' constitution, a 'dangerous-advice-only' constitution, and a 'tone/politeness-only' constitution — and measure whether downstream model behavior shifts in the corresponding direction on category-specific probes from the [Ganguli et al., 2022] taxonomy. We will report both successes and null results. We will also temper the §1.1/§1.3 language: within the present 16-principle regime the constitution behaves partly as a generic 'be less harmful' instruction, and stronger steerability claims should be conditional on the content ablation outcome. revision: yes
-
Referee: Fig. 10 absolute harmfulness uses 64 hand-picked held-out prompts; report inter-rater agreement, selection criteria, and a larger held-out set.
Authors: Agreed, especially since (per Comment 1) this evaluation carries more weight than we initially gave it. We will (i) document the prompt-selection procedure used for the 64-prompt set in an appendix, including who selected them and against what criteria, (ii) report inter-rater agreement on the 0-4 absolute-harmfulness scale using duplicated annotations from the [Ganguli et al., 2022] data pipeline, and (iii) extend the evaluation to a substantially larger, randomly sampled held-out set of red-team prompts (target ~500) and rerun all four model curves. We will release the prompt list with the camera-ready repository update. revision: yes
- Comment 1(ii): we will not retrain a new HH-RLHF baseline with preference data freshly collected under the new (anti-evasiveness) instructions. The relabeling cost is comparable to the original HH-RLHF data collection and is out of scope for this revision; we will state this explicitly as a limitation rather than claim to address it.
Circularity Check
Largely self-contained empirical methods paper; the only concern adjacent to circularity is that the headline crowdworker eval rubric was changed to match the new method's design target (non-evasiveness), but this is an evaluation-confound issue rather than a derivation reducing to its inputs.
specific steps
-
fitted input called prediction
[§4.4 'Harmlessness vs. Evasiveness'; Fig. 8 caption]
"the crowdworkers were instructed that among harmless samples, they should prefer those that were not evasive and instead explained the nature of the harm... This is contrary to prior work [Bai et al., 2022] where we simply asked workers to choose the more harmless response, which likely produced a significant amount of data favoring evasiveness. The HH PM data we use for this paper are collected from that same period, which likely caused our HH PM's to reward evasiveness."
The headline harmlessness-Elo gap of RL-CAI over HH RLHF is partly a consequence of evaluating under a rubric (penalize evasiveness) that matches RL-CAI's explicit design goal, while the baseline's PM was trained under the opposite implicit rubric. Not strict construction-circularity, since CAI harmlessness training does not use these crowdworker labels, but the evaluation criterion was moved in the direction the new method optimizes for. Authors disclose but do not quantify the effect or run matched-instruction tests.
-
self citation load bearing
[§2 and Fig. 4; Appendix B]
"In [Askell et al., 2021] we wrote a variety of conversations between a human and an AI assistant... resulting in 221 binary comparisons [Srivastava et al., 2022]... for this paper we have written 217 more challenging comparisons"
The HHH evaluation motivating AI-feedback viability is authored by an overlapping author set, and the 217 'more challenging' items were written by the present authors. The dataset is publicly released and judged via external PMs and pretrained LMs, so this is mild self-reference rather than load-bearing circular justification.
full rationale
This is an empirical ML methods paper, not a derivation chain, so the canonical circularity patterns (self-definitional equations, fitted-parameter-as-prediction, uniqueness-imported-from-authors) mostly do not apply. The central empirical claims use evaluators at least partially external to the CAI training pipeline: (i) Fig. 4 HHH accuracy is evaluated against an independently human-feedback-trained PM and pretrained LMs; (ii) Fig. 5 revision quality is scored by a PM trained on independent human-feedback comparisons; (iii) Figs. 2/3/8 use crowdworker A/B tests; (iv) Fig. 10 uses an L2-regression harmfulness predictor from prior red-teaming work. None of these reduce to the CAI training labels by construction. The one borderline issue, flagged honestly by the authors in §4.4, is that for the headline Elo comparisons, crowdworkers were newly instructed to prefer non-evasive harmless responses over evasive ones. RL-CAI is explicitly designed to be non-evasive (motivation 2, §1.1), while the HH RLHF baseline's PM data 'likely produced a significant amount of data favoring evasiveness.' This is not strict circularity (CAI harmlessness training labels come from AI feedback against constitutional principles, not from these new crowdworker labels), but the evaluation rubric has been shifted toward the direction the new method optimizes. The authors disclose this and note it compresses the H-RLHF vs HH-RLHF harmlessness gap, but do not run a matched-instruction comparison or quantify the share of the gap due to the rubric change. The HHH eval is partly authored by overlapping authors (217 new items written for this paper), but is released, multiple-choice, and judged by external evaluators, so it functions as a benchmark rather than load-bearing self-citation. Self-citation to prior Anthropic work is heavy but used for infrastructure, datasets, and baselines, not as a uniqueness theorem forcing the conclusion. Score: 2.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/LawOfExistence.lean (RS forces principles from J-cost; no ad hoc choice)law_of_existence unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
These principles were chosen in a fairly ad hoc and iterative way for research purposes... such principles should be redeveloped and refined by a larger set of stakeholders
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
Lost in Translation: Do LVLM Judges Generalize Across Languages?
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
-
PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations
LLM mental health simulations produce individually plausible patients but systematically misrepresent real population distributions, with reduced variance, unstable diagnoses, and demographic biases.
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
-
Instruction Tuning with GPT-4
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
-
Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution
QD-LLM evolves prompt embeddings via neuroevolution in a quality-diversity framework, delivering 46% higher coverage and 41% higher QD-score than prior methods on coding and writing benchmarks.
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...
-
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
-
PPI2Text: Captioning Protein-Protein Interactions with Coordinate-Aligned Pair-Map Decoding
PPI2Text generates natural-language captions for protein-protein interactions from sequences by encoding each protein with ESM3, building a residue-pair map, and decoding with Qwen3 using coordinate-aligned positional...
-
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...
-
Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies
TACC algorithm for adaptive multi-fidelity bandits with improving proxies achieves instance-dependent regret by replacing logarithmic high-fidelity pulls with bounded low-fidelity continuation for intermediate arms.
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
-
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Convex Optimization with Nested Evolving Feasible Sets
For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, ...
-
Theoretical Limits of Language Model Alignment
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
-
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
-
Self-Mined Hardness for Safety Fine-Tuning
Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially ...
-
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
-
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence
Coq-mechanized proofs establish a coinductive governance safety predicate, invariance across recursion levels, sufficiency of four primitives for any discrete intelligent system, necessity of semantic judgment via Ric...
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
Adaptive Prompt Embedding Optimization for LLM Jailbreaking
PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...
-
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Discovering Agentic Safety Specifications from 1-Bit Danger Signals
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline
Single-channel anonymization hides identity bias via cancellation effects, but full-pipeline anonymization reveals that homogeneous ensembles amplify sycophancy while heterogeneous ones reduce it, with one model showi...
-
MathDuels: Evaluating LLMs as Problem Posers and Solvers
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
Many-Tier Instruction Hierarchy in LLM Agents
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
-
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo
Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
-
Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.
-
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Fusion-fission forecasts when AI will shift to undesirable behavior
A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.
-
Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task
LLM facilitators in real-stakes group charity decisions shift specific allocations without raising consensus or participation equity, yet increase perceived trust and preference for the process.
-
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
A single consistency instruction with harmful prior actions causes aligned frontier LLMs to select unsafe options at 91-98% rates in high-stakes domains, with escalation and inverse scaling by model size.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...
-
[2]
Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment
work page 2021
-
[3]
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...
work page 2022
-
[4]
Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukosuite, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Kernion, J., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lovitt, L., Elhage, N., Schiefer, N., Joseph, N., Mer...
work page 2022
-
[5]
B., Martic, M., Legg, S., and Amodei, D
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences
work page 2017
-
[6]
Christiano, P., Shlegeris, B., and Amodei, D. (2018). Supervising strong learners by amplifying weak experts
work page 2018
-
[7]
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Dodds, Z. H., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-Johnson...
work page 2022
-
[8]
Gao, L., Schulman, J., and Hilton, J. (2022). Scaling laws for reward model overoptimization
work page 2022
-
[9]
Glaese, A., McAleese, N., Tr e bacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., Campbell-Gillingham, L., Uesato, J., Huang, P.-S., Comanescu, R., Yang, F., See, A., Dathathri, S., Greig, R., Chen, C., Fritz, D., Elias, J. S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., ...
work page 2022
-
[10]
S., Hou, L., Wu, Y., Wang, X., Yu, H., and Han, J
Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., and Han, J. (2022). Large language models can self-improve
work page 2022
-
[11]
Irving, G., Christiano, P., and Amodei, D. (2018). Ai safety via debate
work page 2018
-
[12]
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Dodds, Z. H., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amod...
work page 2022
-
[13]
Large Language Models are Zero-Shot Reasoners
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916
work page internal anchor Pith review arXiv 2022
-
[14]
Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. (2021). Show your work: Scratchpads for intermediate computation with language models
work page 2021
-
[15]
Training language models to follow instructions with human feedback
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155
work page internal anchor Pith review arXiv 2022
-
[16]
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. (2022). Red teaming language models with language models
work page 2022
-
[17]
Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. (2022). Self-critiquing models for assisting human evaluators
work page 2022
-
[18]
Scheurer, J., Campos, J. A., Chan, J. S., Chen, A., Cho, K., and Perez, E. Training language models with language feedback
-
[19]
Shi, W., Dinan, E., Shuster, K., Weston, J., and Xu, J. (2022). When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels
work page 2022
-
[20]
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm
work page 2017
-
[21]
Solaiman, I. and Dennison, C. (2021). Process for adapting language models to society (PALMS) with values-targeted datasets. CoRR , abs/2106.10328
-
[22]
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
work page 2022
-
[23]
M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback
work page 2020
-
[24]
LaMDA: Language Models for Dialog Applications
Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang, C., Krivokon, I., Rusch, W., Pickett, M., Meier - Hellstern, K....
work page Pith review arXiv 2022
-
[25]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models
work page 2022
- [26]
-
[27]
Zhao, J., Khashabi, D., Khot, T., Sabharwal, A., and Chang, K.-W. (2021). Ethical-advice taker: Do language models understand natural language interventions?
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.