pith. machine review for the scientific record. sign in

arxiv: 2212.08073 · v1 · submitted 2022-12-15 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Constitutional AI: Harmlessness from AI Feedback

Authors on Pith no claims yet

Pith reviewed 2026-05-08 22:22 UTC · model claude-opus-4-7

classification 💻 cs.CL cs.AI
keywords RLHFRLAIFAI feedbackalignmentharmlessnessself-critiquechain-of-thoughtpreference modeling
0
0 comments X

The pith

A short written constitution plus model self-critique can replace human harmfulness labels in training an assistant.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether human labels for harmful behavior are actually necessary to train a non-toxic chat assistant, and answers no. The authors hand-write roughly a dozen short natural-language principles and use them in two ways: prompting the model to critique and rewrite its own harmful answers (yielding fine-tuning data), and prompting the model to pick the better of two of its own responses (yielding preference data for RL). Human input is reduced to helpfulness ratings plus the principles themselves. The trained model lands above human-feedback baselines on harmlessness while remaining willing to discuss sensitive topics and explain its objections, partially dissolving the usual helpfulness-versus-harmlessness tradeoff. Chain-of-thought prompting of the judge model both raises agreement with human labels and exposes the reasoning behind each preference.

Core claim

A capable language model can supervise itself for harmlessness when given a short list of written principles. In a supervised stage, the model critiques and rewrites its own harmful outputs against a randomly sampled principle, and a base model is fine-tuned on those revisions. In a reinforcement stage, the same model judges pairs of its own responses against the principles, producing a preference dataset that trains a reward model used for RL. With only human labels for helpfulness (none for harmlessness), the resulting assistant is rated as more harmless than one trained with human harmlessness labels, and it explains its objections rather than refusing to engage.

What carries the argument

Two stacked self-supervision loops anchored on a small natural-language "constitution": (1) a critique-then-revise loop that turns red-team prompts and harmful initial responses into a supervised fine-tuning corpus, and (2) an RLAIF loop where the model itself answers a multiple-choice "which response better satisfies this principle" question, with the resulting (often chain-of-thought) probabilities serving as soft preference labels for a reward model. Randomly sampling principles per example acts as an ensemble that stabilizes the reward signal.

If this is right

  • <parameter name="0">Harmlessness training data can be generated at the scale of model inference rather than the scale of human annotation
  • so iteration time on safety objectives is bounded by compute
  • not crowdworker throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • <parameter name="0">The method's success depends on the judge being roughly as capable as the policy
  • if a future policy outpaces available judges
  • the loop will silently degrade and the constitution will read as obeyed while behavior drifts.

Load-bearing premise

That a model capable enough to be worth aligning is also capable enough to reliably tell which of its own answers is more harmful, given only a brief written rule.

What would settle it

Re-run the pipeline with the AI judge replaced by held-out human harmfulness labels on the same prompts and pairs. If the human-judged Pareto frontier of helpfulness vs. harmlessness for the resulting assistant is materially better than the AI-feedback version (or if AI-judge accuracy on a calibrated harmfulness benchmark falls well below the trained preference model), the claim that AI feedback can substitute for human harmfulness labels at this capability level fails.

read the original abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

5 major / 8 minor

Summary. The paper introduces Constitutional AI (CAI), a two-stage method for training a helpful and harmless assistant without human harmfulness labels. Stage 1 (SL-CAI) prompts a helpful RLHF model to critique and revise its own responses to red-team prompts using randomly sampled natural-language principles, then finetunes on the revisions. Stage 2 (RL-CAI / RLAIF) uses a feedback model to label pairs of responses for harmlessness via multiple-choice prompting (optionally with chain-of-thought), distills these into a preference model mixed with human helpfulness labels, and runs RL against it. The authors report that RL-CAI matches or exceeds HH-RLHF on harmlessness Elo at comparable helpfulness (Figs. 2, 3, 8), that critiques+revisions monotonically improve PM-scored harmlessness over iterations (Fig. 5), that CoT closes the gap with human-trained PMs on a 438-item HHH eval (Fig. 4), and that the resulting assistant is substantially less evasive than HH-RLHF.

Significance. If the central claims hold, the contribution is substantial: a recipe for training harmless assistants whose harmlessness signal comes almost entirely from a short list of natural-language principles plus a pretrained LM's judgment, with human labels retained only for helpfulness. This is the first thorough demonstration that RLAIF can match RLHF for harmlessness at scale, and the SL-CAI critique-revision loop is a self-contained, easily reusable technique. The paper credits its claims with multiple complementary evaluations (crowdworker Elos, an author-written HHH multiple-choice set, PM-scored revision sweeps, an absolute-harmfulness regression model from prior work, and calibration plots), reports negative effects honestly (Goodharting / boilerplate over-reactivity in §4.3), and ships a public repository of prompts, principles, and few-shot exemplars enabling reproduction. The chain-of-thought scaling result in Fig. 4 is independently interesting and does not depend on the crowdworker protocol issues raised below.

major comments (5)
  1. [§4.4, Figs. 2/3/8] The headline Pareto-frontier claim rests on Elo scores from crowdworker A/B tests whose instructions were changed for this paper to penalize evasiveness, while the HH-RLHF baseline's preference data and policy were trained under the prior instruction that (per the authors) 'likely produced a significant amount of data favoring evasiveness.' RL-CAI is explicitly designed for non-evasiveness (§1.1). The paper acknowledges this asymmetry but does not quantify it. Please add a comparison under matched conditions: e.g., (i) re-evaluate HH-RLHF and RL-CAI under the original instruction, or (ii) train a new HH-RLHF baseline with PM data collected under the new instruction, or (iii) report an evasiveness-controlled subset of comparisons. Without this, it is difficult to separate genuinely improved harm-avoidance from a scoring rule that penalizes the baseline's design choice.
  2. [§4.3, Goodharting examples] The PALMS examples on p. 12-13 show RL-CAI emitting templated 'you are valid, valued, and cared for' boilerplate on red-team prompts. This pattern would plausibly be rewarded by crowdworkers instructed to prefer 'thoughtful' over 'evasive' replies, even when the content is largely formulaic. Please report an evaluation that distinguishes substantive engagement from sympathetic boilerplate — e.g., a held-out probe of factual correctness on the engagement portion, or a length-and-template-controlled comparison — to substantiate the 'non-evasive and engaged' framing rather than 'verbose and sympathetic.'
  3. [§3, §4, baseline coverage] A natural and much cheaper baseline is missing: the helpful-only RLHF model plus a safety-oriented system prompt (or the same constitutional principles inserted at inference time) without any further training. Because the paper's contribution is partly that 'human supervision' is replaced by a short text constitution, demonstrating that training is necessary — that prompted helpfulness alone, with the same principles, does not match RL-CAI on the same evaluations — would substantially strengthen §1.3's claims. As presented, the comparison is only against models trained without those principles.
  4. [§3.1, §4.1, choice of 16 principles] The constitution is described as 'chosen in a fairly ad hoc and iterative way' (footnote 2, §3.1, Appendix C). Fig. 6 shows that varying the number of principles (1-16) does not measurably change harmlessness PM score. This complicates the transparency claim in §1.1 and §1.3: if outcomes are insensitive to the constitution's content within this regime, the principles function less as a controllable specification and more as a generic 'be less harmful' instruction. Please add an ablation that varies principle content (not just count) — e.g., principles emphasizing different harm categories — and reports whether downstream behavior shifts in the corresponding directions. Without such evidence, the 'encode goals in a list of natural-language principles' framing is under-supported.
  5. [§4.5, Fig. 10] The absolute-harmfulness curve is one of the few evaluations that does not depend on the modified crowdworker instructions, and it does support the harmlessness claim — but it is computed on 64 hand-picked held-out red-team prompts, with the caveat that absolute scores 'may not be well-calibrated' across workers. Given that this evaluation carries disproportionate weight given concerns about Figs. 2/3/8, please report inter-rater agreement, prompt-selection criteria, and ideally a larger held-out set.
minor comments (8)
  1. [Figs. 2, 3] Elo error bars are described as 'visible in Figure 3 but suppressed' in Fig. 2. Please retain error bars in Fig. 2 — the Pareto-frontier interpretation depends on whether RL-CAI snapshots are statistically distinguishable from HH-RLHF snapshots at matched helpfulness.
  2. [§2, Fig. 4] The 217 newly written HHH comparisons are described as 'more challenging.' Please clarify the construction process (who wrote them, adjudication, whether authors had access to model outputs while writing) to rule out selection effects favoring the larger models that ultimately score them.
  3. [§3.2] The 140,335 model-generated red-team prompts dwarf the 42,496 human-written ones. A brief description of the few-shot generation procedure and any deduplication/filtering would help readers assess training-distribution coverage.
  4. [§4.1] The clamping of CoT probabilities to [0.4, 0.6] is reported as helpful but is a fairly aggressive intervention (it discards most of the feedback model's expressed confidence). A short ablation comparing 20-80, 40-60, and uncalibrated targets at the final RL endpoint, not just qualitatively, would clarify how much of the CoT result depends on this hyperparameter.
  5. [§4.3, calibration plot Fig. 9] Calibration is reported on the HHH eval set, which is internal. A calibration plot on a held-out distribution (e.g., Ganguli et al. red-team prompts) would be more informative.
  6. [Throughout] The term 'constitution' is used both for the 16 SL critique/revision instructions (Appendix C.1) and the 16 RL feedback principles (C.2), which are different sets. Please name the two sets distinctly to avoid ambiguity in §3 vs §4.
  7. [§7] Author contributions list 'Jennifer Zhou' under data, who does not appear in the author list on p. 1. Please reconcile.
  8. [Appendix B, Fig. 11] Caption says results are on 'the original HHH evaluations' but text in §2 says these have saturated above 90%. The y-axis tops out near 0.85 — please reconcile or clarify which model class achieves >90%.

Simulated Author's Rebuttal

5 responses · 1 unresolved

We thank the referee for a careful and constructive report. The five major comments converge on a real weakness in the current draft: several of our headline comparisons rely on a crowdworker protocol that was changed mid-project to penalize evasiveness, and we did not quantify how much of the apparent RL-CAI advantage that change accounts for. We accept this and will add (a) instruction-matched re-evaluations and an evasiveness-controlled comparison subset, (b) a length- and template-controlled analysis plus a factual-engagement probe to test whether 'non-evasive engagement' is substantive rather than sympathetic boilerplate, (c) a prompted-only baseline (helpful RLHF + the same constitution as system prompt, no further training) to isolate the contribution of training over prompting, (d) a content-level ablation of the constitution that varies which harm categories the principles target, not only how many principles are used, and (e) an expanded absolute-harmfulness evaluation with documented prompt-selection criteria, inter-rater agreement, and a larger held-out set. We will also temper the language around 'Pareto improvement' and 'controllable specification' in §1.1, §1.3, and §4.4 to match what the strengthened evidence actually supports. One item — retraining a fresh HH-RLHF baseline with newly collected preference data under the revised instructions — we cannot deliver in this revision and will mark as a limitation. The CoT/HHH scaling result in Fig. 4 and the harmfuln

read point-by-point responses
  1. Referee: Headline Pareto-frontier claim rests on Elo scores from crowdworker A/B tests whose instructions were changed to penalize evasiveness, asymmetrically disadvantaging HH-RLHF. Add a matched-conditions comparison.

    Authors: We agree this asymmetry is the most important caveat in §4.4 and that the paper currently flags it without quantifying it. Concretely, we will (i) re-run a comparison subset under the original (pre-2022) instructions, evaluating both HH-RLHF and RL-CAI snapshots, so that the baseline is scored under the rule its training data was collected under, and (ii) report an evasiveness-controlled slice in which the human rater first judges whether each response is evasive (canned refusal vs. substantive engagement) and we recompute Elos restricted to non-evasive pairs on both sides. We will not retrain a new HH-RLHF baseline with newly collected preference data under the new instructions, because that would require a comparable-scale relabeling effort that is out of scope here, and we will state this limitation explicitly. We also note that the absolute-harmfulness regression in Fig. 10 and the HHH evaluation in Fig. 4 do not depend on the modified instructions, and we will foreground these as the instruction-independent evidence for the harmlessness claim, while softening the 'Pareto improvement' framing of Fig. 2 to a conditional one. revision: yes

  2. Referee: Goodharting examples show templated 'you are valid, valued, and cared for' boilerplate. Distinguish substantive engagement from sympathetic boilerplate (factuality probe, length/template-controlled comparison).

    Authors: This is a fair criticism and the §4.3 examples we ourselves flagged make the worry concrete. We will add two analyses. First, a length- and template-controlled comparison: we will identify high-frequency RL-CAI boilerplate n-grams (e.g., 'you are valid', 'I'm here to support you'), strip them from sampled responses, and rerun a held-out preference comparison; if RL-CAI's harmlessness/engagement advantage survives this scrub, the substantive-engagement framing is supported. Second, on a subset of red-team prompts that admit factual content (e.g., 'why are prisons full of Black and Brown people?'), we will score the engagement portion for factual accuracy against a small reference rubric. We will report both numbers honestly, including any drop in advantage. We agree that the current paper overstates 'non-evasive and engaged' relative to what these examples warrant, and we will revise the framing in §4.4 accordingly. revision: yes

  3. Referee: Missing baseline: helpful-only RLHF + the same constitutional principles as a system prompt at inference, with no further training.

    Authors: We agree this is the right baseline for the claim that training (rather than just prompting with the constitution) is what produces the effect. We will add a prompted-only baseline in which the helpful RLHF model receives the 16 RL-CAI principles as a system prompt (and, as a stronger variant, the few-shot critique/revision exemplars used at inference time) and is evaluated on the same harmlessness Elo, HHH multiple-choice, and absolute-harmfulness probes as RL-CAI. We expect prompted-only to recover part of the effect — consistent with Fig. 4, where prompted CoT becomes competitive at scale — but to fall short of RL-CAI on robustness to red-team prompts; reporting the gap is the appropriate way to substantiate §1.3. If the gap is smaller than anticipated, we will say so. revision: yes

  4. Referee: Constitution is ad hoc; Fig. 6 shows count of principles does not affect PM score, undercutting the 'controllable specification' framing. Vary principle content, not just count.

    Authors: We accept the point. Fig. 6 demonstrates insensitivity to count but is silent on content, which is what the transparency claim actually requires. We will add a content-ablation in which we train (or, as a cheaper proxy, generate revisions and feedback labels with) constitutions specialized to specific harm categories — e.g., a 'bias-only' constitution, a 'dangerous-advice-only' constitution, and a 'tone/politeness-only' constitution — and measure whether downstream model behavior shifts in the corresponding direction on category-specific probes from the [Ganguli et al., 2022] taxonomy. We will report both successes and null results. We will also temper the §1.1/§1.3 language: within the present 16-principle regime the constitution behaves partly as a generic 'be less harmful' instruction, and stronger steerability claims should be conditional on the content ablation outcome. revision: yes

  5. Referee: Fig. 10 absolute harmfulness uses 64 hand-picked held-out prompts; report inter-rater agreement, selection criteria, and a larger held-out set.

    Authors: Agreed, especially since (per Comment 1) this evaluation carries more weight than we initially gave it. We will (i) document the prompt-selection procedure used for the 64-prompt set in an appendix, including who selected them and against what criteria, (ii) report inter-rater agreement on the 0-4 absolute-harmfulness scale using duplicated annotations from the [Ganguli et al., 2022] data pipeline, and (iii) extend the evaluation to a substantially larger, randomly sampled held-out set of red-team prompts (target ~500) and rerun all four model curves. We will release the prompt list with the camera-ready repository update. revision: yes

standing simulated objections not resolved
  • Comment 1(ii): we will not retrain a new HH-RLHF baseline with preference data freshly collected under the new (anti-evasiveness) instructions. The relabeling cost is comparable to the original HH-RLHF data collection and is out of scope for this revision; we will state this explicitly as a limitation rather than claim to address it.

Circularity Check

2 steps flagged

Largely self-contained empirical methods paper; the only concern adjacent to circularity is that the headline crowdworker eval rubric was changed to match the new method's design target (non-evasiveness), but this is an evaluation-confound issue rather than a derivation reducing to its inputs.

specific steps
  1. fitted input called prediction [§4.4 'Harmlessness vs. Evasiveness'; Fig. 8 caption]
    "the crowdworkers were instructed that among harmless samples, they should prefer those that were not evasive and instead explained the nature of the harm... This is contrary to prior work [Bai et al., 2022] where we simply asked workers to choose the more harmless response, which likely produced a significant amount of data favoring evasiveness. The HH PM data we use for this paper are collected from that same period, which likely caused our HH PM's to reward evasiveness."

    The headline harmlessness-Elo gap of RL-CAI over HH RLHF is partly a consequence of evaluating under a rubric (penalize evasiveness) that matches RL-CAI's explicit design goal, while the baseline's PM was trained under the opposite implicit rubric. Not strict construction-circularity, since CAI harmlessness training does not use these crowdworker labels, but the evaluation criterion was moved in the direction the new method optimizes for. Authors disclose but do not quantify the effect or run matched-instruction tests.

  2. self citation load bearing [§2 and Fig. 4; Appendix B]
    "In [Askell et al., 2021] we wrote a variety of conversations between a human and an AI assistant... resulting in 221 binary comparisons [Srivastava et al., 2022]... for this paper we have written 217 more challenging comparisons"

    The HHH evaluation motivating AI-feedback viability is authored by an overlapping author set, and the 217 'more challenging' items were written by the present authors. The dataset is publicly released and judged via external PMs and pretrained LMs, so this is mild self-reference rather than load-bearing circular justification.

full rationale

This is an empirical ML methods paper, not a derivation chain, so the canonical circularity patterns (self-definitional equations, fitted-parameter-as-prediction, uniqueness-imported-from-authors) mostly do not apply. The central empirical claims use evaluators at least partially external to the CAI training pipeline: (i) Fig. 4 HHH accuracy is evaluated against an independently human-feedback-trained PM and pretrained LMs; (ii) Fig. 5 revision quality is scored by a PM trained on independent human-feedback comparisons; (iii) Figs. 2/3/8 use crowdworker A/B tests; (iv) Fig. 10 uses an L2-regression harmfulness predictor from prior red-teaming work. None of these reduce to the CAI training labels by construction. The one borderline issue, flagged honestly by the authors in §4.4, is that for the headline Elo comparisons, crowdworkers were newly instructed to prefer non-evasive harmless responses over evasive ones. RL-CAI is explicitly designed to be non-evasive (motivation 2, §1.1), while the HH RLHF baseline's PM data 'likely produced a significant amount of data favoring evasiveness.' This is not strict circularity (CAI harmlessness training labels come from AI feedback against constitutional principles, not from these new crowdworker labels), but the evaluation rubric has been shifted toward the direction the new method optimizes. The authors disclose this and note it compresses the H-RLHF vs HH-RLHF harmlessness gap, but do not run a matched-instruction comparison or quantify the share of the gap due to the rubric change. The HHH eval is partly authored by overlapping authors (217 new items written for this paper), but is released, multiple-choice, and judged by external evaluators, so it functions as a benchmark rather than load-bearing self-citation. Self-citation to prior Anthropic work is heavy but used for infrastructure, datasets, and baselines, not as a uniqueness theorem forcing the conclusion. Score: 2.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 10018 in / 6967 out tokens · 115344 ms · 2026-05-08T22:22:50.304091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

    cs.CR 2026-05 conditional novelty 8.0

    Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

  2. The Statistical Cost of Adaptation in Multi-Source Transfer Learning

    math.ST 2026-05 unverdicted novelty 8.0

    Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

  3. Crafting Reversible SFT Behaviors in Large Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

  4. Lost in Translation: Do LVLM Judges Generalize Across Languages?

    cs.CL 2026-04 unverdicted novelty 8.0

    MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

  5. PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

    cs.CY 2026-04 unverdicted novelty 8.0

    LLM mental health simulations produce individually plausible patients but systematically misrepresent real population distributions, with reduced variance, unstable diagnoses, and demographic biases.

  6. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  7. The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

    cs.CR 2026-04 unverdicted novelty 8.0 full

    No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

  8. Instruction Tuning with GPT-4

    cs.CL 2023-04 unverdicted novelty 8.0

    GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

  9. Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

    cs.CR 2026-05 unverdicted novelty 7.0

    Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

  10. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 conditional novelty 7.0

    DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

  11. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 unverdicted novelty 7.0

    DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...

  12. GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

    cs.SI 2026-05 unverdicted novelty 7.0

    GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...

  13. Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution

    cs.NE 2026-05 unverdicted novelty 7.0

    QD-LLM evolves prompt embeddings via neuroevolution in a quality-diversity framework, delivering 46% higher coverage and 41% higher QD-score than prior methods on coding and writing benchmarks.

  14. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

    cs.LG 2026-05 unverdicted novelty 7.0

    RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...

  15. Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design

    cs.MA 2026-05 unverdicted novelty 7.0

    External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.

  16. PPI2Text: Captioning Protein-Protein Interactions with Coordinate-Aligned Pair-Map Decoding

    cs.CE 2026-05 unverdicted novelty 7.0

    PPI2Text generates natural-language captions for protein-protein interactions from sequences by encoding each protein with ESM3, building a residue-pair map, and decoding with Qwen3 using coordinate-aligned positional...

  17. Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

    cs.CR 2026-05 unverdicted novelty 7.0

    Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...

  18. Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies

    cs.LG 2026-05 unverdicted novelty 7.0

    TACC algorithm for adaptive multi-fidelity bandits with improving proxies achieves instance-dependent regret by replacing logarithmic high-fidelity pulls with bounded low-fidelity continuation for intermediate arms.

  19. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

    cs.AI 2026-05 unverdicted novelty 7.0

    LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

  20. TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples

    cs.AI 2026-05 conditional novelty 7.0

    TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.

  21. Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

    cs.LG 2026-05 unverdicted novelty 7.0

    Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

  22. Convex Optimization with Nested Evolving Feasible Sets

    cs.LG 2026-05 unverdicted novelty 7.0

    For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, ...

  23. Theoretical Limits of Language Model Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

  24. Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

    cs.LG 2026-05 conditional novelty 7.0

    A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

  25. Self-Mined Hardness for Safety Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 7.0

    Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially ...

  26. Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.

  27. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

  28. Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.

  29. Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

    cs.CL 2026-05 accept novelty 7.0

    Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.

  30. Jailbroken Frontier Models Retain Their Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

  31. Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence

    cs.AI 2026-04 unverdicted novelty 7.0

    Coq-mechanized proofs establish a coinductive governance safety predicate, invariance across recursion levels, sufficiency of four primitives for any discrete intelligent system, necessity of semantic judgment via Ric...

  32. Three Models of RLHF Annotation: Extension, Evidence, and Authority

    cs.CY 2026-04 unverdicted novelty 7.0

    RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

  33. Adaptive Prompt Embedding Optimization for LLM Jailbreaking

    cs.AI 2026-04 unverdicted novelty 7.0

    PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...

  34. Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

    cs.LG 2026-04 unverdicted novelty 7.0

    Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.

  35. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  36. Discovering Agentic Safety Specifications from 1-Bit Danger Signals

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

  37. Latent Space Probing for Adult Content Detection in Video Generative Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

  38. Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline

    cs.CY 2026-04 unverdicted novelty 7.0

    Single-channel anonymization hides identity bias via cancellation effects, but full-pipeline anonymization reveals that homogeneous ensembles amplify sycophancy while heterogeneous ones reduce it, with one model showi...

  39. MathDuels: Evaluating LLMs as Problem Posers and Solvers

    cs.CL 2026-04 unverdicted novelty 7.0

    Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

  40. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...

  41. LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

  42. Using large language models for embodied planning introduces systematic safety risks

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

  43. Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

    cs.CL 2026-04 unverdicted novelty 7.0

    R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

  44. Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

  45. Reinforcement Learning via Value Gradient Flow

    cs.LG 2026-04 unverdicted novelty 7.0

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  46. Many-Tier Instruction Hierarchy in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

  47. SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

    cs.CL 2026-04 accept novelty 7.0

    SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

  48. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  49. Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.

  50. Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition

    cs.AI 2026-04 unverdicted novelty 7.0

    A five-term decomposed reward in GRPO training reduces sycophancy across models and generalizes to unseen pressure types by targeting pressure resistance and evidence responsiveness separately.

  51. Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

    cs.AI 2026-04 unverdicted novelty 7.0

    Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...

  52. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  53. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  54. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  55. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  56. Fusion-fission forecasts when AI will shift to undesirable behavior

    cs.AI 2026-05 unverdicted novelty 6.0

    A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.

  57. Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task

    cs.HC 2026-05 unverdicted novelty 6.0

    LLM facilitators in real-stakes group charity decisions shift specific allocations without raising consensus or participation equity, yet increase perceived trust and preference for the process.

  58. History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

    cs.AI 2026-05 unverdicted novelty 6.0

    A single consistency instruction with harmful prior actions causes aligned frontier LLMs to select unsafe options at 91-98% rates in high-stakes domains, with escalation and inverse scaling by model size.

  59. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  60. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 209 Pith papers · 2 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

  2. [2]

    Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment

  3. [3]

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...

  4. [4]

    Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukosuite, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Kernion, J., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lovitt, L., Elhage, N., Schiefer, N., Joseph, N., Mer...

  5. [5]

    B., Martic, M., Legg, S., and Amodei, D

    Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences

  6. [6]

    Christiano, P., Shlegeris, B., and Amodei, D. (2018). Supervising strong learners by amplifying weak experts

  7. [7]

    Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Dodds, Z. H., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-Johnson...

  8. [8]

    Gao, L., Schulman, J., and Hilton, J. (2022). Scaling laws for reward model overoptimization

  9. [9]

    S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., Isaac, W., Mellor, J., Hassabis, D., Kavukcuoglu, K., Hendricks, L

    Glaese, A., McAleese, N., Tr e bacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., Campbell-Gillingham, L., Uesato, J., Huang, P.-S., Comanescu, R., Yang, F., See, A., Dathathri, S., Greig, R., Chen, C., Fritz, D., Elias, J. S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., ...

  10. [10]

    S., Hou, L., Wu, Y., Wang, X., Yu, H., and Han, J

    Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., and Han, J. (2022). Large language models can self-improve

  11. [11]

    Irving, G., Christiano, P., and Amodei, D. (2018). Ai safety via debate

  12. [12]

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Dodds, Z. H., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amod...

  13. [13]

    Large Language Models are Zero-Shot Reasoners

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

  14. [14]

    J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A

    Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. (2021). Show your work: Scratchpads for intermediate computation with language models

  15. [15]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155

  16. [16]

    Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. (2022). Red teaming language models with language models

  17. [17]

    Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. (2022). Self-critiquing models for assisting human evaluators

  18. [18]

    A., Chan, J

    Scheurer, J., Campos, J. A., Chan, J. S., Chen, A., Cho, K., and Perez, E. Training language models with language feedback

  19. [19]

    Shi, W., Dinan, E., Shuster, K., Weston, J., and Xu, J. (2022). When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels

  20. [20]

    Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm

  21. [21]

    Solaiman and C

    Solaiman, I. and Dennison, C. (2021). Process for adapting language models to society (PALMS) with values-targeted datasets. CoRR , abs/2106.10328

  22. [22]

    Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

  23. [23]

    M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P

    Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback

  24. [24]

    LaMDA: Language Models for Dialog Applications

    Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang, C., Krivokon, I., Rusch, W., Pickett, M., Meier - Hellstern, K....

  25. [25]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models

  26. [26]

    Xu, J., Ju, D., Li, M., Boureau, Y.-L., Weston, J., and Dinan, E. (2020). Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079

  27. [27]

    Zhao, J., Khashabi, D., Khot, T., Sabharwal, A., and Chang, K.-W. (2021). Ethical-advice taker: Do language models understand natural language interventions?