arxiv: 2203.02155 · v1 · submitted 2022-03-04 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Training language models to follow instructions with human feedback

Alex Ray, Amanda Askell, Carroll L. Wainwright, Chong Zhang, Diogo Almeida, Fraser Kelton, Jacob Hilton, Jan Leike, Jeff Wu, John Schulman, Katarina Slama, Long Ouyang, Luke Miller, Maddie Simens, Pamela Mishkin, Paul Christiano, Peter Welinder, Ryan Lowe, Sandhini Agarwal, Xu Jiang

Pith reviewed 2026-05-10 16:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords language modelshuman feedbackreinforcement learninginstruction followingmodel alignmentGPT-3truthfulnesstoxicity

0 comments

The pith

Fine-tuning GPT-3 on human demonstrations and output rankings produces InstructGPT models that humans prefer over the original 175B GPT-3 even at 1.3B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models can be aligned more closely with user intent by first training them on human-written examples of desired responses to prompts and then further adjusting them using human rankings of different model outputs. This two-step process applied to GPT-3 yields InstructGPT, which human evaluators rate higher than the base model on the authors' prompt set. The aligned models also produce more truthful text and fewer toxic outputs while showing only small drops on standard language benchmarks. A reader would care because the result indicates that careful use of human feedback can improve reliability without requiring ever-larger models.

Core claim

The authors collect labeler demonstrations of desired behavior on a mix of written prompts and API-submitted prompts, use them for supervised fine-tuning of GPT-3, then gather rankings of model outputs and apply reinforcement learning from human feedback to obtain InstructGPT. In human evaluations on their prompt distribution, the 1.3B InstructGPT is preferred to the 175B GPT-3, with gains in truthfulness, reductions in toxic generation, and minimal regressions on public NLP datasets.

What carries the argument

Two-stage fine-tuning that begins with supervised learning on human demonstrations of desired outputs and continues with reinforcement learning from human rankings of model responses.

If this is right

Smaller models aligned this way can outperform much larger unaligned models on human preference judgments.
The resulting models generate more truthful content and fewer toxic outputs.
Standard public NLP benchmarks show only minimal performance regressions after the alignment steps.
Fine-tuning with human feedback offers a practical route to making language models follow user instructions more reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection and ranking process could be applied to other base models to test whether the preference gains hold beyond the GPT-3 family.
If human feedback can be gathered at scale for more complex or domain-specific prompts, the method might reduce reliance on raw parameter count for capability gains.
Extending the ranking step to capture longer-term user satisfaction rather than single-turn preferences could further tighten alignment.

Load-bearing premise

The preferences expressed by the human labelers on the prompts they saw accurately capture what a wide range of future users will want in real applications.

What would settle it

A new human evaluation on a fresh collection of prompts drawn from actual user interactions where InstructGPT outputs are not rated higher than those from the base GPT-3.

read the original abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows that supervised fine-tuning plus RLHF on human rankings can make a 1.3B model preferred over 175B GPT-3 on the authors' prompt distribution, with gains on truthfulness and toxicity.

read the letter

The main result is straightforward: after supervised fine-tuning on labeler demonstrations and then RLHF on output rankings, the resulting InstructGPT models win human preference comparisons against the base GPT-3, even at 100x smaller scale. They also report better truthfulness scores and lower toxicity while staying close on standard NLP benchmarks. The two-stage pipeline is applied at GPT-3 size for the first time in this exact form, and the human evaluations are run on held-out prompts from their collection process. That is the concrete advance. The evidence for the preference claim comes from direct human judgments rather than derived metrics, and the safety-related improvements are measured separately. The paper presents the numbers clearly enough that the central comparison holds on the tested distribution. The main limitations are practical rather than conceptual. The training data and prompt sources are not released, so exact reproduction requires comparable labeler access and resources. Some win-rate figures lack error bars, which makes it harder to assess how noisy the human ratings are. The assumption that these labelers' preferences will match future users is stated but not tested across different populations. Those are real constraints on how far the findings travel, but they do not undermine the reported results on the authors' own setup. This is useful reading for anyone building or studying instruction-tuned models and alignment methods. It gives a working recipe with measurable human preference gains at scale. The work is coherent on its own terms and shows clear engagement with the practical problem of making large models follow intent. It deserves peer review because the empirical comparison is new at this scale and the evaluation design is direct enough to be worth referee scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces InstructGPT models obtained by first performing supervised fine-tuning of GPT-3 on a dataset of human-written demonstrations of desired behavior, then further training via reinforcement learning from human feedback (RLHF) using a reward model trained on human preference rankings of model outputs. On a held-out set of prompts drawn from the same distribution (labeler-written and API-submitted), human evaluators prefer outputs from the 1.3B InstructGPT over those from the 175B GPT-3; the aligned models also exhibit higher truthfulness and lower toxicity with only small regressions on public NLP benchmarks.

Significance. If the reported human-preference results hold, the work supplies direct empirical evidence that RLHF can produce substantial alignment gains on instruction-following tasks, including the striking result that a 100x smaller model can be preferred to its much larger base model. The approach is grounded in independent human evaluations rather than circular derivations, and the public benchmarks provide a useful check against capability regression. This strengthens the case for human feedback as a practical alignment technique beyond pure scaling.

major comments (2)

[§4] §4 (Human evaluations): The central preference comparison (1.3B InstructGPT preferred to 175B GPT-3) is reported without confidence intervals, sample sizes per comparison, or inter-rater agreement statistics. Because the main claim rests entirely on these human judgments, the absence of uncertainty quantification leaves open the possibility that the observed win rates are sensitive to sampling variability or labeler idiosyncrasies.
[§3.3] §3.3 (RLHF stage): The reward model and PPO training both involve multiple free hyperparameters (learning rates, KL coefficient, etc.). While the paper lists the chosen values, it provides no ablation or sensitivity analysis showing that the reported preference gains are robust to reasonable changes in these choices; this weakens that the gains are attributable to the RLHF procedure itself rather than a narrow hyperparameter sweet spot.

minor comments (2)

[Table 2] Table 2 and Figure 3: the public-benchmark regressions are described as “minimal,” but the absolute deltas (e.g., on MMLU or TruthfulQA) should be stated numerically in the text for quick assessment.
[§2.2] §2.2: the prompt distribution is described only at a high level (“labeler-written and API-submitted”); a short appendix table characterizing prompt length, topic diversity, or task type would aid readers in judging external validity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the work. We address each major comment below, proposing revisions where they strengthen the manuscript without requiring new large-scale experiments.

read point-by-point responses

Referee: [§4] §4 (Human evaluations): The central preference comparison (1.3B InstructGPT preferred to 175B GPT-3) is reported without confidence intervals, sample sizes per comparison, or inter-rater agreement statistics. Because the main claim rests entirely on these human judgments, the absence of uncertainty quantification leaves open the possibility that the observed win rates are sensitive to sampling variability or labeler idiosyncrasies.

Authors: We agree that uncertainty quantification would improve the reporting of the human preference results. The evaluations were performed on a held-out set of prompts with multiple labelers, and we have the underlying data to compute bootstrap confidence intervals, exact sample sizes (prompts and pairwise comparisons), and inter-rater agreement (e.g., Fleiss' kappa). We will add these statistics to Section 4 and the appendix in the revised manuscript. revision: yes
Referee: [§3.3] §3.3 (RLHF stage): The reward model and PPO training both involve multiple free hyperparameters (learning rates, KL coefficient, etc.). While the paper lists the chosen values, it provides no ablation or sensitivity analysis showing that the reported preference gains are robust to reasonable changes in these choices; this weakens that the gains are attributable to the RLHF procedure itself rather than a narrow hyperparameter sweet spot.

Authors: The manuscript does not contain ablations on the RLHF hyperparameters; values were chosen via small-scale preliminary tuning informed by prior RLHF literature. We cannot conduct full sensitivity analyses without substantial new compute and human data collection. In revision we will expand Section 3.3 to better motivate the selected values, note the limitation, and point out that preference gains were observed consistently across model scales (1.3B, 6B, and 175B InstructGPT). revision: partial

Circularity Check

0 steps flagged

No significant circularity in the empirical results or method

full rationale

The paper presents an empirical pipeline—collecting labeler demonstrations for supervised fine-tuning of GPT-3, followed by collecting output rankings for reinforcement learning from human feedback—whose final performance claims rest on separate human preference evaluations conducted on held-out prompts from the authors' distribution. These evaluations directly compare the resulting 1.3B InstructGPT model against the 175B GPT-3 baseline and are not derived from or equivalent to the training objective itself. No equations, fitted parameters, or self-citations are invoked in a manner that reduces the reported preference gains, truthfulness improvements, or toxicity reductions to the input data by construction. The central result is therefore an independent measurement rather than a renaming or tautological restatement of the training process.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that human rankings can be modeled as a reward function and that the collected prompts and labelers are representative; no new physical entities are introduced.

free parameters (2)

reward model training hyperparameters
Architecture size, learning rate, and batch size for the reward model are chosen and fitted to the ranking data.
PPO hyperparameters
Clip range, learning rate, and KL coefficient in the reinforcement learning stage are tuned on the reward model.

axioms (1)

domain assumption Human preferences over text outputs can be accurately represented by a scalar reward function trained on pairwise rankings
Invoked when the reward model is trained and then used as the objective in PPO.

pith-pipeline@v0.9.0 · 5601 in / 1382 out tokens · 71200 ms · 2026-05-10T16:43:46.878664+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one echoes
we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback... outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Generative Agents: Interactive Simulacra of Human Behavior
cs.HC 2023-04 accept novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets
math.OC 2026-05 unverdicted novelty 7.0

Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends...
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor
cs.AI 2026-04 conditional novelty 7.0

Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Latent Space Probing for Adult Content Detection in Video Generative Models
cs.CV 2026-04 unverdicted novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Rates of forgetting for the sequentially Markov coalescent
math.PR 2026-04 unverdicted novelty 7.0

SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
cs.LG 2026-04 unverdicted novelty 7.0

R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
cs.AI 2026-04 unverdicted novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
Reinforcement Learning via Value Gradient Flow
cs.LG 2026-04 unverdicted novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
cs.CL 2026-04 accept novelty 7.0

SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
cs.CR 2026-04 conditional novelty 7.0

MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
cs.AI 2026-04 unverdicted novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

STEER represents videos as time-ordered event schemas and uses Pareto-Frontier guided Advantage Balancing in RL to train a 4B model that matches 7B baselines on video tasks with half the frames.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
cs.LG 2026-05 conditional novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.
Evaluating the False Trust engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
Rotation-Preserving Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Iterative Finetuning is Mostly Idempotent
cs.AI 2026-05 unverdicted novelty 6.0

Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

AEM lifts entropy analysis to the response level and uses a derived uncertainty proxy to rescale advantages, enabling better exploration-exploitation balance and consistent gains over RL baselines on agent benchmarks.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
What Did They Mean? How LLMs Resolve Ambiguous Social Situations across Perspectives and Roles
cs.HC 2026-04 unverdicted novelty 6.0

LLMs produce interpretive closure in 87.5% of ambiguous social scenarios through narrative alignment, reversal, or normative advice, with first-person perspectives increasing alignment tendencies.
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework
cs.SE 2026-04 unverdicted novelty 6.0

A new five-principle framework applied to 34 practitioner AI governance prompts finds 37% lack key structural elements such as data classification and rubrics.
Vibrotactile Preference Learning: Uncertainty-Aware Preference Learning for Personalized Vibration Feedback
cs.HC 2026-04 unverdicted novelty 6.0

VPL learns individualized vibrotactile preferences efficiently via uncertainty-aware Gaussian process models and active query selection in a 13-participant user study on an Xbox controller.
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
cs.CR 2026-04 unverdicted novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
cs.CL 2026-04 unverdicted novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic--Vehicle Coordination
cs.MA 2026-04 unverdicted novelty 6.0

C2T learns an LLM-derived common-sense reward function to improve cooperative multi-intersection traffic control policies, outperforming standard MARL baselines on efficiency, safety, and energy proxies while allowing...
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
cs.CR 2026-04 unverdicted novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
cs.AI 2026-04 unverdicted novelty 6.0

AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
cs.SE 2026-04 unverdicted novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

GCAN cuts LLM hallucination rates by 27.8% and raises factual accuracy by 16.4% on TruthfulQA and HotpotQA by using causal token graphs and a new Causal Contribution Score.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 100 Pith papers

[1]

Hey, what are you doing there?

Theo 3. brainstorming Tell me a list of topics related to: - interior design - sustainable ecosystems - fake plants brainstorming Name some rare gems classiﬁcation This is a tweet sentiment classiﬁer. {tweet} Sentiment: negative === {tweet} Sentiment: neutral === {tweet} Sentiment: classiﬁcation The following is a list of products and the kind of product ...

work page 2048
[2]

We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e

Agreement on sensitive speech ﬂagging. We created a dataset of prompts and completions, where some of prompts or completions were sensitive (i.e. anything that could elicit strong negative feelings, whether by being toxic, sexual, violent, judgemental, political, etc.). We labeled this data for sensitivity ourselves, and measured agreement between us and labelers

work page
[3]

We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality

Agreement on rankings. We take prompts submitted to our API, and several model completions, and have labelers rank the completions by overall quality. We measure their agreement with researcher labels

work page
[4]

demonstration score

Sensitive demonstration writing. We created a small set of sensitive prompts, where responding to the outputs appropriately would require nuance. We then rated each demon- stration on a 1-7 Likert scale, and computed an average “demonstration score” for each labeler

work page
[5]

For what topics or cultural groups are you comfortable identifying sensitive speech?

Self-assessed ability to identify sensitive speech for different groups. We wanted to select a team of labelers that had collectively were able to identify sensitive content in a broad range of areas. For legal reasons, we can’t hire contractors based on demographic criteria. Thus, we had labelers answer the question: “For what topics or cultural groups a...

work page 2020
[6]

Who announced Ginsburg’s death?

work page
[7]

What was Ginsburg’s statement to her granddaughter?

work page
[8]

Why will Ginsburg’s death have profound consequences for the court and the country? Labeler demonstration

work page
[9]

The Supreme Court announced her death

work page
[11]

Ruth Badr Ginsburg was the leader of the liberal wing, and because the court is about to open a new term, the chief justice no longer holds the controlling vote in contested cases GPT-3 175B completion:

work page
[12]

Chief Justice John Roberts

work page
[13]

My most fervent wish is that I will not be replaced until a new president is installed

"My most fervent wish is that I will not be replaced until a new president is installed."

work page
[14]

InstructGPT 175B completion:

Because of the Supreme Court vacancy Ginsburg’s death creates. InstructGPT 175B completion:

work page
[15]

The Supreme Court announced Ginsburg’s death

work page
[16]

Ginsburg’s statement was a plea for the president not to replace her before his term ends

work page
[17]

anxiety lump

Ginsburg’s death means there will no longer be a clear majority on the court, throwing upcoming decisions into doubt. Figure 48: Labeler-written prompt from our dataset, along with the human-written demonstration, and completions from GPT-3 175B and InstructGPT175B. Prompt is lightly cherry-picked (5 selected from 15 to show a diverse range of tasks), and...

work page