Recognition: 3 theorem links
· Lean TheoremTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Pith reviewed 2026-05-10 13:30 UTC · model grok-4.3
The pith
Reinforcement learning from human feedback aligns language models to be helpful and harmless assistants while improving performance on NLP evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence.
What carries the argument
Preference models trained on human rankings of model outputs, followed by reinforcement learning that optimizes a policy against those models.
If this is right
- Alignment training raises rather than lowers scores on most NLP tasks.
- Specialized capability training such as coding or summarization can be performed alongside or after the alignment stage without interference.
- Weekly online updates using new human rankings allow steady dataset and model improvement.
- The observed linear relation between reward and square-root KL divergence gives a practical dial for controlling how far the policy moves during RL.
Where Pith is reading between the lines
- If the linear reward-KL relation generalizes, future runs could choose divergence targets in advance to keep models close enough to retain capabilities while still aligning.
- The lack of conflict with domain-specific training suggests alignment can be applied as a reusable post-training step for many different model lineages.
- Ongoing weekly feedback collection implies that alignment can be maintained as models continue to scale rather than treated as a one-time step.
Load-bearing premise
Human preference rankings supply a stable training signal that does not contain systematic biases the RL stage will amplify into worse behavior.
What would settle it
Train a model with the described RLHF procedure and then measure that its scores on standard NLP benchmarks have not risen or that the rate of harmful outputs has not fallen relative to the starting model.
read the original abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript applies preference modeling and reinforcement learning from human feedback (RLHF) to fine-tune language models as helpful and harmless assistants. It reports that this alignment training improves performance on almost all NLP evaluations while remaining compatible with specialized skill training such as Python coding and summarization. The work also introduces an iterated online training mode with weekly updates using fresh human feedback, identifies a roughly linear relation between RL reward and the square root of KL divergence from the initialization, and includes peripheral analyses on calibration, competing objectives, OOD detection, human writer comparisons, and model samples.
Significance. If the empirical results hold under rigorous controls, the paper demonstrates that RLHF can simultaneously boost general NLP capabilities and safety alignment in language models, with the iterated online approach offering a scalable data-collection strategy. The reported linear reward-KL relation provides an empirical regularity that could guide future theoretical analyses of RLHF optimization landscapes. These findings, supported by the paper's machine-checked or reproducible elements where present, would have direct implications for deploying aligned AI assistants.
major comments (1)
- [Abstract] Abstract and the section describing human preference data collection: the central claim that alignment training improves performance on almost all NLP evaluations rests on the assumption that human preference rankings supply a consistent, unbiased signal for both helpfulness and harmlessness. The manuscript does not appear to report inter-annotator agreement statistics or explicit tests for style biases (e.g., preference for longer or more confident responses) that could be amplified by the RL stage; without these, the observed gains risk being artifacts of the training signal rather than genuine capability or safety improvements.
minor comments (2)
- [Abstract] The abstract states a 'roughly linear relation' between RL reward and sqrt(KL) but does not specify the exact functional form, the range of KL values examined, or whether the relation holds after controlling for reward scaling coefficient; adding this detail would strengthen the observation.
- [Abstract] The claim of compatibility with specialized skills (python coding, summarization) would benefit from explicit quantitative results or ablation tables showing performance before and after alignment training on those tasks.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review of our manuscript. We address the single major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract and the section describing human preference data collection: the central claim that alignment training improves performance on almost all NLP evaluations rests on the assumption that human preference rankings supply a consistent, unbiased signal for both helpfulness and harmlessness. The manuscript does not appear to report inter-annotator agreement statistics or explicit tests for style biases (e.g., preference for longer or more confident responses) that could be amplified by the RL stage; without these, the observed gains risk being artifacts of the training signal rather than genuine capability or safety improvements.
Authors: We agree that the absence of inter-annotator agreement (IAA) statistics and explicit bias analyses constitutes a gap in the current manuscript. In the revised version we will report IAA for both the helpfulness and harmlessness preference datasets (computed on the subset of examples that received multiple annotations). We will also add a short analysis of response-length and confidence-related biases: we will measure the correlation between preference scores and response length in the training data, and we will compare length distributions of model outputs before and after RLHF. While our annotation guidelines explicitly instructed raters to prioritize substantive helpfulness and harmlessness over stylistic features, we acknowledge that these controls are not sufficient without quantitative checks. The fact that gains appear on standard NLP benchmarks (many of which penalize overly verbose or hedging answers) provides indirect evidence against pure artifact explanations, but we will make this argument explicit and add the requested statistics and bias diagnostics. revision: yes
Circularity Check
No significant circularity; empirical results independent of fitted inputs
full rationale
The paper reports empirical measurements of RLHF alignment training on held-out NLP evaluations, human comparisons, and specialized tasks. The noted roughly linear relation between RL reward and sqrt(KL divergence is described as an observed pattern from training runs, not a first-principles derivation or prediction that reduces to model inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claims rest on independent benchmark performance rather than any equation or theorem that collapses to the training data or preferences by definition.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL reward scaling coefficient
axioms (1)
- domain assumption Human preference rankings over model outputs are transitive and consistent enough to train a reliable reward model.
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We find this alignment training improves performance on almost all NLP evaluations
-
Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
identify a roughly linear relation between the RL reward and the square root of the KL divergence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
Contrastive Identification and Generation in the Limit
Contrastive pair presentations yield exact identifiability characterizations via a geometric refinement of Angluin's condition, a new contrastive closure dimension for generation, mutual incomparability with text iden...
-
Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
-
Efficient Preference Poisoning Attack on Offline RLHF
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
-
Revisable by Design: A Theory of Streaming LLM Agent Execution
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...
-
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning
Risk-sensitive preference games retain monotonicity via translation-invariant risk measures, enabling convergent self-play algorithms with stability bounds and empirical robustness across data strata.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
-
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2....
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization
Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.
-
Theoretical Limits of Language Model Alignment
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading
Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.
-
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
Mind the Gap: Structure-Aware Consistency in Preference Learning
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guara...
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
-
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
Causal inference for social network formation
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineer...
-
AI Integrity: A New Paradigm for Verifiable AI Governance
AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
-
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Many Preferences, Few Policies: Towards Scalable Language Model Personalization
PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.
-
Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
TextGrad: Automatic "Differentiation" via Text
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Fusion-fission forecasts when AI will shift to undesirable behavior
A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.
-
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.
-
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
A single consistency instruction with harmful prior actions causes aligned frontier LLMs to select unsafe options at 91-98% rates in high-stakes domains, with escalation and inverse scaling by model size.
-
Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models
A hierarchical genetic algorithm induces overthinking in black-box LRMs, increasing output length by up to 26.1x on the MATH benchmark.
-
Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models
A hierarchical genetic algorithm induces overthinking in black-box large reasoning models by perturbing logical structure, achieving up to 26.1x longer outputs on the MATH benchmark.
-
Learning Perturbations to Extrapolate Your LLM
A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
-
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Reference graph
Works this paper leans on
-
[1]
[Amodei et al., 2016] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016). Concrete problems in ai safety. [Askell et al., 2021] Askell, A., Bai, Y ., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K.,
work page 2016
-
[2]
Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment. [Bender et al., 2021] Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? ᅵᅵ. In Proceedings of the 2021 AC...
work page 2021
-
[3]
On the Opportunities and Risks of Foundation Models
Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kuditipudi, R., and et al. (2021). On the opportunities and risks of foundation models. CoRR, abs/2108.07258. [Borgeaud et al., 2021] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick,...
work page internal anchor Pith review arXiv 2021
-
[4]
Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,
Vinyals, O., Osindero, S., Simonyan, K., Rae, J. W., Elsen, E., and Sifre, L. (2021). Improving language models by retrieving from trillions of tokens. CoRR, abs/2112.04426. [Brown et al., 2020] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krue...
-
[5]
Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. [Chowdhery et al., 2022] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A.,
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. (2022). Palm: Scaling language modeling with pathways. [Clark et al., 2018] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). Think you have solved question answering? try arc, the ai2 reasoning ch...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C., and Clark, J. (2022). Predictability and surprise in large generative models. [Guo et al., 2017] Guo, C., Pleiss, G., Sun, Y ., and Weinberger, K. Q. (2017). On calibration of modern neural networks. 71 [Guu et al., 2020] Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. (2020). REALM: retr...
-
[8]
Pineau, J. (2017). Ethical challenges in data-driven dialogue systems. CoRR, abs/1711.09050. [Hendrycks et al., 2021a] Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. (2021a). Aligning ai with shared human values. [Hendrycks et al., 2021b] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Stei...
-
[9]
Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative model- ing. [Hernandez et al., 2021] Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. CoRR, abs/2102.01293. [Hestness et al., 2019] Hestness, J., Ardalani, N., and Diamos, G. (2019). Beyond human-level accuracy: Com...
-
[10]
Etzioni, O., Sap, M., and Choi, Y . (2021). Delphi: Towards machine ethics and norms. [Joshi et al., 2017] Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. [Kaplan et al., 2020] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R....
work page 2021
-
[11]
Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. [Koch et al., 2021] Koch, J., Langosco, L., Pfau, J., Le, J., and Sharkey, L. (2021). Objective robustness in deep reinforcement learning. CoRR, abs/2105.14111. [Lakshminarayanan et al., 2016] Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2016). Simple and scalable...
-
[12]
Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR, abs/2005.11401. [Liang et al., 2017] Liang, S., Li, Y ., and Srikant, R. (2017). Enhancing the reliability of out-of-distribution image detection in neural networks. [Lin et al., 2021] Lin, S., Hilton, J., and Evan...
work page internal anchor Pith review arXiv 2020
-
[13]
Chess, B., and Schulman, J. (2021). Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332. [Nalisnick et al., 2019] Nalisnick, E., Matsukawa, A., Teh, Y . W., Gorur, D., and Lakshminarayanan, B. (2019). Hybrid models with deep and invertible features. [Nguyen et al., 2014] Nguyen, A., Yosinski, J., and Clune, J. (2014). Dee...
work page internal anchor Pith review arXiv 2021
-
[14]
Baroni, M., Boleda, G., and Fernández, R. (2016). The lambada dataset: Word prediction requiring a broad discourse context. [Parrish et al., 2021] Parrish, A., Chen, A., Nangia, N., Padmakumar, V ., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. R. (2021). BBQ: A hand-built bias benchmark for question answering. CoRR, abs/2110.08193. [Paszke et al....
-
[15]
Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high- performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32 , pages 8024–8035. Curran Associates, Inc. [Perez et al.,...
work page Pith review arXiv 2019
-
[16]
Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446. [Ramasesh et al., 2022] Ramasesh, V . V ., Lewkowycz, A., and Dyer, E. (2022). Effect of scale on catas- ...
work page internal anchor Pith review arXiv 2021
-
[17]
Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback. [Thoppilan et al., 2022] Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.,
work page 2020
-
[18]
Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B., Cui, C., Croak, M., Chi, E., and Le, Q. (2022). Lamda: Language models for dialog applications. CoRR, abs/2201.08239. [Thulasidasan et al., 2021] Thulasidasan, S., Thapa, S., Dhaubhadel, S., Chennupati, G., Bhattacharya, T., and Bilmes, J. (2021). A simple and effective baseline for out-of-distribu...
work page Pith review arXiv 2022
-
[19]
Ethical and social risks of harm from Language Models
Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., Isaac, W. S., Legassick, S., Irving, G., and Gabriel, I. (2021). Ethical and social risks of harm from language models. CoRR, abs/2112.04359. [Winkens et al., 2020] Winkens, J., Bunel, R., Roy, A. G., Stanforth, R., Natarajan, V ., Ledsam, J. R.,
work page internal anchor Pith review arXiv 2021
-
[20]
MacWilliams, P., Kohli, P., Karthikesalingam, A., Kohl, S., Cemgil, T., Eslami, S. M. A., and Ronneberger, O. (2020). Contrastive training for improved out-of-distribution detection. [Xu et al., 2020] Xu, J., Ju, D., Li, M., Boureau, Y .-L., Weston, J., and Dinan, E. (2020). Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079. [Zel...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.