A General Language Assistant as a Laboratory for Alignment

Amanda Askell; Andy Jones; Anna Chen; Ben Mann; Catherine Olsson; Chris Olah; Danny Hernandez; Dario Amodei; Dawn Drain; Deep Ganguli

arxiv: 2112.00861 · v3 · submitted 2021-12-01 · 💻 cs.CL · cs.LG

A General Language Assistant as a Laboratory for Alignment

Amanda Askell , Yuntao Bai , Anna Chen , Dawn Drain , Deep Ganguli , Tom Henighan , Andy Jones , Nicholas Joseph

show 14 more authors

Ben Mann Nova DasSarma Nelson Elhage Zac Hatfield-Dodds Danny Hernandez Jackson Kernion Kamal Ndousse Catherine Olsson Dario Amodei Tom Brown Jack Clark Sam McCandlish Chris Olah Jared Kaplan

This is my paper

Pith reviewed 2026-05-11 14:17 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords language model alignmentpreference modelingimitation learninghelpful honest harmlessscaling trendshuman feedbackpromptingalignment evaluations

0 comments

The pith

Ranked preference modeling outperforms imitation learning and scales better with model size when aligning language models to human values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates simple methods to turn large language models into general assistants that are helpful, honest, and harmless. It shows that basic prompting interventions produce bigger gains on alignment measures as models increase in size and do not reduce general performance. Comparing training objectives reveals that ranked preference modeling, which trains on human orderings of possible outputs, beats straightforward imitation of human text and often improves more rapidly with scale. Binary discrimination of good versus bad responses performs and scales much like imitation. A pre-training stage on preferences is also tested to lower the amount of human feedback needed during fine-tuning.

Core claim

The authors establish that ranked preference modeling performs much better than imitation learning on alignment evaluations and frequently scales more favorably with model size, while binary discrimination typically performs and scales similarly to imitation learning. Modest prompting interventions yield benefits that grow with model size, generalize across alignment tests, and leave large-model capabilities intact.

What carries the argument

Ranked preference modeling, which trains the model to predict human rankings of alternative responses rather than simply copying desired text or making binary good/bad judgments.

If this is right

Alignment interventions such as prompting become more effective as model size grows.
Ranked preference training can deliver stronger alignment without sacrificing the model's core capabilities.
Binary discrimination methods offer little improvement over basic imitation learning.
A preference-model pre-training stage can reduce the volume of human preference data required for fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These scaling patterns suggest alignment may become easier to achieve with future, larger models if ranked preferences remain the superior objective.
The results point toward using preference pre-training as a way to make alignment more data-efficient across different model families.
The setup provides a controllable testbed for studying how different objectives interact with model scale on the same set of alignment metrics.

Load-bearing premise

The proxy evaluations chosen for helpfulness, honesty, and harmlessness sufficiently represent the full range of alignment properties needed in real-world use.

What would settle it

Training a substantially larger model with imitation learning alone and finding that it matches or exceeds the alignment scores of an equivalent model trained with ranked preference modeling on the same HHH evaluations.

read the original abstract

Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ranked preference modeling beats imitation learning on alignment benchmarks and scales better with size, but the gap may trace to uneven supervision volume rather than the loss itself.

read the letter

The central result is that ranked preference modeling pulls ahead of imitation learning on the helpfulness-honesty-harmlessness proxies and improves more as models scale up, while binary discrimination looks basically like imitation. The paper also shows that a preference-model pre-training stage can help later fine-tuning use fewer human labels. These are direct comparisons across model sizes on the same base models, which gives a clearer picture than isolated runs of one method or another. The work is straightforward about starting with simple baselines and checking that the alignment steps do not hurt general capabilities much at larger scales. That part is useful to see laid out with trends rather than single-point claims. The authors are clear that this is an early look, not a complete solution. The main soft spot is the data-volume question. Imitation trains on positive demonstrations while ranked preferences use human-ranked pairs, and the paper does not report or match the total number of human judgments or effective tokens across conditions. If the preference runs simply see more labeled data, the performance and scaling edge could be an artifact of richer supervision instead of the objective. Binary discrimination tracking imitation is consistent with that possibility. The proxy evaluations are reasonable starting points but remain limited, as the authors note. This paper is for researchers working on scaling alignment techniques for general language models, especially those who want empirical trends on preference objectives. Readers who follow RLHF-style methods or scaling laws for safety will find the comparisons worth looking at. It has enough new empirical content and clear enough framing to deserve a serious referee, even though the controls around data matching will need tightening in revision. I would send it out for peer review after asking the authors to clarify annotation counts and token volumes per condition.

Referee Report

2 major / 2 minor

Summary. The paper studies simple baselines for aligning large language models to be helpful, honest, and harmless. It first examines prompting interventions and finds that their benefits grow with model size without harming capabilities. It then compares scaling trends across three training objectives on human feedback data: imitation learning (SFT on positive demonstrations), binary discrimination, and ranked preference modeling. The central empirical claim is that ranked preference modeling substantially outperforms imitation learning and often scales more favorably with model size, while binary discrimination performs and scales similarly to imitation. The work also introduces a preference-model pre-training stage intended to improve sample efficiency when fine-tuning on human preferences. All results are obtained from independent training runs evaluated on held-out data.

Significance. If the central comparisons hold after controlling for supervision volume, the results would be a useful empirical contribution to alignment research by showing that preference-based objectives can be more effective and scale better than pure imitation. The independent training runs and held-out evaluations are a strength that supports the reliability of the reported scaling trends. The work also provides a laboratory-style exploration of alignment techniques that could inform later studies on larger models.

major comments (2)

[Section 4 (Scaling Trends for Alignment Objectives)] The central claim that ranked preference modeling outperforms imitation learning and scales more favorably rests on comparisons whose supervision budgets are not matched or reported. The manuscript does not state the total number of human annotations (demonstrations vs. ranked pairs) or effective training tokens supplied to each objective. If ranked preference modeling receives substantially more labeled data, the performance gap and favorable scaling could be artifacts of data volume rather than intrinsic properties of the loss. Binary discrimination performing similarly to imitation is consistent with this alternative explanation. A matched-budget ablation or explicit reporting of annotation counts per condition is required to isolate the effect of the objective.
[Section 5 (Evaluations)] The proxy evaluations for helpfulness, honesty, and harmlessness are used to support all scaling claims, yet the manuscript provides insufficient detail on data splits, statistical controls, and error analysis. Without these, it is not possible to verify that post-hoc evaluation choices do not influence the reported trends. The weakest assumption—that these proxies adequately capture the alignment properties needed for deployment—remains untested.

minor comments (2)

[Section 3 (Methods)] Notation for the three objectives (imitation, binary discrimination, ranked preference) is introduced clearly but could be summarized in a single table for quick reference when reading the scaling plots.
[Section 4] Figure captions for the scaling plots should explicitly state the number of independent runs and any error bars used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. We address each major comment below and will make revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Section 4 (Scaling Trends for Alignment Objectives)] The central claim that ranked preference modeling outperforms imitation learning and scales more favorably rests on comparisons whose supervision budgets are not matched or reported. The manuscript does not state the total number of human annotations (demonstrations vs. ranked pairs) or effective training tokens supplied to each objective. If ranked preference modeling receives substantially more labeled data, the performance gap and favorable scaling could be artifacts of data volume rather than intrinsic properties of the loss. Binary discrimination performing similarly to imitation is consistent with this alternative explanation. A matched-budget ablation or explicit reporting of annotation counts per condition is required to isolate the effect of the objective.

Authors: We agree that explicit reporting of supervision budgets is essential. The revised manuscript will include a new table (or expanded methods subsection) detailing the exact number of human annotations and effective training tokens for each objective. All data originates from the same human feedback collection pipeline: imitation learning uses positive demonstrations, while ranked preference modeling uses the corresponding ranked pairs (typically 2–4 comparisons per prompt). Binary discrimination uses the same pairs but with binary labels. Although the number of ranked pairs exceeds the number of single demonstrations, the performance advantage and scaling trends for ranked preference modeling persist even when normalizing for annotation effort. We will also add a brief discussion of this point and note that a fully matched-budget ablation is planned for follow-up work. revision: yes
Referee: [Section 5 (Evaluations)] The proxy evaluations for helpfulness, honesty, and harmlessness are used to support all scaling claims, yet the manuscript provides insufficient detail on data splits, statistical controls, and error analysis. Without these, it is not possible to verify that post-hoc evaluation choices do not influence the reported trends. The weakest assumption—that these proxies adequately capture the alignment properties needed for deployment—remains untested.

Authors: We will expand Section 5 with the requested details: explicit descriptions of train/validation/test splits for each proxy task, any statistical controls (e.g., bootstrapped confidence intervals or significance tests on scaling trends), and a short error analysis of the proxy metrics. We acknowledge that these proxies are imperfect stand-ins for real-world alignment and do not claim they fully capture deployment requirements. The revised text will add an explicit limitations paragraph stating that further validation through deployment studies or more comprehensive human evaluations would be needed, positioning the current results as an initial laboratory exploration rather than a definitive demonstration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent

full rationale

The paper's core claims rest on direct experimental comparisons of training objectives (imitation learning, binary discrimination, ranked preference modeling) via independent runs and held-out evaluations. No equations, fitted parameters, or self-citations reduce the reported performance gaps or scaling trends to inputs by construction. The analysis uses external benchmarks and does not invoke uniqueness theorems or ansatzes from prior self-work as load-bearing justification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical training runs and evaluation metrics rather than new mathematical derivations or postulated entities. Standard machine-learning assumptions about generalization from preference data are used.

axioms (1)

domain assumption Human preference rankings collected for the study are consistent and representative of desired alignment properties
Invoked when interpreting ranked preference modeling results as alignment progress

pith-pipeline@v0.9.0 · 5525 in / 1107 out tokens · 86702 ms · 2026-05-11T14:17:58.602248+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
cs.CL 2023-08 conditional novelty 8.0

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Teaching Models to Express Their Uncertainty in Words
cs.CL 2022-05 unverdicted novelty 8.0

GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
TruthfulQA: Measuring How Models Mimic Human Falsehoods
cs.CL 2021-09 unverdicted novelty 8.0

A new benchmark reveals that language models including GPT-3 are truthful on only 58% of questions designed to elicit popular misconceptions, far below human performance of 94%, with larger models performing worse.
Self-Policy Distillation via Capability-Selective Subspace Projection
cs.CL 2026-05 unverdicted novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines...
Measuring Safety Alignment Effects in Autonomous Security Agents
cs.CR 2026-05 conditional novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security...
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
cs.CL 2026-05 unverdicted novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks
cs.SE 2026-05 conditional novelty 7.0

The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice ...
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
cs.MA 2026-05 unverdicted novelty 7.0

External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
cs.AI 2026-05 unverdicted novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Three Models of RLHF Annotation: Extension, Evidence, and Authority
cs.CY 2026-04 unverdicted novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
cs.LG 2026-04 unverdicted novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
cs.LG 2026-04 conditional novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation
cs.CL 2026-04 unverdicted novelty 7.0

EuropeMedQA is presented as the first comprehensive multilingual and multimodal medical examination dataset drawn from official regulatory exams in four European countries.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
cs.CL 2026-04 accept novelty 7.0

SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
Hidden Elo: Private Matchmaking through Encrypted Rating Systems
cs.CR 2026-03 unverdicted novelty 7.0

H-Elo is an FHE-based protocol that enables private rating-based matchmaking while achieving accuracy comparable to plaintext implementations.
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench
cs.LG 2026-01 unverdicted novelty 7.0

HE-SNR is a high-entropy signal-to-noise ratio metric derived from the Entropy Compression Hypothesis to better guide LLM mid-training on complex software engineering benchmarks.
Incentivizing High-Quality Human Annotations with Golden Questions
cs.GT 2025-05 unverdicted novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
cs.CL 2023-10 conditional novelty 7.0

Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
cs.CL 2026-05 conditional novelty 6.0

LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
When Vision Speaks for Sound
cs.CV 2026-05 unverdicted novelty 6.0

Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention pe...
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
Understanding Annotator Safety Policy with Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
cs.LG 2026-04 unverdicted novelty 6.0

MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
AlignCultura: Towards Culturally Aligned Large Language Models?
cs.CL 2026-04 unverdicted novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
The Triadic Loop: A Framework for Negotiating Alignment in AI Co-hosted Livestreaming
cs.HC 2026-04 unverdicted novelty 6.0

The Triadic Loop reconceptualizes AI alignment in livestreaming as a temporally reinforced process of bidirectional adaptation among streamer, AI co-host, and audience.
CoAct: Co-Active LLM Preference Learning with Human-AI Synergy
cs.CL 2026-04 unverdicted novelty 6.0

CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest
cs.AI 2026-04 unverdicted novelty 6.0

Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities
cs.CL 2026-04 unverdicted novelty 6.0

Misalignment with structurally critical human values in LLM agent communities produces macro-level collapses and micro-level emergent behaviors such as deception.
Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing
cs.AI 2026-04 unverdicted novelty 6.0

Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
cs.AI 2026-04 unverdicted novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models
cs.LG 2026-03 unverdicted novelty 6.0

VC-Soup uses a cosine-similarity consistency metric to filter data, trains value-consistent policies, and applies linear merging with Pareto filtering to improve multi-value LLM alignment trade-offs.
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
cs.LG 2026-02 conditional novelty 6.0

OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.
Factored Causal Representation Learning for Robust Reward Modeling in RLHF
cs.LG 2026-01 unverdicted novelty 6.0

A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.
Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models
cs.CR 2025-12 unverdicted novelty 6.0

A meta-prompt and hierarchical detection framework automates LLM red-teaming, achieving 3.9 times higher vulnerability discovery rate than manual methods with 89% accuracy on GPT-OSS-20B.
When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning
cs.AI 2025-05 unverdicted novelty 6.0

Slower multimodal reasoning models exhibit inverse scaling in truthfulness by fabricating details under ambiguous visual inputs, while faster models remain more cautious via broader inference.
Supervising the search process produces reliable and generalizable information-seeking agents
cs.CL 2025-02 unverdicted novelty 6.0

Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators
cs.LG 2025-02 unverdicted novelty 6.0

Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
cs.AI 2024-08 conditional novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
TouchAI: Exploring human-AI perceptual alignment in touch through language model representations
cs.CL 2024-06 unverdicted novelty 6.0

LLMs show partial and variable perceptual alignment with human touch on textiles, succeeding on samples like silk satin but failing on cotton denim when matching descriptive language to embedding similarity.
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
cs.AI 2024-05 unverdicted novelty 6.0

OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
cs.CR 2024-04 unverdicted novelty 6.0

Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
Laissez-Faire Harms: Algorithmic Biases in Generative Language Models
cs.CL 2024-04 unverdicted novelty 6.0

Generative LMs in laissez-faire open-ended prompting settings disproportionately generate subordinated portrayals of minoritized race, gender, and sexual orientation identities at rates hundreds to thousands of times ...
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
A Roadmap to Pluralistic Alignment
cs.AI 2024-02 unverdicted novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Steering Llama 2 via Contrastive Activation Addition
cs.CL 2023-12 unverdicted novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
cs.CV 2023-09 conditional novelty 6.0

DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.
Simple synthetic data reduces sycophancy in large language models
cs.CL 2023-08 unverdicted novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
cs.CV 2023-06 unverdicted novelty 6.0

LLaVA-Med is created via curriculum fine-tuning on PubMed figure-caption pairs and GPT-4 self-instructed data, achieving competitive or better results than prior supervised models on three biomedical VQA benchmarks.
Aligning Text-to-Image Models using Human Feedback
cs.LG 2023-02 unverdicted novelty 6.0

A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Measuring Progress on Scalable Oversight for Large Language Models
cs.HC 2022-11 unverdicted novelty 6.0

Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.

Reference graph

Works this paper leans on

241 extracted references · 241 canonical work pages · cited by 90 Pith papers · 45 internal anchors

[1]

2021 , Eprint =

Johannes Welbl and Amelia Glaese and Jonathan Uesato and Sumanth Dathathri and John Mellor and Lisa Anne Hendricks and Kirsty Anderson and Pushmeet Kohli and Ben Coppin and Po-Sen Huang , Title =. 2021 , Eprint =

work page 2021
[2]

2021 , eprint=

Scaling Scaling Laws with Board Games , author=. 2021 , eprint=

work page 2021
[3]

2021 , eprint=

When Combating Hype, Proceed with Caution , author=. 2021 , eprint=

work page 2021
[4]

2019 , eprint=

Generating Long Sequences with Sparse Transformers , author=. 2019 , eprint=

work page 2019
[5]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[6]

2021 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2021 , eprint=

work page 2021
[7]

2021 , eprint=

Mitigating harm in language models with conditional-likelihood filtration , author=. 2021 , eprint=

work page 2021
[8]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

work page 2020
[9]

2016 , eprint=

Concrete Problems in AI Safety , author=. 2016 , eprint=

work page 2016
[10]

2020 , eprint=

Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics , author=. 2020 , eprint=

work page 2020
[11]

2020 , eprint=

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

work page 2020
[12]

2021 , eprint=

Unsolved Problems in ML Safety , author=. 2021 , eprint=

work page 2021
[13]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[14]

2021 , eprint=

Aligning AI With Shared Human Values , author=. 2021 , eprint=

work page 2021
[15]

2021 , eprint=

Decision Transformer: Reinforcement Learning via Sequence Modeling , author=. 2021 , eprint=

work page 2021
[16]

2021 , eprint=

Delphi: Towards Machine Ethics and Norms , author=. 2021 , eprint=

work page 2021
[17]

2018 , eprint=

Supervising strong learners by amplifying weak experts , author=. 2018 , eprint=

work page 2018
[18]

2018 , eprint=

AI safety via debate , author=. 2018 , eprint=

work page 2018
[19]

2021 , eprint=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. 2021 , eprint=

work page 2021
[20]

2021 , eprint=

Finetuned Language Models Are Zero-Shot Learners , author=. 2021 , eprint=

work page 2021
[21]

2021 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2021 , eprint=

work page 2021
[22]

2020 , eprint=

Learning to summarize from human feedback , author=. 2020 , eprint=

work page 2020
[23]

2016 , eprint=

Generative Adversarial Imitation Learning , author=. 2016 , eprint=

work page 2016
[24]

2020 , eprint=

Language GANs Falling Short , author=. 2020 , eprint=

work page 2020
[25]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019
[26]

2017 , eprint=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

work page 2017
[27]

2021 , eprint=

Imitating Interactive Intelligence , author=. 2021 , eprint=

work page 2021
[30]

Rethinking imagenet pre-training , Year =

He, Kaiming and Girshick, Ross and Doll. Rethinking imagenet pre-training , Year =. Proceedings of the IEEE/CVF International Conference on Computer Vision , Date-Added =

work page
[31]

Exploring the Limits of Weakly Supervised Pretraining

Exploring the Limits of Weakly Supervised Pretraining , Year =. arXiv , Author =:1805.00932 , Primaryclass =

work page Pith review arXiv
[32]

A survey on deep transfer learning , Year =

Tan, Chuanqi and Sun, Fuchun and Kong, Tao and Zhang, Wenchang and Yang, Chao and Liu, Chunfang , Booktitle =. A survey on deep transfer learning , Year =

work page
[33]

lilianweng.github.io/lil-log , Title =

Weng, Lilian , Date-Added =. lilianweng.github.io/lil-log , Title =. 2018 , Bdsk-Url-1 =

work page 2018
[35]

arXiv preprint arXiv:1907.07174 , Title =

Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , Date-Added =. arXiv preprint arXiv:1907.07174 , Title =

work page arXiv 1907
[36]

Learning Transferable Visual Models From Natural Language Supervision , Volume =

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others , Date-Added =. Learning Transferable Visual Models From Natural Language Supervision , Volume =. Image , Pages =

work page
[37]

Solving Rubik's Cube with a Robot Hand

Solving Rubik's Cube with a Robot Hand , Year =. arXiv , Author =:1910.07113 , Primaryclass =

work page internal anchor Pith review arXiv 1910
[38]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , Year =. arXiv , Author =:1703.03400 , Primaryclass =

work page Pith review arXiv
[39]

International Conference on Learning Representations (ICLR) , year =

Deep Double Descent: Where Bigger Models and More Data Hurt , Year =. arXiv , Author =:1912.02292 , Primaryclass =

work page arXiv 1912
[40]

Dota 2 with Large Scale Deep Reinforcement Learning

2019 , Bdsk-Url-1 =. arXiv , Author =:1912.06680 , Title =

work page internal anchor Pith review arXiv 2019
[41]

A Neural Probabilistic Language Model , Volume =

Yoshua Bengio and R. A Neural Probabilistic Language Model , Volume =. JOURNAL OF MACHINE LEARNING RESEARCH , Pages =

work page
[42]

Recurrent neural network based language model , Volume =

Mikolov, Tomas and Karafi. Recurrent neural network based language model , Volume =. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 , Month =

work page 2010
[43]

Universal Language Model Fine-tuning for Text Classification

Universal Language Model Fine-tuning for Text Classification , Year =. arXiv , Author =:1801.06146 , Primaryclass =

work page Pith review arXiv
[44]

Semi-supervised Sequence Learning

Semi-supervised Sequence Learning , Year =. arXiv , Author =:1511.01432 , Primaryclass =

work page Pith review arXiv
[45]

Deep contextualized word representations

Deep contextualized word representations , Year =. arXiv , Author =:1802.05365 , Primaryclass =

work page Pith review arXiv
[46]

Silver, David and Huang, Aja and Maddison, Chris J. and Guez, Arthur and Sifre, Laurent and van den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and K...

work page doi:10.1038/nature16961
[47]

Learning internal representations by error propagation , Year =

Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J , Date-Added =. Learning internal representations by error propagation , Year =

work page
[48]

Long Short-Term Memory , Volume =

Sepp Hochreiter and J. Long Short-Term Memory , Volume =. Neural Computation , Number =

work page
[49]

Mastering the game of Go with deep neural networks and tree search , Volume =

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , Date-Added =. Mastering the game of Go with deep neural networks and tree search , Volume =. nature , Number =

work page
[50]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , Year =. arXiv , Author =:1910.10683 , Primaryclass =

work page internal anchor Pith review arXiv 1910
[51]

Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks , Year =. arXiv , Author =:1409.3215 , Primaryclass =

work page Pith review arXiv
[52]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , Year =. arXiv , Author =:2009.03300 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2009
[53]

doi: 10.1126/science.aab3050

Lake, Brenden M. and Salakhutdinov, Ruslan and Tenenbaum, Joshua B. , Date-Added =. Human-level concept learning through probabilistic program induction , Url =. Science , Number =. 2015 , Bdsk-Url-1 =. doi:10.1126/science.aab3050 , Eprint =

work page doi:10.1126/science.aab3050 2015
[54]

Scaling Laws for Autoregressive Generative Modeling

Scaling Laws for Autoregressive Generative Modeling , Year =. arXiv , Author =:2010.14701 , Primaryclass =

work page internal anchor Pith review arXiv 2010
[55]

and Brown, T

Measuring the Algorithmic Efficiency of Neural Networks , Year =. arXiv , Author =:2005.04305 , Primaryclass =

work page arXiv 2005
[56]

Neural Discrete Representation Learning

Neural Discrete Representation Learning , Year =. arXiv , Author =:1711.00937 , Primaryclass =

work page Pith review arXiv
[57]

Jukebox: A Generative Model for Music

Jukebox: A Generative Model for Music , Year =. arXiv , Author =:2005.00341 , Primaryclass =

work page Pith review arXiv 2005
[58]

Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,

Scaling Autoregressive Video Models , Year =. arXiv , Author =:1906.02634 , Primaryclass =

work page arXiv 1906
[59]

Pixel Recurrent Neural Networks

Pixel Recurrent Neural Networks , Url =. 2016 , Bdsk-Url-1 =. arXiv , Author =:1601.06759 , Journal =

work page Pith review arXiv 2016
[60]

Multimodal transformer for unaligned multimodal language sequences , Volume =

Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan , Booktitle =. Multimodal transformer for unaligned multimodal language sequences , Volume =

work page
[61]

Enhancing the transformer with explicit relational encoding for math problem solving

Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , Year =. arXiv , Author =:1910.06611 , Primaryclass =

work page arXiv 1910
[62]

YFCC100M: The New Data in Multimedia Research

The New Data and New Challenges in Multimedia Research , Url =. 2015 , Bdsk-Url-1 =. arXiv , Author =:1503.01817 , Journal =

work page Pith review arXiv 2015
[63]

arXiv , Author =:2006.10621 , Primaryclass =

On the Predictability of Pruning Across Scales , Year =. arXiv , Author =:2006.10621 , Primaryclass =

work page arXiv 2006
[65]

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

A Downsampled Variant of ImageNet as an Alternative to the. 2017 , Bdsk-Url-1 =. arXiv , Author =:1707.08819 , Journal =

work page Pith review arXiv 2017
[67]

Analysing Mathematical Reasoning Abilities of Neural Models

Analysing Mathematical Reasoning Abilities of Neural Models , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.01557 , Journal =

work page Pith review arXiv 2019
[68]

Generating Diverse High-Fidelity Images with VQ-VAE-2

Generating Diverse High-Fidelity Images with. 2019 , Bdsk-Url-1 =. arXiv , Author =:1906.00446 , Journal =

work page Pith review arXiv 2019
[70]

A neural scaling law from the dimension of the data manifold

A Neural Scaling Law from the Dimension of the Data Manifold , Year =. arXiv , Author =:2004.10802 , Primaryclass =

work page arXiv 2004
[71]

arXiv , Author =:2002.11794 , Primaryclass =

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , Year =. arXiv , Author =:2002.11794 , Primaryclass =

work page arXiv 2002
[72]

Roller, E

Recipes for building an open-domain chatbot , Year =. arXiv , Author =:2004.13637 , Primaryclass =

work page arXiv 2004
[74]

Liu , Eprint =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , Eprint =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , Year =

work page
[75]

Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =

Jonathan S. Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =. A Constructive Prediction of the Generalization Error Across Scales , Year =

work page
[76]

2021 , eprint=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. 2021 , eprint=

work page 2021
[77]

Analysis of a random forests model , Volume =

Biau, G. Analysis of a random forests model , Volume =. Journal of Machine Learning Research , Number =

work page
[78]

All of nonparametric statistics , Year =

Wasserman, Larry , Publisher =. All of nonparametric statistics , Year =

work page
[80]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , Year =. arXiv , Author =:1909.11942 , Primaryclass =

work page internal anchor Pith review arXiv 1909
[81]

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow: Deep Learning for Supercomputers , Year =. arXiv , Author =:1811.02084 , Primaryclass =

work page Pith review arXiv
[82]

Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =

Hestness, Joel and Ardalani, Newsha and Diamos, Gregory , Booktitle =. Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =. 2019 , Bdsk-Url-1 =. doi:10.1145/3293883.3295710 , Isbn =

work page doi:10.1145/3293883.3295710 2019
[84]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.07062 , Journal =

work page Pith review arXiv 2018
[85]

Common Crawl , Url =

The Common Crawl Foundation , Date-Added =. Common Crawl , Url =

work page
[86]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , Year =. arXiv , Author =:1905.00537 , Primaryclass =

work page internal anchor Pith review arXiv 1905
[87]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa:. 2019 , Bdsk-Url-1 =. arXiv , Author =:1907.11692 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[88]

On the origin of long-range correlations in texts , Volume =

Altmann, Eduardo G and Cristadoro, Giampaolo and Degli Esposti, Mirko , Journal =. On the origin of long-range correlations in texts , Volume =

work page
[89]

Entropy and long-range correlations in literary English , Volume =

Ebeling, Werner and P. Entropy and long-range correlations in literary English , Volume =. EPL (Europhysics Letters) , Number =

work page

Showing first 80 references.

[1] [1]

2021 , Eprint =

Johannes Welbl and Amelia Glaese and Jonathan Uesato and Sumanth Dathathri and John Mellor and Lisa Anne Hendricks and Kirsty Anderson and Pushmeet Kohli and Ben Coppin and Po-Sen Huang , Title =. 2021 , Eprint =

work page 2021

[2] [2]

2021 , eprint=

Scaling Scaling Laws with Board Games , author=. 2021 , eprint=

work page 2021

[3] [3]

2021 , eprint=

When Combating Hype, Proceed with Caution , author=. 2021 , eprint=

work page 2021

[4] [4]

2019 , eprint=

Generating Long Sequences with Sparse Transformers , author=. 2019 , eprint=

work page 2019

[5] [5]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021

[6] [6]

2021 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2021 , eprint=

work page 2021

[7] [7]

2021 , eprint=

Mitigating harm in language models with conditional-likelihood filtration , author=. 2021 , eprint=

work page 2021

[8] [8]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

work page 2020

[9] [9]

2016 , eprint=

Concrete Problems in AI Safety , author=. 2016 , eprint=

work page 2016

[10] [10]

2020 , eprint=

Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics , author=. 2020 , eprint=

work page 2020

[11] [11]

2020 , eprint=

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

work page 2020

[12] [12]

2021 , eprint=

Unsolved Problems in ML Safety , author=. 2021 , eprint=

work page 2021

[13] [13]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021

[14] [14]

2021 , eprint=

Aligning AI With Shared Human Values , author=. 2021 , eprint=

work page 2021

[15] [15]

2021 , eprint=

Decision Transformer: Reinforcement Learning via Sequence Modeling , author=. 2021 , eprint=

work page 2021

[16] [16]

2021 , eprint=

Delphi: Towards Machine Ethics and Norms , author=. 2021 , eprint=

work page 2021

[17] [17]

2018 , eprint=

Supervising strong learners by amplifying weak experts , author=. 2018 , eprint=

work page 2018

[18] [18]

2018 , eprint=

AI safety via debate , author=. 2018 , eprint=

work page 2018

[19] [19]

2021 , eprint=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. 2021 , eprint=

work page 2021

[20] [20]

2021 , eprint=

Finetuned Language Models Are Zero-Shot Learners , author=. 2021 , eprint=

work page 2021

[21] [21]

2021 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2021 , eprint=

work page 2021

[22] [22]

2020 , eprint=

Learning to summarize from human feedback , author=. 2020 , eprint=

work page 2020

[23] [23]

2016 , eprint=

Generative Adversarial Imitation Learning , author=. 2016 , eprint=

work page 2016

[24] [24]

2020 , eprint=

Language GANs Falling Short , author=. 2020 , eprint=

work page 2020

[25] [25]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019

[26] [26]

2017 , eprint=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

work page 2017

[27] [27]

2021 , eprint=

Imitating Interactive Intelligence , author=. 2021 , eprint=

work page 2021

[28] [30]

Rethinking imagenet pre-training , Year =

He, Kaiming and Girshick, Ross and Doll. Rethinking imagenet pre-training , Year =. Proceedings of the IEEE/CVF International Conference on Computer Vision , Date-Added =

work page

[29] [31]

Exploring the Limits of Weakly Supervised Pretraining

Exploring the Limits of Weakly Supervised Pretraining , Year =. arXiv , Author =:1805.00932 , Primaryclass =

work page Pith review arXiv

[30] [32]

A survey on deep transfer learning , Year =

Tan, Chuanqi and Sun, Fuchun and Kong, Tao and Zhang, Wenchang and Yang, Chao and Liu, Chunfang , Booktitle =. A survey on deep transfer learning , Year =

work page

[31] [33]

lilianweng.github.io/lil-log , Title =

Weng, Lilian , Date-Added =. lilianweng.github.io/lil-log , Title =. 2018 , Bdsk-Url-1 =

work page 2018

[32] [35]

arXiv preprint arXiv:1907.07174 , Title =

Hendrycks, Dan and Zhao, Kevin and Basart, Steven and Steinhardt, Jacob and Song, Dawn , Date-Added =. arXiv preprint arXiv:1907.07174 , Title =

work page arXiv 1907

[33] [36]

Learning Transferable Visual Models From Natural Language Supervision , Volume =

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others , Date-Added =. Learning Transferable Visual Models From Natural Language Supervision , Volume =. Image , Pages =

work page

[34] [37]

Solving Rubik's Cube with a Robot Hand

Solving Rubik's Cube with a Robot Hand , Year =. arXiv , Author =:1910.07113 , Primaryclass =

work page internal anchor Pith review arXiv 1910

[35] [38]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , Year =. arXiv , Author =:1703.03400 , Primaryclass =

work page Pith review arXiv

[36] [39]

International Conference on Learning Representations (ICLR) , year =

Deep Double Descent: Where Bigger Models and More Data Hurt , Year =. arXiv , Author =:1912.02292 , Primaryclass =

work page arXiv 1912

[37] [40]

Dota 2 with Large Scale Deep Reinforcement Learning

2019 , Bdsk-Url-1 =. arXiv , Author =:1912.06680 , Title =

work page internal anchor Pith review arXiv 2019

[38] [41]

A Neural Probabilistic Language Model , Volume =

Yoshua Bengio and R. A Neural Probabilistic Language Model , Volume =. JOURNAL OF MACHINE LEARNING RESEARCH , Pages =

work page

[39] [42]

Recurrent neural network based language model , Volume =

Mikolov, Tomas and Karafi. Recurrent neural network based language model , Volume =. Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 , Month =

work page 2010

[40] [43]

Universal Language Model Fine-tuning for Text Classification

Universal Language Model Fine-tuning for Text Classification , Year =. arXiv , Author =:1801.06146 , Primaryclass =

work page Pith review arXiv

[41] [44]

Semi-supervised Sequence Learning

Semi-supervised Sequence Learning , Year =. arXiv , Author =:1511.01432 , Primaryclass =

work page Pith review arXiv

[42] [45]

Deep contextualized word representations

Deep contextualized word representations , Year =. arXiv , Author =:1802.05365 , Primaryclass =

work page Pith review arXiv

[43] [46]

Silver, David and Huang, Aja and Maddison, Chris J. and Guez, Arthur and Sifre, Laurent and van den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and Dieleman, Sander and Grewe, Dominik and Nham, John and Kalchbrenner, Nal and Sutskever, Ilya and Lillicrap, Timothy and Leach, Madeleine and K...

work page doi:10.1038/nature16961

[44] [47]

Learning internal representations by error propagation , Year =

Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J , Date-Added =. Learning internal representations by error propagation , Year =

work page

[45] [48]

Long Short-Term Memory , Volume =

Sepp Hochreiter and J. Long Short-Term Memory , Volume =. Neural Computation , Number =

work page

[46] [49]

Mastering the game of Go with deep neural networks and tree search , Volume =

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , Date-Added =. Mastering the game of Go with deep neural networks and tree search , Volume =. nature , Number =

work page

[47] [50]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , Year =. arXiv , Author =:1910.10683 , Primaryclass =

work page internal anchor Pith review arXiv 1910

[48] [51]

Sequence to Sequence Learning with Neural Networks

Sequence to Sequence Learning with Neural Networks , Year =. arXiv , Author =:1409.3215 , Primaryclass =

work page Pith review arXiv

[49] [52]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , Year =. arXiv , Author =:2009.03300 , Primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2009

[50] [53]

doi: 10.1126/science.aab3050

Lake, Brenden M. and Salakhutdinov, Ruslan and Tenenbaum, Joshua B. , Date-Added =. Human-level concept learning through probabilistic program induction , Url =. Science , Number =. 2015 , Bdsk-Url-1 =. doi:10.1126/science.aab3050 , Eprint =

work page doi:10.1126/science.aab3050 2015

[51] [54]

Scaling Laws for Autoregressive Generative Modeling

Scaling Laws for Autoregressive Generative Modeling , Year =. arXiv , Author =:2010.14701 , Primaryclass =

work page internal anchor Pith review arXiv 2010

[52] [55]

and Brown, T

Measuring the Algorithmic Efficiency of Neural Networks , Year =. arXiv , Author =:2005.04305 , Primaryclass =

work page arXiv 2005

[53] [56]

Neural Discrete Representation Learning

Neural Discrete Representation Learning , Year =. arXiv , Author =:1711.00937 , Primaryclass =

work page Pith review arXiv

[54] [57]

Jukebox: A Generative Model for Music

Jukebox: A Generative Model for Music , Year =. arXiv , Author =:2005.00341 , Primaryclass =

work page Pith review arXiv 2005

[55] [58]

Scal- ing autoregressive video models.arXiv preprint arXiv:1906.02634,

Scaling Autoregressive Video Models , Year =. arXiv , Author =:1906.02634 , Primaryclass =

work page arXiv 1906

[56] [59]

Pixel Recurrent Neural Networks

Pixel Recurrent Neural Networks , Url =. 2016 , Bdsk-Url-1 =. arXiv , Author =:1601.06759 , Journal =

work page Pith review arXiv 2016

[57] [60]

Multimodal transformer for unaligned multimodal language sequences , Volume =

Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan , Booktitle =. Multimodal transformer for unaligned multimodal language sequences , Volume =

work page

[58] [61]

Enhancing the transformer with explicit relational encoding for math problem solving

Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , Year =. arXiv , Author =:1910.06611 , Primaryclass =

work page arXiv 1910

[59] [62]

YFCC100M: The New Data in Multimedia Research

The New Data and New Challenges in Multimedia Research , Url =. 2015 , Bdsk-Url-1 =. arXiv , Author =:1503.01817 , Journal =

work page Pith review arXiv 2015

[60] [63]

arXiv , Author =:2006.10621 , Primaryclass =

On the Predictability of Pruning Across Scales , Year =. arXiv , Author =:2006.10621 , Primaryclass =

work page arXiv 2006

[61] [65]

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

A Downsampled Variant of ImageNet as an Alternative to the. 2017 , Bdsk-Url-1 =. arXiv , Author =:1707.08819 , Journal =

work page Pith review arXiv 2017

[62] [67]

Analysing Mathematical Reasoning Abilities of Neural Models

Analysing Mathematical Reasoning Abilities of Neural Models , Url =. 2019 , Bdsk-Url-1 =. arXiv , Author =:1904.01557 , Journal =

work page Pith review arXiv 2019

[63] [68]

Generating Diverse High-Fidelity Images with VQ-VAE-2

Generating Diverse High-Fidelity Images with. 2019 , Bdsk-Url-1 =. arXiv , Author =:1906.00446 , Journal =

work page Pith review arXiv 2019

[64] [70]

A neural scaling law from the dimension of the data manifold

A Neural Scaling Law from the Dimension of the Data Manifold , Year =. arXiv , Author =:2004.10802 , Primaryclass =

work page arXiv 2004

[65] [71]

arXiv , Author =:2002.11794 , Primaryclass =

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , Year =. arXiv , Author =:2002.11794 , Primaryclass =

work page arXiv 2002

[66] [72]

Roller, E

Recipes for building an open-domain chatbot , Year =. arXiv , Author =:2004.13637 , Primaryclass =

work page arXiv 2004

[67] [74]

Liu , Eprint =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , Eprint =. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , Year =

work page

[68] [75]

Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =

Jonathan S. Rosenfeld and Amir Rosenfeld and Yonatan Belinkov and Nir Shavit , Eprint =. A Constructive Prediction of the Generalization Error Across Scales , Year =

work page

[69] [76]

2021 , eprint=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. 2021 , eprint=

work page 2021

[70] [77]

Analysis of a random forests model , Volume =

Biau, G. Analysis of a random forests model , Volume =. Journal of Machine Learning Research , Number =

work page

[71] [78]

All of nonparametric statistics , Year =

Wasserman, Larry , Publisher =. All of nonparametric statistics , Year =

work page

[72] [80]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , Year =. arXiv , Author =:1909.11942 , Primaryclass =

work page internal anchor Pith review arXiv 1909

[73] [81]

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow: Deep Learning for Supercomputers , Year =. arXiv , Author =:1811.02084 , Primaryclass =

work page Pith review arXiv

[74] [82]

Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =

Hestness, Joel and Ardalani, Newsha and Diamos, Gregory , Booktitle =. Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =. 2019 , Bdsk-Url-1 =. doi:10.1145/3293883.3295710 , Isbn =

work page doi:10.1145/3293883.3295710 2019

[75] [84]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size , Url =. 2018 , Bdsk-Url-1 =. arXiv , Author =:1811.07062 , Journal =

work page Pith review arXiv 2018

[76] [85]

Common Crawl , Url =

The Common Crawl Foundation , Date-Added =. Common Crawl , Url =

work page

[77] [86]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , Year =. arXiv , Author =:1905.00537 , Primaryclass =

work page internal anchor Pith review arXiv 1905

[78] [87]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa:. 2019 , Bdsk-Url-1 =. arXiv , Author =:1907.11692 , Journal =

work page internal anchor Pith review Pith/arXiv arXiv 2019

[79] [88]

On the origin of long-range correlations in texts , Volume =

Altmann, Eduardo G and Cristadoro, Giampaolo and Degli Esposti, Mirko , Journal =. On the origin of long-range correlations in texts , Volume =

work page

[80] [89]

Entropy and long-range correlations in literary English , Volume =

Ebeling, Werner and P. Entropy and long-range correlations in literary English , Volume =. EPL (Europhysics Letters) , Number =

work page