Recognition: unknown
GPT-4 Technical Report
read the original abstract
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
-
Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers
A standard Transformer with O(ε^{-d0/α}) blocks can approximate any bounded d0-dimensional Hölder function of smoothness α to accuracy ε, but at least Ω(ε^{-d0/(4α)}) blocks are required.
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
LLM Translation of Compiler Intermediate Representation
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
-
Nearly Optimal Attention Coresets
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
-
Efficient Preference Poisoning Attack on Offline RLHF
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
Revisable by Design: A Theory of Streaming LLM Agent Execution
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
-
RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
PhysInOne: Visual Physics Learning and Reasoning in One Suite
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
Disentangling MLP Neuron Weights in Vocabulary Space
ROTATE disentangles MLP neurons into faithful vocabulary channels by optimizing weight rotations to maximize vocabulary-space kurtosis, outperforming activation-based baselines for neuron descriptions.
-
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
-
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...
-
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
-
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction parado...
-
Adaptive Stopping for Multi-Turn LLM Reasoning
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG an...
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
The Linear Representation Hypothesis and the Geometry of Large Language Models
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
-
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.
-
Sampling from Flow Language Models via Marginal-Conditioned Bridges
Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and r...
-
Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models
DyGFM introduces decoupled pre-training and divergence-conditioned prompts to create the first multi-domain dynamic graph foundation model that outperforms baselines on node classification and link prediction.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
-
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.
-
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition
STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
-
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
-
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
-
CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference
CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.
-
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Online Continual Learning with Dynamic Label Hierarchies
HALO improves online continual learning under evolving label hierarchies by adaptively combining classification heads regularized with organized learnable prototypes for better adaptation and reduced forgetting.
-
SoK: Unlearnability and Unlearning for Model Dememorization
The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
-
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
-
Kairos: A Scalable Serving System for Physical AI
Kairos is the first multi-robot serving system that treats the generate-execute loop as a first-class citizen and reduces average task latency by 31.8-66.5% versus digital AI serving systems.
-
Neural Statistical Functions
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
-
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
-
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
-
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
-
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
-
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
Reference graph
Works this paper leans on
-
[1]
Understanding the Capabilities, Limita- tions, and Societal Impact of Large Language Models,
A. Tamkin, M. Brundage, J. Clark, and D. Ganguli, “Understanding the Capabilities, Limita- tions, and Societal Impact of Large Language Models,” Feb. 2021
work page 2021
- [2]
-
[3]
WebGPT: Improving the factual accuracy of language models through web browsing
J. Hilton, R. Nakano, S. Balaji, and J. Schulman, “WebGPT: Improving the factual accuracy of language models through web browsing. ” https://openai.com/research/webgpt, Dec. 2021
work page 2021
-
[4]
ACT-1: Transformer for Actions – Adept
“ACT-1: Transformer for Actions – Adept. ” https://www.adept.ai/blog/act-1
-
[5]
Evaluating Large Language Models Trained on Code,
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Vo...
work page 2021
-
[6]
Ethical and social risks of harm from Language Models,
L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel, “Ethical and social risks of harm from Language Models,” Dec. 2021
work page 2021
-
[7]
Release Strategies and the Social Impacts of Language Models,
I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W. Kim, S. Kreps, M. McCain, A. Newhouse, J. Blazakis, K. McGuffie, and J. Wang, “Release Strategies and the Social Impacts of Language Models,” Nov. 2019
work page 2019
-
[8]
Improving language understanding with unsupervised learning
A. Radford, “Improving language understanding with unsupervised learning. ” https://ope- nai.com/research/language-unsupervised, June 2018
work page 2018
-
[9]
Better language models and their implications
A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage, I. Sutskever, A. Askell, D. Lansky, D. Hernandez, and D. Luan, “Better language models and their implications. ” https://openai.com/research/better-language-models, Feb. 2019
work page 2019
-
[10]
Language Models are Few-Shot Learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...
work page 2020
-
[11]
S. Altman, “Planning for AGI and beyond. ” https://openai.com/blog/planning-for-agi-and- beyond, Feb. 2023
work page 2023
-
[12]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” Mar. 2022. 71
work page 2022
-
[13]
Deep reinforcement learning from human preferences,
P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Feb. 2023
work page 2023
-
[14]
Model Cards for Model Reporting,
M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229, Jan. 2019
work page 2019
-
[15]
System Cards, a new resource for under- standing how AI systems work
N. Green, C. Procope, A. Cheema, and A. Adediji, “System Cards, a new resource for under- standing how AI systems work. ” https://ai.facebook.com/blog/system-cards-a-new-resource- for-understanding-how-ai-systems-work/, Feb. 2022
work page 2022
-
[16]
DALL ·E 2 Preview - Risks and Limitations
“DALL ·E 2 Preview - Risks and Limitations. ” OpenAI, Apr. 2022
work page 2022
-
[17]
J. Sandbrink, H. Hobbs, J. Swett, A. Dafoe, and A. Sandberg, “Differential Technology Development: A Responsible Innovation Principle for Navigating Technology Risks,” Sept. 2022
work page 2022
-
[18]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Gan- guli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Ka- plan...
work page 2022
-
[19]
Discovering Language Model Behaviors with Model-Written Evaluations,
E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, A. Jones, A. Chen, B. Mann, B. Israel, B. Seethor, C. McKinnon, C. Olah, D. Yan, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, G. Khundadze, J. Kernion, J. Landis, J. Kerr, J. Mueller, J. Hyun, J. Landau, K. Ndousse, L. Goldberg, L....
work page 2022
-
[20]
B. P. Kehoe, Zen and the Art of the Internet . Project Gutenberg, June 1992
work page 1992
-
[21]
Lessons learned on language model safety and misuse
M. Brundage, K. Mayer, T. Eloundou, S. Agarwal, S. Adler, G. Krueger, J. Leike, and P. Mishkin, “Lessons learned on language model safety and misuse. ” https://ope- nai.com/research/language-model-safety-and-misuse, Mar. 2022
work page 2022
-
[22]
Language Models are Unsupervised Multitask Learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019
work page 2019
-
[23]
G. C. Bowker and S. L. Star, Sorting Things Out . MIT Press, Aug. 2000
work page 2000
-
[24]
Taxonomy of Risks posed by Language Models,
L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hendricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel, “Taxonomy of Risks posed by Language Models,” in 2022 ACM Conference on Fairness, Acco...
work page 2022
-
[25]
Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,
I. Solaiman and C. Dennison, “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,” Nov. 2021
work page 2021
-
[26]
Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems,
H. Khlaaf, “Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems,” Trail of Bits , 2023
work page 2023
-
[27]
Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims,
M. Brundage, S. A vin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...
work page 2020
-
[28]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,
D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brow...
work page 2022
-
[29]
Red Teaming Language Models with Language Models,
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red Teaming Language Models with Language Models,” Feb. 2022
work page 2022
-
[30]
A Hazard Analysis Framework for Code Synthesis Large Language Models,
H. Khlaaf, P. Mishkin, J. Achiam, G. Krueger, and M. Brundage, “A Hazard Analysis Framework for Code Synthesis Large Language Models,” July 2022
work page 2022
-
[31]
On Faithfulness and Factuality in Abstractive Summarization,
J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On Faithfulness and Factuality in Abstractive Summarization,” May 2020
work page 2020
-
[32]
TruthfulQA: Measuring How Models Mimic Human False- hoods,
S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring How Models Mimic Human False- hoods,” May 2022
work page 2022
-
[33]
Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk
J. A. Goldstein, G. Sastry, M. Musser, R. DiResta, M. Gentzel, and K. Sedova, “Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk. ” https://openai.com/research/forecasting-misuse, Jan. 2023
work page 2023
-
[34]
Truthful AI: Developing and governing AI that does not lie,
O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders, “Truthful AI: Developing and governing AI that does not lie,” Oct. 2021
work page 2021
-
[35]
Detoxifying Language Models Risks Marginalizing Minority Voices,
A. Xu, E. Pathak, E. Wallace, S. Gururangan, M. Sap, and D. Klein, “Detoxifying Language Models Risks Marginalizing Minority Voices,” Apr. 2021
work page 2021
-
[36]
Measuring and Mitigating Unintended Bias in Text Classification,
L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman, “Measuring and Mitigating Unintended Bias in Text Classification,” in Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, (New York, NY, USA), pp. 67–73, Association for Computing Machinery, Dec. 2018
work page 2018
-
[37]
A Holistic Approach to Undesired Content Detection in the Real World,
T. Markov, C. Zhang, S. Agarwal, T. Eloundou, T. Lee, S. Adler, A. Jiang, and L. Weng, “A Holistic Approach to Undesired Content Detection in the Real World,” Feb. 2023. 73
2023
-
[38]
How should AI systems behave, and who should decide?
OpenAI, “How should AI systems behave, and who should decide?. ” https://ope- nai.com/blog/how-should-ai-systems-behave, Feb. 2023
work page 2023
-
[39]
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models,
M. Rauh, J. Mellor, J. Uesato, P.-S. Huang, J. Welbl, L. Weidinger, S. Dathathri, A. Glaese, G. Irving, I. Gabriel, W. Isaac, and L. A. Hendricks, “Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models,” Oct. 2022
work page 2022
-
[40]
L., Barocas, S., Daum \'e , III, H., and Wallach, H
S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, “Language (Technology) is Power: A Critical Survey of "Bias" in NLP. ” https://arxiv.org/abs/2005.14050v2, May 2020
-
[41]
On Measures of Biases and Harms in NLP,
S. Dev, E. Sheng, J. Zhao, A. Amstutz, J. Sun, Y. Hou, M. Sanseverino, J. Kim, A. Nishi, N. Peng, and K.-W. Chang, “On Measures of Biases and Harms in NLP,” in Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022 , (Online only), pp. 246–267, Association for Computational Linguistics, Nov. 2022
work page 2022
-
[42]
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,
T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” July 2016
work page 2016
-
[43]
H. Gonen and Y. Goldberg, “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , (Minneapolis, Minnesota), pp. 609...
work page 2019
-
[44]
Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns,
K. Webster, M. Recasens, V. Axelrod, and J. Baldridge, “Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns,” Oct. 2018
work page 2018
-
[45]
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ,
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , (Virtual Event Canada), pp. 610–623, ACM, Mar. 2021
work page 2021
-
[46]
On the Opportunities and Risks of Foundation Models,
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...
work page 2021
-
[47]
S. U. Noble, Algorithms of Oppression . NYU Press, Feb. 2018
work page 2018
-
[48]
R. Richardson, J. Schultz, and K. Crawford, “Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice,” Feb. 2019. 74
work page 2019
-
[49]
MacAskill, What We Owe The Future
W. MacAskill, What We Owe The Future . Basic Books, Aug. 2022
work page 2022
-
[50]
OpenAI, “GPT-2: 1.5B release. ” https://openai.com/research/gpt-2-1-5b-release, Nov. 2019
work page 2019
-
[51]
All the News That’s Fit to Fabricate: AI- Generated Text as a Tool of Media Misinformation,
S. Kreps, R. M. McCain, and M. Brundage, “All the News That’s Fit to Fabricate: AI- Generated Text as a Tool of Media Misinformation,” Journal of Experimental Political Science , vol. 9, no. 1, pp. 104–117, 2022/ed
work page 2022
-
[52]
B. Buchanan, A. Lohn, M. Musser, and K. Sedova, “Truth, Lies, and Automation,” tech. rep., Center for Security and Emerging Technology, May 2021
work page 2021
-
[53]
AI’s Powers of Political Persuasion
A. Myers, “AI’s Powers of Political Persuasion. ” https://hai.stanford.edu/news/ais-powers- political-persuasion, Feb. 2023
work page 2023
-
[54]
Artificial intelligence can persuade humans on political issues,
H. Bai, J. Voelkel, J. Eichstaedt, and R. Willer, “Artificial intelligence can persuade humans on political issues,” 2023
work page 2023
-
[55]
On the Horizon: Interactive and Compositional Deepfakes,
E. Horvitz, “On the Horizon: Interactive and Compositional Deepfakes,” in INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION , pp. 653–661, Nov. 2022
work page 2022
-
[56]
Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security,
R. Chesney and D. K. Citron, “Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security,” July 2018
work page 2018
-
[57]
U.S. Department of Commerce, “Dual use export licenses,” March 13 2023. accessed 2023-03-13
work page 2023
-
[58]
Arms control, disarmament and non-proliferation in nato,
NATO, “Arms control, disarmament and non-proliferation in nato,” February 27 2023. accessed 2023-02-27
work page 2023
-
[59]
Extracting Training Data from Large Language Models,
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data from Large Language Models,” June 2021
work page 2021
-
[60]
Quantifying Memo- rization Across Neural Language Models,
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memo- rization Across Neural Language Models,” Mar. 2023
work page 2023
-
[61]
Predictability and Surprise in Large Generative Models,
D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield-Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark, “Predictabili...
work page 2022
-
[62]
Emergent Abilities of Large Language Models,
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent Abilities of Large Language Models,” Oct. 2022
work page 2022
-
[63]
The alignment problem from a deep learning perspec- tive,
R. Ngo, L. Chan, and S. Mindermann, “The alignment problem from a deep learning perspec- tive,” Feb. 2023
work page 2023
-
[64]
Bostrom, Superintelligence: Paths, Dangers, Strategies
N. Bostrom, Superintelligence: Paths, Dangers, Strategies . United Kingdom: Oxford University Press, Sept. 2014. 75
work page 2014
-
[65]
Harms from Increasingly Agentic Algorithmic Systems,
A. Chan, R. Salganik, A. Markelius, C. Pang, N. Rajkumar, D. Krasheninnikov, L. Langosco, Z. He, Y. Duan, M. Carroll, M. Lin, A. Mayhew, K. Collins, M. Molamohammadi, J. Burden, W. Zhao, S. Rismani, K. Voudouris, U. Bhatt, A. Weller, D. Krueger, and T. Maharaj, “Harms from Increasingly Agentic Algorithmic Systems,” Feb. 2023
work page 2023
-
[66]
Language Models as Agent Models,
J. Andreas, “Language Models as Agent Models,” Dec. 2022
work page 2022
-
[67]
Emergent Deception and Emergent Optimization
J. Steinhardt, “Emergent Deception and Emergent Optimization. ” https://bounded- regret.ghost.io/emergent-deception-optimization/, Feb. 2023
work page 2023
-
[68]
S. M. Omohundro, “The Basic AI Drives,” in Proceedings of the 2008 Conference on Artificial General Intelligence 2008 , (NLD), pp. 483–492, IOS Press, June 2008
work page 2008
-
[69]
The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents,
N. Bostrom, “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents,” Minds and Machines , vol. 22, pp. 71–85, May 2012
work page 2012
-
[70]
Optimal Policies Tend to Seek Power,
A. M. Turner, L. Smith, R. Shah, A. Critch, and P. Tadepalli, “Optimal Policies Tend to Seek Power,” Jan. 2023
work page 2023
-
[71]
Parametrically Retargetable Decision-Makers Tend To Seek Power,
A. M. Turner and P. Tadepalli, “Parametrically Retargetable Decision-Makers Tend To Seek Power,” Oct. 2022
work page 2022
-
[72]
Power-seeking can be probable and predictive for trained agents,
V. Krakovna and janos, “Power-seeking can be probable and predictive for trained agents,” Mar. 2023
work page 2023
-
[73]
Russell, Human Compatible: Artificial Intelligence and the Problem of Control
S. Russell, Human Compatible: Artificial Intelligence and the Problem of Control . Cham: Springer International Publishing, 2022
work page 2022
-
[74]
Is Power-Seeking AI an Existential Risk?,
J. Carlsmith, “Is Power-Seeking AI an Existential Risk?,” June 2022
work page 2022
-
[75]
Update on arc’s recent eval efforts,
Alignment Research Center, “Update on arc’s recent eval efforts,” March 2023 2023. accessed 2023-03-17
work page 2023
-
[76]
E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine, K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz, G. Shachaf, S. Shalev- Shwartz, A. Shashua, and M. Tenenholtz, “MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,” May 2022
work page 2022
-
[77]
Toolformer: Language Models Can Teach Themselves to Use Tools,
T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,” Feb. 2023
work page 2023
-
[78]
Augmented Language Models: A Survey,
G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y. LeCun, and T. Scialom, “Augmented Language Models: A Survey,” Feb. 2023
work page 2023
-
[79]
TALM: Tool Augmented Language Models,
A. Parisi, Y. Zhao, and N. Fiedel, “TALM: Tool Augmented Language Models,” May 2022
work page 2022
-
[80]
D. Weininger, “Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,” Journal of chemical information and computer sciences , vol. 28, no. 1, pp. 31–36, 1988
work page 1988
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.