pith. machine review for the scientific record. sign in

arxiv: 2303.08774 · v6 · submitted 2023-03-15 · 💻 cs.CL · cs.AI

Recognition: unknown

GPT-4 Technical Report

OpenAI , Josh Achiam , Steven Adler , Sandhini Agarwal , Lama Ahmad , Ilge Akkaya , Florencia Leoni Aleman , Diogo Almeida
show 272 more authors
Janko Altenschmidt Sam Altman Shyamal Anadkat Red Avila Igor Babuschkin Suchir Balaji Valerie Balcom Paul Baltescu Haiming Bao Mohammad Bavarian Jeff Belgum Irwan Bello Jake Berdine Gabriel Bernadett-Shapiro Christopher Berner Lenny Bogdonoff Oleg Boiko Madelaine Boyd Anna-Luisa Brakman Greg Brockman Tim Brooks Miles Brundage Kevin Button Trevor Cai Rosie Campbell Andrew Cann Brittany Carey Chelsea Carlson Rory Carmichael Brooke Chan Che Chang Fotis Chantzis Derek Chen Sully Chen Ruby Chen Jason Chen Mark Chen Ben Chess Chester Cho Casey Chu Hyung Won Chung Dave Cummings Jeremiah Currier Yunxing Dai Cory Decareaux Thomas Degry Noah Deutsch Damien Deville Arka Dhar David Dohan Steve Dowling Sheila Dunning Adrien Ecoffet Atty Eleti Tyna Eloundou David Farhi Liam Fedus Niko Felix Sim\'on Posada Fishman Juston Forte Isabella Fulford Leo Gao Elie Georges Christian Gibson Vik Goel Tarun Gogineni Gabriel Goh Rapha Gontijo-Lopes Jonathan Gordon Morgan Grafstein Scott Gray Ryan Greene Joshua Gross Shixiang Shane Gu Yufei Guo Chris Hallacy Jesse Han Jeff Harris Yuchen He Mike Heaton Johannes Heidecke Chris Hesse Alan Hickey Wade Hickey Peter Hoeschele Brandon Houghton Kenny Hsu Shengli Hu Xin Hu Joost Huizinga Shantanu Jain Shawn Jain Joanne Jang Angela Jiang Roger Jiang Haozhun Jin Denny Jin Shino Jomoto Billie Jonn Heewoo Jun Tomer Kaftan {\L}ukasz Kaiser Ali Kamali Ingmar Kanitscheider Nitish Shirish Keskar Tabarak Khan Logan Kilpatrick Jong Wook Kim Christina Kim Yongjik Kim Jan Hendrik Kirchner Jamie Kiros Matt Knight Daniel Kokotajlo {\L}ukasz Kondraciuk Andrew Kondrich Aris Konstantinidis Kyle Kosic Gretchen Krueger Vishal Kuo Michael Lampe Ikai Lan Teddy Lee Jan Leike Jade Leung Daniel Levy Chak Ming Li Rachel Lim Molly Lin Stephanie Lin Mateusz Litwin Theresa Lopez Ryan Lowe Patricia Lue Anna Makanju Kim Malfacini Sam Manning Todor Markov Yaniv Markovski Bianca Martin Katie Mayer Andrew Mayne Bob McGrew Scott Mayer McKinney Christine McLeavey Paul McMillan Jake McNeil David Medina Aalok Mehta Jacob Menick Luke Metz Andrey Mishchenko Pamela Mishkin Vinnie Monaco Evan Morikawa Daniel Mossing Tong Mu Mira Murati Oleg Murk David M\'ely Ashvin Nair Reiichiro Nakano Rajeev Nayak Arvind Neelakantan Richard Ngo Hyeonwoo Noh Long Ouyang Cullen O'Keefe Jakub Pachocki Alex Paino Joe Palermo Ashley Pantuliano Giambattista Parascandolo Joel Parish Emy Parparita Alex Passos Mikhail Pavlov Andrew Peng Adam Perelman Filipe de Avila Belbute Peres Michael Petrov Henrique Ponde de Oliveira Pinto Michael (Rai) Pokorny Michelle Pokrass Vitchyr H. Pong Tolly Powell Alethea Power Boris Power Elizabeth Proehl Raul Puri Alec Radford Jack Rae Aditya Ramesh Cameron Raymond Francis Real Kendra Rimbach Carl Ross Bob Rotsted Henri Roussez Nick Ryder Mario Saltarelli Ted Sanders Shibani Santurkar Girish Sastry Heather Schmidt David Schnurr John Schulman Daniel Selsam Kyla Sheppard Toki Sherbakov Jessica Shieh Sarah Shoker Pranav Shyam Szymon Sidor Eric Sigler Maddie Simens Jordan Sitkin Katarina Slama Ian Sohl Benjamin Sokolowsky Yang Song Natalie Staudacher Felipe Petroski Such Natalie Summers Ilya Sutskever Jie Tang Nikolas Tezak Madeleine B. Thompson Phil Tillet Amin Tootoonchian Elizabeth Tseng Preston Tuggle Nick Turley Jerry Tworek Juan Felipe Cer\'on Uribe Andrea Vallone Arun Vijayvergiya Chelsea Voss Carroll Wainwright Justin Jay Wang Alvin Wang Ben Wang Jonathan Ward Jason Wei CJ Weinmann Akila Welihinda Peter Welinder Jiayi Weng Lilian Weng Matt Wiethoff Dave Willner Clemens Winter Samuel Wolrich Hannah Wong Lauren Workman Sherwin Wu Jeff Wu Michael Wu Kai Xiao Tao Xu Sarah Yoo Kevin Yu Qiming Yuan Wojciech Zaremba Rowan Zellers Chong Zhang Marvin Zhang Shengjia Zhao Tianhao Zheng Juntang Zhuang William Zhuk Barret Zoph
Authors on Pith no claims yet
classification 💻 cs.CL cs.AI
keywords gpt-4performancemodelpredictreporttextacademicaccept
0
0 comments X
read the original abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ViMU: Benchmarking Video Metaphorical Understanding

    cs.CV 2026-05 unverdicted novelty 8.0

    ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

  2. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  3. Pretraining Exposure Explains Popularity Judgments in Large Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

  4. Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

    cs.CV 2026-05 unverdicted novelty 8.0

    An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

  5. Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers

    cs.LG 2026-05 unverdicted novelty 8.0

    A standard Transformer with O(ε^{-d0/α}) blocks can approximate any bounded d0-dimensional Hölder function of smoothness α to accuracy ε, but at least Ω(ε^{-d0/(4α)}) blocks are required.

  6. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

  7. LLM Translation of Compiler Intermediate Representation

    cs.PL 2026-05 unverdicted novelty 8.0

    IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

  8. Nearly Optimal Attention Coresets

    cs.DS 2026-05 unverdicted novelty 8.0

    ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

  9. Efficient Preference Poisoning Attack on Offline RLHF

    cs.LG 2026-05 unverdicted novelty 8.0

    Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

  10. Characterizing the Expressivity of Local Attention in Transformers

    cs.CL 2026-05 unverdicted novelty 8.0

    Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...

  11. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

  12. Revisable by Design: A Theory of Streaming LLM Agent Execution

    cs.LG 2026-04 unverdicted novelty 8.0

    LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...

  13. RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

    cs.CL 2026-04 unverdicted novelty 8.0

    RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.

  14. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

    cs.AI 2026-04 accept novelty 8.0

    MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...

  15. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  16. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  17. RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

    cs.RO 2026-04 unverdicted novelty 8.0

    RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

  18. PhysInOne: Visual Physics Learning and Reasoning in One Suite

    cs.CV 2026-04 unverdicted novelty 8.0

    PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...

  19. HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

    cs.CV 2026-04 accept novelty 8.0

    HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

  20. Disentangling MLP Neuron Weights in Vocabulary Space

    cs.CL 2026-04 unverdicted novelty 8.0

    ROTATE disentangles MLP neurons into faithful vocabulary channels by optimizing weight rotations to maximize vocabulary-space kurtosis, outperforming activation-based baselines for neuron descriptions.

  21. ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

    cs.CV 2026-04 unverdicted novelty 8.0

    ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.

  22. FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

    physics.chem-ph 2026-04 conditional novelty 8.0

    FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...

  23. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    cs.CR 2026-04 unverdicted novelty 8.0

    DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

  24. AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

    cs.AI 2026-04 unverdicted novelty 8.0

    AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction parado...

  25. Adaptive Stopping for Multi-Turn LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 8.0

    MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG an...

  26. When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

    cs.LG 2026-03 conditional novelty 8.0

    Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

  27. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  28. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  29. Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    cs.LG 2024-07 conditional novelty 8.0

    TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

  30. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  31. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  32. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  33. The Linear Representation Hypothesis and the Geometry of Large Language Models

    cs.CL 2023-11 conditional novelty 8.0

    Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.

  34. Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    cs.CL 2023-05 accept novelty 8.0

    Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

  35. MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

    cs.CR 2026-05 unverdicted novelty 7.0

    MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.

  36. Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

    cs.CV 2026-05 conditional novelty 7.0

    SIRA mitigates hallucinations in LVLMs by internally contrasting full visual access against a masked late-layer branch that retains shared context but lacks fine-grained visual evidence.

  37. GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...

  38. What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.

  39. Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

    cs.CL 2026-05 unverdicted novelty 7.0

    New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

  40. BOOKMARKS: Efficient Active Storyline Memory for Role-playing

    cs.CL 2026-05 unverdicted novelty 7.0

    BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.

  41. GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

    cs.LG 2026-05 unverdicted novelty 7.0

    GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.

  42. Sampling from Flow Language Models via Marginal-Conditioned Bridges

    cs.LG 2026-05 unverdicted novelty 7.0

    Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and r...

  43. Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

    cs.LG 2026-05 conditional novelty 7.0

    DyGFM introduces decoupled pre-training and divergence-conditioned prompts to create the first multi-domain dynamic graph foundation model that outperforms baselines on node classification and link prediction.

  44. Query-Conditioned Test-Time Self-Training for Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.

  45. Query-Conditioned Test-Time Self-Training for Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

  46. A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

    cs.CL 2026-05 unverdicted novelty 7.0

    IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.

  47. STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

    cs.CV 2026-05 conditional novelty 7.0

    STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.

  48. CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

    cs.CV 2026-05 conditional novelty 7.0

    LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...

  49. OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems

    cs.DB 2026-05 conditional novelty 7.0

    OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.

  50. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  51. Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

  52. The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

    stat.ML 2026-05 unverdicted novelty 7.0

    In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.

  53. State-Centric Decision Process

    cs.AI 2026-05 unverdicted novelty 7.0

    SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

  54. G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 7.0

    G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.

  55. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  56. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  57. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  58. CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference

    cs.IT 2026-05 unverdicted novelty 7.0

    CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.

  59. Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

    cs.CV 2026-05 unverdicted novelty 7.0

    Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.

  60. Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

Reference graph

Works this paper leans on

128 extracted references · 127 canonical work pages · cited by 933 Pith papers

  1. [1]

    Understanding the Capabilities, Limita- tions, and Societal Impact of Large Language Models,

    A. Tamkin, M. Brundage, J. Clark, and D. Ganguli, “Understanding the Capabilities, Limita- tions, and Societal Impact of Large Language Models,” Feb. 2021

  2. [2]

    Introducing the new Bing

    “Introducing the new Bing. ” https://www.bing.com/new

  3. [3]

    WebGPT: Improving the factual accuracy of language models through web browsing

    J. Hilton, R. Nakano, S. Balaji, and J. Schulman, “WebGPT: Improving the factual accuracy of language models through web browsing. ” https://openai.com/research/webgpt, Dec. 2021

  4. [4]

    ACT-1: Transformer for Actions – Adept

    “ACT-1: Transformer for Actions – Adept. ” https://www.adept.ai/blog/act-1

  5. [5]

    Evaluating Large Language Models Trained on Code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Vo...

  6. [6]

    Ethical and social risks of harm from Language Models,

    L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel, “Ethical and social risks of harm from Language Models,” Dec. 2021

  7. [7]

    Release Strategies and the Social Impacts of Language Models,

    I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W. Kim, S. Kreps, M. McCain, A. Newhouse, J. Blazakis, K. McGuffie, and J. Wang, “Release Strategies and the Social Impacts of Language Models,” Nov. 2019

  8. [8]

    Improving language understanding with unsupervised learning

    A. Radford, “Improving language understanding with unsupervised learning. ” https://ope- nai.com/research/language-unsupervised, June 2018

  9. [9]

    Better language models and their implications

    A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage, I. Sutskever, A. Askell, D. Lansky, D. Hernandez, and D. Luan, “Better language models and their implications. ” https://openai.com/research/better-language-models, Feb. 2019

  10. [10]

    Language Models are Few-Shot Learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

  11. [11]

    Planning for AGI and beyond

    S. Altman, “Planning for AGI and beyond. ” https://openai.com/blog/planning-for-agi-and- beyond, Feb. 2023

  12. [12]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” Mar. 2022. 71

  13. [13]

    Deep reinforcement learning from human preferences,

    P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Feb. 2023

  14. [14]

    Model Cards for Model Reporting,

    M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229, Jan. 2019

  15. [15]

    System Cards, a new resource for under- standing how AI systems work

    N. Green, C. Procope, A. Cheema, and A. Adediji, “System Cards, a new resource for under- standing how AI systems work. ” https://ai.facebook.com/blog/system-cards-a-new-resource- for-understanding-how-ai-systems-work/, Feb. 2022

  16. [16]

    DALL ·E 2 Preview - Risks and Limitations

    “DALL ·E 2 Preview - Risks and Limitations. ” OpenAI, Apr. 2022

  17. [17]

    Differential Technology Development: A Responsible Innovation Principle for Navigating Technology Risks,

    J. Sandbrink, H. Hobbs, J. Swett, A. Dafoe, and A. Sandberg, “Differential Technology Development: A Responsible Innovation Principle for Navigating Technology Risks,” Sept. 2022

  18. [18]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,

    Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Gan- guli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Ka- plan...

  19. [19]

    Discovering Language Model Behaviors with Model-Written Evaluations,

    E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, A. Jones, A. Chen, B. Mann, B. Israel, B. Seethor, C. McKinnon, C. Olah, D. Yan, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, G. Khundadze, J. Kernion, J. Landis, J. Kerr, J. Mueller, J. Hyun, J. Landau, K. Ndousse, L. Goldberg, L....

  20. [20]

    B. P. Kehoe, Zen and the Art of the Internet . Project Gutenberg, June 1992

  21. [21]

    Lessons learned on language model safety and misuse

    M. Brundage, K. Mayer, T. Eloundou, S. Agarwal, S. Adler, G. Krueger, J. Leike, and P. Mishkin, “Lessons learned on language model safety and misuse. ” https://ope- nai.com/research/language-model-safety-and-misuse, Mar. 2022

  22. [22]

    Language Models are Unsupervised Multitask Learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019

  23. [23]

    G. C. Bowker and S. L. Star, Sorting Things Out . MIT Press, Aug. 2000

  24. [24]

    Taxonomy of Risks posed by Language Models,

    L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hendricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel, “Taxonomy of Risks posed by Language Models,” in 2022 ACM Conference on Fairness, Acco...

  25. [25]

    Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,

    I. Solaiman and C. Dennison, “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,” Nov. 2021

  26. [26]

    Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems,

    H. Khlaaf, “Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems,” Trail of Bits , 2023

  27. [27]

    Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims,

    M. Brundage, S. A vin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...

  28. [28]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brow...

  29. [29]

    Red Teaming Language Models with Language Models,

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red Teaming Language Models with Language Models,” Feb. 2022

  30. [30]

    A Hazard Analysis Framework for Code Synthesis Large Language Models,

    H. Khlaaf, P. Mishkin, J. Achiam, G. Krueger, and M. Brundage, “A Hazard Analysis Framework for Code Synthesis Large Language Models,” July 2022

  31. [31]

    On Faithfulness and Factuality in Abstractive Summarization,

    J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On Faithfulness and Factuality in Abstractive Summarization,” May 2020

  32. [32]

    TruthfulQA: Measuring How Models Mimic Human False- hoods,

    S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring How Models Mimic Human False- hoods,” May 2022

  33. [33]

    Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk

    J. A. Goldstein, G. Sastry, M. Musser, R. DiResta, M. Gentzel, and K. Sedova, “Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk. ” https://openai.com/research/forecasting-misuse, Jan. 2023

  34. [34]

    Truthful AI: Developing and governing AI that does not lie,

    O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders, “Truthful AI: Developing and governing AI that does not lie,” Oct. 2021

  35. [35]

    Detoxifying Language Models Risks Marginalizing Minority Voices,

    A. Xu, E. Pathak, E. Wallace, S. Gururangan, M. Sap, and D. Klein, “Detoxifying Language Models Risks Marginalizing Minority Voices,” Apr. 2021

  36. [36]

    Measuring and Mitigating Unintended Bias in Text Classification,

    L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman, “Measuring and Mitigating Unintended Bias in Text Classification,” in Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, (New York, NY, USA), pp. 67–73, Association for Computing Machinery, Dec. 2018

  37. [37]

    A Holistic Approach to Undesired Content Detection in the Real World,

    T. Markov, C. Zhang, S. Agarwal, T. Eloundou, T. Lee, S. Adler, A. Jiang, and L. Weng, “A Holistic Approach to Undesired Content Detection in the Real World,” Feb. 2023. 73

  38. [38]

    How should AI systems behave, and who should decide?

    OpenAI, “How should AI systems behave, and who should decide?. ” https://ope- nai.com/blog/how-should-ai-systems-behave, Feb. 2023

  39. [39]

    Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models,

    M. Rauh, J. Mellor, J. Uesato, P.-S. Huang, J. Welbl, L. Weidinger, S. Dathathri, A. Glaese, G. Irving, I. Gabriel, W. Isaac, and L. A. Hendricks, “Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models,” Oct. 2022

  40. [40]

    L., Barocas, S., Daum \'e , III, H., and Wallach, H

    S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, “Language (Technology) is Power: A Critical Survey of "Bias" in NLP. ” https://arxiv.org/abs/2005.14050v2, May 2020

  41. [41]

    On Measures of Biases and Harms in NLP,

    S. Dev, E. Sheng, J. Zhao, A. Amstutz, J. Sun, Y. Hou, M. Sanseverino, J. Kim, A. Nishi, N. Peng, and K.-W. Chang, “On Measures of Biases and Harms in NLP,” in Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022 , (Online only), pp. 246–267, Association for Computational Linguistics, Nov. 2022

  42. [42]

    Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,

    T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” July 2016

  43. [43]

    Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them,

    H. Gonen and Y. Goldberg, “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , (Minneapolis, Minnesota), pp. 609...

  44. [44]

    Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns,

    K. Webster, M. Recasens, V. Axelrod, and J. Baldridge, “Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns,” Oct. 2018

  45. [45]

    On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ,

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , (Virtual Event Canada), pp. 610–623, ACM, Mar. 2021

  46. [46]

    On the Opportunities and Risks of Foundation Models,

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

  47. [47]

    S. U. Noble, Algorithms of Oppression . NYU Press, Feb. 2018

  48. [48]

    Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice,

    R. Richardson, J. Schultz, and K. Crawford, “Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice,” Feb. 2019. 74

  49. [49]

    MacAskill, What We Owe The Future

    W. MacAskill, What We Owe The Future . Basic Books, Aug. 2022

  50. [50]

    GPT-2: 1.5B release

    OpenAI, “GPT-2: 1.5B release. ” https://openai.com/research/gpt-2-1-5b-release, Nov. 2019

  51. [51]

    All the News That’s Fit to Fabricate: AI- Generated Text as a Tool of Media Misinformation,

    S. Kreps, R. M. McCain, and M. Brundage, “All the News That’s Fit to Fabricate: AI- Generated Text as a Tool of Media Misinformation,” Journal of Experimental Political Science , vol. 9, no. 1, pp. 104–117, 2022/ed

  52. [52]

    Truth, Lies, and Automation,

    B. Buchanan, A. Lohn, M. Musser, and K. Sedova, “Truth, Lies, and Automation,” tech. rep., Center for Security and Emerging Technology, May 2021

  53. [53]

    AI’s Powers of Political Persuasion

    A. Myers, “AI’s Powers of Political Persuasion. ” https://hai.stanford.edu/news/ais-powers- political-persuasion, Feb. 2023

  54. [54]

    Artificial intelligence can persuade humans on political issues,

    H. Bai, J. Voelkel, J. Eichstaedt, and R. Willer, “Artificial intelligence can persuade humans on political issues,” 2023

  55. [55]

    On the Horizon: Interactive and Compositional Deepfakes,

    E. Horvitz, “On the Horizon: Interactive and Compositional Deepfakes,” in INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION , pp. 653–661, Nov. 2022

  56. [56]

    Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security,

    R. Chesney and D. K. Citron, “Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security,” July 2018

  57. [57]

    Dual use export licenses,

    U.S. Department of Commerce, “Dual use export licenses,” March 13 2023. accessed 2023-03-13

  58. [58]

    Arms control, disarmament and non-proliferation in nato,

    NATO, “Arms control, disarmament and non-proliferation in nato,” February 27 2023. accessed 2023-02-27

  59. [59]

    Extracting Training Data from Large Language Models,

    N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data from Large Language Models,” June 2021

  60. [60]

    Quantifying Memo- rization Across Neural Language Models,

    N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memo- rization Across Neural Language Models,” Mar. 2023

  61. [61]

    Predictability and Surprise in Large Generative Models,

    D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield-Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark, “Predictabili...

  62. [62]

    Emergent Abilities of Large Language Models,

    J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent Abilities of Large Language Models,” Oct. 2022

  63. [63]

    The alignment problem from a deep learning perspec- tive,

    R. Ngo, L. Chan, and S. Mindermann, “The alignment problem from a deep learning perspec- tive,” Feb. 2023

  64. [64]

    Bostrom, Superintelligence: Paths, Dangers, Strategies

    N. Bostrom, Superintelligence: Paths, Dangers, Strategies . United Kingdom: Oxford University Press, Sept. 2014. 75

  65. [65]

    Harms from Increasingly Agentic Algorithmic Systems,

    A. Chan, R. Salganik, A. Markelius, C. Pang, N. Rajkumar, D. Krasheninnikov, L. Langosco, Z. He, Y. Duan, M. Carroll, M. Lin, A. Mayhew, K. Collins, M. Molamohammadi, J. Burden, W. Zhao, S. Rismani, K. Voudouris, U. Bhatt, A. Weller, D. Krueger, and T. Maharaj, “Harms from Increasingly Agentic Algorithmic Systems,” Feb. 2023

  66. [66]

    Language Models as Agent Models,

    J. Andreas, “Language Models as Agent Models,” Dec. 2022

  67. [67]

    Emergent Deception and Emergent Optimization

    J. Steinhardt, “Emergent Deception and Emergent Optimization. ” https://bounded- regret.ghost.io/emergent-deception-optimization/, Feb. 2023

  68. [68]

    The Basic AI Drives,

    S. M. Omohundro, “The Basic AI Drives,” in Proceedings of the 2008 Conference on Artificial General Intelligence 2008 , (NLD), pp. 483–492, IOS Press, June 2008

  69. [69]

    The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents,

    N. Bostrom, “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents,” Minds and Machines , vol. 22, pp. 71–85, May 2012

  70. [70]

    Optimal Policies Tend to Seek Power,

    A. M. Turner, L. Smith, R. Shah, A. Critch, and P. Tadepalli, “Optimal Policies Tend to Seek Power,” Jan. 2023

  71. [71]

    Parametrically Retargetable Decision-Makers Tend To Seek Power,

    A. M. Turner and P. Tadepalli, “Parametrically Retargetable Decision-Makers Tend To Seek Power,” Oct. 2022

  72. [72]

    Power-seeking can be probable and predictive for trained agents,

    V. Krakovna and janos, “Power-seeking can be probable and predictive for trained agents,” Mar. 2023

  73. [73]

    Russell, Human Compatible: Artificial Intelligence and the Problem of Control

    S. Russell, Human Compatible: Artificial Intelligence and the Problem of Control . Cham: Springer International Publishing, 2022

  74. [74]

    Is Power-Seeking AI an Existential Risk?,

    J. Carlsmith, “Is Power-Seeking AI an Existential Risk?,” June 2022

  75. [75]

    Update on arc’s recent eval efforts,

    Alignment Research Center, “Update on arc’s recent eval efforts,” March 2023 2023. accessed 2023-03-17

  76. [76]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,

    E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine, K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz, G. Shachaf, S. Shalev- Shwartz, A. Shashua, and M. Tenenholtz, “MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,” May 2022

  77. [77]

    Toolformer: Language Models Can Teach Themselves to Use Tools,

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,” Feb. 2023

  78. [78]

    Augmented Language Models: A Survey,

    G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y. LeCun, and T. Scialom, “Augmented Language Models: A Survey,” Feb. 2023

  79. [79]

    TALM: Tool Augmented Language Models,

    A. Parisi, Y. Zhao, and N. Fiedel, “TALM: Tool Augmented Language Models,” May 2022

  80. [80]

    Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,

    D. Weininger, “Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,” Journal of chemical information and computer sciences , vol. 28, no. 1, pp. 31–36, 1988

Showing first 80 references.