GPT-4 Technical Report

Aalok Mehta, Adam Perelman, Aditya Ramesh, Adrien Ecoffet, Akila Welihinda, Alan Hickey, Alec Radford, Alethea Power, Alex Paino, Alex Passos, Ali Kamali, Alvin Wang, Amin Tootoonchian, Andrea Vallone, Andrew Cann, Andrew Kondrich, Andrew Mayne, Andrew Peng, Andrey Mishchenko, Angela Jiang, Anna-Luisa Brakman, Anna Makanju, Aris Konstantinidis, Arka Dhar, Arun Vijayvergiya, Arvind Neelakantan, Ashley Pantuliano, Ashvin Nair, Atty Eleti, Barret Zoph, Ben Chess, Benjamin Sokolowsky, Ben Wang, Bianca Martin, Billie Jonn, Bob McGrew, Bob Rotsted, Boris Power, Brandon Houghton, Brittany Carey, Brooke Chan, Cameron Raymond, Carl Ross, Carroll Wainwright, Casey Chu, Chak Ming Li, Che Chang, Chelsea Carlson, Chelsea Voss, Chester Cho, Chong Zhang, Chris Hallacy, Chris Hesse, Christian Gibson, Christina Kim, Christine McLeavey, Christopher Berner, CJ Weinmann, Clemens Winter, Cory Decareaux, Cullen O'Keefe, Damien Deville, Daniel Kokotajlo, Daniel Levy, Daniel Mossing, Daniel Selsam, Dave Cummings, Dave Willner, David Dohan, David Farhi, David Medina, David M\'ely, David Schnurr, Denny Jin, Derek Chen, Diogo Almeida, Elie Georges, Elizabeth Proehl, Elizabeth Tseng, Emy Parparita, Eric Sigler, Evan Morikawa, Felipe Petroski Such, Filipe de Avila Belbute Peres, Florencia Leoni Aleman, Fotis Chantzis, Francis Real, Gabriel Bernadett-Shapiro, Gabriel Goh, Giambattista Parascandolo, Girish Sastry, Greg Brockman, Gretchen Krueger, Haiming Bao, Hannah Wong, Haozhun Jin, Heather Schmidt, Heewoo Jun, Henrique Ponde de Oliveira Pinto, Henri Roussez, Hyeonwoo Noh, Hyung Won Chung, Ian Sohl, Igor Babuschkin, Ikai Lan, Ilge Akkaya, Ilya Sutskever, Ingmar Kanitscheider, Irwan Bello, Isabella Fulford, Jack Rae, Jacob Menick, Jade Leung, Jake Berdine, Jake McNeil, Jakub Pachocki, Jamie Kiros, Jan Hendrik Kirchner, Janko Altenschmidt, Jan Leike, Jason Chen, Jason Wei, Jeff Belgum, Jeff Harris, Jeff Wu, Jeremiah Currier, Jerry Tworek, Jesse Han, Jessica Shieh, Jiayi Weng, Jie Tang, Joanne Jang, Joel Parish, Joe Palermo, Johannes Heidecke, John Schulman, Jonathan Gordon, Jonathan Ward, Jong Wook Kim, Joost Huizinga, Jordan Sitkin, Josh Achiam, Joshua Gross, Juan Felipe Cer\'on Uribe, Juntang Zhuang, Justin Jay Wang, Juston Forte, Kai Xiao, Katarina Slama, Katie Mayer, Kendra Rimbach, Kenny Hsu, Kevin Button, Kevin Yu, Kim Malfacini, Kyla Sheppard, Kyle Kosic, Lama Ahmad, Lauren Workman, Lenny Bogdonoff, Leo Gao, Liam Fedus, Lilian Weng, Logan Kilpatrick, Long Ouyang, {\L}ukasz Kaiser, {\L}ukasz Kondraciuk, Luke Metz, Maddie Simens, Madelaine Boyd, Madeleine B. Thompson, Mario Saltarelli, Mark Chen, Marvin Zhang, Mateusz Litwin, Matt Knight, Matt Wiethoff, Michael Lampe, Michael Petrov, Michael (Rai) Pokorny, Michael Wu, Michelle Pokrass, Mike Heaton, Mikhail Pavlov, Miles Brundage, Mira Murati, Mohammad Bavarian, Molly Lin, Morgan Grafstein, Natalie Staudacher, Natalie Summers, Nick Ryder, Nick Turley, Niko Felix, Nikolas Tezak, Nitish Shirish Keskar, Noah Deutsch, Oleg Boiko, Oleg Murk, OpenAI, Pamela Mishkin, Patricia Lue, Paul Baltescu, Paul McMillan, Peter Hoeschele, Peter Welinder, Phil Tillet, Pranav Shyam, Preston Tuggle, Qiming Yuan, Rachel Lim, Rajeev Nayak, Rapha Gontijo-Lopes, Raul Puri, Red Avila, Reiichiro Nakano, Richard Ngo, Roger Jiang, Rory Carmichael, Rosie Campbell, Rowan Zellers, Ruby Chen, Ryan Greene, Ryan Lowe, Sam Altman, Sam Manning, Samuel Wolrich, Sandhini Agarwal, Sarah Shoker, Sarah Yoo, Scott Gray, Scott Mayer McKinney, Shantanu Jain, Shawn Jain, Sheila Dunning, Shengjia Zhao, Shengli Hu, Sherwin Wu, Shibani Santurkar, Shino Jomoto, Shixiang Shane Gu, Shyamal Anadkat, Sim\'on Posada Fishman, Stephanie Lin, Steve Dowling, Steven Adler, Suchir Balaji, Sully Chen, Szymon Sidor, Tabarak Khan, Tao Xu, Tarun Gogineni, Teddy Lee, Ted Sanders, Theresa Lopez, Thomas Degry, Tianhao Zheng, Tim Brooks, Todor Markov, Toki Sherbakov, Tolly Powell, Tomer Kaftan, Tong Mu, Trevor Cai, Tyna Eloundou, Valerie Balcom, Vik Goel, Vinnie Monaco, Vishal Kuo, Vitchyr H. Pong, Wade Hickey, William Zhuk, Wojciech Zaremba, Xin Hu, Yang Song, Yaniv Markovski, Yongjik Kim, Yuchen He, Yufei Guo, Yunxing Dai

Authors on Pith no claims yet

classification 💻 cs.CL cs.AI

keywords gpt-4performancemodelpredictreporttextacademicaccept

0 comments

read the original abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
cs.CL 2026-05 accept novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
Pretraining Exposure Explains Popularity Judgments in Large Language Models
cs.CL 2026-05 unverdicted novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
cs.CV 2026-05 unverdicted novelty 8.0

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers
cs.LG 2026-05 unverdicted novelty 8.0

A standard Transformer with O(ε^{-d0/α}) blocks can approximate any bounded d0-dimensional Hölder function of smoothness α to accuracy ε, but at least Ω(ε^{-d0/(4α)}) blocks are required.
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
LLM Translation of Compiler Intermediate Representation
cs.PL 2026-05 unverdicted novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
Nearly Optimal Attention Coresets
cs.DS 2026-05 unverdicted novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
Efficient Preference Poisoning Attack on Offline RLHF
cs.LG 2026-05 unverdicted novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Revisable by Design: A Theory of Streaming LLM Agent Execution
cs.LG 2026-04 unverdicted novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
cs.CL 2026-04 unverdicted novelty 8.0

RespondeoQA is the first benchmark dataset for question answering and translation between Latin and English, with 7,800 pairs from pedagogical sources and initial LLM evaluations.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Disentangling MLP Neuron Weights in Vocabulary Space
cs.CL 2026-04 unverdicted novelty 8.0

ROTATE disentangles MLP neurons into faithful vocabulary channels by optimizing weight rotations to maximize vocabulary-space kurtosis, outperforming activation-based baselines for neuron descriptions.
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
cs.CV 2026-04 unverdicted novelty 8.0

ActivityForensics is the first large-scale benchmark for temporally localizing activity-level forgeries in videos, paired with a diffusion-based baseline called TADiff.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations
physics.chem-ph 2026-04 conditional novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
cs.CR 2026-04 unverdicted novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
cs.AI 2026-04 unverdicted novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction parado...
Adaptive Stopping for Multi-Turn LLM Reasoning
cs.CL 2026-04 unverdicted novelty 8.0

MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG an...
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
cs.CL 2023-05 accept novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
cs.LG 2026-05 unverdicted novelty 7.0

GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.
Sampling from Flow Language Models via Marginal-Conditioned Bridges
cs.LG 2026-05 unverdicted novelty 7.0

Marginal-conditioned bridges enable training-free sampling from Flow Language Models by drawing clean one-hot endpoints from factorized posteriors and using Ornstein-Uhlenbeck bridges, preserving token marginals and r...
Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models
cs.LG 2026-05 conditional novelty 7.0

DyGFM introduces decoupled pre-training and divergence-conditioned prompts to create the first multi-domain dynamic graph foundation model that outperforms baselines on node classification and link prediction.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
cs.CL 2026-05 unverdicted novelty 7.0

IfcLLM combines relational and graph representations of IFC models with iterative LLM reasoning to deliver 93.3-100% first-attempt accuracy on natural language queries across three test models.
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition
cs.CV 2026-05 conditional novelty 7.0

STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
cs.CV 2026-05 conditional novelty 7.0

LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
cs.DB 2026-05 conditional novelty 7.0

OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
ImageAttributionBench: How Far Are We from Generalizable Attribution?
cs.CV 2026-05 unverdicted novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
stat.ML 2026-05 unverdicted novelty 7.0

In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 7.0

G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference
cs.IT 2026-05 unverdicted novelty 7.0

CR^2 matches full-information routing performance for device-edge LLM inference using only device-side signals and cuts normalized deployment cost by up to 16.9% at matched accuracy.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Online Continual Learning with Dynamic Label Hierarchies
cs.LG 2026-05 unverdicted novelty 7.0

HALO improves online continual learning under evolving label hierarchies by adaptively combining classification heads regularized with organized learnable prototypes for better adaptation and reduced forgetting.
SoK: Unlearnability and Unlearning for Model Dememorization
cs.LG 2026-05 conditional novelty 7.0

The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
cs.RO 2026-05 unverdicted novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
Kairos: A Scalable Serving System for Physical AI
cs.RO 2026-05 unverdicted novelty 7.0

Kairos is the first multi-robot serving system that treats the generate-execute loop as a first-class citizen and reduces average task latency by 31.8-66.5% versus digital AI serving systems.
Neural Statistical Functions
cs.LG 2026-05 unverdicted novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
cs.AI 2026-05 unverdicted novelty 7.0

EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 accept novelty 7.0

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 unverdicted novelty 7.0

StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
cs.CV 2026-05 unverdicted novelty 7.0

SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...

Reference graph

Works this paper leans on

128 extracted references · 127 canonical work pages · cited by 874 Pith papers

[1]

Understanding the Capabilities, Limita- tions, and Societal Impact of Large Language Models,

A. Tamkin, M. Brundage, J. Clark, and D. Ganguli, “Understanding the Capabilities, Limita- tions, and Societal Impact of Large Language Models,” Feb. 2021

work page 2021
[2]

Introducing the new Bing

“Introducing the new Bing. ” https://www.bing.com/new

work page
[3]

WebGPT: Improving the factual accuracy of language models through web browsing

J. Hilton, R. Nakano, S. Balaji, and J. Schulman, “WebGPT: Improving the factual accuracy of language models through web browsing. ” https://openai.com/research/webgpt, Dec. 2021

work page 2021
[4]

ACT-1: Transformer for Actions – Adept

“ACT-1: Transformer for Actions – Adept. ” https://www.adept.ai/blog/act-1

work page
[5]

Evaluating Large Language Models Trained on Code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Vo...

work page 2021
[6]

Ethical and social risks of harm from Language Models,

L. Weidinger, J. Mellor, M. Rauh, C. Griﬃn, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown, W. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L. A. Hendricks, W. Isaac, S. Legassick, G. Irving, and I. Gabriel, “Ethical and social risks of harm from Language Models,” Dec. 2021

work page 2021
[7]

Release Strategies and the Social Impacts of Language Models,

I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, J. Wu, A. Radford, G. Krueger, J. W. Kim, S. Kreps, M. McCain, A. Newhouse, J. Blazakis, K. McGuﬃe, and J. Wang, “Release Strategies and the Social Impacts of Language Models,” Nov. 2019

work page 2019
[8]

Improving language understanding with unsupervised learning

A. Radford, “Improving language understanding with unsupervised learning. ” https://ope- nai.com/research/language-unsupervised, June 2018

work page 2018
[9]

Better language models and their implications

A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage, I. Sutskever, A. Askell, D. Lansky, D. Hernandez, and D. Luan, “Better language models and their implications. ” https://openai.com/research/better-language-models, Feb. 2019

work page 2019
[10]

Language Models are Few-Shot Learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amod...

work page 2020
[11]

Planning for AGI and beyond

S. Altman, “Planning for AGI and beyond. ” https://openai.com/blog/planning-for-agi-and- beyond, Feb. 2023

work page 2023
[12]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” Mar. 2022. 71

work page 2022
[13]

Deep reinforcement learning from human preferences,

P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Feb. 2023

work page 2023
[14]

Model Cards for Model Reporting,

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229, Jan. 2019

work page 2019
[15]

System Cards, a new resource for under- standing how AI systems work

N. Green, C. Procope, A. Cheema, and A. Adediji, “System Cards, a new resource for under- standing how AI systems work. ” https://ai.facebook.com/blog/system-cards-a-new-resource- for-understanding-how-ai-systems-work/, Feb. 2022

work page 2022
[16]

DALL ·E 2 Preview - Risks and Limitations

“DALL ·E 2 Preview - Risks and Limitations. ” OpenAI, Apr. 2022

work page 2022
[17]

Diﬀerential Technology Development: A Responsible Innovation Principle for Navigating Technology Risks,

J. Sandbrink, H. Hobbs, J. Swett, A. Dafoe, and A. Sandberg, “Diﬀerential Technology Development: A Responsible Innovation Principle for Navigating Technology Risks,” Sept. 2022

work page 2022
[18]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Gan- guli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatﬁeld-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Ka- plan...

work page 2022
[19]

Discovering Language Model Behaviors with Model-Written Evaluations,

E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, A. Jones, A. Chen, B. Mann, B. Israel, B. Seethor, C. McKinnon, C. Olah, D. Yan, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, G. Khundadze, J. Kernion, J. Landis, J. Kerr, J. Mueller, J. Hyun, J. Landau, K. Ndousse, L. Goldberg, L....

work page 2022
[20]

B. P. Kehoe, Zen and the Art of the Internet . Project Gutenberg, June 1992

work page 1992
[21]

Lessons learned on language model safety and misuse

M. Brundage, K. Mayer, T. Eloundou, S. Agarwal, S. Adler, G. Krueger, J. Leike, and P. Mishkin, “Lessons learned on language model safety and misuse. ” https://ope- nai.com/research/language-model-safety-and-misuse, Mar. 2022

work page 2022
[22]

Language Models are Unsupervised Multitask Learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019

work page 2019
[23]

G. C. Bowker and S. L. Star, Sorting Things Out . MIT Press, Aug. 2000

work page 2000
[24]

Taxonomy of Risks posed by Language Models,

L. Weidinger, J. Uesato, M. Rauh, C. Griﬃn, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hendricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel, “Taxonomy of Risks posed by Language Models,” in 2022 ACM Conference on Fairness, Acco...

work page 2022
[25]

Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,

I. Solaiman and C. Dennison, “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets,” Nov. 2021

work page 2021
[26]

Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems,

H. Khlaaf, “Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems,” Trail of Bits , 2023

work page 2023
[27]

Toward Trustworthy AI Development: Mechanisms for Supporting Veriﬁable Claims,

M. Brundage, S. A vin, J. Wang, H. Belﬁeld, G. Krueger, G. Hadﬁeld, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryﬀel, J. B. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askel...

work page 2020
[28]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned,

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatﬁeld-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brow...

work page 2022
[29]

Red Teaming Language Models with Language Models,

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red Teaming Language Models with Language Models,” Feb. 2022

work page 2022
[30]

A Hazard Analysis Framework for Code Synthesis Large Language Models,

H. Khlaaf, P. Mishkin, J. Achiam, G. Krueger, and M. Brundage, “A Hazard Analysis Framework for Code Synthesis Large Language Models,” July 2022

work page 2022
[31]

On Faithfulness and Factuality in Abstractive Summarization,

J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On Faithfulness and Factuality in Abstractive Summarization,” May 2020

work page 2020
[32]

TruthfulQA: Measuring How Models Mimic Human False- hoods,

S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring How Models Mimic Human False- hoods,” May 2022

work page 2022
[33]

Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk

J. A. Goldstein, G. Sastry, M. Musser, R. DiResta, M. Gentzel, and K. Sedova, “Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk. ” https://openai.com/research/forecasting-misuse, Jan. 2023

work page 2023
[34]

Truthful AI: Developing and governing AI that does not lie,

O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders, “Truthful AI: Developing and governing AI that does not lie,” Oct. 2021

work page 2021
[35]

Detoxifying Language Models Risks Marginalizing Minority Voices,

A. Xu, E. Pathak, E. Wallace, S. Gururangan, M. Sap, and D. Klein, “Detoxifying Language Models Risks Marginalizing Minority Voices,” Apr. 2021

work page 2021
[36]

Measuring and Mitigating Unintended Bias in Text Classiﬁcation,

L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman, “Measuring and Mitigating Unintended Bias in Text Classiﬁcation,” in Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’18, (New York, NY, USA), pp. 67–73, Association for Computing Machinery, Dec. 2018

work page 2018
[37]

A Holistic Approach to Undesired Content Detection in the Real World,

T. Markov, C. Zhang, S. Agarwal, T. Eloundou, T. Lee, S. Adler, A. Jiang, and L. Weng, “A Holistic Approach to Undesired Content Detection in the Real World,” Feb. 2023. 73

2023
[38]

How should AI systems behave, and who should decide?

OpenAI, “How should AI systems behave, and who should decide?. ” https://ope- nai.com/blog/how-should-ai-systems-behave, Feb. 2023

work page 2023
[39]

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models,

M. Rauh, J. Mellor, J. Uesato, P.-S. Huang, J. Welbl, L. Weidinger, S. Dathathri, A. Glaese, G. Irving, I. Gabriel, W. Isaac, and L. A. Hendricks, “Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models,” Oct. 2022

work page 2022
[40]

L., Barocas, S., Daum \'e , III, H., and Wallach, H

S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, “Language (Technology) is Power: A Critical Survey of "Bias" in NLP. ” https://arxiv.org/abs/2005.14050v2, May 2020

work page arXiv 2005
[41]

On Measures of Biases and Harms in NLP,

S. Dev, E. Sheng, J. Zhao, A. Amstutz, J. Sun, Y. Hou, M. Sanseverino, J. Kim, A. Nishi, N. Peng, and K.-W. Chang, “On Measures of Biases and Harms in NLP,” in Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022 , (Online only), pp. 246–267, Association for Computational Linguistics, Nov. 2022

work page 2022
[42]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,

T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” July 2016

work page 2016
[43]

Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them,

H. Gonen and Y. Goldberg, “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , (Minneapolis, Minnesota), pp. 609...

work page 2019
[44]

Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns,

K. Webster, M. Recasens, V. Axelrod, and J. Baldridge, “Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns,” Oct. 2018

work page 2018
[45]

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ,

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , (Virtual Event Canada), pp. 610–623, ACM, Mar. 2021

work page 2021
[46]

On the Opportunities and Risks of Foundation Models,

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

work page 2021
[47]

S. U. Noble, Algorithms of Oppression . NYU Press, Feb. 2018

work page 2018
[48]

Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice,

R. Richardson, J. Schultz, and K. Crawford, “Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice,” Feb. 2019. 74

work page 2019
[49]

MacAskill, What We Owe The Future

W. MacAskill, What We Owe The Future . Basic Books, Aug. 2022

work page 2022
[50]

GPT-2: 1.5B release

OpenAI, “GPT-2: 1.5B release. ” https://openai.com/research/gpt-2-1-5b-release, Nov. 2019

work page 2019
[51]

All the News That’s Fit to Fabricate: AI- Generated Text as a Tool of Media Misinformation,

S. Kreps, R. M. McCain, and M. Brundage, “All the News That’s Fit to Fabricate: AI- Generated Text as a Tool of Media Misinformation,” Journal of Experimental Political Science , vol. 9, no. 1, pp. 104–117, 2022/ed

work page 2022
[52]

Truth, Lies, and Automation,

B. Buchanan, A. Lohn, M. Musser, and K. Sedova, “Truth, Lies, and Automation,” tech. rep., Center for Security and Emerging Technology, May 2021

work page 2021
[53]

AI’s Powers of Political Persuasion

A. Myers, “AI’s Powers of Political Persuasion. ” https://hai.stanford.edu/news/ais-powers- political-persuasion, Feb. 2023

work page 2023
[54]

Artiﬁcial intelligence can persuade humans on political issues,

H. Bai, J. Voelkel, J. Eichstaedt, and R. Willer, “Artiﬁcial intelligence can persuade humans on political issues,” 2023

work page 2023
[55]

On the Horizon: Interactive and Compositional Deepfakes,

E. Horvitz, “On the Horizon: Interactive and Compositional Deepfakes,” in INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION , pp. 653–661, Nov. 2022

work page 2022
[56]

Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security,

R. Chesney and D. K. Citron, “Deep Fakes: A Looming Challenge for Privacy, Democracy, and National Security,” July 2018

work page 2018
[57]

Dual use export licenses,

U.S. Department of Commerce, “Dual use export licenses,” March 13 2023. accessed 2023-03-13

work page 2023
[58]

Arms control, disarmament and non-proliferation in nato,

NATO, “Arms control, disarmament and non-proliferation in nato,” February 27 2023. accessed 2023-02-27

work page 2023
[59]

Extracting Training Data from Large Language Models,

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, A. Oprea, and C. Raﬀel, “Extracting Training Data from Large Language Models,” June 2021

work page 2021
[60]

Quantifying Memo- rization Across Neural Language Models,

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying Memo- rization Across Neural Language Models,” Mar. 2023

work page 2023
[61]

Predictability and Surprise in Large Generative Models,

D. Ganguli, D. Hernandez, L. Lovitt, N. DasSarma, T. Henighan, A. Jones, N. Joseph, J. Kernion, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatﬁeld-Dodds, S. Johnston, S. Kravec, N. Nanda, K. Ndousse, C. Olsson, D. Amodei, D. Amodei, T. Brown, J. Kaplan, S. McCandlish, C. Olah, and J. Clark, “Predictabili...

work page 2022
[62]

Emergent Abilities of Large Language Models,

J. Wei, Y. Tay, R. Bommasani, C. Raﬀel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent Abilities of Large Language Models,” Oct. 2022

work page 2022
[63]

The alignment problem from a deep learning perspec- tive,

R. Ngo, L. Chan, and S. Mindermann, “The alignment problem from a deep learning perspec- tive,” Feb. 2023

work page 2023
[64]

Bostrom, Superintelligence: Paths, Dangers, Strategies

N. Bostrom, Superintelligence: Paths, Dangers, Strategies . United Kingdom: Oxford University Press, Sept. 2014. 75

work page 2014
[65]

Harms from Increasingly Agentic Algorithmic Systems,

A. Chan, R. Salganik, A. Markelius, C. Pang, N. Rajkumar, D. Krasheninnikov, L. Langosco, Z. He, Y. Duan, M. Carroll, M. Lin, A. Mayhew, K. Collins, M. Molamohammadi, J. Burden, W. Zhao, S. Rismani, K. Voudouris, U. Bhatt, A. Weller, D. Krueger, and T. Maharaj, “Harms from Increasingly Agentic Algorithmic Systems,” Feb. 2023

work page 2023
[66]

Language Models as Agent Models,

J. Andreas, “Language Models as Agent Models,” Dec. 2022

work page 2022
[67]

Emergent Deception and Emergent Optimization

J. Steinhardt, “Emergent Deception and Emergent Optimization. ” https://bounded- regret.ghost.io/emergent-deception-optimization/, Feb. 2023

work page 2023
[68]

The Basic AI Drives,

S. M. Omohundro, “The Basic AI Drives,” in Proceedings of the 2008 Conference on Artiﬁcial General Intelligence 2008 , (NLD), pp. 483–492, IOS Press, June 2008

work page 2008
[69]

The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artiﬁcial Agents,

N. Bostrom, “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artiﬁcial Agents,” Minds and Machines , vol. 22, pp. 71–85, May 2012

work page 2012
[70]

Optimal Policies Tend to Seek Power,

A. M. Turner, L. Smith, R. Shah, A. Critch, and P. Tadepalli, “Optimal Policies Tend to Seek Power,” Jan. 2023

work page 2023
[71]

Parametrically Retargetable Decision-Makers Tend To Seek Power,

A. M. Turner and P. Tadepalli, “Parametrically Retargetable Decision-Makers Tend To Seek Power,” Oct. 2022

work page 2022
[72]

Power-seeking can be probable and predictive for trained agents,

V. Krakovna and janos, “Power-seeking can be probable and predictive for trained agents,” Mar. 2023

work page 2023
[73]

Russell, Human Compatible: Artiﬁcial Intelligence and the Problem of Control

S. Russell, Human Compatible: Artiﬁcial Intelligence and the Problem of Control . Cham: Springer International Publishing, 2022

work page 2022
[74]

Is Power-Seeking AI an Existential Risk?,

J. Carlsmith, “Is Power-Seeking AI an Existential Risk?,” June 2022

work page 2022
[75]

Update on arc’s recent eval eﬀorts,

Alignment Research Center, “Update on arc’s recent eval eﬀorts,” March 2023 2023. accessed 2023-03-17

work page 2023
[76]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,

E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine, K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz, G. Shachaf, S. Shalev- Shwartz, A. Shashua, and M. Tenenholtz, “MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,” May 2022

work page 2022
[77]

Toolformer: Language Models Can Teach Themselves to Use Tools,

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,” Feb. 2023

work page 2023
[78]

Augmented Language Models: A Survey,

G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y. LeCun, and T. Scialom, “Augmented Language Models: A Survey,” Feb. 2023

work page 2023
[79]

TALM: Tool Augmented Language Models,

A. Parisi, Y. Zhao, and N. Fiedel, “TALM: Tool Augmented Language Models,” May 2022

work page 2022
[80]

Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,

D. Weininger, “Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,” Journal of chemical information and computer sciences , vol. 28, no. 1, pp. 31–36, 1988

work page 1988

Showing first 80 references.