arxiv: 2311.12983 · v1 · submitted 2023-11-21 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

GAIA: a benchmark for General AI Assistants

Cl\'ementine Fourrier, Craig Swift, Gr\'egoire Mialon, Thomas Scialom, Thomas Wolf, Yann LeCun

Pith reviewed 2026-05-12 15:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords GAIA benchmarkgeneral AI assistantsAGI milestonetool use proficiencymulti-modalityweb browsinghuman-AI performance gapreal-world questions

0 comments

The pith

GAIA benchmark shows humans at 92 percent accuracy on general questions where GPT-4 with plugins reaches only 15 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GAIA as a benchmark for general AI assistants made of 466 real-world questions that test core abilities including reasoning, multi-modal data handling, web browsing, and tool use. These questions are designed to be straightforward for humans, who score 92 percent, but difficult for current AI systems, which reach only 15 percent even when GPT-4 is equipped with plugins. This gap reverses the recent pattern in which large language models exceed human experts on narrow professional tasks such as those in law or chemistry. The authors argue that matching average human robustness on GAIA would mark a milestone toward artificial general intelligence because it requires flexible, everyday performance rather than superhuman skill in one domain. They release the questions and retain answers for 300 of them to support an ongoing public leaderboard.

Core claim

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. Human respondents obtain 92 percent accuracy versus 15 percent for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills. The advent of AGI hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, 466 questions and answers were created, with questions released and answers to 300 retained for a leaderboard.

What carries the argument

The GAIA benchmark itself: a set of 466 real-world questions that are conceptually simple for humans but demand combined reasoning, multi-modal processing, browsing, and tool-use skills to answer correctly.

Load-bearing premise

That matching the robustness of an average human on these everyday questions is the key requirement for achieving AGI.

What would settle it

An AI system reaching near 90 percent accuracy on the GAIA set through narrow memorization or task-specific tuning, yet still failing on new but similar real-world tasks outside the benchmark, would show that solving GAIA does not indicate general intelligence.

read the original abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GAIA, a benchmark for General AI Assistants consisting of 466 real-world questions that require fundamental abilities including reasoning, multi-modality handling, web browsing, and tool-use proficiency. It reports human performance at 92% accuracy versus 15% for GPT-4 equipped with plugins, contrasting this with trends where LLMs outperform humans on professional tasks. The authors argue that AGI progress depends on achieving robustness comparable to the average human on such questions and release the questions with answers to 300 retained for a leaderboard.

Significance. If the reported human-AI performance gap is substantiated, GAIA offers a significant contribution by providing a benchmark that targets general robustness rather than specialized or superhuman capabilities. This could help steer AI research towards more practical, real-world assistant systems. The authors deserve credit for making the questions publicly available and setting up a leaderboard to enable ongoing assessment and reproducibility.

major comments (2)

[Abstract] Abstract: The performance disparity (92% human vs. 15% GPT-4+plugins) is central to the paper's claim of a notable gap and the departure from harder-than-human benchmarks. However, the abstract provides no details on the human respondent selection process, number of participants, demographics, or any validation of the 92% figure, which undermines the ability to interpret it as representative of 'average human' robustness as stated in the AGI posit.
[Benchmark construction description] Benchmark construction description: The methodology for devising, selecting, and balancing the 466 questions to ensure they test the claimed set of fundamental abilities (reasoning, multi-modality, web browsing, tool-use) while remaining conceptually simple for humans is not described. This is load-bearing for establishing the benchmark's validity and the reported performance numbers.

minor comments (1)

[Abstract] The abstract states 'their answer' in reference to the 466 questions; this should be corrected to 'their answers' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of GAIA as a benchmark focused on general robustness. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The performance disparity (92% human vs. 15% GPT-4+plugins) is central to the paper's claim of a notable gap and the departure from harder-than-human benchmarks. However, the abstract provides no details on the human respondent selection process, number of participants, demographics, or any validation of the 92% figure, which undermines the ability to interpret it as representative of 'average human' robustness as stated in the AGI posit.

Authors: We agree that the abstract should concisely summarize the human evaluation methodology to support the reported performance gap and the reference to average human robustness. We will revise the abstract to include key details on the participant selection process, number of respondents, and validation approach used to obtain the 92% figure. revision: yes
Referee: [Benchmark construction description] Benchmark construction description: The methodology for devising, selecting, and balancing the 466 questions to ensure they test the claimed set of fundamental abilities (reasoning, multi-modality, web browsing, tool-use) while remaining conceptually simple for humans is not described. This is load-bearing for establishing the benchmark's validity and the reported performance numbers.

Authors: We acknowledge that a more detailed account of the benchmark construction process is needed to fully substantiate the design choices and performance claims. We will expand the relevant section of the manuscript with a step-by-step description of how the questions were devised, selected, and balanced across the targeted abilities, including the criteria used to maintain conceptual simplicity for humans. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct performance reporting

full rationale

The paper introduces GAIA as a new benchmark consisting of 466 real-world questions and reports empirical accuracy scores (92% human vs. 15% GPT-4+plugins) without any mathematical derivation, parameter fitting, or predictive modeling. The central posit that AGI requires average-human robustness on these questions is presented as a philosophical stance rather than a derived result from equations or self-referential data. No self-citations, ansatzes, or uniqueness theorems are invoked to justify core claims, and the reported disparity is a direct measurement rather than a fitted or renamed input. The work is self-contained as an empirical proposal with no load-bearing steps that reduce to their own definitions or prior author outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; no explicit free parameters, invented entities, or detailed axioms visible. The central claim rests on the assumption that the chosen questions adequately probe fundamental abilities required for general AI.

axioms (1)

domain assumption The selected real-world questions require fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency.
Stated directly in the abstract as the basis for the benchmark design.

pith-pipeline@v0.9.0 · 5508 in / 1293 out tokens · 53150 ms · 2026-05-12T15:40:38.187830+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. ... human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
cs.CL 2026-04 unverdicted novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
cs.CR 2026-05 unverdicted novelty 7.0

Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
cs.CL 2026-05 unverdicted novelty 7.0

Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
cs.AI 2026-05 unverdicted novelty 7.0

TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
cs.AI 2026-04 conditional novelty 7.0

AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...
AEL: Agent Evolving Learning for Open-Ended Environments
cs.CL 2026-04 conditional novelty 7.0

AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechan...
Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception
cs.AI 2026-04 unverdicted novelty 7.0

SPECTRA enables supervision-free bootstrapping of agentic capabilities in SVLMs via cascaded tool rollout alignment, multi-objective rewards, and the TIU metric, yielding up to 5% higher task accuracy and 9% better to...
KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents
cs.SE 2026-03 accept novelty 7.0

KAIJU decouples LLM reasoning from execution using a specialized kernel and Intent-Gated Execution to enable parallel tool scheduling and robust security.
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
gr-qc 2026-05 unverdicted novelty 6.0

LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
CL-bench Life: Can Language Models Learn from Real-Life Context?
cs.CL 2026-04 unverdicted novelty 6.0

CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
cs.AI 2026-04 unverdicted novelty 6.0

AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
cs.AI 2026-04 unverdicted novelty 6.0

CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
cs.CL 2026-04 unverdicted novelty 6.0

TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.
Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning
cs.PF 2026-04 unverdicted novelty 6.0

PTE is a hardware-aware metric that better predicts actual inference latency in tool-integrated reasoning than token counts and reveals that high-PTE trajectories often have lower correctness.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
cs.LG 2024-10 accept novelty 6.0

AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes...
MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings
cs.CL 2026-04 unverdicted novelty 5.0

MTRouter learns turn-level model utility predictors from logged trajectories using history-model joint embeddings, delivering 58.7% cost reduction on ScienceWorld and 43.4% on HLE while matching or exceeding GPT-5 per...
Mind DeepResearch Technical Report
cs.AI 2026-04 unverdicted novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Agentic Performance at the Edge: Insights from Benchmarking
cs.AI 2026-05 unverdicted novelty 4.0

Edge agentic AI quality is not a simple function of model size; robust results require joint design of model selection and tool integration, as revealed by domain-conditioned benchmarks showing accuracy-latency Pareto fronts.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...

Reference graph

Works this paper leans on

200 extracted references · 200 canonical work pages · cited by 31 Pith papers · 22 internal anchors

[1]

2023 , eprint=

Levels of AGI: Operationalizing Progress on the Path to AGI , author=. 2023 , eprint=

work page 2023
[2]

2023 , eprint=

On the Tool Manipulation Capability of Open-source Large Language Models , author=. 2023 , eprint=

work page 2023
[3]

2023 , eprint=

API-Bank: A Benchmark for Tool-Augmented LLMs , author=. 2023 , eprint=

work page 2023
[4]

2023 , journal =

AgentBench: Evaluating LLMs as Agents , author =. 2023 , journal =

work page 2023
[5]

and Friedman, Batya

Bender, Emily M. and Friedman, Batya. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics. 2018

work page 2018
[7]

2019 , eprint=

On the Measure of Intelligence , author=. 2019 , eprint=

work page 2019
[8]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[9]

2023 , eprint=

Efficient Benchmarking (of Language Models) , author=. 2023 , eprint=

work page 2023
[10]

2023 , eprint=

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models , author=. 2023 , eprint=

work page 2023
[12]

Pricing via Processing or Combatting Junk Mail

Dwork, Cynthia and Naor, Moni. Pricing via Processing or Combatting Junk Mail. Advances in Cryptology --- CRYPTO' 92. 1993

work page 1993
[13]

AUTOMATION, PARTIAL AND FULL , volume=

Growiec, Jakub , year=. AUTOMATION, PARTIAL AND FULL , volume=. Macroeconomic Dynamics , publisher=

work page
[14]

2023 , eprint=

Making Large Language Models Better Reasoners with Step-Aware Verifier , author=. 2023 , eprint=

work page 2023
[18]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Codebert: A pre-trained model for programming and natural languages , author=. arXiv preprint arXiv:2002.08155 , year=

work page internal anchor Pith review arXiv 2002
[19]

The Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

work page 2020
[20]

2023 , eprint=

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification , author=. 2023 , eprint=

work page 2023
[21]

2023 , url =

Model Card and Evaluations for Claude Models , author =. 2023 , url =

work page 2023
[22]

Advances in Neural Information Processing Systems , editor=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[23]

2023 , eprint=

How is ChatGPT's behavior changing over time? , author=. 2023 , eprint=

work page 2023
[24]

2023 , eprint=

Capabilities of GPT-4 on Medical Challenge Problems , author=. 2023 , eprint=

work page 2023
[26]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[27]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[28]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[29]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[30]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[31]

2022 , eprint=

Holistic Evaluation of Language Models , author=. 2022 , eprint=

work page 2022
[32]

2023 , eprint=

LIMA: Less Is More for Alignment , author=. 2023 , eprint=

work page 2023
[33]

2023 , eprint=

PaLM 2 Technical Report , author=. 2023 , eprint=

work page 2023
[34]

2023 , eprint=

Augmented Language Models: a Survey , author=. 2023 , eprint=

work page 2023
[35]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[36]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[37]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019
[38]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[39]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[40]

2022 , eprint=

PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

work page 2022
[41]

2023 , eprint=

Progressive-Hint Prompting Improves Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[42]

2023 , eprint=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , eprint=

work page 2023
[43]

2018 , eprint=

Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=

work page 2018
[44]

Semantic

Microsoft , year =. Semantic

work page
[45]

2023 , month = sep, urldate =

work page 2023
[46]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chan, Chi-Min and Chen, Weize and Su, Yusheng and Yu, Jianxuan and Xue, Wei and Zhang, Shanghang and Fu, Jie and Liu, Zhiyuan , year =. doi:10.48550/arXiv.2308.07201 , urldate =. 2308.07201 , publisher =

work page internal anchor Pith review doi:10.48550/arxiv.2308.07201
[47]

Chase, Harrison , year =

work page
[48]

Dong, Xin Luna and Moon, Seungwhan and Xu, Yifan Ethan and Malik, Kshitiz and Yu, Zhou , year =. Towards. Proceedings of the 29th. doi:10.1145/3580305.3599572 , urldate =

work page doi:10.1145/3580305.3599572
[49]

doi:10.48550/arXiv.2306.08640 , urldate =

Gao, Difei and Ji, Lei and Zhou, Luowei and Lin, Kevin Qinghong and Chen, Joya and Fan, Zihan and Shou, Mike Zheng , year =. doi:10.48550/arXiv.2306.08640 , urldate =. 2306.08640 , publisher =

work page doi:10.48550/arxiv.2306.08640
[50]

2023 , eprint=

OpenAGI: When LLM Meets Domain Experts , author=. 2023 , eprint=

work page 2023
[51]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Hong, Sirui and Zheng, Xiawu and Chen, Jonathan and Cheng, Yuheng and Wang, Jinlin and Zhang, Ceyao and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin , year =. doi:10.48550/arXiv.2308.00352 , urldate =. 2308.00352 , publisher =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.00352
[52]

doi:10.48550/arXiv.2304.08244 , urldate =

Li, Minghao and Song, Feifan and Yu, Bowen and Yu, Haiyang and Li, Zhoujun and Huang, Fei and Li, Yongbin , year =. doi:10.48550/arXiv.2304.08244 , urldate =. 2304.08244 , publisher =

work page doi:10.48550/arxiv.2304.08244
[53]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , year =. doi:10.48550/arXiv.2303.17760 , urldate =. 2303.17760 , publisher =

work page internal anchor Pith review doi:10.48550/arxiv.2303.17760
[54]

AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

Lin, Jiaju and Zhao, Haoran and Zhang, Aochi and Wu, Yiting and Ping, Huqiuyue and Chen, Qin , year =. doi:10.48550/arXiv.2308.04026 , urldate =. 2308.04026 , publisher =

work page doi:10.48550/arxiv.2308.04026
[55]

AgentBench: Evaluating LLMs as Agents

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , year =. d...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.03688
[56]

doi:10.48550/arXiv.2308.05960 , urldate =

Liu, Zhiwei and Yao, Weiran and Zhang, Jianguo and Xue, Le and Heinecke, Shelby and Murthy, Rithesh and Feng, Yihao and Chen, Zeyuan and Niebles, Juan Carlos and Arpit, Devansh and Xu, Ran and Mui, Phil and Wang, Huan and Xiong, Caiming and Savarese, Silvio , year =. doi:10.48550/arXiv.2308.05960 , urldate =. 2308.05960 , publisher =

work page doi:10.48550/arxiv.2308.05960
[57]

Nakajima, Yohei , year =

work page
[58]

Task-Driven

Nakajima, Yohei , year =. Task-Driven

work page
[59]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , year =. doi:10.485...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.09332
[60]

Osika, Anton , year =

work page
[61]

Gorilla: Large Language Model Connected with Massive APIs

Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , year =. Gorilla:. doi:10.48550/arXiv.2305.15334 , urldate =. 2305.15334 , publisher =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.15334
[62]

Rush, Sasha , year =

work page
[63]

Toolformer: Language Models Can Teach Themselves to Use Tools

Schick, Timo and. Toolformer:. 2023 , month = feb, number =. doi:10.48550/arXiv.2302.04761 , urldate =. 2302.04761 , publisher =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.04761 2023
[64]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , year =. doi:10.48550/arXiv.2303.17580 , urldate =. 2303.17580 , publisher =

work page internal anchor Pith review doi:10.48550/arxiv.2303.17580
[65]

arXiv preprint arXiv:2303.08128 , year=

Sur. 2023 , month = mar, number =. doi:10.48550/arXiv.2303.08128 , urldate =. 2303.08128 , publisher =

work page doi:10.48550/arxiv.2303.08128 2023
[66]

Talebirad, Yashar and Nadiri, Amirhossein , year =. Multi-. doi:10.48550/arXiv.2306.03314 , urldate =. 2306.03314 , publisher =

work page doi:10.48550/arxiv.2306.03314
[67]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Zhang, Shaokun and Zhu, Erkang and Li, Beibin and Jiang, Li and Zhang, Xiaoyun and Wang, Chi , year =. doi:10.48550/arXiv.2308.08155 , urldate =. 2308.08155 , publisher =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08155
[68]

Wu, Chenfei and Yin, Shengming and Qi, Weizhen and Wang, Xiaodong and Tang, Zecheng and Duan, Nan , year =. Visual. doi:10.48550/arXiv.2303.04671 , urldate =. 2303.04671 , publisher =

work page internal anchor Pith review doi:10.48550/arxiv.2303.04671
[69]

Gentopia: A collaborative platform for tool-augmented llms,

Xu, Binfeng and Liu, Xukun and Shen, Hua and Han, Zeyu and Li, Yuhan and Yue, Murong and Peng, Zhiyuan and Liu, Yuchen and Yao, Ziyu and Xu, Dongkuan , year =. Gentopia:. doi:10.48550/arXiv.2308.04030 , urldate =. 2308.04030 , publisher =

work page doi:10.48550/arxiv.2308.04030
[70]

Yang, Hui and Yue, Sifu and He, Yunzhong , year =. Auto-. doi:10.48550/arXiv.2306.02224 , urldate =. 2306.02224 , publisher =

work page doi:10.48550/arxiv.2306.02224
[71]

So- cratic models: Composing zero-shot multimodal reasoning with language

Zeng, Andy and Attarian, Maria and Ichter, Brian and Choromanski, Krzysztof and Wong, Adrian and Welker, Stefan and Tombari, Federico and Purohit, Aveek and Ryoo, Michael and Sindhwani, Vikas and Lee, Johnny and Vanhoucke, Vincent and Florence, Pete , year =. Socratic. doi:10.48550/arXiv.2204.00598 , urldate =. 2204.00598 , publisher =

work page doi:10.48550/arxiv.2204.00598
[72]

Toolqa: A dataset for llm question answering with external tools,

Zhuang, Yuchen and Yu, Yue and Wang, Kuan and Sun, Haotian and Zhang, Chao , year =. doi:10.48550/arXiv.2306.13304 , urldate =. 2306.13304 , publisher =

work page doi:10.48550/arxiv.2306.13304
[73]

Jess and Hine, Emmie and Ashurst, Carolyn and Sedille, Paul and Carlier, Alexis and Noetel, Michael and Stuhlm

Alex, Neel and Lifland, Eli and Tunstall, Lewis and Thakur, Abhishek and Maham, Pegah and Riedel, C. Jess and Hine, Emmie and Ashurst, Carolyn and Sedille, Paul and Carlier, Alexis and Noetel, Michael and Stuhlm. Proceedings of the 35th. 2022 , month = jan, eprint =. doi:10.48550/arXiv.2109.14076 , urldate =

work page doi:10.48550/arxiv.2109.14076 2022
[74]

Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles , year =. Program. doi:10.48550/arXiv.2108.07732 , urldate =. 2108.07732 , publisher =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732
[75]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and Rosenberg, Mir and Song, Xia and Stoica, Alina and Tiwary, Saurabh and Wang, Tong , year =. doi:10.48550/arXiv.1611.09268 , urldate =. 1611.09268 , publisher =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.09268
[76]

Semantic

Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy , year =. Semantic. Proceedings of the 2013

work page 2013
[77]

2023 , publisher =

Beeching, Edward and Fourrier, Clémentine and Habib, Nathan and Han, Sheon and Lambert, Nathan and Rajani, Nazneen and Sanseviero, Omar and Tunstall, Lewis and Wolf, Thomas , title =. 2023 , publisher =

work page 2023
[78]

Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , year =. The

work page
[79]

Borkan, Daniel and Dixon, Lucas and Sorensen, Jeffrey and Thain, Nithum and Vasserman, Lucy , year =. Nuanced. Companion. doi:10.1145/3308560.3317593 , urldate =

work page doi:10.1145/3308560.3317593
[80]

Language

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and. Language. Advances in. 2020 , volume =

work page 2020
[81]

Buchanan, Ben and Lohn, Andrew and Musser, Micah and Sedova, Katerina , year =. Truth,

work page
[82]

Evaluating Large Language Models Trained on Code

Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
[83]

Proceedings of the 2018

Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke , year =. Proceedings of the 2018. doi:10.18653/v1/D18-1241 , urldate =

work page doi:10.18653/v1/d18-1241 2018
[84]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sasha and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakaran,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
[85]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year =. Think You Have. doi:10.48550/arXiv.1803.05457 , urldate =. 1803.05457 , publisher =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457
[86]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , year =. Proceedings of the 2019. doi:10.18653/v1/N19-1300 , urldate =

work page doi:10.18653/v1/n19-1300 2019

Showing first 80 references.