pith. machine review for the scientific record. sign in

arxiv: 2311.12983 · v1 · submitted 2023-11-21 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

GAIA: a benchmark for General AI Assistants

Cl\'ementine Fourrier, Craig Swift, Gr\'egoire Mialon, Thomas Scialom, Thomas Wolf, Yann LeCun

Pith reviewed 2026-05-12 15:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords GAIA benchmarkgeneral AI assistantsAGI milestonetool use proficiencymulti-modalityweb browsinghuman-AI performance gapreal-world questions
0
0 comments X

The pith

GAIA benchmark shows humans at 92 percent accuracy on general questions where GPT-4 with plugins reaches only 15 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GAIA as a benchmark for general AI assistants made of 466 real-world questions that test core abilities including reasoning, multi-modal data handling, web browsing, and tool use. These questions are designed to be straightforward for humans, who score 92 percent, but difficult for current AI systems, which reach only 15 percent even when GPT-4 is equipped with plugins. This gap reverses the recent pattern in which large language models exceed human experts on narrow professional tasks such as those in law or chemistry. The authors argue that matching average human robustness on GAIA would mark a milestone toward artificial general intelligence because it requires flexible, everyday performance rather than superhuman skill in one domain. They release the questions and retain answers for 300 of them to support an ongoing public leaderboard.

Core claim

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. Human respondents obtain 92 percent accuracy versus 15 percent for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills. The advent of AGI hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, 466 questions and answers were created, with questions released and answers to 300 retained for a leaderboard.

What carries the argument

The GAIA benchmark itself: a set of 466 real-world questions that are conceptually simple for humans but demand combined reasoning, multi-modal processing, browsing, and tool-use skills to answer correctly.

Load-bearing premise

That matching the robustness of an average human on these everyday questions is the key requirement for achieving AGI.

What would settle it

An AI system reaching near 90 percent accuracy on the GAIA set through narrow memorization or task-specific tuning, yet still failing on new but similar real-world tasks outside the benchmark, would show that solving GAIA does not indicate general intelligence.

read the original abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GAIA, a benchmark for General AI Assistants consisting of 466 real-world questions that require fundamental abilities including reasoning, multi-modality handling, web browsing, and tool-use proficiency. It reports human performance at 92% accuracy versus 15% for GPT-4 equipped with plugins, contrasting this with trends where LLMs outperform humans on professional tasks. The authors argue that AGI progress depends on achieving robustness comparable to the average human on such questions and release the questions with answers to 300 retained for a leaderboard.

Significance. If the reported human-AI performance gap is substantiated, GAIA offers a significant contribution by providing a benchmark that targets general robustness rather than specialized or superhuman capabilities. This could help steer AI research towards more practical, real-world assistant systems. The authors deserve credit for making the questions publicly available and setting up a leaderboard to enable ongoing assessment and reproducibility.

major comments (2)
  1. [Abstract] Abstract: The performance disparity (92% human vs. 15% GPT-4+plugins) is central to the paper's claim of a notable gap and the departure from harder-than-human benchmarks. However, the abstract provides no details on the human respondent selection process, number of participants, demographics, or any validation of the 92% figure, which undermines the ability to interpret it as representative of 'average human' robustness as stated in the AGI posit.
  2. [Benchmark construction description] Benchmark construction description: The methodology for devising, selecting, and balancing the 466 questions to ensure they test the claimed set of fundamental abilities (reasoning, multi-modality, web browsing, tool-use) while remaining conceptually simple for humans is not described. This is load-bearing for establishing the benchmark's validity and the reported performance numbers.
minor comments (1)
  1. [Abstract] The abstract states 'their answer' in reference to the 466 questions; this should be corrected to 'their answers' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of GAIA as a benchmark focused on general robustness. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The performance disparity (92% human vs. 15% GPT-4+plugins) is central to the paper's claim of a notable gap and the departure from harder-than-human benchmarks. However, the abstract provides no details on the human respondent selection process, number of participants, demographics, or any validation of the 92% figure, which undermines the ability to interpret it as representative of 'average human' robustness as stated in the AGI posit.

    Authors: We agree that the abstract should concisely summarize the human evaluation methodology to support the reported performance gap and the reference to average human robustness. We will revise the abstract to include key details on the participant selection process, number of respondents, and validation approach used to obtain the 92% figure. revision: yes

  2. Referee: [Benchmark construction description] Benchmark construction description: The methodology for devising, selecting, and balancing the 466 questions to ensure they test the claimed set of fundamental abilities (reasoning, multi-modality, web browsing, tool-use) while remaining conceptually simple for humans is not described. This is load-bearing for establishing the benchmark's validity and the reported performance numbers.

    Authors: We acknowledge that a more detailed account of the benchmark construction process is needed to fully substantiate the design choices and performance claims. We will expand the relevant section of the manuscript with a step-by-step description of how the questions were devised, selected, and balanced across the targeted abilities, including the criteria used to maintain conceptual simplicity for humans. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct performance reporting

full rationale

The paper introduces GAIA as a new benchmark consisting of 466 real-world questions and reports empirical accuracy scores (92% human vs. 15% GPT-4+plugins) without any mathematical derivation, parameter fitting, or predictive modeling. The central posit that AGI requires average-human robustness on these questions is presented as a philosophical stance rather than a derived result from equations or self-referential data. No self-citations, ansatzes, or uniqueness theorems are invoked to justify core claims, and the reported disparity is a direct measurement rather than a fitted or renamed input. The work is self-contained as an empirical proposal with no load-bearing steps that reduce to their own definitions or prior author outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review limited to abstract; no explicit free parameters, invented entities, or detailed axioms visible. The central claim rests on the assumption that the chosen questions adequately probe fundamental abilities required for general AI.

axioms (1)
  • domain assumption The selected real-world questions require fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency.
    Stated directly in the abstract as the basis for the benchmark design.

pith-pipeline@v0.9.0 · 5508 in / 1293 out tokens · 53150 ms · 2026-05-12T15:40:38.187830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

    cs.SD 2026-05 unverdicted novelty 8.0

    Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

  2. OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

    cs.CL 2026-04 unverdicted novelty 8.0

    OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...

  3. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  4. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

    cs.AI 2026-05 conditional novelty 7.0

    BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

  5. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  6. Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

    cs.CR 2026-05 unverdicted novelty 7.0

    Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.

  7. Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

    cs.CL 2026-05 unverdicted novelty 7.0

    Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.

  8. AcademiClaw: When Students Set Challenges for AI Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

  9. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 accept novelty 7.0

    NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...

  10. Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

    cs.AI 2026-05 unverdicted novelty 7.0

    TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...

  11. AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

    cs.AI 2026-04 conditional novelty 7.0

    AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...

  12. AEL: Agent Evolving Learning for Open-Ended Environments

    cs.CL 2026-04 conditional novelty 7.0

    AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechan...

  13. Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception

    cs.AI 2026-04 unverdicted novelty 7.0

    SPECTRA enables supervision-free bootstrapping of agentic capabilities in SVLMs via cascaded tool rollout alignment, multi-objective rewards, and the TIU metric, yielding up to 5% higher task accuracy and 9% better to...

  14. KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents

    cs.SE 2026-03 accept novelty 7.0

    KAIJU decouples LLM reasoning from execution using a specialized kernel and Intent-Gated Execution to enable parallel tool scheduling and robust security.

  15. gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

    gr-qc 2026-05 unverdicted novelty 6.0

    LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.

  16. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...

  17. Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

    cs.MA 2026-05 unverdicted novelty 6.0

    Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.

  18. CL-bench Life: Can Language Models Learn from Real-Life Context?

    cs.CL 2026-04 unverdicted novelty 6.0

    CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.

  19. AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum

    cs.AI 2026-04 unverdicted novelty 6.0

    AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...

  20. CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

    cs.AI 2026-04 unverdicted novelty 6.0

    CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.

  21. Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...

  22. TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

    cs.CL 2026-04 unverdicted novelty 6.0

    TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.

  23. Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

    cs.PF 2026-04 unverdicted novelty 6.0

    PTE is a hardware-aware metric that better predicts actual inference latency in tool-integrated reasoning than token counts and reveals that high-PTE trajectories often have lower correctness.

  24. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  25. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    cs.LG 2024-10 accept novelty 6.0

    AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

  26. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    cs.AI 2026-05 conditional novelty 5.0

    Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

  27. Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes...

  28. MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings

    cs.CL 2026-04 unverdicted novelty 5.0

    MTRouter learns turn-level model utility predictors from logged trajectories using history-model joint embeddings, delivering 58.7% cost reduction on ScienceWorld and 43.4% on HLE while matching or exceeding GPT-5 per...

  29. Mind DeepResearch Technical Report

    cs.AI 2026-04 unverdicted novelty 5.0

    MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

  30. Agentic Performance at the Edge: Insights from Benchmarking

    cs.AI 2026-05 unverdicted novelty 4.0

    Edge agentic AI quality is not a simple function of model size; robust results require joint design of model selection and tool integration, as revealed by domain-conditioned benchmarks showing accuracy-latency Pareto fronts.

  31. Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

    cs.CL 2026-05 unverdicted novelty 4.0

    This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...

Reference graph

Works this paper leans on

200 extracted references · 200 canonical work pages · cited by 31 Pith papers · 22 internal anchors

  1. [1]

    2023 , eprint=

    Levels of AGI: Operationalizing Progress on the Path to AGI , author=. 2023 , eprint=

  2. [2]

    2023 , eprint=

    On the Tool Manipulation Capability of Open-source Large Language Models , author=. 2023 , eprint=

  3. [3]

    2023 , eprint=

    API-Bank: A Benchmark for Tool-Augmented LLMs , author=. 2023 , eprint=

  4. [4]

    2023 , journal =

    AgentBench: Evaluating LLMs as Agents , author =. 2023 , journal =

  5. [5]

    and Friedman, Batya

    Bender, Emily M. and Friedman, Batya. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics. 2018

  6. [7]

    2019 , eprint=

    On the Measure of Intelligence , author=. 2019 , eprint=

  7. [8]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  8. [9]

    2023 , eprint=

    Efficient Benchmarking (of Language Models) , author=. 2023 , eprint=

  9. [10]

    2023 , eprint=

    Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models , author=. 2023 , eprint=

  10. [12]

    Pricing via Processing or Combatting Junk Mail

    Dwork, Cynthia and Naor, Moni. Pricing via Processing or Combatting Junk Mail. Advances in Cryptology --- CRYPTO' 92. 1993

  11. [13]

    AUTOMATION, PARTIAL AND FULL , volume=

    Growiec, Jakub , year=. AUTOMATION, PARTIAL AND FULL , volume=. Macroeconomic Dynamics , publisher=

  12. [14]

    2023 , eprint=

    Making Large Language Models Better Reasoners with Step-Aware Verifier , author=. 2023 , eprint=

  13. [18]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Codebert: A pre-trained model for programming and natural languages , author=. arXiv preprint arXiv:2002.08155 , year=

  14. [19]

    The Journal of Machine Learning Research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

  15. [20]

    2023 , eprint=

    Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification , author=. 2023 , eprint=

  16. [21]

    2023 , url =

    Model Card and Evaluations for Claude Models , author =. 2023 , url =

  17. [22]

    Advances in Neural Information Processing Systems , editor=

    Flamingo: a Visual Language Model for Few-Shot Learning , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  18. [23]

    2023 , eprint=

    How is ChatGPT's behavior changing over time? , author=. 2023 , eprint=

  19. [24]

    2023 , eprint=

    Capabilities of GPT-4 on Medical Challenge Problems , author=. 2023 , eprint=

  20. [26]

    2023 , eprint=

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  21. [27]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  22. [28]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  23. [29]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  24. [30]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  25. [31]

    2022 , eprint=

    Holistic Evaluation of Language Models , author=. 2022 , eprint=

  26. [32]

    2023 , eprint=

    LIMA: Less Is More for Alignment , author=. 2023 , eprint=

  27. [33]

    2023 , eprint=

    PaLM 2 Technical Report , author=. 2023 , eprint=

  28. [34]

    2023 , eprint=

    Augmented Language Models: a Survey , author=. 2023 , eprint=

  29. [35]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  30. [36]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  31. [37]

    2019 , eprint=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

  32. [38]

    2019 , eprint=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

  33. [39]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  34. [40]

    2022 , eprint=

    PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

  35. [41]

    2023 , eprint=

    Progressive-Hint Prompting Improves Reasoning in Large Language Models , author=. 2023 , eprint=

  36. [42]

    2023 , eprint=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , eprint=

  37. [43]

    2018 , eprint=

    Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=

  38. [44]

    Semantic

    Microsoft , year =. Semantic

  39. [45]

    2023 , month = sep, urldate =

  40. [46]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chan, Chi-Min and Chen, Weize and Su, Yusheng and Yu, Jianxuan and Xue, Wei and Zhang, Shanghang and Fu, Jie and Liu, Zhiyuan , year =. doi:10.48550/arXiv.2308.07201 , urldate =. 2308.07201 , publisher =

  41. [47]

    Chase, Harrison , year =

  42. [48]

    Dong, Xin Luna and Moon, Seungwhan and Xu, Yifan Ethan and Malik, Kshitiz and Yu, Zhou , year =. Towards. Proceedings of the 29th. doi:10.1145/3580305.3599572 , urldate =

  43. [49]

    doi:10.48550/arXiv.2306.08640 , urldate =

    Gao, Difei and Ji, Lei and Zhou, Luowei and Lin, Kevin Qinghong and Chen, Joya and Fan, Zihan and Shou, Mike Zheng , year =. doi:10.48550/arXiv.2306.08640 , urldate =. 2306.08640 , publisher =

  44. [50]

    2023 , eprint=

    OpenAGI: When LLM Meets Domain Experts , author=. 2023 , eprint=

  45. [51]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Hong, Sirui and Zheng, Xiawu and Chen, Jonathan and Cheng, Yuheng and Wang, Jinlin and Zhang, Ceyao and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin , year =. doi:10.48550/arXiv.2308.00352 , urldate =. 2308.00352 , publisher =

  46. [52]

    doi:10.48550/arXiv.2304.08244 , urldate =

    Li, Minghao and Song, Feifan and Yu, Bowen and Yu, Haiyang and Li, Zhoujun and Huang, Fei and Li, Yongbin , year =. doi:10.48550/arXiv.2304.08244 , urldate =. 2304.08244 , publisher =

  47. [53]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , year =. doi:10.48550/arXiv.2303.17760 , urldate =. 2303.17760 , publisher =

  48. [54]

    AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

    Lin, Jiaju and Zhao, Haoran and Zhang, Aochi and Wu, Yiting and Ping, Huqiuyue and Chen, Qin , year =. doi:10.48550/arXiv.2308.04026 , urldate =. 2308.04026 , publisher =

  49. [55]

    AgentBench: Evaluating LLMs as Agents

    Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , year =. d...

  50. [56]

    doi:10.48550/arXiv.2308.05960 , urldate =

    Liu, Zhiwei and Yao, Weiran and Zhang, Jianguo and Xue, Le and Heinecke, Shelby and Murthy, Rithesh and Feng, Yihao and Chen, Zeyuan and Niebles, Juan Carlos and Arpit, Devansh and Xu, Ran and Mui, Phil and Wang, Huan and Xiong, Caiming and Savarese, Silvio , year =. doi:10.48550/arXiv.2308.05960 , urldate =. 2308.05960 , publisher =

  51. [57]

    Nakajima, Yohei , year =

  52. [58]

    Task-Driven

    Nakajima, Yohei , year =. Task-Driven

  53. [59]

    WebGPT: Browser-assisted question-answering with human feedback

    Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , year =. doi:10.485...

  54. [60]

    Osika, Anton , year =

  55. [61]

    Gorilla: Large Language Model Connected with Massive APIs

    Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , year =. Gorilla:. doi:10.48550/arXiv.2305.15334 , urldate =. 2305.15334 , publisher =

  56. [62]

    Rush, Sasha , year =

  57. [63]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Schick, Timo and. Toolformer:. 2023 , month = feb, number =. doi:10.48550/arXiv.2302.04761 , urldate =. 2302.04761 , publisher =

  58. [64]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , year =. doi:10.48550/arXiv.2303.17580 , urldate =. 2303.17580 , publisher =

  59. [65]

    arXiv preprint arXiv:2303.08128 , year=

    Sur. 2023 , month = mar, number =. doi:10.48550/arXiv.2303.08128 , urldate =. 2303.08128 , publisher =

  60. [66]

    Talebirad, Yashar and Nadiri, Amirhossein , year =. Multi-. doi:10.48550/arXiv.2306.03314 , urldate =. 2306.03314 , publisher =

  61. [67]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Zhang, Shaokun and Zhu, Erkang and Li, Beibin and Jiang, Li and Zhang, Xiaoyun and Wang, Chi , year =. doi:10.48550/arXiv.2308.08155 , urldate =. 2308.08155 , publisher =

  62. [68]

    Wu, Chenfei and Yin, Shengming and Qi, Weizhen and Wang, Xiaodong and Tang, Zecheng and Duan, Nan , year =. Visual. doi:10.48550/arXiv.2303.04671 , urldate =. 2303.04671 , publisher =

  63. [69]

    Gentopia: A collaborative platform for tool-augmented llms,

    Xu, Binfeng and Liu, Xukun and Shen, Hua and Han, Zeyu and Li, Yuhan and Yue, Murong and Peng, Zhiyuan and Liu, Yuchen and Yao, Ziyu and Xu, Dongkuan , year =. Gentopia:. doi:10.48550/arXiv.2308.04030 , urldate =. 2308.04030 , publisher =

  64. [70]

    Yang, Hui and Yue, Sifu and He, Yunzhong , year =. Auto-. doi:10.48550/arXiv.2306.02224 , urldate =. 2306.02224 , publisher =

  65. [71]

    So- cratic models: Composing zero-shot multimodal reasoning with language

    Zeng, Andy and Attarian, Maria and Ichter, Brian and Choromanski, Krzysztof and Wong, Adrian and Welker, Stefan and Tombari, Federico and Purohit, Aveek and Ryoo, Michael and Sindhwani, Vikas and Lee, Johnny and Vanhoucke, Vincent and Florence, Pete , year =. Socratic. doi:10.48550/arXiv.2204.00598 , urldate =. 2204.00598 , publisher =

  66. [72]

    Toolqa: A dataset for llm question answering with external tools,

    Zhuang, Yuchen and Yu, Yue and Wang, Kuan and Sun, Haotian and Zhang, Chao , year =. doi:10.48550/arXiv.2306.13304 , urldate =. 2306.13304 , publisher =

  67. [73]

    Jess and Hine, Emmie and Ashurst, Carolyn and Sedille, Paul and Carlier, Alexis and Noetel, Michael and Stuhlm

    Alex, Neel and Lifland, Eli and Tunstall, Lewis and Thakur, Abhishek and Maham, Pegah and Riedel, C. Jess and Hine, Emmie and Ashurst, Carolyn and Sedille, Paul and Carlier, Alexis and Noetel, Michael and Stuhlm. Proceedings of the 35th. 2022 , month = jan, eprint =. doi:10.48550/arXiv.2109.14076 , urldate =

  68. [74]

    Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles , year =. Program. doi:10.48550/arXiv.2108.07732 , urldate =. 2108.07732 , publisher =

  69. [75]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and Rosenberg, Mir and Song, Xia and Stoica, Alina and Tiwary, Saurabh and Wang, Tong , year =. doi:10.48550/arXiv.1611.09268 , urldate =. 1611.09268 , publisher =

  70. [76]

    Semantic

    Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy , year =. Semantic. Proceedings of the 2013

  71. [77]

    2023 , publisher =

    Beeching, Edward and Fourrier, Clémentine and Habib, Nathan and Han, Sheon and Lambert, Nathan and Rajani, Nazneen and Sanseviero, Omar and Tunstall, Lewis and Wolf, Thomas , title =. 2023 , publisher =

  72. [78]

    Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , year =. The

  73. [79]

    Borkan, Daniel and Dixon, Lucas and Sorensen, Jeffrey and Thain, Nithum and Vasserman, Lucy , year =. Nuanced. Companion. doi:10.1145/3308560.3317593 , urldate =

  74. [80]

    Language

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and. Language. Advances in. 2020 , volume =

  75. [81]

    Buchanan, Ben and Lohn, Andrew and Musser, Micah and Sedova, Katerina , year =. Truth,

  76. [82]

    Evaluating Large Language Models Trained on Code

    Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and...

  77. [83]

    Proceedings of the 2018

    Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke , year =. Proceedings of the 2018. doi:10.18653/v1/D18-1241 , urldate =

  78. [84]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sasha and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakaran,...

  79. [85]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year =. Think You Have. doi:10.48550/arXiv.1803.05457 , urldate =. 1803.05457 , publisher =

  80. [86]

    B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , year =. Proceedings of the 2019. doi:10.18653/v1/N19-1300 , urldate =

Showing first 80 references.