Recognition: 2 theorem links
· Lean TheoremGAIA: a benchmark for General AI Assistants
Pith reviewed 2026-05-12 15:40 UTC · model grok-4.3
The pith
GAIA benchmark shows humans at 92 percent accuracy on general questions where GPT-4 with plugins reaches only 15 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. Human respondents obtain 92 percent accuracy versus 15 percent for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills. The advent of AGI hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, 466 questions and answers were created, with questions released and answers to 300 retained for a leaderboard.
What carries the argument
The GAIA benchmark itself: a set of 466 real-world questions that are conceptually simple for humans but demand combined reasoning, multi-modal processing, browsing, and tool-use skills to answer correctly.
Load-bearing premise
That matching the robustness of an average human on these everyday questions is the key requirement for achieving AGI.
What would settle it
An AI system reaching near 90 percent accuracy on the GAIA set through narrow memorization or task-specific tuning, yet still failing on new but similar real-world tasks outside the benchmark, would show that solving GAIA does not indicate general intelligence.
read the original abstract
We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GAIA, a benchmark for General AI Assistants consisting of 466 real-world questions that require fundamental abilities including reasoning, multi-modality handling, web browsing, and tool-use proficiency. It reports human performance at 92% accuracy versus 15% for GPT-4 equipped with plugins, contrasting this with trends where LLMs outperform humans on professional tasks. The authors argue that AGI progress depends on achieving robustness comparable to the average human on such questions and release the questions with answers to 300 retained for a leaderboard.
Significance. If the reported human-AI performance gap is substantiated, GAIA offers a significant contribution by providing a benchmark that targets general robustness rather than specialized or superhuman capabilities. This could help steer AI research towards more practical, real-world assistant systems. The authors deserve credit for making the questions publicly available and setting up a leaderboard to enable ongoing assessment and reproducibility.
major comments (2)
- [Abstract] Abstract: The performance disparity (92% human vs. 15% GPT-4+plugins) is central to the paper's claim of a notable gap and the departure from harder-than-human benchmarks. However, the abstract provides no details on the human respondent selection process, number of participants, demographics, or any validation of the 92% figure, which undermines the ability to interpret it as representative of 'average human' robustness as stated in the AGI posit.
- [Benchmark construction description] Benchmark construction description: The methodology for devising, selecting, and balancing the 466 questions to ensure they test the claimed set of fundamental abilities (reasoning, multi-modality, web browsing, tool-use) while remaining conceptually simple for humans is not described. This is load-bearing for establishing the benchmark's validity and the reported performance numbers.
minor comments (1)
- [Abstract] The abstract states 'their answer' in reference to the 466 questions; this should be corrected to 'their answers' for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential value of GAIA as a benchmark focused on general robustness. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The performance disparity (92% human vs. 15% GPT-4+plugins) is central to the paper's claim of a notable gap and the departure from harder-than-human benchmarks. However, the abstract provides no details on the human respondent selection process, number of participants, demographics, or any validation of the 92% figure, which undermines the ability to interpret it as representative of 'average human' robustness as stated in the AGI posit.
Authors: We agree that the abstract should concisely summarize the human evaluation methodology to support the reported performance gap and the reference to average human robustness. We will revise the abstract to include key details on the participant selection process, number of respondents, and validation approach used to obtain the 92% figure. revision: yes
-
Referee: [Benchmark construction description] Benchmark construction description: The methodology for devising, selecting, and balancing the 466 questions to ensure they test the claimed set of fundamental abilities (reasoning, multi-modality, web browsing, tool-use) while remaining conceptually simple for humans is not described. This is load-bearing for establishing the benchmark's validity and the reported performance numbers.
Authors: We acknowledge that a more detailed account of the benchmark construction process is needed to fully substantiate the design choices and performance claims. We will expand the relevant section of the manuscript with a step-by-step description of how the questions were devised, selected, and balanced across the targeted abilities, including the criteria used to maintain conceptual simplicity for humans. revision: yes
Circularity Check
No circularity: empirical benchmark with direct performance reporting
full rationale
The paper introduces GAIA as a new benchmark consisting of 466 real-world questions and reports empirical accuracy scores (92% human vs. 15% GPT-4+plugins) without any mathematical derivation, parameter fitting, or predictive modeling. The central posit that AGI requires average-human robustness on these questions is presented as a philosophical stance rather than a derived result from equations or self-referential data. No self-citations, ansatzes, or uniqueness theorems are invoked to justify core claims, and the reported disparity is a direct measurement rather than a fitted or renamed input. The work is self-contained as an empirical proposal with no load-bearing steps that reduce to their own definitions or prior author outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected real-world questions require fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclearGAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. ... human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclearWe posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions.
Forward citations
Cited by 31 Pith papers
-
Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
-
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
-
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...
-
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...
-
AEL: Agent Evolving Learning for Open-Ended Environments
AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechan...
-
Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception
SPECTRA enables supervision-free bootstrapping of agentic capabilities in SVLMs via cascaded tool rollout alignment, multi-objective rewards, and the TIU metric, yielding up to 5% higher task accuracy and 9% better to...
-
KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents
KAIJU decouples LLM reasoning from execution using a specialized kernel and Intent-Gated Execution to enable parallel tool scheduling and robust security.
-
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
-
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
-
CL-bench Life: Can Language Models Learn from Real-Life Context?
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
-
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...
-
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving
TEC is a new public dataset of detailed human trial-and-error trajectories and reflections on web tasks, with humans showing substantially higher accuracy than LLMs.
-
Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning
PTE is a hardware-aware metric that better predicts actual inference latency in tool-integrated reasoning than token counts and reveals that high-PTE trajectories often have lower correctness.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents
Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes...
-
MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings
MTRouter learns turn-level model utility predictors from logged trajectories using history-model joint embeddings, delivering 58.7% cost reduction on ScienceWorld and 43.4% on HLE while matching or exceeding GPT-5 per...
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
Agentic Performance at the Edge: Insights from Benchmarking
Edge agentic AI quality is not a simple function of model size; robust results require joint design of model selection and tool integration, as revealed by domain-conditioned benchmarks showing accuracy-latency Pareto fronts.
-
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...
Reference graph
Works this paper leans on
-
[1]
Levels of AGI: Operationalizing Progress on the Path to AGI , author=. 2023 , eprint=
work page 2023
-
[2]
On the Tool Manipulation Capability of Open-source Large Language Models , author=. 2023 , eprint=
work page 2023
-
[3]
API-Bank: A Benchmark for Tool-Augmented LLMs , author=. 2023 , eprint=
work page 2023
- [4]
-
[5]
Bender, Emily M. and Friedman, Batya. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics. 2018
work page 2018
- [7]
-
[8]
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
work page 2022
- [9]
-
[10]
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models , author=. 2023 , eprint=
work page 2023
-
[12]
Pricing via Processing or Combatting Junk Mail
Dwork, Cynthia and Naor, Moni. Pricing via Processing or Combatting Junk Mail. Advances in Cryptology --- CRYPTO' 92. 1993
work page 1993
-
[13]
AUTOMATION, PARTIAL AND FULL , volume=
Growiec, Jakub , year=. AUTOMATION, PARTIAL AND FULL , volume=. Macroeconomic Dynamics , publisher=
-
[14]
Making Large Language Models Better Reasoners with Step-Aware Verifier , author=. 2023 , eprint=
work page 2023
-
[18]
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Codebert: A pre-trained model for programming and natural languages , author=. arXiv preprint arXiv:2002.08155 , year=
work page internal anchor Pith review arXiv 2002
-
[19]
The Journal of Machine Learning Research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=
work page 2020
-
[20]
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification , author=. 2023 , eprint=
work page 2023
- [21]
-
[22]
Advances in Neural Information Processing Systems , editor=
Flamingo: a Visual Language Model for Few-Shot Learning , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[23]
How is ChatGPT's behavior changing over time? , author=. 2023 , eprint=
work page 2023
-
[24]
Capabilities of GPT-4 on Medical Challenge Problems , author=. 2023 , eprint=
work page 2023
-
[26]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[27]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
- [28]
- [29]
-
[30]
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
work page 2021
- [31]
- [32]
- [33]
- [34]
-
[35]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[36]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[37]
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=
work page 2019
-
[38]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=
work page 2019
-
[39]
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
work page 2021
-
[40]
PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=
work page 2022
-
[41]
Progressive-Hint Prompting Improves Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[42]
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , eprint=
work page 2023
-
[43]
Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=
work page 2018
- [44]
-
[45]
2023 , month = sep, urldate =
work page 2023
-
[46]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chan, Chi-Min and Chen, Weize and Su, Yusheng and Yu, Jianxuan and Xue, Wei and Zhang, Shanghang and Fu, Jie and Liu, Zhiyuan , year =. doi:10.48550/arXiv.2308.07201 , urldate =. 2308.07201 , publisher =
work page internal anchor Pith review doi:10.48550/arxiv.2308.07201
-
[47]
Chase, Harrison , year =
-
[48]
Dong, Xin Luna and Moon, Seungwhan and Xu, Yifan Ethan and Malik, Kshitiz and Yu, Zhou , year =. Towards. Proceedings of the 29th. doi:10.1145/3580305.3599572 , urldate =
-
[49]
doi:10.48550/arXiv.2306.08640 , urldate =
Gao, Difei and Ji, Lei and Zhou, Luowei and Lin, Kevin Qinghong and Chen, Joya and Fan, Zihan and Shou, Mike Zheng , year =. doi:10.48550/arXiv.2306.08640 , urldate =. 2306.08640 , publisher =
- [50]
-
[51]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Hong, Sirui and Zheng, Xiawu and Chen, Jonathan and Cheng, Yuheng and Wang, Jinlin and Zhang, Ceyao and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin , year =. doi:10.48550/arXiv.2308.00352 , urldate =. 2308.00352 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.00352
-
[52]
doi:10.48550/arXiv.2304.08244 , urldate =
Li, Minghao and Song, Feifan and Yu, Bowen and Yu, Haiyang and Li, Zhoujun and Huang, Fei and Li, Yongbin , year =. doi:10.48550/arXiv.2304.08244 , urldate =. 2304.08244 , publisher =
-
[53]
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , year =. doi:10.48550/arXiv.2303.17760 , urldate =. 2303.17760 , publisher =
work page internal anchor Pith review doi:10.48550/arxiv.2303.17760
-
[54]
AgentSims: An Open-Source Sandbox for Large Language Model Evaluation
Lin, Jiaju and Zhao, Haoran and Zhang, Aochi and Wu, Yiting and Ping, Huqiuyue and Chen, Qin , year =. doi:10.48550/arXiv.2308.04026 , urldate =. 2308.04026 , publisher =
-
[55]
AgentBench: Evaluating LLMs as Agents
Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , year =. d...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.03688
-
[56]
doi:10.48550/arXiv.2308.05960 , urldate =
Liu, Zhiwei and Yao, Weiran and Zhang, Jianguo and Xue, Le and Heinecke, Shelby and Murthy, Rithesh and Feng, Yihao and Chen, Zeyuan and Niebles, Juan Carlos and Arpit, Devansh and Xu, Ran and Mui, Phil and Wang, Huan and Xiong, Caiming and Savarese, Silvio , year =. doi:10.48550/arXiv.2308.05960 , urldate =. 2308.05960 , publisher =
-
[57]
Nakajima, Yohei , year =
- [58]
-
[59]
WebGPT: Browser-assisted question-answering with human feedback
Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , year =. doi:10.485...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.09332
-
[60]
Osika, Anton , year =
-
[61]
Gorilla: Large Language Model Connected with Massive APIs
Patil, Shishir G. and Zhang, Tianjun and Wang, Xin and Gonzalez, Joseph E. , year =. Gorilla:. doi:10.48550/arXiv.2305.15334 , urldate =. 2305.15334 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.15334
-
[62]
Rush, Sasha , year =
-
[63]
Toolformer: Language Models Can Teach Themselves to Use Tools
Schick, Timo and. Toolformer:. 2023 , month = feb, number =. doi:10.48550/arXiv.2302.04761 , urldate =. 2302.04761 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.04761 2023
-
[64]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , year =. doi:10.48550/arXiv.2303.17580 , urldate =. 2303.17580 , publisher =
work page internal anchor Pith review doi:10.48550/arxiv.2303.17580
-
[65]
arXiv preprint arXiv:2303.08128 , year=
Sur. 2023 , month = mar, number =. doi:10.48550/arXiv.2303.08128 , urldate =. 2303.08128 , publisher =
-
[66]
Talebirad, Yashar and Nadiri, Amirhossein , year =. Multi-. doi:10.48550/arXiv.2306.03314 , urldate =. 2306.03314 , publisher =
-
[67]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Zhang, Shaokun and Zhu, Erkang and Li, Beibin and Jiang, Li and Zhang, Xiaoyun and Wang, Chi , year =. doi:10.48550/arXiv.2308.08155 , urldate =. 2308.08155 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.08155
-
[68]
Wu, Chenfei and Yin, Shengming and Qi, Weizhen and Wang, Xiaodong and Tang, Zecheng and Duan, Nan , year =. Visual. doi:10.48550/arXiv.2303.04671 , urldate =. 2303.04671 , publisher =
work page internal anchor Pith review doi:10.48550/arxiv.2303.04671
-
[69]
Gentopia: A collaborative platform for tool-augmented llms,
Xu, Binfeng and Liu, Xukun and Shen, Hua and Han, Zeyu and Li, Yuhan and Yue, Murong and Peng, Zhiyuan and Liu, Yuchen and Yao, Ziyu and Xu, Dongkuan , year =. Gentopia:. doi:10.48550/arXiv.2308.04030 , urldate =. 2308.04030 , publisher =
-
[70]
Yang, Hui and Yue, Sifu and He, Yunzhong , year =. Auto-. doi:10.48550/arXiv.2306.02224 , urldate =. 2306.02224 , publisher =
-
[71]
So- cratic models: Composing zero-shot multimodal reasoning with language
Zeng, Andy and Attarian, Maria and Ichter, Brian and Choromanski, Krzysztof and Wong, Adrian and Welker, Stefan and Tombari, Federico and Purohit, Aveek and Ryoo, Michael and Sindhwani, Vikas and Lee, Johnny and Vanhoucke, Vincent and Florence, Pete , year =. Socratic. doi:10.48550/arXiv.2204.00598 , urldate =. 2204.00598 , publisher =
-
[72]
Toolqa: A dataset for llm question answering with external tools,
Zhuang, Yuchen and Yu, Yue and Wang, Kuan and Sun, Haotian and Zhang, Chao , year =. doi:10.48550/arXiv.2306.13304 , urldate =. 2306.13304 , publisher =
-
[73]
Alex, Neel and Lifland, Eli and Tunstall, Lewis and Thakur, Abhishek and Maham, Pegah and Riedel, C. Jess and Hine, Emmie and Ashurst, Carolyn and Sedille, Paul and Carlier, Alexis and Noetel, Michael and Stuhlm. Proceedings of the 35th. 2022 , month = jan, eprint =. doi:10.48550/arXiv.2109.14076 , urldate =
-
[74]
Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and Sutton, Charles , year =. Program. doi:10.48550/arXiv.2108.07732 , urldate =. 2108.07732 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732
-
[75]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and Rosenberg, Mir and Song, Xia and Stoica, Alina and Tiwary, Saurabh and Wang, Tong , year =. doi:10.48550/arXiv.1611.09268 , urldate =. 1611.09268 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.09268
- [76]
-
[77]
Beeching, Edward and Fourrier, Clémentine and Habib, Nathan and Han, Sheon and Lambert, Nathan and Rajani, Nazneen and Sanseviero, Omar and Tunstall, Lewis and Wolf, Thomas , title =. 2023 , publisher =
work page 2023
-
[78]
Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , year =. The
-
[79]
Borkan, Daniel and Dixon, Lucas and Sorensen, Jeffrey and Thain, Nithum and Vasserman, Lucy , year =. Nuanced. Companion. doi:10.1145/3308560.3317593 , urldate =
- [80]
-
[81]
Buchanan, Ben and Lohn, Andrew and Musser, Micah and Sedova, Katerina , year =. Truth,
-
[82]
Evaluating Large Language Models Trained on Code
Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde de Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and Ray, Alex and Puri, Raul and Krueger, Gretchen and Petrov, Michael and Khlaaf, Heidy and Sastry, Girish and Mishkin, Pamela and Chan, Brooke and Gray, Scott and...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374 2021
-
[83]
Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke , year =. Proceedings of the 2018. doi:10.18653/v1/D18-1241 , urldate =
-
[84]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sasha and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakaran,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
-
[85]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year =. Think You Have. doi:10.48550/arXiv.1803.05457 , urldate =. 1803.05457 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457
-
[86]
B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , year =. Proceedings of the 2019. doi:10.18653/v1/N19-1300 , urldate =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.