arxiv: 2604.20148 · v1 · submitted 2026-04-22 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

Sachin Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords tool adaptationfew-shot learninghypernetworksLoRAsmall language modelsprompt engineeringnegative resulttool use

0 comments

The pith

Hypernetworks for adapting small language models to tools provide no benefit over few-shot prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether a hypernetwork can help small language models use tools effectively by generating adaptation weights. Through experiments on multiple benchmarks, it demonstrates that this approach yields no improvement compared to carefully designed few-shot prompts and documentation. Few-shot examples account for the majority of performance gains while the hypernetwork contributes nothing. This finding is important because it indicates that simple prompting strategies can enable small models to perform well on tool-use tasks without the need for large additional networks. It also shows that a 3 billion parameter model can reach a substantial portion of the performance of much larger systems at significantly reduced latency.

Core claim

Using a Llama-3.2-3B-Instruct backbone, the study compares four adaptation mechanisms across four benchmarks and finds that the 227.8M-parameter hypernetwork generates non-trivial weights but adds zero measurable performance improvement over few-shot prompting alone. Ablations quantify the contributions as +21.5% from few-shot examples, +5.0% from documentation, and 0% from the hypernetwork. With well-designed prompts, the 3B model attains 79.7% of GPT-5's average performance at 10 times lower latency. Analysis of 722 failures indicates that at 5-shot, errors are mostly semantic on schema-heavy tasks and format-related on others.

What carries the argument

Hypernetwork-generated LoRA weights for task-specific adaptation of the base small language model, evaluated against few-shot prompting and documentation encoding baselines.

Load-bearing premise

That the few-shot prompting baseline was implemented at its full potential and the hypernetwork training regime was sufficient to produce useful adaptations.

What would settle it

Training the hypernetwork on additional tool-use data or with improved optimization and then measuring whether it surpasses the few-shot baseline on the same benchmarks would test the claim.

read the original abstract

Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, we evaluate four adaptation mechanisms--few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search--across four diverse benchmarks: Gorilla APIBench, Spider 2.0, WebArena, and InterCode. Our central finding is a well-supported negative result: despite generating non-trivial weight matrices, the 227.8M-parameter hypernetwork provides no measurable improvement over few-shot prompting alone. Comprehensive ablation studies reveal that few-shot examples contribute +21.5% to performance and documentation contributes +5.0%, while the hypernetwork adds 0%. A 3B model with well-designed prompts achieves 79.7% of GPT-5's average performance at $10 \times$ lower latency. Error analysis across 722 failure cases spanning all shot counts (0--5) shows that at the 5-shot configuration (106 failures), failure modes are task-dependent: schema-heavy tasks (Spider 2.0, WebArena) show near-zero format errors with remaining failures semantic, while format errors dominate on Gorilla (100%) and InterCode (70%). These findings redirect practitioners toward prompt engineering and example curation rather than complex adaptation architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hypernetwork-generated LoRA adds zero lift over few-shot prompting for 3B tool-use models, with ablations crediting gains to examples and docs instead.

read the letter

The main thing here is that a 228 million parameter hypernetwork generating LoRA weights for a 3B model on tool tasks adds zero improvement over few-shot prompting. The ablations quantify this: shots give +21.5%, documentation +5%, hypernetwork 0%. They also show a 3B model can reach nearly 80% of a big model's performance at much lower cost. The work does a clean comparison on four benchmarks and includes an error analysis of 722 cases. It breaks down how failures change with shot count and task type, which is practical. The negative result on hypernetwork adaptation is new in this specific setting and pushes back against using complex meta-adapters for small models. What it does well is focus on efficiency and give numbers that matter for real deployment. The finding redirects attention to better prompts and examples rather than bigger adaptation modules. The potential issue is the hypernetwork training. If it wasn't trained long enough or with the right objective to actually adapt to these tool benchmarks, the zero gain doesn't tell us much about the approach in general. The abstract claims non-trivial weights, but I'd want to see evidence that the generated matrices varied usefully across tasks and that training converged. The prompting baseline also needs checking to ensure it had no hidden advantages. This paper is for researchers and engineers working on tool-augmented agents with small models. It has enough empirical grounding to deserve peer review, though revisions should add more on the meta-training details to make the negative result more convincing.

Referee Report

2 major / 2 minor

Summary. The paper claims that a 227.8M-parameter hypernetwork generating LoRA weights for a Llama-3.2-3B-Instruct model provides no measurable improvement over carefully designed few-shot prompting on tool-use benchmarks (Gorilla APIBench, Spider 2.0, WebArena, InterCode). Ablations attribute +21.5% performance to few-shot examples and +5.0% to documentation, with the hypernetwork contributing 0%; a 3B model reaches 79.7% of GPT-5 average performance at 10x lower latency. Error analysis on 722 failures shows task-dependent modes (format errors dominate on Gorilla/InterCode; semantic on Spider/WebArena).

Significance. If the negative result holds after verification of training adequacy, it is significant for efficient tool-use adaptation in small LMs: it quantifies that prompt engineering and example curation outperform hypernetwork-based meta-adaptation, with concrete ablations and a large-scale error analysis (722 cases) that could redirect research away from complex architectures toward simpler, lower-latency methods.

major comments (2)

[Methods (hypernetwork training)] Hypernetwork training subsection: the manuscript states that the hypernetwork generates 'non-trivial weight matrices' yet provides no details on meta-training steps, learning rate, loss formulation, or convergence diagnostics on the training tasks. This is load-bearing for the central 0% contribution claim, as inadequate optimization could produce the observed null result even if the architecture is capable of task-specific adaptation.
[Ablation studies] Ablation studies: the +21.5% attribution to few-shot examples and 0% to hypernetwork assumes the few-shot baseline was implemented at full potential (example selection, formatting, and prompting strategy). Without explicit controls (e.g., hypernetwork with task-agnostic conditioning or random weights), it is unclear whether the generated matrices differ meaningfully across tasks as asserted.

minor comments (2)

[Error analysis] The error analysis references 722 failure cases across shot counts 0-5 but does not describe sampling, annotation protocol, or inter-annotator agreement, limiting interpretability of the task-dependent failure mode claims.
[Results] Table or figure reporting the 79.7% GPT-5 relative performance should include per-benchmark breakdowns and latency measurements to support the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments on hypernetwork training transparency and ablation rigor are well-taken and help strengthen the central negative result. We respond to each major comment below and will revise the manuscript to incorporate additional details and controls where they directly address the concerns.

read point-by-point responses

Referee: [Methods (hypernetwork training)] Hypernetwork training subsection: the manuscript states that the hypernetwork generates 'non-trivial weight matrices' yet provides no details on meta-training steps, learning rate, loss formulation, or convergence diagnostics on the training tasks. This is load-bearing for the central 0% contribution claim, as inadequate optimization could produce the observed null result even if the architecture is capable of task-specific adaptation.

Authors: We agree that the hypernetwork training procedure requires more explicit documentation to support the claim of a null result. The manuscript will be revised to include a dedicated paragraph specifying the meta-training configuration: number of steps, optimizer and learning rate, loss formulation, and convergence behavior on the meta-training tasks. These details will confirm that the hypernetwork was trained to a stable point and that the observed 0% contribution is not attributable to under-optimization. revision: yes
Referee: [Ablation studies] Ablation studies: the +21.5% attribution to few-shot examples and 0% to hypernetwork assumes the few-shot baseline was implemented at full potential (example selection, formatting, and prompting strategy). Without explicit controls (e.g., hypernetwork with task-agnostic conditioning or random weights), it is unclear whether the generated matrices differ meaningfully across tasks as asserted.

Authors: The few-shot baseline was constructed with semantic example retrieval and iterative prompt formatting, as described in Section 3.2, which we consider to represent a strong implementation. We acknowledge, however, that explicit controls would make the task-specificity claim more robust. We will therefore add two new ablation rows: (1) hypernetwork conditioned on task-agnostic inputs and (2) random LoRA weights. These will be reported alongside the existing ablations to demonstrate that the generated matrices are task-dependent yet still yield no performance gain over the curated few-shot setting. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical negative result with no derivations or load-bearing self-citations

full rationale

The paper is an empirical comparison of adaptation methods (few-shot prompting, documentation, hypernetwork LoRA, beam search) on four benchmarks using a fixed Llama-3.2-3B backbone. The central claim rests on measured performance deltas and ablations (+21.5% from shots, +5% from docs, 0% from hypernetwork) rather than any derivation, equation, or fitted parameter that reduces to its own inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the architecture or results; the negative finding is presented as a direct observation from the experiments. This is the common case of a self-contained empirical study with no derivational chain to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the chosen benchmarks and prompting setups provide a fair test of adaptation methods without systematic bias favoring the baseline.

axioms (1)

domain assumption The four benchmarks (Gorilla APIBench, Spider 2.0, WebArena, InterCode) are representative of practical tool-use scenarios.
The study extrapolates from these specific tasks to general tool adaptation without additional validation.

pith-pipeline@v0.9.0 · 5561 in / 1235 out tokens · 31837 ms · 2026-05-10T00:45:17.730908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 24 canonical work pages · 6 internal anchors

[1]

A Practical Guide to Building Agents , year =
[2]

2025 , publisher =

Proceedings of the 1st Workshop for Research on Agent Language Models (. 2025 , publisher =

2025
[3]

2025 , eprint=

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents , author=. 2025 , eprint=

2025
[4]

Computing Research Repository , volume =

Gaurav Verma and Rachneet Kaur and Nishan Srishankar and Zhen Zeng and Tucker Balch and Manuela Veloso , title =. Computing Research Repository , volume =. 2024 , url =

2024
[5]

2024 , eprint=

A Closer Look at the Limitations of Instruction Tuning , author=. 2024 , eprint=

2024
[6]

2025 , eprint=

Agent Learning via Early Experience , author=. 2025 , eprint=

2025
[8]

2025 , eprint=

RustEvo ^2 : An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation , author=. 2025 , eprint=

2025
[9]

2025 , address =

Fangyu Lei and Jixuan Chen and Yuxiao Ye and Ruisheng Cao and Dongchan Shin and Hongjin Su and Zhaoqing Suo and Hongcheng Gao and Wenjing Hu and Pengcheng Yin and Victor Zhong and Caiming Xiong and Ruoxi Sun and Qian Liu and Sida Wang and Tao Yu , booktitle =. 2025 , address =

2025
[10]

Computing Research Repository , volume =

Zhaoxuan Tan and Zixuan Zhang and Haoyang Wen and Zheng Li and Rongzhi Zhang and Pei Chen and Fengran Mo and Zheyuan Liu and Qingkai Zeng and Qingyu Yin and Meng Jiang , title =. Computing Research Repository , volume =. 2025 , url =

2025
[12]

2023 , eprint=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. 2023 , eprint=

2023
[13]

Computing Research Repository , volume =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , title =. Computing Research Repository , volume =. 2022 , url =

2022
[14]

Computing Research Repository , volume =

Asaf Yehudai and Lilach Eden and Alan Li and Guy Uziel and Yilun Zhao and Roy Bar-Haim and Arman Cohan and Michal Shmueli-Scheuer , title =. Computing Research Repository , volume =. 2025 , url =

2025
[15]

2023 , url =

Lilian Weng , title =. 2023 , url =

2023
[16]

Evaluation and Benchmarking of

Mahmoud Mohammadi and Yipeng Li and Jane Lo and Wendy Yip , booktitle =. Evaluation and Benchmarking of. 2025 , address =

2025
[17]

2017 , eprint=

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , author=. 2017 , eprint=

2017
[18]

2025 , eprint=

Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning , author=. 2025 , eprint=

2025
[19]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021
[20]

2025 , eprint=

Hypernetworks for Perspectivist Adaptation , author=. 2025 , eprint=

2025
[21]

2025 , eprint=

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents , author=. 2025 , eprint=

2025
[22]

2025 , address =

Deng, Yuchen and Fan, Shichen and Wang, Naibo and Zhao, Xinkui and Ng, See-Kiong , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.506 , pages =

work page doi:10.18653/v1/2025.emnlp-main.506 2025
[23]

2024 , eprint=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2024 , eprint=

2024
[24]

2025 , eprint=

Search Self-play: Pushing the Frontier of Agent Capability without Supervision , author=. 2025 , eprint=

2025
[25]

2025 , eprint=

Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs , author=. 2025 , eprint=

2025
[26]

2023 , eprint=

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback , author=. 2023 , eprint=

2023
[27]

2023 , eprint=

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. 2023 , eprint=

2023
[28]

2023 , eprint=

Gorilla: Large Language Model Connected with Massive APIs , author=. 2023 , eprint=

2023
[29]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

2023
[30]

2023 , eprint=

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs , author=. 2023 , eprint=

2023
[31]

2023 , eprint=

RestGPT: Connecting Large Language Models with Real-World RESTful APIs , author=. 2023 , eprint=

2023
[32]

M. H. I. Abdalla, Zhipin Wang, Christian Frey, Steffen Eger, and Josif Grabocka. 2025. https://arxiv.org/abs/2510.19733 Zhyper: Factorized hypernetworks for conditioned llm fine-tuning . Preprint, arXiv:2510.19733

work page arXiv 2025
[33]

Anthropic . 2024. https://modelcontextprotocol.io/ Model Context Protocol

2024
[34]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. https://arxiv.org/abs/1703.03400 Model-agnostic meta-learning for fast adaptation of deep networks . Preprint, arXiv:1703.03400

work page arXiv 2017
[35]

Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, and Dinesh Manocha. 2024. https://arxiv.org/abs/2402.05119 A closer look at the limitations of instruction tuning . Preprint, arXiv:2402.05119

work page arXiv 2024
[36]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. https://arxiv.org/abs/2106.09685 Lora: Low-rank adaptation of large language models . Preprint, arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Daniil Ignatev, Denis Paperno, and Massimo Poesio. 2025. https://arxiv.org/abs/2510.13259 Hypernetworks for perspectivist adaptation . Preprint, arXiv:2510.13259

work page arXiv 2025
[38]

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. 2025. https://openreview.net/forum?id=XmProj9cPs Spider 2.0: Evaluating language models on real-world enterprise text-to- SQL workflows . In Proceeding...

2025
[39]

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. https://arxiv.org/abs/2304.08244 Api-bank: A comprehensive benchmark for tool-augmented llms . Preprint, arXiv:2304.08244

work page arXiv 2023
[40]

Linxi Liang, Jing Gong, Mingwei Liu, Chong Wang, Guangsheng Ou, Yanlin Wang, Xin Peng, and Zibin Zheng. 2025. https://arxiv.org/abs/2503.16922 Rustevo ^2 : An evolving benchmark for api evolution in llm-based rust code generation . Preprint, arXiv:2503.16922

work page arXiv 2025
[41]

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. 2025. https://arxiv.org/abs/2510.18821 Search self-play: Pushing the frontier of agent capability without supervision . Preprint, arXiv:2510.18821

work page arXiv 2025
[42]

Chuancheng Lv, Lei Li, Shitou Zhang, Gang Chen, Fanchao Qi, Ningyu Zhang, and Hai-Tao Zheng. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.956 HyperLoRA : Efficient cross-task generalization via constrained low-rank adapters generation . In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 16376--16393, Miami, Florida, ...

work page doi:10.18653/v1/2024.findings-emnlp.956 2024
[43]

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025. https://doi.org/10.1145/3711896.3736570 Evaluation and benchmarking of LLM agents: A survey . In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining ( KDD '25) , Toronto, ON, Canada. Association for Computing Machinery

work page doi:10.1145/3711896.3736570 2025
[44]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2305.15334 Gorilla: Large language model connected with massive apis . Preprint, arXiv:2305.15334

work page internal anchor Pith review arXiv 2023
[45]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. https://arxiv.org/abs/2307.16789 Toolllm: Facilitating large language models to master 16000+ real-world apis . Preprint, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . Preprint, arXiv:2302.04761

work page internal anchor Pith review arXiv 2023
[47]

Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, and Yun Ma. 2025. https://arxiv.org/abs/2407.00132 Shortcutsbench: A large-scale real-world benchmark for api-based agents . Preprint, arXiv:2407.00132

work page arXiv 2025
[48]

Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang Huang, Cheng Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li. 2023. https://arxiv.org/abs/2306.06624 Restgpt: Connecting large language models with real-world restful apis . Preprint, arXiv:2306.06624

work page arXiv 2023
[49]

Gaurav Verma, Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Tucker Balch, and Manuela Veloso. 2024. https://arxiv.org/abs/2411.13451 AdaptAgent : Adapting multimodal web agents with few-shot learning from human demonstrations . Computing Research Repository, arXiv:2411.13451. Version 1

work page arXiv 2024
[50]

Xingyao Wang, Simon Rosenberg, Juan Michelini, Calvin Smith, Hoang Tran, Engel Nyst, Rohit Malhotra, Xuhui Zhou, Valerie Chen, Robert Brennan, and Graham Neubig. 2025. https://arxiv.org/abs/2511.03690 The openhands software agent sdk: A composable and extensible foundation for production agents . Preprint, arXiv:2511.03690

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Lilian Weng. 2023. https://lilianweng.github.io/posts/2023-06-23-agent/ LLM powered autonomous agents

2023
[52]

John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. 2023. https://arxiv.org/abs/2306.14898 Intercode: Standardizing and benchmarking interactive coding with execution feedback . Preprint, arXiv:2306.14898

work page arXiv 2023
[53]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . Computing Research Repository, arXiv:2210.03629. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin. 2024. https://doi.org/10.18653/v1/2024.acl-long.670 Agent Lumos : Unified and modular training for open-source language agents . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

work page doi:10.18653/v1/2024.acl-long.670 2024
[55]

Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2023. https://arxiv.org/abs/2310.12823 Agenttuning: Enabling generalized agent abilities for llms . Preprint, arXiv:2310.12823

work page arXiv 2023
[56]

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, and 11 others. 2025. https://arxiv.org/abs/2510.08558 Agent learning via early experience . Preprint, arXiv:2510.08558

work page arXiv 2025
[57]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. https://arxiv.org/abs/2307.13854 Webarena: A realistic web environment for building autonomous agents . Preprint, arXiv:2307.13854

work page Pith review arXiv 2024