Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Guangyu Feng; Huanzhi Mao; Joseph E. Gonzalez; Prabal Dutta

arxiv: 2605.15077 · v1 · pith:MQTZIXSInew · submitted 2026-05-14 · 💻 cs.CL · cs.AI· cs.LG

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Guangyu Feng , Huanzhi Mao , Prabal Dutta , Joseph E. Gonzalez This is my paper

Pith reviewed 2026-06-30 20:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords asynchronous function callingLLM tool usesymbolic futuresconcurrencymodel-tool interactionfunction calling benchmarkssoftware engineering

0 comments

The pith

LLMs can natively reason over symbolic futures to enable asynchronous function calling without any model or protocol changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AsyncFC as a way to make LLM function calling asynchronous by using symbolic futures for unresolved results. This allows the model to continue decoding while functions execute and to run independent functions in parallel. It requires no changes to the LLM or the standard synchronous calling protocol. On function calling and software engineering benchmarks, it cuts completion time substantially while accuracy stays the same. The results indicate LLMs have this capability built in for better tool interaction.

Core claim

AsyncFC is an execution-layer framework that decouples LLM decoding from function execution by representing unresolved results as symbolic futures. This enables overlap of decoding and execution as well as inter-function parallelism. It works with existing models and unmodified function implementations using the standard synchronous protocol. Experiments show significant reductions in end-to-end task completion time with no loss in task accuracy, revealing LLMs' native ability to reason over such futures.

What carries the argument

Symbolic futures representing unresolved execution results, allowing the LLM to proceed without waiting for results.

If this is right

Overlap between model decoding and function execution reduces end-to-end latency.
Inter-function parallelism occurs when dependencies permit.
The method requires no fine-tuning or changes to the synchronous protocol.
Task accuracy is maintained on both standard and adapted benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could improve efficiency in real-time LLM agent applications.
It may generalize to other external tool or API interactions.
Greater speedups are possible in tasks with higher degrees of independent calls.

Load-bearing premise

The standard synchronous function-calling protocol remains usable without modification and benchmark tasks contain sufficient independent calls to permit measurable parallelism without affecting accuracy.

What would settle it

A benchmark consisting only of strictly sequential dependent function calls where AsyncFC shows no time reduction or causes accuracy drops.

Figures

Figures reproduced from arXiv: 2605.15077 by Guangyu Feng, Huanzhi Mao, Joseph E. Gonzalez, Prabal Dutta.

**Figure 1.** Figure 1: Timeline of synchronous and asynchronous function calling. F1 and F2 are independent function calls, while F3 depends on the result of F2. The example is illustrated using a sequential function-calling API. (a) Under synchronous function calling, decoding is blocked until each function execution completes. (b) AsyncFC returns a future placeholder immediately after dispatch, allowing decoding and function e… view at source ↗

**Figure 2.** Figure 2: AsyncFC runtime design. Left: Example of dependency and output structure annotation. Dependency annotations specify the read and write sets, and the runtime automatically infers future ID structure from example values in output schema annotations. Right: Overview of the AsyncFC execution pipeline. Model-emitted function calls are synchronously dispatched to the scheduler and enqueued with metadata. The sch… view at source ↗

**Figure 3.** Figure 3: Main BFCL Results. Results are reported on BFCL v3 Multi-Turn (n = 150, 5s delay, GPT-4o) and BFCL v4 Web Search with real backend latency (matched non-overflow composed workloads: n = 31 for Sequential FC and n = 29 for Parallel FC, GPT-4o). AsyncFC shows no evidence of statistically significant accuracy difference (pacc > 0.05) and achieves speedups in all settings, with statistically significant latency… view at source ↗

**Figure 4.** Figure 4: BFCL latency-sweep analysis. Results are reported on BFCL v3 Multi-Turn (n = 150, GPT-4o) while varying injected per-function delay. The left panel shows mean task end-to-end latency, and the right panel decomposes AsyncFC(S) savings over Sequential FC into decode–execution overlap and inter-function parallelism. Error bars denote 95% bootstrap confidence intervals obtained by resampling matched cases. Asy… view at source ↗

**Figure 5.** Figure 5: Cross-model and downstream application evaluations. The three panels report crossmodel transfer (n = 150, Gemini 3.1 Pro, 10s delay), software-engineering results (n = 300, GPT-5.2, 2× function latency), and asynchronous-thinking results (composed workloads, n = 50, GPT-4o), respectively. Speedups are relative to the corresponding synchronous baseline. For SWE-Bench Lite, displayed speedups are relative t… view at source ↗

**Figure 6.** Figure 6: Real workload execution traces under distinct AsyncFC operation regimes. Left: Balanced model decode and function critical-path times (TLLM ≈ Tcp), where decoding with futures enables continuous decoding. Right: High inter-function parallelism (Ttool ≫ Tcp), where shortlatency function calls allow decoding to proceed. B Design Intuition: Operation Regimes and Speedup Patterns This section expands the spee… view at source ↗

read the original abstract

Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AsyncFC is a clean execution-layer wrapper for overlapping LLM decoding with tool runs that claims no protocol changes, but the native future-reasoning interpretation needs the full methods to hold up.

read the letter

AsyncFC is an execution-layer addition that lets LLM tool calls run asynchronously so decoding overlaps with execution and independent calls can parallelize. The paper keeps the standard synchronous function-calling interface untouched and reports lower end-to-end times on both ordinary function-calling benchmarks and adapted software-engineering tasks while accuracy stays the same.

The practical piece is straightforward: you insert the framework between the model and the tools, feed back symbolic placeholders for unfinished results, and let the model continue generating calls that reference those placeholders. If the benchmarks contain enough independent work, the latency drop follows directly. That part looks like a useful systems contribution for anyone shipping agents where wall-clock time matters.

The softer spot is the stronger claim that this setup reveals a native LLM ability to reason over symbolic futures. The abstract asserts it works with zero changes to prompts, schemas, or decoding, yet gives no concrete example of how a future token is represented in a tool response or how the model is expected to emit a reference to it. If the adapted benchmarks are mostly loosely coupled calls, the measured gains could come from simple scheduling rather than any new symbolic reasoning. The stress-test note correctly flags this as the load-bearing assumption; without the methods section it is hard to judge whether the protocol really stayed unmodified.

The work is aimed at people building and deploying tool-using agents who already have a working synchronous stack and want lower latency. A reader focused on agent infrastructure would find the framework description and the reported numbers worth seeing. It is not a theoretical advance, but the execution approach is simple enough that the empirical results, if they replicate, are worth having in the literature.

I would send it to peer review. The core idea is narrow but cleanly scoped, and the practical payoff is clear enough to justify referee time even if the native-capability framing needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AsyncFC, a pure execution-layer framework that enables asynchronous function calling for LLMs by decoupling decoding from execution via symbolic futures representing unresolved results. It claims this permits overlap between model decoding and function execution plus inter-function parallelism when dependencies allow, all while layering over existing models and unmodified function implementations with no fine-tuning or changes to the standard synchronous function-calling protocol. The work reports significant reductions in end-to-end task completion time on standard function-calling benchmarks and adapted software engineering benchmarks while preserving task accuracy, and interprets the results as evidence that LLMs possess a native capability to reason over symbolic futures.

Significance. If the results hold, the framework could improve efficiency of LLM agents by enabling concurrency without model or protocol modifications, a strength given the emphasis on no changes to the synchronous interface. The potential to reveal native symbolic-future reasoning would be of interest if the empirical support is made verifiable.

major comments (2)

[Abstract] Abstract: the central claims of benchmark improvements (reduced latency, preserved accuracy) and native symbolic-future reasoning are asserted without any methods, error bars, dataset details, or statistical tests, rendering the claims unverifiable from the provided text and undermining soundness of the load-bearing empirical results.
[Abstract] Abstract: the load-bearing assumption that the unmodified synchronous function-calling protocol suffices for symbolic future reasoning (i.e., that the execution layer can substitute future placeholders into tool responses such that subsequent model-generated calls can reference them via IDs or tokens without any JSON schema, prompt, or decoding changes) receives no concrete representation details, leaving open whether the reported latency gains and accuracy preservation actually demonstrate the claimed native capability.

minor comments (1)

[Abstract] The abstract would benefit from a brief parenthetical on the exact symbolic representation used for futures to allow readers to assess the 'no model changes' claim immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on the abstract. We address each major comment below and will revise the manuscript to improve verifiability of the claims while preserving the paper's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of benchmark improvements (reduced latency, preserved accuracy) and native symbolic-future reasoning are asserted without any methods, error bars, dataset details, or statistical tests, rendering the claims unverifiable from the provided text and undermining soundness of the load-bearing empirical results.

Authors: We agree the abstract, as a high-level summary, omits these specifics. The full manuscript details the methods, benchmarks (standard function-calling and adapted software engineering tasks), error bars, accuracy metrics, and statistical tests in the Evaluation section. We will revise the abstract to include brief references to the evaluation setup, observed latency reductions, and accuracy preservation to enhance immediate verifiability. revision: yes
Referee: [Abstract] Abstract: the load-bearing assumption that the unmodified synchronous function-calling protocol suffices for symbolic future reasoning (i.e., that the execution layer can substitute future placeholders into tool responses such that subsequent model-generated calls can reference them via IDs or tokens without any JSON schema, prompt, or decoding changes) receives no concrete representation details, leaving open whether the reported latency gains and accuracy preservation actually demonstrate the claimed native capability.

Authors: Section 3 of the manuscript specifies the mechanism: the execution layer inserts symbolic future placeholders as standard IDs or tokens into tool responses using the existing synchronous protocol format. The LLM then references these via the protocol's native ID mechanism in follow-up calls, with no schema, prompt, or decoding modifications required. This substitution is handled purely at the execution layer, directly supporting the native reasoning claim. We will add a concise clarifying phrase and example reference to the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is execution-layer addition with no derivations or self-referential claims

full rationale

The paper introduces AsyncFC as a pure execution-layer framework that layers over existing models and unmodified synchronous function-calling protocols without fine-tuning or changes. No equations, fitted parameters, derivations, or self-referential structures are described in the abstract or claims. The inference that LLMs possess a native capability to reason over symbolic futures is presented as an empirical revelation from benchmark results, not as a load-bearing assumption that reduces to itself or prior self-citations. No patterns matching self-definitional, fitted-input-called-prediction, self-citation-load-bearing, or related circularity kinds are present. The work is self-contained as an engineering framework evaluated on standard benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unelaborated premise that LLMs can natively handle symbolic futures and that benchmark tasks expose sufficient parallelism; no free parameters or formal axioms are stated.

invented entities (1)

symbolic futures no independent evidence
purpose: Placeholders representing unresolved function results that LLMs can reason over without waiting for execution
Introduced to enable decoupling without model changes; no independent evidence provided beyond the framework description.

pith-pipeline@v0.9.1-grok · 5689 in / 1058 out tokens · 25784 ms · 2026-06-30T20:25:36.291933+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools
cs.CR 2026-06 unverdicted novelty 6.0

Ghost tool calls from speculative dispatch create persistent intent leaks that only issue-time policies changing or suppressing call arguments or destinations can reduce, per evaluations of twelve policies on three corpora.

Reference graph

Works this paper leans on

33 extracted references · 12 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

W. An, J. Nie, Y . Wu, F. Tian, S. Lu, and Q. Zheng. Empowering multimodal llms with external tools: A comprehensive survey.arXiv preprint arXiv:2508.10955, 2025

work page arXiv 2025
[2]

H. C. Baker Jr and C. Hewitt. The incremental garbage collection of processes.ACM SIGART Bulletin, (64):55–59, 1977

1977
[3]

Bauer, S

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: Expressing locality and independence with logical regions. InSC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2012

2012
[4]

Z. Chi, L. Dong, Q. Dong, Y . Hao, X. Wu, S. Huang, and F. Wei. The era of agentic organization: Learning to organize with language models.arXiv preprint arXiv:2510.26658, 2025

work page arXiv 2025
[5]

L. E. Erdogan, N. Lee, S. Jha, S. Kim, R. Tabrizi, S. Moon, C. R. C. Hooper, G. Anumanchipalli, K. Keutzer, and A. Gholami. Tinyagent: Function calling at the edge. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 80–88, 2024. 10

2024
[6]

Gim, S.-s

I. Gim, S.-s. Lee, and L. Zhong. Asynchronous llm function calling.arXiv preprint arXiv:2412.07017, 2024

work page arXiv 2024
[7]

A. A. Ginart, N. Kodali, J. Lee, C. Xiong, S. Savarese, and J. Emmons. Asynchronous tool usage for real-time agents.arXiv preprint arXiv:2410.21620, 2024

work page arXiv 2024
[8]

Function calling with the gemini api

Google AI. Function calling with the gemini api. https://ai.google.dev/gemini-api/docs/ function-calling?example=meeting, 2025. Accessed: Jan. 27, 2026

2025
[9]

Huang, A

K.-H. Huang, A. Prabhakar, S. Dhawan, Y . Mao, H. Wang, S. Savarese, C. Xiong, P. Laban, and C.-S. Wu. Crmarena: Understanding the capacity of llm agents to perform professional crm tasks in realistic environments. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T...

2025
[10]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

S. Kim, S. Moon, R. Tabrizi, N. Lee, M. W. Mahoney, K. Keutzer, and A. Gholami. An llm compiler for parallel function calling. InForty-first International Conference on Machine Learning, 2024

2024
[12]

Kulkarni, V

M. Kulkarni, V . Mazzia, J. Gaspers, C. Hench, J. FitzGerald, and A. Amazon. Massive-agents: A benchmark for multilingual function-calling in 52 languages. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 20193–20215, 2025

2025
[13]

Liskov and L

B. Liskov and L. Shrira. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems.ACM Sigplan Notices, 23(7):260–267, 1988

1988
[14]

W. Liu, X. Huang, X. Zeng, X. Hao, S. Yu, D. Li, S. Wang, W. Gan, Z. Liu, Y . Yu, et al. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024
[15]

Q. McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947

1947
[16]

S. Moon, S. Jha, L. E. Erdogan, S. Kim, W. Lim, K. Keutzer, and A. Gholami. Efficient and scalable estimation of tool representations in vector space.arXiv preprint arXiv:2409.02141, 2024

work page arXiv 2024
[17]

Function calling

OpenAI. Function calling. https://platform.openai.com/docs/guides/function-calling, 2025. Accessed: Jan. 27, 2026

2025
[18]

Packer, V

C. Packer, V . Fang, S. Patil, K. Lin, S. Wooders, and J. Gonzalez. Memgpt: towards llms as operating systems. 2023

2023
[19]

Pantiukhin, B

D. Pantiukhin, B. Shapkin, I. Kuznetsov, A. A. Jost, and N. Koldunov. Accelerating earth science discovery via multi-agent llm systems.Frontiers in Artificial Intelligence, 8:1674927, 2025

2025
[20]

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

2024
[21]

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

2025
[22]

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

2023
[25]

J. E. Thornton. Parallel operation in the control data 6600. InProceedings of the October 27-29, 1964, fall joint computer conference, part II: very high speed computer systems, pages 33–40, 1964

1964
[26]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

M. Wang, Y . Zhang, B. Yu, B. Hao, C. Peng, Y . Chen, W. Zhou, J. Gu, C. Zhuang, R. Guo, et al. Function calling in large language models: Industrial practices, challenges, and future directions.ACM Computing Surveys, 58(9):1–37, 2026

2026
[28]

B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y . Liu, and D. Xu. Rewoo: Decoupling reasoning from observations for efficient augmented language models.arXiv preprint arXiv:2305.18323, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent- computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[30]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018
[31]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

J. Ye, G. Li, S. Gao, C. Huang, Y . Wu, S. Li, X. Fan, S. Dou, T. Ji, Q. Zhang, et al. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. InProceedings of the 31st international conference on computational linguistics, pages 156–187, 2025

2025
[33]

Zhuang, Y

Y . Zhuang, Y . Yu, K. Wang, H. Sun, and C. Zhang. Toolqa: A dataset for llm question answering with external tools.Advances in Neural Information Processing Systems, 36:50117–50143, 2023. A Details for Dependency Specification with Labeling. To enable developers to supply these annotations, AsyncFC introduces a lightweight unified labeling mechanism via ...

2023

[1] [1]

W. An, J. Nie, Y . Wu, F. Tian, S. Lu, and Q. Zheng. Empowering multimodal llms with external tools: A comprehensive survey.arXiv preprint arXiv:2508.10955, 2025

work page arXiv 2025

[2] [2]

H. C. Baker Jr and C. Hewitt. The incremental garbage collection of processes.ACM SIGART Bulletin, (64):55–59, 1977

1977

[3] [3]

Bauer, S

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: Expressing locality and independence with logical regions. InSC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2012

2012

[4] [4]

Z. Chi, L. Dong, Q. Dong, Y . Hao, X. Wu, S. Huang, and F. Wei. The era of agentic organization: Learning to organize with language models.arXiv preprint arXiv:2510.26658, 2025

work page arXiv 2025

[5] [5]

L. E. Erdogan, N. Lee, S. Jha, S. Kim, R. Tabrizi, S. Moon, C. R. C. Hooper, G. Anumanchipalli, K. Keutzer, and A. Gholami. Tinyagent: Function calling at the edge. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 80–88, 2024. 10

2024

[6] [6]

Gim, S.-s

I. Gim, S.-s. Lee, and L. Zhong. Asynchronous llm function calling.arXiv preprint arXiv:2412.07017, 2024

work page arXiv 2024

[7] [7]

A. A. Ginart, N. Kodali, J. Lee, C. Xiong, S. Savarese, and J. Emmons. Asynchronous tool usage for real-time agents.arXiv preprint arXiv:2410.21620, 2024

work page arXiv 2024

[8] [8]

Function calling with the gemini api

Google AI. Function calling with the gemini api. https://ai.google.dev/gemini-api/docs/ function-calling?example=meeting, 2025. Accessed: Jan. 27, 2026

2025

[9] [9]

Huang, A

K.-H. Huang, A. Prabhakar, S. Dhawan, Y . Mao, H. Wang, S. Savarese, C. Xiong, P. Laban, and C.-S. Wu. Crmarena: Understanding the capacity of llm agents to perform professional crm tasks in realistic environments. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T...

2025

[10] [10]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

S. Kim, S. Moon, R. Tabrizi, N. Lee, M. W. Mahoney, K. Keutzer, and A. Gholami. An llm compiler for parallel function calling. InForty-first International Conference on Machine Learning, 2024

2024

[12] [12]

Kulkarni, V

M. Kulkarni, V . Mazzia, J. Gaspers, C. Hench, J. FitzGerald, and A. Amazon. Massive-agents: A benchmark for multilingual function-calling in 52 languages. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 20193–20215, 2025

2025

[13] [13]

Liskov and L

B. Liskov and L. Shrira. Promises: Linguistic support for efficient asynchronous procedure calls in distributed systems.ACM Sigplan Notices, 23(7):260–267, 1988

1988

[14] [14]

W. Liu, X. Huang, X. Zeng, X. Hao, S. Yu, D. Li, S. Wang, W. Gan, Z. Liu, Y . Yu, et al. Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024

[15] [15]

Q. McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947

1947

[16] [16]

S. Moon, S. Jha, L. E. Erdogan, S. Kim, W. Lim, K. Keutzer, and A. Gholami. Efficient and scalable estimation of tool representations in vector space.arXiv preprint arXiv:2409.02141, 2024

work page arXiv 2024

[17] [17]

Function calling

OpenAI. Function calling. https://platform.openai.com/docs/guides/function-calling, 2025. Accessed: Jan. 27, 2026

2025

[18] [18]

Packer, V

C. Packer, V . Fang, S. Patil, K. Lin, S. Wooders, and J. Gonzalez. Memgpt: towards llms as operating systems. 2023

2023

[19] [19]

Pantiukhin, B

D. Pantiukhin, B. Shapkin, I. Kuznetsov, A. A. Jost, and N. Koldunov. Accelerating earth science discovery via multi-agent llm systems.Frontiers in Artificial Intelligence, 8:1674927, 2025

2025

[20] [20]

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

2024

[21] [21]

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

2025

[22] [22]

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551, 2023

2023

[25] [25]

J. E. Thornton. Parallel operation in the control data 6600. InProceedings of the October 27-29, 1964, fall joint computer conference, part II: very high speed computer systems, pages 33–40, 1964

1964

[26] [26]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

M. Wang, Y . Zhang, B. Yu, B. Hao, C. Peng, Y . Chen, W. Zhou, J. Gu, C. Zhuang, R. Guo, et al. Function calling in large language models: Industrial practices, challenges, and future directions.ACM Computing Surveys, 58(9):1–37, 2026

2026

[28] [28]

B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y . Liu, and D. Xu. Rewoo: Decoupling reasoning from observations for efficient augmented language models.arXiv preprint arXiv:2305.18323, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent- computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024

[30] [30]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

2018

[31] [31]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

J. Ye, G. Li, S. Gao, C. Huang, Y . Wu, S. Li, X. Fan, S. Dou, T. Ji, Q. Zhang, et al. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. InProceedings of the 31st international conference on computational linguistics, pages 156–187, 2025

2025

[33] [33]

Zhuang, Y

Y . Zhuang, Y . Yu, K. Wang, H. Sun, and C. Zhang. Toolqa: A dataset for llm question answering with external tools.Advances in Neural Information Processing Systems, 36:50117–50143, 2023. A Details for Dependency Specification with Labeling. To enable developers to supply these annotations, AsyncFC introduces a lightweight unified labeling mechanism via ...

2023