Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Hua Wei; Ogochukwu N Okoani; T Pranav Kutralingam; Wanpeng Xu; Xiyang Hu; Zheng Luo

arxiv: 2601.05366 · v2 · pith:PVRSATCQnew · submitted 2026-01-08 · 💻 cs.CL · cs.AI· cs.LG

Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Zheng Luo , T Pranav Kutralingam , Ogochukwu N Okoani , Wanpeng Xu , Hua Wei , Xiyang Hu This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords languagetoolcallingexecutionmodelsmultilinguallargeparameter

0 comments

read the original abstract

Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user's language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
cs.CL 2026-04 unverdicted novelty 7.0

A new workflow for multilingual agent benchmark adaptation using functional, cultural, and difficulty alignments improves non-English agent success rates by up to 32.7% over simple machine translation, indicating subs...
What to Format and How: A Benchmark and Workflow Approach for Document Formatting
cs.CL 2026-06 unverdicted novelty 6.0

Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
cs.AI 2026-06 unverdicted novelty 4.0

The paper proposes a unified MDP-based research agenda for addressing sim-to-real gaps in foundation model agents and advocates adopting classical solutions such as domain randomization.