pith. machine review for the scientific record. sign in

arxiv: 2604.11322 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations

Fang Fang, Ge Zhang, Pengfei Cao, Xixun Lin, Yanan Cao, Yilong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords structural alignment biastool invocationLLM tool usesemantic relevanceattention attributiontool refusalbias mitigation
0
0 comments X

The pith

LLMs invoke tools whenever query attributes fit tool parameters, even if the tool cannot serve the query goal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often call external tools even when those tools are irrelevant to the user's request. The paper isolates a specific flaw called structural alignment bias, in which models check whether query details can be assigned to a tool's parameters and treat that match as sufficient reason to invoke the tool. To measure this cleanly, they created SABEval, a dataset that varies structural fit independently of whether the tool actually helps. Experiments show the bias produces many erroneous calls that existing benchmarks overlook. Internal probing with Contrastive Attention Attribution identifies two opposing pathways in the model—one checking semantics and one checking structure—and shows that a simple rebalancing of their strengths reduces the errors while leaving normal tool use intact.

Core claim

Structural alignment bias is the tendency of LLMs to invoke a tool as soon as its parameters can receive valid values from the query, regardless of whether the tool advances the user's goal. SABEval decouples this structural factor from semantic relevance and reveals that the bias drives most refusal failures. Contrastive Attention Attribution demonstrates that invocation decisions result from the relative strength of a semantic-checking pathway versus a structural-matching pathway. Rebalancing these pathways through targeted attention adjustment corrects the bias without harming performance on relevant tool calls.

What carries the argument

Structural alignment bias, the decision rule that triggers tool invocation on the basis of valid parameter assignment from the query rather than on goal relevance.

If this is right

  • Standard tool-use benchmarks that do not control for structural similarity will systematically underestimate refusal errors.
  • Rebalancing the relative strength of semantic and structural attention pathways reduces false invocations across tested models.
  • The bias and its mitigation generalize without degrading performance on cases where tools are relevant.
  • Contrastive attribution can be used to diagnose other decision biases that pit surface form against intent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structural-matching shortcut may explain other LLM failures where surface patterns override intended meaning.
  • Future tool benchmarks should routinely include structurally matched but semantically irrelevant distractors.
  • Attention rebalancing techniques could apply to other internal conflicts between syntax and semantics in language models.

Load-bearing premise

That SABEval successfully isolates structural alignment from semantic relevance without other confounds and that the contrastive attribution method accurately identifies the two competing pathways.

What would settle it

If models continue to show high rates of erroneous invocations on structurally aligned but semantically irrelevant cases even after the rebalancing intervention, the claim that the bias is both measurable and mitigable would be falsified.

Figures

Figures reproduced from arXiv: 2604.11322 by Fang Fang, Ge Zhang, Pengfei Cao, Xixun Lin, Yanan Cao, Yilong Liu.

Figure 1
Figure 1. Figure 1: Illustration of erroneous tool invocation driven by the model’s strong reliance on structural alignment. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SABEval construction with three steps. Sam [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Tool invocation rate (%) on SABEval subsets [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tool invocation rate (%) on original samples [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of pathway strengths between invocation and non-invocation cases for Qwen3-8B. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Scaling Coefficient Semantic Structural 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.93 0.84 0.68 0.53 0.36 0.25 0.17 0.11 0.06 0.04 0.02 0.01 0.01 0.00 0.10 0.10 0.11 0.14 0.17 0.21 0.26 0.32 0.37 0.46 0.53 0.61 0.67 0.73 0.79 0.84 0.88 0.91 … view at source ↗
Figure 7
Figure 7. Figure 7: TIR under different scaling coefficients for [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of derived class counts (left) and [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Tool invocation rate (%) across 10 randomly sampled tool templates. The [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pathways in Qwen3-4B. system-start tool-instr tool-start tool-definition tool-end output-instr system-end user-start user-query user-end assistant-start response Emb L0 L12 L13 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 L25 L26 L27 L28 L29 L30 L38 L39 Semantic Pathways system-start tool-instr tool-start tool-definition tool-end output-instr system-end user-start user-query user-end assistant-start response L… view at source ↗
Figure 11
Figure 11. Figure 11: Pathways in Qwen3-14B. BOS system-start background tool-instr output-instr tool-start tool-definition system-end user-start user-query user-end assistant-start L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 Semantic Pathways BOS system-start background tool-instr output-instr tool-start tool-definition system-end user-start user-query user-end assistant-start L0 L1 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L1… view at source ↗
Figure 12
Figure 12. Figure 12: Pathways in ToolACE-2.5-8B. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pathways in Watt-Tool-8b. 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 Top Attn Weights 5 10 15 20 S e m a n tic C h e c kin g M e t ric (¢ mse m) Semantic Pathways No Invocation Invocation 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 Top Attn Weights 5 10 15 20 S t r u c t u r al M a t c hin g M e t ric (¢ mstr) Structural Pathways No Invocation Invocation [PITH_FULL_IMAGE:… view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of pathway strengths between invocation and non-invocation cases for Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of pathway strengths between invocation and non-invocation cases for Qwen3-14B. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of pathway strengths between invocation and non-invocation cases for ToolACE-2.5-8B. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of pathway strengths between invocation and non-invocation cases for Watt-Tool-8B. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Pathway strengths across degrees of structural alignment for Qwen3-8B and ToolACE-2.5-8B. [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: TIR at different top-k thresholds under varying scaling coefficients for Qwen3-4B. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Scaling Coefficient Semantic Structural 1.00 1.00 1.00 1.00 0.98 0.96 0.91 0.83 0.74 0.63 0.53 0.40 0.31 0.26 0.19 0.13 0.10 0.07 0.05 0.03 0.02 0.27 0.27 0.28 0.30 0.33 0.36 0.38 0.43 0.46 0.50 0.53 0.57 0.61 0.64 0.67 0.69 0.71 0.74 0.77 0… view at source ↗
Figure 20
Figure 20. Figure 20: TIR at different top-k thresholds under varying scaling coefficients for Qwen3-8B. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Scaling Coefficient Semantic Structural 1.00 0.99 0.98 0.97 0.95 0.92 0.89 0.83 0.77 0.70 0.59 0.51 0.41 0.34 0.27 0.22 0.18 0.14 0.12 0.10 0.08 0.07 0.08 0.11 0.15 0.19 0.25 0.35 0.39 0.45 0.52 0.59 0.66 0.73 0.78 0.83 0.85 0.89 0.91 0.93 0… view at source ↗
Figure 21
Figure 21. Figure 21: TIR at different top-k thresholds under varying scaling coefficients for Qwen3-14B. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Scaling Coefficient Semantic Structural 1.00 0.99 0.98 0.96 0.93 0.88 0.82 0.75 0.67 0.58 0.50 0.42 0.36 0.29 0.24 0.21 0.17 0.14 0.12 0.10 0.08 0.19 0.22 0.24 0.26 0.29 0.33 0.35 0.39 0.42 0.46 0.50 0.54 0.57 0.60 0.63 0.67 0.71 0.73 0.76 … view at source ↗
Figure 22
Figure 22. Figure 22: TIR at different top-k thresholds under varying scaling coefficients for ToolACE-2.5-8B. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: TIR at different top-k thresholds under varying scaling coefficients for Watt-Tool-8B. Span Name Span Content system-start “<|im_start|>system\n” tool-instruction “# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n” tool-start “<tools>\n” tool-definition tool definition tool-end “</tools>\n\n” output-… view at source ↗
Figure 24
Figure 24. Figure 24: An example annotation illustrating a base class, derived classes, and a tool template. [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Query-generation prompt used during dataset construction. [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt for adding additional parameters during dataset extension. [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Default system prompt for Qwen3-4B, Qwen3-8B and Qwen3-14B [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Default system prompt for ToolACE-8B and Watt-Tool-8B [PITH_FULL_IMAGE:figures/full_fig_p025_28.png] view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated impressive capabilities in utilizing external tools. In practice, however, LLMs are often exposed to tools that are irrelevant to the user's query, in which case the desired behavior is to refrain from invocations. In this work, we identify a widespread yet overlooked mechanistic flaw in tool refusal, which we term structural alignment bias: Even when a tool fails to serve the user's goal, LLMs still tend to invoke it whenever query attributes can be validly assigned to tool parameters. To systematically study this bias, we introduce SABEval, a new dataset that decouples structural alignment from semantic relevance. Our analysis shows that structural alignment bias induces severe tool-invocation errors in LLMs, yet remains largely unaccounted for in existing evaluations. To investigate the internal mechanisms underlying this bias, we propose Contrastive Attention Attribution, which reveals two competing pathways for semantic checking and structural matching. The relative strength of these pathways drives LLMs' tool invocation decisions. Based on these findings, we further introduce a rebalancing strategy that effectively mitigates structural alignment bias, as demonstrated by extensive experiments, without degrading general tool-use capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs exhibit a structural alignment bias in tool invocation: they tend to call tools when query attributes can be mapped to tool parameters structurally, even if the tool is semantically irrelevant to the query goal. To study this, the authors introduce SABEval, a dataset that decouples structural alignment from semantic relevance. Analysis via a new Contrastive Attention Attribution technique identifies two competing pathways (semantic checking and structural matching) whose relative strengths determine invocation decisions. Based on this, they propose a rebalancing strategy that mitigates the bias in experiments while preserving general tool-use performance.

Significance. If the empirical findings and mechanistic account hold, the work is significant for highlighting an overlooked failure mode in LLM tool use that existing benchmarks miss. SABEval is a useful new resource for evaluating tool irrelevance. The attention-based pathway analysis and rebalancing mitigation offer both diagnostic insight and a practical fix. Credit is given for the systematic dataset construction and the extensive experiments demonstrating mitigation without capability degradation.

major comments (2)
  1. [§4.2] §4.2 (Contrastive Attention Attribution): The identification of two competing pathways rests on correlational attention patterns from contrastive pairs in SABEval. Without intervention experiments (e.g., ablating or patching the attributed attention heads/layers to measure changes in refusal rates), the account does not establish that these pathways causally drive invocation decisions. This directly underpins the motivation and design of the rebalancing strategy in §6.
  2. [§3] §3 (SABEval construction): The dataset is presented as successfully isolating structural alignment from semantic relevance, but the paper does not report controls or ablations for potential confounds such as lexical overlap between query and tool descriptions or parameter-type matching. This affects whether the reported error rates can be attributed specifically to structural alignment bias rather than other factors.
minor comments (2)
  1. [Figure 4] Figure 4: The attention attribution visualizations would benefit from quantitative summaries (e.g., average attribution scores per pathway) in addition to the qualitative examples.
  2. [Related Work] Related work section: The discussion of prior tool-use evaluations could more explicitly contrast SABEval with existing irrelevance benchmarks to clarify the incremental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of identifying structural alignment bias in LLM tool use. We address each major comment below, outlining our responses and planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Contrastive Attention Attribution): The identification of two competing pathways rests on correlational attention patterns from contrastive pairs in SABEval. Without intervention experiments (e.g., ablating or patching the attributed attention heads/layers to measure changes in refusal rates), the account does not establish that these pathways causally drive invocation decisions. This directly underpins the motivation and design of the rebalancing strategy in §6.

    Authors: We agree that the Contrastive Attention Attribution analysis relies on correlational patterns observed in attention weights across SABEval contrastive pairs, and that this does not constitute direct causal evidence via interventions such as head ablation or patching. The rebalancing strategy in §6 is motivated by these patterns and functions as an indirect test by down-weighting the structural matching pathway, which empirically reduces invocation errors while preserving tool-use performance. In the revision, we will update §4.2 to explicitly discuss the correlational nature of the findings as a limitation and clarify the empirical support provided by the rebalancing experiments. We will also add a brief discussion of potential future causal interventions. revision: partial

  2. Referee: [§3] §3 (SABEval construction): The dataset is presented as successfully isolating structural alignment from semantic relevance, but the paper does not report controls or ablations for potential confounds such as lexical overlap between query and tool descriptions or parameter-type matching. This affects whether the reported error rates can be attributed specifically to structural alignment bias rather than other factors.

    Authors: We thank the referee for highlighting these potential confounds. SABEval was constructed to isolate structural alignment (parameter compatibility) from semantic relevance (query-tool goal mismatch), but we did not include explicit ablations for lexical overlap or parameter-type matching in the reported results. In the revised manuscript, we will add controls and ablations in §3 (or a new appendix) that systematically vary lexical similarity and parameter-type consistency while holding other factors fixed, to better attribute the error rates to structural alignment bias specifically. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is empirically grounded in new dataset and methods

full rationale

The paper defines structural alignment bias from observed LLM behavior with irrelevant tools, introduces the independent SABEval dataset to decouple structural alignment from semantic relevance, proposes Contrastive Attention Attribution as an analysis technique to identify competing pathways, and derives the rebalancing strategy from those experimental findings. No steps reduce by construction to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. All central claims rest on new empirical measurements rather than tautological or self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work is empirical and introduces one new conceptual entity (structural alignment bias) and one new dataset (SABEval). No free parameters or ad-hoc axioms are described in the abstract.

invented entities (1)
  • structural alignment bias no independent evidence
    purpose: To name and isolate the tendency of LLMs to invoke tools based on parameter matching rather than goal relevance
    Defined in the abstract as the core flaw; no independent evidence provided beyond the new dataset and analysis

pith-pipeline@v0.9.0 · 5518 in / 1163 out tokens · 27524 ms · 2026-05-10T15:00:02.955457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    arXiv preprint arXiv:2501.12851 , year=

    Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083. Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, and 1 others. 2025. Acebench: Who wins the match point in tool usage? arXiv preprint arXiv:2501.12851. Zili...

  2. [2]

    InProceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 719–729

    The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 719–729. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024....

  3. [3]

    Sam Houliston, Ambroise Odonnat, Charles Arnal, and Vivien Cabannes

    How does gpt-2 compute greater-than?: In- terpreting mathematical abilities in a pre-trained lan- guage model.Advances in Neural Information Pro- cessing Systems, 36:76033–76060. Sam Houliston, Ambroise Odonnat, Charles Arnal, and Vivien Cabannes. 2025. Provable benefits of in-tool learning for large language models.arXiv preprint arXiv:2508.20755. Yue Hu...

  4. [4]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, and Hayato Yamana. 2024. ToolBeHonest: A multi- level hallucination diagnostic benchmark for tool- augmented large language models. InProceedings of the 20...

  5. [5]

    Derived Class Inclusion: Every generated query must contain the exact derived class name provided

  6. [6]

    - The expected parameter values must be specific and realistic

    Parameter Constraints: - Each query must explicitly contain information for all and only the parameters of the tool. - The expected parameter values must be specific and realistic. Avoid vague values

  7. [7]

    No Attachments: Do not assume or pretend that files, images, audio clips, videos, or any other attachments are being provided

  8. [8]

    Quality: Generated queries must be solvable with the tool without requiring further clarification

  9. [9]

    query":

    Diversity: - You should generate at least 5 distinct queries. - The queries should have varied sentence structures (e.g., imperative commands, interrogative queries). - The parameter values across different queries should also be diverse, covering a wide range of realistic scenarios, if applicable. Output Format: Return a single JSON array as follows: [ {...

  10. [10]

    Tool Template: An tool schema that uses'<class>'as a placeholder for a specific derived class

  11. [11]

    List of Derived Classes: A list of the specific derived class names that will eventually replace'<class>'

  12. [12]

    Design Principles: - The tool's purpose must remain clear and unambiguous after adding new parameters

    Base Class Description: A brief explanation of the base class corresponding to the tool template. Design Principles: - The tool's purpose must remain clear and unambiguous after adding new parameters. - All parameter values would be provided by the user when invoke the tool, not generated or assumed by the LLM assistant. The LLM acts as a bridge to execut...

  13. [13]

    Universally Applicable: Each proposed parameter must be universally applicable and make sense for all derived classes provided

  14. [14]

    They cannot duplicate the functionality or name of any parameters already present in the tool template

    Uniqueness: The proposed parameters must be entirely new. They cannot duplicate the functionality or name of any parameters already present in the tool template

  15. [15]

    Parameter description, however, should contain the'<class>'placeholder if it is contextually appropriate when replaced with a specific subclass

    Placeholder Usage: Parameter names must be generic and must not contain the'<class>' placeholder. Parameter description, however, should contain the'<class>'placeholder if it is contextually appropriate when replaced with a specific subclass

  16. [16]

    Quantity: Generate at least four distinct and meaningful parameters

  17. [17]

    parameter_name_1

    Type: New parameters should be simple types: string, integer, number, boolean or array of simple types. Do not propose complex nested structures. Output Format: Return a single JSON object as follows: {{"parameter_name_1": {{"type": "string", "description": "Description for param 1."}}, " parameter_name_2": {{"type": "integer", "description": "Description...