arxiv: 2605.00472 · v1 · submitted 2026-05-01 · 💻 cs.SE

Recognition: unknown

Q-ARE: An Evaluation Dataset for Query Based API Recommendation

Shenglong Wu , Xunhui Zhang , Tao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:58 UTC · model grok-4.3

classification 💻 cs.SE

keywords API recommendationevaluation datasetquery-based methodsinvocation chainAPI call depthinvocation densitylarge language modelsJava projects

0 comments

The pith

Q-ARE dataset shows API recommenders and LLMs weaken on deep, low-density invocation chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds Q-ARE, a dataset from GitHub Java projects, to test whether query-based API recommendation methods can correctly identify the third-party APIs needed to fulfill a developer's natural-language query. It does this by tracing method invocations, recursively expanding the chains to collect all directly and indirectly called APIs, and packaging them as recommendation targets. Two metrics are added: API Call Depth tracks how many layers separate the query from the target API, while Invocation Density measures how much of the chain's code relates to that API. Tests on existing methods and general large language models find clear drops in accuracy as depth rises and density falls.

Core claim

Q-ARE is built by extracting methods from open-source Java projects, analyzing their invocation chains, and recursively expanding those chains to unify all third-party APIs invoked at any level into single target sets for each query. The authors define API Call Depth as the distance in the call hierarchy from the query method to a target API and Invocation Density as the proportion of code lines in the chain tied to the target API. When several query-based recommendation methods and large language models are evaluated on the resulting dataset, their performance declines sharply with greater call depth and lower invocation density, showing that current approaches still have trouble with multi

What carries the argument

The Q-ARE dataset formed by recursive expansion of method invocation chains from GitHub Java projects, together with the two metrics API Call Depth and Invocation Density that measure the structural complexity of those chains.

If this is right

Query-based API recommendation methods must incorporate handling of indirect, multi-level invocations to remain useful in realistic codebases.
Large language models exhibit the same performance degradation on deep call structures as traditional specialized recommenders.
Q-ARE provides a concrete benchmark that future algorithms can use to measure progress on hierarchical API usage.
Developers working with complex Java libraries will continue to receive incomplete or irrelevant suggestions until depth and density are addressed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Recommendation engines could add static call-graph traversal as a post-processing step to surface indirectly invoked APIs that current query matching misses.
The recursive chain-expansion technique used to build Q-ARE could be applied to other languages or to web-service APIs to create comparable test collections.
Training data that explicitly labels APIs by their depth and density in call chains might help models learn to prioritize the most relevant ones.
A side-by-side comparison of Q-ARE targets against APIs actually chosen by human developers for the same tasks would test whether the automatic targets match real intent.

Load-bearing premise

The target API sets created by recursively expanding invocation chains from GitHub projects accurately reflect the semantic requirements a developer would state in a natural-language query.

What would settle it

A controlled test in which the same functional task is expressed once as a query and once as code with systematically varied call depths, showing no drop in recommendation accuracy as depth increases.

Figures

Figures reproduced from arXiv: 2605.00472 by Shenglong Wu, Tao Wang, Xunhui Zhang.

**Figure 1.** Figure 1: Example of multi-level invocation of a third-party view at source ↗

**Figure 2.** Figure 2: Dataset Construction Pipeline facilitates accurate and in-depth evaluation and ensures that experimental results are representative of realistic scenarios. Through these selection criteria, we obtained a set of representative real-world software projects, providing a reliable data source for constructing the Q-ARE dataset. 3.2 Method Extraction and Invocation Relationship Analysis Within the selected pro… view at source ↗

**Figure 3.** Figure 3: Performance comparison of baseline methods un view at source ↗

**Figure 5.** Figure 5: Semantic similarity across different Depth groups view at source ↗

**Figure 4.** Figure 4: Performance comparison of baseline methods un view at source ↗

**Figure 6.** Figure 6: Semantic similarity across different Density view at source ↗

read the original abstract

As software systems grow in scale, developers face increasing difficulty in selecting appropriate Application Programming Interfaces (APIs) from numerous options. Efficiently identifying APIs that satisfy functional requirements has become a key challenge. To evaluate the semantic understanding of existing query-based API recommendation methods, this paper constructs Q-ARE (Query-based API Recommendation Evaluation), a dataset based on open-source Java projects from GitHub. Methods and their invocation chains are analyzed to identify third-party APIs directly or indirectly invoked by target methods, recursively expanding multi-level invocations to unify hierarchical call structures into API recommendation target sets. Furthermore, we introduce two metrics: API Call Depth, measuring the invocation distance between a query method and a target API, and Invocation Density, quantifying the proportion of code lines associated with the target API in the invocation chain. Based on Q-ARE, we systematically evaluate several query-based API recommendation methods and general Large Language Models (LLMs). Results show that performance drops significantly as API Call Depth increases and invocation density decreases, indicating that existing methods still struggle with multi-level method invocation structures. Q-ARE and its metrics provide a new benchmark for assessing semantic understanding in API recommendation and offer insights for improving future algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-ARE gives a practical benchmark for multi-level API calls but the recursive targets may include incidental APIs that don't match real query intent.

read the letter

The main takeaway is that this paper builds Q-ARE from GitHub Java projects by recursively expanding invocation chains to form target API sets, then shows that existing recommenders and LLMs lose ground as API Call Depth rises and Invocation Density falls. They define those two metrics to quantify nesting distance and code proportion in the chain. The evaluation runs on real open-source code rather than synthetic cases, which is a step forward from flat single-call tests. The construction process is described clearly enough to reproduce the dataset from the same repos, and the metrics are simple enough that others could apply them to new methods. This focus on hierarchical structures addresses a genuine pain point in how developers actually use APIs inside larger codebases. The soft spot is the lack of any validation that the collected targets align with what a natural-language query would intend. Recursive expansion can drag in implementation-detail APIs that sit in the chain but have no semantic tie to the query text, while skipping others a developer might expect. The abstract and description give no sign of manual audits, developer ratings, or cross-checks against usage logs, so the reported performance drops could partly trace to noisy targets instead of pure model shortcomings. This paper is for researchers working on query-based or LLM-driven API recommendation in software engineering. It supplies a concrete benchmark that highlights where current tools fall short on complex calls. I would send it for peer review because the dataset and metrics are new and grounded in real code, even if referees will likely ask for stronger checks on target quality and additional baselines.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Q-ARE, a new evaluation dataset for query-based API recommendation constructed from GitHub Java projects. Target sets are formed by recursively expanding invocation chains of methods to include directly and indirectly invoked third-party APIs. New metrics API Call Depth and Invocation Density are introduced to characterize the complexity of these chains. Evaluations of existing recommendation methods and LLMs on Q-ARE demonstrate significant performance degradation with increasing call depth and decreasing density, suggesting limitations in handling multi-level invocations.

Significance. Should the target sets prove to align with genuine developer query intents, the dataset would provide an important benchmark for testing semantic understanding in API recommendation, highlighting areas where current methods and LLMs fall short on complex code structures. The metrics offer quantifiable ways to assess invocation complexity, which could drive targeted improvements in the field.

major comments (2)

[§3] §3 (Dataset Construction): The recursive expansion of invocation chains to form target sets includes all third-party APIs appearing anywhere in the chain without any described validation step (e.g., manual audit, developer ratings, or usage-log comparison) to confirm semantic alignment with the natural-language query. This assumption is load-bearing for the central claim that performance drops indicate struggles with semantic understanding of multi-level structures rather than inclusion of incidental APIs.
[§5] §5 (Evaluation): The reported performance drops as API Call Depth increases and Invocation Density decreases are presented at a high level with no error analysis, failure-mode breakdown, or qualitative examples of queries where targets include non-intended APIs. This makes it difficult to attribute the drops specifically to limitations in semantic handling of hierarchical invocations.

minor comments (1)

The abstract and construction description would benefit from a concrete example showing a sample query, its target method, and the resulting recursively expanded target set to illustrate the unification process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to address the concerns raised regarding dataset construction and evaluation. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The recursive expansion of invocation chains to form target sets includes all third-party APIs appearing anywhere in the chain without any described validation step (e.g., manual audit, developer ratings, or usage-log comparison) to confirm semantic alignment with the natural-language query. This assumption is load-bearing for the central claim that performance drops indicate struggles with semantic understanding of multi-level structures rather than inclusion of incidental APIs.

Authors: The Q-ARE dataset is constructed by extracting invocation chains directly from real GitHub Java projects, where the target APIs are those actually invoked by the methods associated with each query. This data-driven approach grounds the targets in observable code behavior rather than curated intent. We acknowledge that the manuscript does not describe an explicit validation step such as manual audit, and that incidental APIs could be included in longer chains. In the revised manuscript we will add a dedicated subsection in §3 discussing this assumption, its implications for the central claims, and illustrative examples of chain expansion to clarify semantic alignment where possible. revision: partial
Referee: [§5] §5 (Evaluation): The reported performance drops as API Call Depth increases and Invocation Density decreases are presented at a high level with no error analysis, failure-mode breakdown, or qualitative examples of queries where targets include non-intended APIs. This makes it difficult to attribute the drops specifically to limitations in semantic handling of hierarchical invocations.

Authors: We agree that the current evaluation would be strengthened by more granular analysis. In the revised manuscript we will expand §5 to include error analysis across methods and LLMs, a breakdown of failure modes stratified by call depth and density, and qualitative examples of queries (including cases where target sets may contain APIs that are not the most direct match for the query). These additions will better support attribution of performance degradation to challenges with multi-level invocation structures. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction and empirical evaluation are self-contained

full rationale

The paper describes construction of Q-ARE by extracting methods and recursively collecting third-party APIs from their invocation chains in GitHub Java projects, followed by evaluation of existing API recommendation methods and LLMs using two new metrics (API Call Depth, Invocation Density). No equations, parameter fitting, predictions derived from fits, or load-bearing self-citations appear in the provided text. The central claim (performance drop with increasing depth and decreasing density) is an empirical observation on the constructed dataset rather than a derivation that reduces to its own inputs by construction. The work is therefore independent of the circularity patterns listed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that static analysis of call chains in open-source Java code yields representative ground-truth targets for natural-language API queries; no free parameters or new entities are introduced.

axioms (1)

domain assumption Recursive expansion of multi-level method invocations produces accurate and complete sets of third-party APIs that satisfy the functional requirements implied by a query method.
This assumption underpins the entire target-set construction described in the abstract.

pith-pipeline@v0.9.0 · 5505 in / 1141 out tokens · 44229 ms · 2026-05-09T18:58:27.046073+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 2 canonical work pages

[1]

Chi Chen, Xin Peng, Zhenchang Xing, Jun Sun, Xin Wang, Yifan Zhao, and Wenyun Zhao. 2021. Holistic combination of structural and textual code infor- mation for context based API recommendation. IEEE Transactions on Software Engineering 48, 8 (2021), 2987–3009

2021
[2]

Yujia Chen, Cuiyun Gao, Muyijie Zhu, Qing Liao, Yong Wang, and Guoai Xu
[3]

In 2024 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER)

APIGen: Generative API method recommendation. In 2024 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 171–182

2024
[4]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. In Findings of the association for computational linguistics: EMNLP 2020 . 1536–1547

2020
[5]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th international conference on software engineering. 933–944

2018
[6]

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT international sympo- sium on foundations of software engineering . 631–642

2016
[7]

Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang. 2018. API method recommendation without worrying about the task-API knowledge gap. Wu et al. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Soft- ware Engineering. 293–304

2018
[8]

Ivana Clairine Irsan, Ting Zhang, Ferdian Thung, Kisub Kim, and David Lo. 2023. Picaso: enhancing api recommendations with relevant stack overflow posts. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 92–103

2023
[9]

Yuning Kang, Zan Wang, Hongyu Zhang, Junjie Chen, and Hanmo You. 2021. Apirecx: Cross-library api recommendation via pre-trained language model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3425–3436

2021
[10]

Guancheng Lin, Xiao Yu, Jacky Keung, Xing Hu, Xin Xia, and Alex X Liu. 2025. Lightweight Model Editing for LLMs to Correct Deprecated API Recommenda- tions. arXiv preprint arXiv:2511.21022 (2025)

work page arXiv 2025
[11]

Chun-Yang Ling, Yan-Zhen Zou, Ze-Qi Lin, and Bing Xie. 2019. Graph embed- ding based API graph search and recommendation. Journal of Computer Science and Technology 34, 5 (2019), 993–1006

2019
[12]

Jiaxin Liu, Yating Zhang, Deze Wang, Yiwei Li, and Wei Dong. 2025. THINK: Tackling API Hallucinations in LLMs via Injecting Knowledge. In2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 229–240

2025
[13]

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. 2024. Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920 (2024)

work page arXiv 2024
[14]

Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenxuan Wang, Cuiyun Gao, and Michael R Lyu. 2022. Revisiting, benchmarking and exploring API recommen- dation: How far are we? IEEE Transactions on Software Engineering 49, 4 (2022), 1876–1897

2022
[15]

Mohammad Masudur Rahman, Chanchal K Roy, and David Lo. 2016. Rack: Automatic api recommendation using crowdsourced knowledge. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1. IEEE, 349–359

2016
[16]

Shanghai AI Lab. 2024. OpenCompass. Retrieved March 12, 2026 from https: //rank.opencompass.org.cn/home AI

2024
[17]

Yewei Song, Xunzhu Tang, Cedric Lothritz, Saad Ezzini, Jacques Klein, Tegawendé Bissyande, Andrey Boytsov, Ulrick Ble, and Anne Goujon. 2025. Callnavi, a challenge and empirical study on llm function calling and routing. In Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering. 114–125

2025
[18]

Chaozheng Wang, Shuzheng Gao, Cuiyun Gao, Wenxuan Wang, Chun Yong Chong, Shan Gao, and Michael R Lyu. 2024. A systematic evaluation of large code models in api suggestion: When, which, and how. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering . 281– 293

2024
[19]

Ying Wang, Bihuan Chen, Kaifeng Huang, Bowen Shi, Congying Xu, Xin Peng, Yijian Wu, and Yang Liu. 2020. An empirical study of usages, updates and risks of third-party libraries in java projects. In 2020 IEEE International conference on software maintenance and evolution (ICSME) . IEEE, 35–45

2020
[20]

Moshi Wei, Nima Shiri Harzevili, Yuchao Huang, Junjie Wang, and Song Wang
[21]

In Proceedings of the 44th International Conference on Software Engineering

Clear: contrastive learning for api recommendation. In Proceedings of the 44th International Conference on Software Engineering . 376–387
[22]

Yu Zhou, Xinying Yang, Taolue Chen, Zhiqiu Huang, Xiaoxing Ma, and Harald Gall. 2021. Boosting API recommendation with implicit feedback. IEEE Trans- actions on Software Engineering 48, 6 (2021), 2157–2172

2021