arxiv: 2604.27647 · v1 · submitted 2026-04-30 · 💻 cs.SE

Recognition: unknown

Tail-aware N-version Machine Learning Models for Reliable API Recommendation

Aoi Matsuda, David Lo, Fumio Machida

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:52 UTC · model grok-4.3

classification 💻 cs.SE

keywords API recommendationN-version programmingmachine learninglong-tail distributionreliabilityJava benchmarkensemble methodsCodeT5

0 comments

The pith

N-version ML models with tail-aware performance profiles can filter unreliable API recommendations to reach 83.8% true accept rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NvRec to handle the problem that ML-based API recommenders trained on code datasets often give wrong suggestions for rare APIs that appear infrequently. It runs several different models in parallel, builds a profile of each model's accuracy on specific APIs according to how rare they are, and then at runtime discards suggestions from models that have shown poor reliability on those tail APIs. Majority voting among the remaining reliable candidates produces the final output. Evaluation on a public Java benchmark shows that a three-model combination reaches 83.8% true accept rate while rejecting 80.7% of cases when voting is limited to high-reliability candidates, and a five-model version reaches 83.1% accept rate with 69% rejection under simple majority voting. A sympathetic reader cares because developers lose time and introduce bugs when tools suggest APIs that do not fit the current code context.

Core claim

NvRec profiles each available ML model on every API method in the training data according to the method's tail properties, then uses those profiles at inference time to exclude unreliable recommendations before applying majority voting restricted to highly reliable candidates. On the public Java benchmark the three-model ensemble of CodeT5, MulaRec and UniXcoder yields the highest true accept rate of 83.8% paired with an 80.7% rejection rate, while the five-model configuration yields 83.1% true accept rate with a lower 69.0% rejection rate under simple majority voting, demonstrating a practical trade-off between accuracy and coverage.

What carries the argument

The model performance profile, which records each model's accuracy on individual API methods grouped by their frequency in the training distribution and is consulted at inference to decide which model outputs are trustworthy enough to participate in voting.

If this is right

A three-model ensemble restricted to highly reliable candidates can achieve a higher rejection rate than a five-model ensemble while preserving nearly the same true accept rate.
Simple majority voting across all five models reduces the rejection rate to 69% without substantially lowering the true accept rate below 83%.
The specific trio of CodeT5, MulaRec and UniXcoder outperforms other three-model combinations on the Java benchmark when reliability filtering is applied.
Five-version NvRec provides a better overall balance between acceptance and rejection than the three-version version under simple voting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same profiling-plus-filtering idea could be applied to other code-generation tasks that also suffer from long-tail distributions, such as method-name suggestion or bug-fix patch generation.
Periodic re-profiling on recent developer code could allow NvRec to adapt when new APIs become popular or old ones fall out of use.
Integration into an IDE could let the system surface only recommendations that survive the reliability filter, thereby lowering the chance a developer accepts a mismatched API call.

Load-bearing premise

That each model's observed accuracy on tail APIs during offline profiling will continue to predict its reliability when the same model is applied to new code that was not seen during profiling.

What would settle it

Running the same NvRec configurations on a fresh collection of Java projects whose API frequency distribution differs markedly from the original benchmark and checking whether the reported true accept and rejection rates fall below the published figures.

Figures

Figures reproduced from arXiv: 2604.27647 by Aoi Matsuda, David Lo, Fumio Machida.

**Figure 1.** Figure 1: An example of source code and its corresponding view at source ↗

**Figure 2.** Figure 2: Overview of the API Method Sequence Recommen view at source ↗

**Figure 4.** Figure 4: The number of correct predictions by three ML view at source ↗

**Figure 5.** Figure 5: Overview of NvRec. The main process of NvRec is divided into two steps: 1) model profile step that generates the view at source ↗

read the original abstract

Machine learning (ML)-based API recommendation helps developers efficiently identify suitable APIs to complement the application code. However, code datasets used to train ML models often exhibit a long-tail distribution, leading to unreliable API recommendations, especially for infrequently used API methods at the tail of the distribution. To address this issue, we propose N-version API Recommendation (NvRec), which leverages N different versions of ML models to enhance the reliability of API sequence recommendations by suppressing unreliable outputs entailing tail APIs. NvRec leverages a set of available ML models and profiles their performance on individual API methods with their tail properties. The generated model profile is used at inference time to filter out unreliable API recommendations and determine the final output. We implement NvRec using five API recommendation models, including CodeBERT, CodeT5, MulaRec, UniXcoder, and CodeT5+, and evaluate it on a public benchmark dataset constructed from compilable Java projects. For the three-version NvRec, we find that the combination of CodeT5, MulaRec, and UniXcoder achieves the highest true accept rate of 83.8%, with a rejection rate of 80.7%, when majority voting is restricted to highly reliable candidates. In contrast, the five-version configuration achieves its highest true accept rate of 83.1% with simple majority voting, while reducing the rejection rate to 69.0%. Overall, the five-version configuration offers a better balance between true accept rate and rejection rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NvRec combines N-version ensembles with per-API tail profiling to filter unreliable recommendations and reports 83%+ true accept rates on a Java benchmark, but the gains rest on unverified assumptions about profile transfer.

read the letter

The main point is that this work adapts the N-version idea to ML API recommenders by building per-API performance profiles that emphasize tail methods, then using those profiles at inference to restrict majority voting to reliable candidates. On the public Java benchmark they get the best result from CodeT5 + MulaRec + UniXcoder at 83.8% true accept rate with 80.7% rejection under the filtered-voting rule, and the five-model version trades a bit of accuracy for lower rejection at 83.1% / 69%. That is a concrete, reproducible setup using off-the-shelf models and an external dataset, which is worth noting as a practical step beyond single-model baselines. The paper does a reasonable job of showing that different model combinations behave differently on tail APIs and that simple majority voting alone is not always optimal. The evaluation numbers are reported clearly enough to let others replicate the headline comparisons. The soft spots are exactly where the stress-test flagged them. The abstract gives no description of how the profiling data is separated from the test instances, no cross-validation results, and no out-of-distribution checks on code from different projects or with shifted API frequencies. Without those, it is hard to know whether the reliability filter is genuinely detecting unreliability or simply re-weighting toward APIs that happen to be well-covered in the benchmark. The quantification of “tail properties” is also left vague. If the full paper has a clean train/profile/test partition and some sensitivity analysis, those concerns shrink; otherwise the reported lift could be benchmark-specific. This paper is aimed at people building or evaluating code completion tools inside IDEs, especially anyone already working on ensemble or reliability techniques for long-tail code data. A reader who wants empirical numbers on Java API recommendation ensembles will find usable material here. It is solid enough on the empirical side and addresses a real developer pain point, so it deserves a serious referee rather than a desk reject. I would send it to a software engineering venue for feedback on the generalization questions.

Referee Report

2 major / 3 minor

Summary. The paper proposes NvRec, an N-version programming approach for ML-based API recommendation that builds offline performance profiles for each model (CodeBERT, CodeT5, MulaRec, UniXcoder, CodeT5+) on individual APIs with emphasis on tail properties from a public Java benchmark, then applies these profiles at inference to suppress unreliable outputs via majority voting restricted to reliable candidates. On the benchmark, the three-model ensemble (CodeT5 + MulaRec + UniXcoder) reaches 83.8% true accept rate at 80.7% rejection under restricted voting, while the five-model configuration reaches 83.1% true accept rate at 69.0% rejection under simple majority voting.

Significance. If the offline profiles are shown to be strictly disjoint from evaluation data and predictive on unseen code, the work would offer a practical, training-free way to improve reliability of API recommenders in long-tail settings, which is a known pain point in software engineering ML tools. Strengths include the use of multiple off-the-shelf models on a public benchmark and the reporting of concrete trade-off numbers between true accept rate and rejection rate. The absence of partitioning details and validation experiments, however, prevents assessing whether the gains reflect genuine reliability improvements or benchmark-specific artifacts.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: the manuscript reports headline numbers (83.8% true accept rate at 80.7% rejection for the three-version system) but supplies no description of the train/profile/test partitioning used to build the per-API performance profiles. Without explicit confirmation that profiling data is disjoint from the instances used to compute the reported metrics, the filtering step risks data leakage and cannot be guaranteed to operate as described at true inference time on new code.
[Evaluation] Evaluation section: no cross-validation results, no out-of-distribution experiments, and no analysis of whether the benchmark's tail distribution and context patterns match real developer usage are provided. The central claim that offline tail-aware profiles remain predictive of reliability at inference therefore rests on an untested assumption; this is load-bearing for the assertion that NvRec improves reliability rather than re-weighting toward well-covered APIs in the evaluation set.

minor comments (3)

[Abstract] Abstract: the terms 'true accept rate' and 'rejection rate' are used without explicit definitions or formulas; a short paragraph or equation clarifying their computation (e.g., how 'highly reliable candidates' are identified) would improve clarity.
[Abstract] Abstract: the three-version and five-version configurations employ different voting strategies ('majority voting restricted to highly reliable candidates' vs. 'simple majority voting'); the rationale for this difference and the precise reliability threshold should be stated.
[Evaluation] Evaluation: individual model performances on the benchmark (especially on tail APIs) are not reported as baselines; adding a table comparing single-model true accept rates against the N-version results would strengthen the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the partitioning details and validation of the offline profiles in NvRec. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the manuscript reports headline numbers (83.8% true accept rate at 80.7% rejection for the three-version system) but supplies no description of the train/profile/test partitioning used to build the per-API performance profiles. Without explicit confirmation that profiling data is disjoint from the instances used to compute the reported metrics, the filtering step risks data leakage and cannot be guaranteed to operate as described at true inference time on new code.

Authors: We agree that the current manuscript does not provide an explicit description of the train/profile/test partitioning. The per-API performance profiles were generated offline on a dedicated profiling subset drawn from the public Java benchmark that is strictly disjoint from the test instances used to compute the reported true accept and rejection rates. This separation was enforced to ensure the reliability filters operate without leakage at inference time. We will revise the Evaluation section to include a clear description of the partitioning (including split ratios and confirmation of disjoint sets), along with a workflow diagram illustrating how profiles are built and applied. revision: yes
Referee: [Evaluation] Evaluation section: no cross-validation results, no out-of-distribution experiments, and no analysis of whether the benchmark's tail distribution and context patterns match real developer usage are provided. The central claim that offline tail-aware profiles remain predictive of reliability at inference therefore rests on an untested assumption; this is load-bearing for the assertion that NvRec improves reliability rather than re-weighting toward well-covered APIs in the evaluation set.

Authors: We concur that cross-validation, out-of-distribution experiments, and direct comparison of the benchmark's tail distribution against real developer usage patterns would provide stronger support for the generalizability of the offline profiles. The present evaluation demonstrates the practical trade-offs achieved by NvRec on the standard public Java benchmark. We will add a Limitations subsection discussing the reliance on the benchmark's observed tail properties and the assumption of profile predictiveness on unseen code, while outlining directions for future OOD and cross-validation studies. The reported gains remain valid within the benchmark setting used. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmark with no reducing derivations

full rationale

The paper describes an empirical N-version system (NvRec) that profiles off-the-shelf models (CodeBERT, CodeT5, etc.) on per-API performance including tail properties from a public Java benchmark dataset, then applies majority voting plus reliability filtering at inference. Reported figures such as 83.8% true accept rate at 80.7% rejection are direct outcome metrics from running the method on that dataset; no equations, fitted parameters, or self-citations are invoked that define or force these metrics by construction from the same experiment's inputs. The evaluation remains self-contained against the external benchmark and models, satisfying the default expectation of non-circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical profiling and voting rules rather than new theoretical axioms; the main free parameters are the choice of N and the definition of 'highly reliable' candidates.

free parameters (2)

N (number of model versions)
Experiments use N=3 and N=5; the value is chosen by the authors to explore trade-offs.
Reliability threshold for candidate filtering
Used to restrict majority voting to 'highly reliable' candidates; exact cutoff not stated in abstract.

axioms (1)

domain assumption Majority agreement among models known to perform well on a given API improves recommendation reliability
Core assumption of the N-version filtering strategy.

pith-pipeline@v0.9.0 · 5568 in / 1339 out tokens · 56128 ms · 2026-05-07T07:52:28.152382+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Java Platform, Standard Edition Java Development Kit Version 17 API Specification

2021. Java Platform, Standard Edition Java Development Kit Version 17 API Specification. URL: https://docs.oracle.com/en/java/javase/17/docs/api/

2021
[2]

Algirdas Avizienis. 1985. The N-version approach to fault-tolerant software.IEEE Transactions on software engineering12 (1985), 1491–1501

1985
[3]

Avizienis

Liming Chen and A. Avizienis. 1995. N-VERSION PROGRAMMINC: A FAULT- TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON. In Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ’ High- lights from Twenty-Five Years’.113–. doi:10.1109/FTCSH.1995.532621

work page doi:10.1109/ftcsh.1995.532621 1995
[4]

Herbsleb

Uri Dekel and James D. Herbsleb. 2009. Improving API documentation usability with knowledge pushing. In2009 IEEE 31st International Conference on Software Engineering. 320–330

2009
[5]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547. doi:10.18653/ v1/2020.findings-emnlp.139

2020
[6]

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API Learning. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). 631–642. doi:10.1145/2950290. 2950334

work page doi:10.1145/2950290 2016
[7]

Zuxing Gu, Jiecheng Wu, Jiaxiang Liu, Min Zhou, and Ming Gu. 2019. An Empirical Study on API-Misuse Bugs in Open-Source C Programs. In2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. 11–20

2019
[8]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin
[9]

Unixcoder: Unified cross-modal pre-training for code representation,

UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv:2203.03850 [cs.CL]

work page arXiv
[10]

Heinemann, V

L. Heinemann, V. Bauer, M. Herrmannsdoerfer, and B. Hummel. 2012. Identifier- Based Context-Dependent API Method Recommendation. Inthe 16th European Conference on Software Maintenance and Reengineering. 31–40. doi:10.1109/CSMR. 2012.14

work page doi:10.1109/csmr 2012
[11]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.arXiv preprint arXiv:2308.00352(2024)

work page internal anchor Pith review arXiv 2024
[12]

arXiv preprint arXiv:2312.13010 , year=

Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Hem- ing Cui. 2024. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv:2312.13010 [cs.CL]

work page arXiv 2024
[13]

Ilham Cahya Irsan, Tianyi Zhang, Ferdian Thung, Kwangkeun Kim, and David Lo
[14]

InProceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2023)

Multi-Modal API Recommendation. InProceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2023). 272–

2023
[15]

doi:10.1109/SANER56733.2023.00034

work page doi:10.1109/saner56733.2023.00034 2023
[16]

Xia Li, Jiajun Jiang, Samuel Benton, Yingfei Xiong, and Lingming Zhang. 2021. A Large-scale Study on API Misuses in the Wild. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). 241–252

2021
[17]

Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. 2013. API change and fault proneness: a threat to the success of Android apps. InProceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013). 477–487

2013
[18]

Zexiong Ma, Shengnan An, Bing Xie, and Zeqi Lin. 2024. Compositional API Recommendation for Library-Oriented Code Generation.In the IEEE/ACM 32nd International Conference on Program Comprehension (ICPC)(2024), 87–98

2024
[19]

Fumio Machida. 2023. Using diversities to model the reliability of two-version machine learning systems.IEEE Transactions on Emerging Topics in Computing 12, 3 (2023), 810–825

2023
[20]

Pedro Martins, Rohan Achar, and Cristina V. Lopes. 2018. 50K-C: A Dataset of Compilable, and Compiled, Java Projects. In2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR). 1–5

2018
[21]

Michael Meng, Stephanie Steinhardt, and Andreas Schubert. 2019. How devel- opers use API documentation: an observation study.Commun. Des. Q. Rev7, 2 (2019), 40–49

2019
[22]

Siyu Nan, Jian Wang, Neng Zhang, Duantengchuan Li, and Bing Li. 2025. DDASR: Deep Diverse API Sequence Recommendation.ACM Transactions on Software Engineering and Methodology, Article 162 (2025), 39 pages

2025
[23]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. 126544–126565

2024
[24]

Robillard

Martin P. Robillard. 2009. What Makes APIs Hard to Learn? Answers from Developers.IEEE Software26, 6 (2009), 27–34. doi:10.1109/MS.2009.193

work page doi:10.1109/ms.2009.193 2009
[25]

Robillard and Robert Deline

Martin P. Robillard and Robert Deline. 2011. A field study of API learning obstacles.Empirical Softw. Engg.16, 6 (2011), 703–732

2011
[26]

Christopher Scaffidi. 2006. Why are APIs difficult to learn and use?XRDS12, 4 (2006), 4. doi:10.1145/1144359.1144363

work page doi:10.1145/1144359.1144363 2006
[27]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InThirty- seventh Conference on Neural Information Processing Systems

2023
[28]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. arXiv:2305.07922 [cs.CL] https://arxiv.org/abs/ 2305.07922

work page arXiv 2023
[29]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-Aware Unified Pre-trained Encoder-Decoder Models for Code Under- standing and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708. doi:10.18653/v1/2021.emnlp- main.685

work page doi:10.18653/v1/2021.emnlp- 2021
[30]

Xin Zhou, Kyungmin Kim, Bowen Xu, Junjie Liu, Dong Han, and David Lo. 2023. The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023). 40–52. doi:10.1109/ASE56229. 2023.00157 11

work page doi:10.1109/ase56229 2023