From Ranking to Reasoning: Explainable Web API Recommendation via Semantic Reasoning

Dezhong Yao; Yao Wan; Zishuo Xu

arxiv: 2511.05820 · v2 · pith:ECXJ2J6Znew · submitted 2025-11-08 · 💻 cs.SE · cs.AI

From Ranking to Reasoning: Explainable Web API Recommendation via Semantic Reasoning

Zishuo Xu , Dezhong Yao , Yao Wan This is my paper

Pith reviewed 2026-05-17 23:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords Web API recommendationexplainable AIsemantic reasoninglarge language modelreinforcement learningadaptive recommendationProgrammableWeb

0 comments

The pith

WAR-R1 uses semantic reasoning in a lightweight LLM to recommend variable numbers of Web APIs with natural justifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current Web API recommenders rely on rigid top-N strategies and offer minimal transparency, which hinders their use in varied mashup scenarios and erodes user trust. The paper introduces WAR-R1 to overcome these issues by enabling a model to output both an appropriate number of APIs and explanatory text for each choice. It employs special start and stop tokens to handle variable list lengths and trains the system first through supervised fine-tuning then with reinforcement learning via GRPO to balance accuracy and explanation quality. Results on ProgrammableWeb data indicate gains of up to 10.89 percent in accuracy over existing methods while delivering consistent, grounded explanations. Sympathetic readers would value this for making automated recommendations more flexible and understandable in practice.

Core claim

The central claim is that WAR-R1, built on a lightweight large language model, can perform adaptive Web API recommendation by generating relevant API sets of variable cardinality together with natural-language justifications. Training proceeds in two stages: supervised fine-tuning on an annotated corpus of mashups and APIs, followed by reinforcement learning using Group Relative Policy Optimization with low-rank adaptation. This joint optimization of recommendation accuracy and reasoning quality leads to superior performance, with experiments showing improvements of up to 10.89% over state-of-the-art baselines on the ProgrammableWeb dataset and high-quality explanations validated through ab

What carries the argument

Special start and stop tokens in the LLM that enable learning to begin and terminate API generation at appropriate points, supporting variable-cardinality outputs alongside integrated reasoning.

If this is right

Outperforms state-of-the-art baselines by up to 10.89% in recommendation accuracy.
Consistently produces high-quality, semantically grounded explanations.
Adapts recommendation size to the complexity of individual mashups rather than using fixed top-N lists.
Reinforcement learning stage improves both accuracy and reasoning quality over supervised fine-tuning alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This reasoning-based approach could be tested in other API or service recommendation settings where transparency matters.
Future experiments might verify if the generated justifications remain faithful when the model is scaled or applied to different domains.
Integrating this with existing mashup tools might reduce development time by providing actionable, explained suggestions.

Load-bearing premise

That after supervised fine-tuning and reinforcement learning with special tokens, the lightweight LLM reliably learns to output accurate variable-sized API sets and faithful natural-language justifications that reflect its internal decisions.

What would settle it

Running human evaluations or automated checks on whether the explanations accurately describe the reasons behind each recommended API, or testing the model on a new dataset of mashups where ground-truth explanations are available.

Figures

Figures reproduced from arXiv: 2511.05820 by Dezhong Yao, Yao Wan, Zishuo Xu.

**Figure 2.** Figure 2: The architecture of WAR-Re. the requirement description for a mashup mi , as specified by the user. Let A = {a1, . . . , aj , . . . , aS} denote a repository comprising S Web APIs. Each API aj = ⟨Daj , Caj , Xaj ⟩ ∈ A is defined by its functional description Daj , category C aj , and associated meta-elements Xaj . Given T mi and A, our goal is twofold. First, we must identify a subset S mi = {a1, . . . , a… view at source ↗

read the original abstract

The rapid growth of Web APIs has made automated Web API recommendation essential for efficient mashup development. However, existing approaches suffer from two major limitations: 1) they rely on fixed top-N recommendation strategies that cannot adapt to mashup complexity, and 2) they provide little or no explanation for recommended APIs, limiting transparency and user trust. To address these challenges, we propose WAR-R1, an explainable Web API recommendation framework that integrates semantic reasoning with adaptive, variable-cardinality recommendation. Built on a lightweight large language model (LLM), WAR-R1 generates both a set of relevant APIs and a natural-language justification for each recommendation. To support adaptive recommendation size, we introduce special start and stop tokens that allow the model to learn when to begin and terminate API generation. WAR-R1 is trained in two stages: supervised fine-tuning on an annotated mashup-API corpus, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) with low-rank adaptation to jointly optimize recommendation accuracy and reasoning quality. Experiments on the ProgrammableWeb dataset show that WAR-R1 outperforms state-of-the-art baselines by up to 10.89% in recommendation accuracy while consistently producing high-quality, semantically grounded explanations. Extensive ablation studies validate the effectiveness of reinforcement learning, special token design, and integrated reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds special start/stop tokens and GRPO to a lightweight LLM so it can output variable numbers of Web APIs plus natural-language justifications in one pass, with reported accuracy gains on ProgrammableWeb, but the explanations stay mostly qualitative.

read the letter

Here's the quick take: WAR-R1 combines special start and stop tokens with GRPO on a lightweight LLM to handle variable-cardinality Web API recommendations and generate explanations in the same pass. It reports better accuracy than prior methods on ProgrammableWeb, but the explanation faithfulness is mostly asserted rather than measured. What the paper actually does is train the model first with supervised fine-tuning on annotated mashup data, then use Group Relative Policy Optimization to improve both the recommendation accuracy and the quality of the natural language justifications. The special tokens let the model learn when to start listing APIs and when to stop, which addresses the fixed top-N limitation in earlier work. They run experiments showing up to 10.89% better accuracy and include ablations that test the impact of the RL stage and the token additions. That part looks like a reasonable engineering contribution for the software engineering community working on service composition. The results are measured on held-out data from the ProgrammableWeb corpus, so the accuracy numbers aren't just restating the training loss. The citation pattern seems standard for this area, building on prior recommendation and LLM fine-tuning papers. The soft spot is in how they back up the 'semantically grounded' explanations. The abstract and experiments describe them as high-quality, but there's no mention of a specific faithfulness metric or a test that checks whether the generated text would change appropriately if the underlying recommendation logic were altered. Without something like that, or at least a comparison to a separate explainer, the claim that the reasoning is integrated rather than post-hoc stays qualitative. If the full paper has more on this, it would help; otherwise it's a gap that referees would likely flag. This paper is for researchers and practitioners in web service recommendation and LLM applications to software engineering tasks. Someone looking for a concrete example of using RL to improve both prediction and interpretability in a recommendation setting would find it worth reading. I'd recommend sending it to peer review. The core method is described clearly enough, the empirical gains are there to evaluate, and the open issue around explanation quality is a normal point for discussion rather than a reason to desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes WAR-R1, a framework for explainable Web API recommendation that fine-tunes a lightweight LLM via supervised fine-tuning followed by Group Relative Policy Optimization (GRPO). Special start and stop tokens are introduced to support variable-cardinality recommendations, and the model jointly generates API sets and natural-language justifications. Experiments on the ProgrammableWeb dataset are reported to show up to 10.89% accuracy gains over state-of-the-art baselines, with ablation studies validating the RL stage, token design, and integrated reasoning.

Significance. If the empirical claims hold under full scrutiny, the work could advance adaptive, transparent API recommendation for mashup development by moving beyond fixed top-N strategies. The combination of GRPO for joint accuracy-reasoning optimization and special tokens for cardinality control is a concrete technical contribution in the empirical setting.

major comments (2)

[Abstract] Abstract: the headline claim of up to 10.89% accuracy improvement is presented without reference to statistical significance tests, confidence intervals, or error bars, and without explicit definitions of the baseline implementations or evaluation protocol; these omissions make it impossible to assess whether the central performance result is robust or sensitive to post-hoc choices.
[Abstract] Abstract and described experiments: the assertion that explanations are 'semantically grounded' and faithful to the model's variable-cardinality recommendation process rests on qualitative claims only; no quantitative faithfulness metric (e.g., input perturbation, attention alignment, or counterfactual consistency) or comparison against a post-hoc explainer baseline is reported, leaving the explainability component of the central claim unsupported.

minor comments (1)

The abstract would benefit from a brief statement of dataset scale (number of mashups, APIs, or splits) to provide immediate context for the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and explainability claims. We address each point below and will revise the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of up to 10.89% accuracy improvement is presented without reference to statistical significance tests, confidence intervals, or error bars, and without explicit definitions of the baseline implementations or evaluation protocol; these omissions make it impossible to assess whether the central performance result is robust or sensitive to post-hoc choices.

Authors: We agree that the abstract would benefit from explicit statistical details. In the revision we will add references to the paired t-tests (p < 0.05) confirming the reported gains, 95% confidence intervals, error bars from five random seeds, and direct pointers to the baseline re-implementations and evaluation protocol in Sections 4.2 and 5.1. These additions will make the headline result more transparent and reproducible. revision: yes
Referee: [Abstract] Abstract and described experiments: the assertion that explanations are 'semantically grounded' and faithful to the model's variable-cardinality recommendation process rests on qualitative claims only; no quantitative faithfulness metric (e.g., input perturbation, attention alignment, or counterfactual consistency) or comparison against a post-hoc explainer baseline is reported, leaving the explainability component of the central claim unsupported.

Authors: Because WAR-R1 generates recommendations and natural-language justifications jointly through the same autoregressive process and special tokens, the explanations are faithful by construction rather than post-hoc. We already validate the reasoning component via ablation studies on explanation quality. To further address the request, the revised manuscript will include a quantitative faithfulness analysis using attention alignment scores and a comparison against a LIME post-hoc baseline on a held-out subset, with results added to Section 5.3. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation

full rationale

The paper describes a standard two-stage training pipeline (SFT on annotated mashup-API corpus followed by GRPO with LoRA) on an external ProgrammableWeb dataset, then reports recommendation accuracy measured on held-out test data. The claimed performance gains (up to 10.89%) are external empirical outcomes rather than quantities defined by or fitted directly to the training objective. No self-definitional equations, fitted-input predictions, load-bearing self-citations, or uniqueness theorems appear in the provided description. The architecture choices (special start/stop tokens, joint optimization of accuracy and reasoning quality) are presented as design decisions whose effectiveness is validated by ablation studies on separate data, keeping the central claims self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard supervised learning and RL assumptions plus the untested premise that natural-language explanations generated by the model are faithful to its recommendation decisions.

free parameters (1)

GRPO hyperparameters
Group size, learning rate, and reward weighting coefficients are chosen during training and directly affect the joint optimization of accuracy and explanation quality.

axioms (1)

domain assumption The lightweight LLM can be fine-tuned to produce both API lists and coherent justifications using the same output sequence.
Invoked when describing the two-stage training and the role of special tokens.

invented entities (1)

Special start and stop tokens no independent evidence
purpose: Allow the model to learn when to begin and terminate API generation for variable recommendation size.
Introduced as a core architectural addition; no independent evidence outside the training process is provided.

pith-pipeline@v0.9.0 · 5530 in / 1415 out tokens · 51450 ms · 2026-05-17T23:49:48.953333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 3 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https: //arxiv.org/abs/2106.09685. Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tsc.2025 2021
[2]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

URLhttps://arxiv.org/abs/2506.13585. Lianyong Qi, Wenmin Lin, Xuyun Zhang, Wanchun Dou, Xiaolong Xu, and Jinjun Chen. A correlation graph based approach for personalized and compatible web apis recommendation in mobile app development.IEEE Transactions on Knowledge and Data Engineering, 35(6): 5444–5457, 2023. doi: 10.1109/TKDE.2022.3168611. Shaowei Qin, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tkde.2022.3168611 2023
[3]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

URLhttps://arxiv.org/abs/2504.05118. Chunxiang Zhang, Shaowei Qin, Hao Wu, and Lei Zhang. Cooperative mashup embedding leveraging knowledge graph for web api recommendation.IEEE Access, 12:49708–49719, 2024a. doi: 10.1109/ACCESS.2024.3384487. Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model, 2024b. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/access.2024.3384487 2024

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https: //arxiv.org/abs/2106.09685. Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tsc.2025 2021

[2] [2]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

URLhttps://arxiv.org/abs/2506.13585. Lianyong Qi, Wenmin Lin, Xuyun Zhang, Wanchun Dou, Xiaolong Xu, and Jinjun Chen. A correlation graph based approach for personalized and compatible web apis recommendation in mobile app development.IEEE Transactions on Knowledge and Data Engineering, 35(6): 5444–5457, 2023. doi: 10.1109/TKDE.2022.3168611. Shaowei Qin, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tkde.2022.3168611 2023

[3] [3]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

URLhttps://arxiv.org/abs/2504.05118. Chunxiang Zhang, Shaowei Qin, Hao Wu, and Lei Zhang. Cooperative mashup embedding leveraging knowledge graph for web api recommendation.IEEE Access, 12:49708–49719, 2024a. doi: 10.1109/ACCESS.2024.3384487. Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model, 2024b. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/access.2024.3384487 2024