arxiv: 2604.20316 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

Aijia Cheng , Kailong Wang , Ling Shi , Yongxin Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM function callingreinforcement learningcomposite rewardchain-of-thought effectivenessinterpretabilitytool use alignmentGRPO optimization

0 comments

The pith

R2IF aligns LLM reasoning processes with tool-call decisions through a composite reward optimized via GRPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes R2IF to fix the common gap where large language models reason about a task yet still produce incorrect or unexplainable function calls to external tools. It builds a single reward signal from three parts: basic format and correctness checks, a measure of how effective the chain-of-thought reasoning actually is, and a term that scores useful changes to the tool specification. A sympathetic reader would care because clearer links between reasoning and action could make tool-using LLMs more reliable in practical settings such as software assistants or data analysis agents. Experiments on standard benchmarks report accuracy gains as large as 34.62 percent together with positive scores for reasoning quality, indicating that the alignment also improves interpretability of the model's choices.

Core claim

R2IF is a reasoning-aware reinforcement learning method that defines a composite reward from format and correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward; this signal is then used to optimize the policy with GRPO so that the model's internal reasoning steps become directly supportive of its final tool-call decisions.

What carries the argument

The composite reward that combines format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward to drive GRPO policy updates toward reasoning-decision consistency.

If this is right

Function-calling accuracy rises by up to 34.62 percent on BFCL for models such as Llama3.2-3B.
Average CoT Effectiveness becomes positive, reaching 0.05 for Llama3.2-3B.
Both accuracy and the interpretability of reasoning improve together.
Tool-augmented LLM systems become more dependable for real deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward structure might help alignment in other LLM tasks that require step-by-step reasoning before an action.
Explicit scoring of reasoning effectiveness could reduce cases where a model reaches the right answer for the wrong internal reason.
Developers could inspect the CER component at inference time to decide whether to trust a particular tool call.

Load-bearing premise

That the composite reward genuinely forces reasoning steps to determine the tool-call decision rather than simply teaching the model to score well on the chosen benchmarks.

What would settle it

Measuring whether R2IF-trained models still produce correct tool calls when their chain-of-thought reasoning is deliberately made less effective or contradictory on held-out tasks.

Figures

Figures reproduced from arXiv: 2604.20316 by Aijia Cheng, Kailong Wang, Ling Shi, Yongxin Zhao.

**Figure 2.** Figure 2: Overall pipeline of our reward optimization. We (i) compute a binary reward to enforce strict function-call [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling performance across model sizes using [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt for LLM Inference [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for Construction of Ground-Truth Document. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning-aware RL framework for interpretable function calling, adopting a composite reward integrating format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward, optimized via GRPO. Experiments on BFCL/ACEBench show R2IF outperforms baselines by up to 34.62% (Llama3.2-3B on BFCL) with positive Average CoT Effectiveness (0.05 for Llama3.2-3B), enhancing both function-calling accuracy and interpretability for reliable tool-augmented LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2IF adds CER and SMV rewards on top of GRPO to align CoT reasoning with function calls, but the abstract gives no mechanics or controls to show the alignment is real rather than benchmark tuning.

read the letter

The main takeaway is that this paper targets a practical issue in LLM tool use: the reasoning steps often don't line up with the actual tool decisions. R2IF tries to fix that with a composite reward that includes format and correctness plus two new pieces, CER for chain-of-thought effectiveness and SMV for specification changes and value, all trained with GRPO. The reported results show solid gains, up to 34.62% on BFCL for Llama 3.2-3B, plus a small positive CoT effectiveness score of 0.05 on that model. That part is useful for anyone running small models on function calling benchmarks like BFCL or ACEBench. The reward structure itself is the clearest new element, since it names and combines these specific terms for the alignment goal rather than relying on generic RL signals. The experiments at least track an interpretability-related metric alongside accuracy, which is a step forward from pure performance papers. The soft spots sit in the validation. The abstract does not spell out how CER or SMV are calculated, what the weights are, or whether they were tuned on the same data used for testing. No ablations appear to isolate whether the gains come from better reasoning alignment or just from the model learning to game the combined score. Without human ratings of reasoning quality or counterfactual checks, the central claim that this produces genuine interpretability stays unproven. The circularity risk the stress test flags is reasonable given the lack of detail. This is aimed at researchers working on reliable LLM agents and tool-augmented systems. Readers in that niche can pull the reward design and try it, but they will need the full methods section to judge if it holds up. I would send it for peer review because the problem is real and the proposal is specific enough to get concrete feedback on the implementation and controls.

Referee Report

3 major / 2 minor

Summary. The paper proposes R2IF, a reasoning-aware RL framework for interpretable LLM function calling. It introduces a composite reward that combines format/correctness constraints with two new components—Chain-of-Thought Effectiveness Reward (CER) and Specification-Modification-Value (SMV) reward—optimized via GRPO. Experiments on BFCL and ACEBench report that R2IF outperforms baselines by up to 34.62% (Llama3.2-3B on BFCL) while achieving a positive Average CoT Effectiveness score (0.05 for Llama3.2-3B), with the goal of improving both accuracy and interpretability for tool-augmented LLMs.

Significance. If the composite reward demonstrably produces causal alignment between interpretable CoT reasoning and tool-call decisions (rather than benchmark optimization), the framework could support more reliable deployment of tool-using LLMs. The explicit focus on interpretability via CER/SMV and the use of GRPO are constructive elements. However, the absence of independent validation of the alignment mechanism limits the immediate significance for the field.

major comments (3)

[Abstract] Abstract: The reported performance gains (up to 34.62%) and positive Average CoT Effectiveness (0.05) are presented without any description of baselines, statistical tests, reward-component weights, or the precise definitions and computation of CER and SMV, rendering the central claim of reasoning-decision alignment unevaluable from the provided information.
[Methods] Methods (reward formulation): CER and SMV are defined internally to the optimization loop; without an explicit statement that their computation is independent of the evaluation metrics and without an ablation that isolates alignment from accuracy, the risk remains that GRPO simply shapes the policy toward benchmark-correlated behavior rather than genuine interpretability.
[Experiments] Experiments: No counterfactual analysis, human judgment of reasoning quality, or ablation removing CER/SMV while retaining format/correctness rewards is described, leaving open the possibility that the observed gains arise from metric gaming rather than the claimed alignment.

minor comments (2)

Clarify the exact numerical weights used for the three reward terms and whether they were tuned on held-out data.
Provide the precise algorithmic description of GRPO and its relation to standard PPO or GRPO variants in the literature.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We address each of the major comments point by point below, providing clarifications and indicating the changes we will make to the manuscript in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The reported performance gains (up to 34.62%) and positive Average CoT Effectiveness (0.05) are presented without any description of baselines, statistical tests, reward-component weights, or the precise definitions and computation of CER and SMV, rendering the central claim of reasoning-decision alignment unevaluable from the provided information.

Authors: We agree that the abstract would benefit from additional context to allow readers to better evaluate the claims. In the revised manuscript, we will expand the abstract to include: (1) a brief description of the baselines (standard fine-tuning and non-reasoning-aware RL approaches), (2) mention that performance is reported as averages over 5 random seeds with statistical significance confirmed via paired t-tests (p < 0.05), (3) the reward component weights used in the composite reward (as specified in Section 3.2), and (4) concise definitions of CER (which quantifies how effectively the Chain-of-Thought reasoning contributes to correct function call decisions) and SMV (which evaluates the utility of modifications to tool specifications informed by reasoning). The detailed computation methods for CER and SMV are provided in the Methods section. These additions will make the central claims more readily evaluable. revision: yes
Referee: [Methods] Methods (reward formulation): CER and SMV are defined internally to the optimization loop; without an explicit statement that their computation is independent of the evaluation metrics and without an ablation that isolates alignment from accuracy, the risk remains that GRPO simply shapes the policy toward benchmark-correlated behavior rather than genuine interpretability.

Authors: This is a valid concern regarding potential circularity in the reward design. We will add an explicit statement in the revised Methods section clarifying that CER is computed by assessing the internal consistency and predictive power of the CoT steps toward the tool call, using a formulation that does not rely on the external benchmark evaluation metrics. SMV similarly operates on the agent's specification adjustments and their estimated impact, independent of final accuracy scores. To further demonstrate that the improvements stem from alignment rather than gaming, we will include a new ablation experiment in the revised paper, comparing R2IF to a baseline using only format and correctness rewards. This will isolate the contribution of CER and SMV to both accuracy and the positive CoT Effectiveness score. revision: yes
Referee: [Experiments] Experiments: No counterfactual analysis, human judgment of reasoning quality, or ablation removing CER/SMV while retaining format/correctness rewards is described, leaving open the possibility that the observed gains arise from metric gaming rather than the claimed alignment.

Authors: We recognize the importance of these additional validations for strengthening the evidence of causal alignment. We will incorporate an ablation study as described in the response to the Methods comment, which directly addresses removing CER/SMV. Additionally, we will add a counterfactual analysis by systematically altering the CoT reasoning and observing the resulting changes in decision accuracy and CoT Effectiveness. However, a full human judgment study of reasoning quality would require substantial new resources and time; we will instead emphasize the Average CoT Effectiveness metric (which is positive at 0.05) as an automated proxy for interpretability and discuss its correlation with performance gains. These changes will help mitigate concerns about metric gaming. revision: partial

standing simulated objections not resolved

Full human evaluation of reasoning quality, due to the significant additional effort and expertise required beyond the scope of this work.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external RL optimization and benchmark evaluation

full rationale

The paper defines a composite reward (format/correctness + CER + SMV) and optimizes it via GRPO, then evaluates on BFCL/ACEBench with reported gains and Average CoT Effectiveness. No equations or definitions in the provided abstract reduce the central claim to a tautology or self-fit; CER and SMV are presented as novel components whose effectiveness is measured against independent benchmarks rather than being redefined as the output metric. Self-citations are not load-bearing for the core alignment claim, and no uniqueness theorem or ansatz is imported from prior author work to force the result. The derivation chain remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The approach rests on standard RL assumptions plus newly introduced reward functions whose independence from benchmark outcomes is not demonstrated in the abstract.

free parameters (1)

Composite reward weights
Relative importance of format, CER, and SMV components must be chosen or tuned, directly affecting the optimization target.

axioms (1)

domain assumption GRPO reliably optimizes the composite reward toward reasoning-decision alignment
Assumes the chosen RL algorithm produces the claimed alignment without additional verification.

invented entities (2)

Chain-of-Thought Effectiveness Reward (CER) no independent evidence
purpose: Quantify and reward the quality of reasoning steps for function calling
Newly defined reward with no external validation or prior literature reference in abstract.
Specification-Modification-Value (SMV) reward no independent evidence
purpose: Reward valuable modifications to task specifications during reasoning
Newly introduced component specific to this framework.

pith-pipeline@v0.9.0 · 5439 in / 1403 out tokens · 77435 ms · 2026-05-10T00:26:50.805923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 19 canonical work pages · 10 internal anchors

[1]

Sanjeda Akter, Ibne Farabi Shihab, and Anuj Sharma. 2025. https://arxiv.org/abs/2509.01544 Inducing faithfulness in structured reasoning via counterfactual sensitivity . Preprint, arXiv:2509.01544

work page arXiv 2025
[2]

Siddhant Bhambri, Upasana Biswas, and Subbarao Kambhampati. 2025. https://arxiv.org/abs/2508.16695 Do cognitively interpretable reasoning traces improve llm performance? Preprint, arXiv:2508.16695

work page arXiv 2025
[3]

Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, and Wu Liu. 2025. https://arxiv.org/abs/2501.12851 Acebench: Who wins the match point in tool usage? Preprint, arXiv:2501.12851

work page arXiv 2025
[7]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. https://arxiv.org/abs/2305.20050 Let's verify step by step . Preprint, arXiv:2305.20050

work page internal anchor Pith review arXiv 2023
[10]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...

work page internal anchor Pith review arXiv 2022
[12]

Gonzalez

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. 2025. https://openreview.net/forum?id=2GmDdhBdDk The berkeley function calling leaderboard ( BFCL ): From tool use to agentic evaluation of large language models . In Proceedings of the 42nd International Conference on Machine Learning (ICML)

2025
[14]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . Preprint, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . Preprint, arXiv:2201.11903

work page internal anchor Pith review arXiv 2023
[20]

Xinyi Zheng, Ningke Li, Xiaokun Luan, Kailong Wang, Ling Shi, Meng Sun, and Haoyu Wang. 2025. https://arxiv.org/abs/2512.23511 Beyond correctness: Exposing llm-generated logical flaws in reasoning via multi-step automated theorem proving . Preprint, arXiv:2512.23511

work page arXiv 2025
[21]

Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets , author =. 2024 , journal =. 2406.18518 , archivePrefix=

work page arXiv 2024
[22]

Toolace: Winning the points of llm function calling.arXiv preprint arXiv:2409.00920, 2024

Liu, Weiwen and Huang, Xu and Zeng, Xingshan and Hao, Xinlong and Yu, Shuai and Li, Dexun and Wang, Shuai and Gan, Weinan and Liu, Zhengying and Yu, Yuanqing and Wang, Zezhong and Wang, Yuxian and Ning, Wu and Hou, Yutai and Wang, Bin and Wu, Chuhan and Wang, Xinzhi and Liu, Yong and Wang, Yasheng and Tang, Duyu and Tu, Dandan and Shang, Lifeng and Jiang,...

work page arXiv
[23]

, booktitle =

Patil, Shishir G and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie Cheng-Jie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The Berkeley Function Calling Leaderboard (. 2025 , url =

2025
[24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in. 2025 , journal =. 2501.12948 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

ToolRL: Reward is All Tool Learning Needs

ToolRL: Reward is All Tool Learning Needs , author =. 2025 , journal =. 2504.13958 , archivePrefix=

work page internal anchor Pith review arXiv 2025
[26]

Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning , author =. 2025 , journal =. 2505.00024 , archivePrefix=

work page arXiv 2025
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , journal =. 2402.03300 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

2017 , eprint =

Proximal Policy Optimization Algorithms , author =. 2017 , eprint =

2017
[29]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[30]

2023 , eprint =

Let's Verify Step by Step , author =. 2023 , eprint =

2023
[31]

2025 , eprint=

Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity , author=. 2025 , eprint=

2025
[32]

2025 , eprint=

ACEBench: Who Wins the Match Point in Tool Usage? , author=. 2025 , eprint=

2025
[33]

arXiv preprint arXiv:2412.16516 , year=

Hammerbench: Fine-grained function-calling evaluation in real mobile device scenarios , author=. arXiv preprint arXiv:2412.16516 , year=

work page arXiv
[34]

FunReason: Enhancing function calling via self-refinement and data refinement

FunReason: Enhancing Large Language Models' Function Calling via Self-Refinement Multiscale Loss and Automated Data Refinement , author=. arXiv preprint arXiv:2505.20192 , year=

work page arXiv
[35]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

2023
[36]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024
[37]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , volume=

work page internal anchor Pith review arXiv
[39]

2025 , eprint=

Do Cognitively Interpretable Reasoning Traces Improve LLM Performance? , author=. 2025 , eprint=

2025
[40]

2025 , eprint=

Beyond Correctness: Exposing LLM-generated Logical Flaws in Reasoning via Multi-step Automated Theorem Proving , author=. 2025 , eprint=

2025