{"total":15,"items":[{"citing_arxiv_id":"2606.10875","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation","primary_cat":"cs.CL","submitted_at":"2026-06-09T13:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Instance-level experiential knowledge provides strong gains for LLM tool calling, parallel sampling activates it more effectively than deeper reasoning, and RL-based internalization outperforms SFT, yielding the KATE framework with consistent benchmark improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09371","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs","primary_cat":"cs.AI","submitted_at":"2026-06-08T11:48:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CAHL jointly optimizes hierarchical policies for tool-augmented LLMs via RLVR and reports improved results on API-Bank, BFCL, and Bamboogle.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03892","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments","primary_cat":"cs.CL","submitted_at":"2026-06-02T16:52:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00135","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On Effectiveness and Efficiency of Agentic Tool-calling and RL Training","primary_cat":"cs.LG","submitted_at":"2026-05-28T22:21:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tool-calling evaluations for LLM agents are highly sensitive to implementation details such as random seeds and history handling, and two new techniques accelerate RL training with wall-clock speedup and no performance degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29303","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-28T03:36:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14126","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)","primary_cat":"cs.LG","submitted_at":"2026-05-13T21:27:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RL post-training lifts answer correctness on FHIR-AgentBench from 50% (o4-mini) to 77% with a cheaper Qwen3-8B CodeAct agent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11775","ref_index":30,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control","primary_cat":"cs.LG","submitted_at":"2026-05-12T08:47:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09730","ref_index":20,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement","primary_cat":"cs.LG","submitted_at":"2026-05-10T19:57:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03476","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification","primary_cat":"cs.CL","submitted_at":"2026-05-05T08:05:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Agent functionality is implemented using the LangChain framework [18], with GraphRAG [15] employed for knowledge graph construction and retrieval. Model deployment supports both API calls and local inference, using large-scale commercial models and lightweight open-source models respectively. Text processing employs the SaT (Segment any Text) model [44] combined with rule- based methods for sentence segmentation; graph construction uses community detection algorithms for hierarchical organization. 4.2 Hallucination Generation Agent 4.2.1 Design Principles The hallucination generation agent aims to systematically generate diverse medical error patterns while maintaining medical plausibility. We established four core design principles."},{"citing_arxiv_id":"2604.20316","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling","primary_cat":"cs.LG","submitted_at":"2026-04-22T08:13:24+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17739","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model","primary_cat":"cs.LG","submitted_at":"2026-04-20T02:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09813","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-10T18:38:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09712","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-08T06:28:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAST augments MLLMs with a tool-abstraction sandbox and three-stage training to deliver around 20% gains on spatial reasoning tasks, outperforming closed-source models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Typical examples integrate calculators [25, 44], code executors [10, 26] and symbolic solvers [20, 31, 32, 50], leveraging their reliability to handle complex reasoning beyond the native capacity of language models [ 30]. In the multimodal setting [11, 14], tools are extended to visual operations such as cropping, masking, or adjusting image attributes [ 46, 48], some- times coordinated through reinforcement learning for tool selection and sequencing [22, 49]. Spatial reasoning marks another important direction. Text-based systems increasingly adopt logic engines or ASP solvers for multi-hop inference [35, 41], while multimodal spa- tial reasoning [5, 12] begins to exploit expert models for grounding"},{"citing_arxiv_id":"2511.07833","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation","primary_cat":"cs.LG","submitted_at":"2025-11-11T05:03:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.06499","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels","primary_cat":"cs.CL","submitted_at":"2025-10-07T22:30:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Webscale-RL generates 1.2M verifiable QA pairs from pretraining corpora, enabling RL training that matches continual pretraining performance with up to 100x fewer tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}