pith. machine review for the scientific record. sign in

arxiv: 2604.24594 · v1 · submitted 2026-04-27 · 💻 cs.CL · cs.AI

Recognition: unknown

Skill Retrieval Augmentation for Agentic AI

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Skill Retrieval AugmentationLLM agentsAgentic AISkill incorporationRetrieval augmentationSRA-BenchExternal skillsCapability augmentation
0
0 comments X

The pith

Dynamic retrieval of skills from large external corpora can substantially improve LLM agent performance on hard tasks, though agents load skills at similar rates whether the retrieved skill is relevant or needed at all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates Skill Retrieval Augmentation as an alternative to stuffing every available skill into an agent's context window. Instead of listing thousands of options, the agent retrieves relevant skills on demand from a growing external corpus, incorporates them, and then solves the task. To measure this pipeline they release SRA-Bench, a dataset with 5,400 test cases built around 636 gold skills mixed into a 26,262-skill corpus that also contains realistic distractors. Experiments confirm that effective retrieval lifts end-task accuracy for current agents. At the same time the results expose a clear incorporation failure: agents decide to load a skill at roughly the same frequency whether a gold skill was retrieved or whether the task actually requires any external capability.

Core claim

Skill Retrieval Augmentation (SRA) lets agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora rather than enumerating all skills inside the context window. On SRA-Bench, which pairs 5,400 capability-intensive instances with a 26,262-skill corpus, retrieval-based augmentation produces clear gains in agent accuracy, yet agents load skills at comparable rates regardless of whether a gold skill is present or the task requires external help, showing the bottleneck now lies in selective incorporation rather than retrieval alone.

What carries the argument

Skill Retrieval Augmentation (SRA), the full pipeline of retrieving a skill from a large external corpus, deciding whether and how to incorporate it, and then executing the original task.

If this is right

  • Agent systems can scale to skill corpora far larger than any single context window can hold.
  • The performance lift from retrieval shows that external skill libraries are a viable route to capability expansion.
  • The primary remaining obstacle shifts from retrieval quality to the base model's ability to judge when external loading is actually required.
  • Improvements in selective incorporation would be needed before SRA can deliver its full benefit across diverse tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Mechanisms that explicitly signal task need for external skills could close the incorporation gap observed here.
  • The same retrieval-plus-selective-loading pattern may apply to other external resources such as tools or knowledge bases.
  • Re-running the benchmark on newer models that have been trained with explicit tool-use or retrieval objectives would test whether the loading failure is fundamental or model-specific.

Load-bearing premise

The manually constructed gold skills and web-collected distractors in SRA-Bench form a realistic and representative test of real-world agent skill use, and the observed loading rates generalize beyond the specific models and tasks tested.

What would settle it

Measuring whether agent accuracy still rises when a gold skill is deliberately withheld from the retrieved set for tasks that require external capabilities, or when only distractors are returned, would directly test whether retrieval quality alone drives the reported gains.

Figures

Figures reproduced from arXiv: 2604.24594 by Changyue Wang, Jianming Long, Qingyao Ai, Weihang Su, Yichen Tang, Yiqun Liu, Yiteng Tu.

Figure 1
Figure 1. Figure 1: An illustration of the Skill Retrieval Augmentation (SRA) paradigm. The agent retrieves view at source ↗
Figure 2
Figure 2. Figure 2: End-task accuracy as the number of hard-negative distractor skills increases, with the gold view at source ↗
read the original abstract

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Skill Retrieval Augmentation (SRA) as a paradigm for LLM-based agents to dynamically retrieve relevant skills from large external corpora rather than enumerating them in context. It introduces SRA-Bench, a benchmark with 5,400 capability-intensive instances and a corpus of 26,262 skills (636 manually constructed gold skills mixed with web-collected distractors), and reports experiments showing that retrieval-based augmentation substantially improves agent performance while revealing a gap in skill incorporation: agents load skills at similar rates regardless of whether a gold skill is retrieved or the task requires external capabilities.

Significance. If the benchmark is representative, the work is significant for agentic AI by establishing a decomposed evaluation framework for retrieval, incorporation, and execution, and by identifying that the incorporation bottleneck lies in the base model's decision-making about when to load external skills. The introduction of a large-scale skill corpus and benchmark provides a foundation for scalable capability augmentation research.

major comments (2)
  1. [§3] §3 (SRA-Bench construction): The central claims depend on SRA-Bench faithfully testing the paradigm, but the manuscript provides no details on skill overlap, distractor filtering criteria, or validation that the 5,400 instances require external skills. Without these, the uniform loading rates (regardless of gold retrieval or actual need) and reported performance gains risk being artifacts of the manually authored gold skills and web-sourced distractors rather than general findings.
  2. [§4] §4 (Experiments): The abstract states that retrieval 'substantially improve[s] agent performance' and uncovers a 'fundamental gap,' yet the provided details include no quantitative results, baselines, error bars, statistical tests, or specifics on how skills were incorporated or loading rates measured. This absence undermines assessment of the magnitude and robustness of the findings.
minor comments (1)
  1. [Abstract] Abstract: Including one or two key quantitative highlights (e.g., performance deltas) would strengthen the summary of results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while making revisions where the concerns are valid.

read point-by-point responses
  1. Referee: [§3] §3 (SRA-Bench construction): The central claims depend on SRA-Bench faithfully testing the paradigm, but the manuscript provides no details on skill overlap, distractor filtering criteria, or validation that the 5,400 instances require external skills. Without these, the uniform loading rates (regardless of gold retrieval or actual need) and reported performance gains risk being artifacts of the manually authored gold skills and web-sourced distractors rather than general findings.

    Authors: We agree that the original manuscript would benefit from greater transparency on benchmark construction to rule out artifacts. In the revision, we have expanded §3 with: (i) quantitative skill overlap analysis (embedding cosine similarity <0.65 between gold and distractors, plus manual review confirming <5% unintended overlap); (ii) explicit distractor filtering criteria (semantic threshold, description clarity, and diversity sampling from web sources); and (iii) validation via expert annotation and zero-shot baseline performance (showing <20% success without retrieval on the 5,400 instances). These additions demonstrate that the uniform loading rates and gains reflect genuine SRA challenges rather than construction artifacts. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract states that retrieval 'substantially improve[s] agent performance' and uncovers a 'fundamental gap,' yet the provided details include no quantitative results, baselines, error bars, statistical tests, or specifics on how skills were incorporated or loading rates measured. This absence undermines assessment of the magnitude and robustness of the findings.

    Authors: The full experimental section reports quantitative results on performance gains, loading rates, and the incorporation gap, along with baseline comparisons. However, we acknowledge the presentation could be strengthened for robustness. We have revised §4 to add error bars (std. dev. over 3 seeds), statistical tests (paired t-tests, all key gains p<0.01), expanded baselines (no-retrieval, random retrieval, oracle), and precise details on incorporation (prompt templates for loading decisions) and loading-rate measurement (fraction of tasks where the agent explicitly uses the retrieved skill). These changes clarify the magnitude of findings without altering the core claims. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical benchmark construction and experiments

full rationale

The paper introduces the SRA paradigm and SRA-Bench benchmark through manual skill authoring and web-sourced distractors, then reports experimental results on retrieval, incorporation, and task performance. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on direct empirical measurements rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work. The evaluation is self-contained against the constructed benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the assumption that the constructed benchmark tasks and skills are representative of real agent needs and that the observed incorporation behavior is not an artifact of the specific prompt format or model family used.

axioms (1)
  • domain assumption LLM agents can be prompted to retrieve and incorporate external skills in a measurable way
    Invoked when defining the SRA pipeline and benchmark evaluation
invented entities (2)
  • Skill Retrieval Augmentation (SRA) paradigm no independent evidence
    purpose: New way for agents to access skills without full enumeration
    Introduced as the core contribution; no independent evidence beyond the benchmark results
  • SRA-Bench no independent evidence
    purpose: First benchmark for full SRA pipeline
    Newly constructed dataset; independent evidence would require external validation of its realism

pith-pipeline@v0.9.0 · 5623 in / 1346 out tokens · 47036 ms · 2026-05-08T03:39:22.020790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    Judge-R1 improves LLM judgment document generation by combining agentic legal information retrieval with GRPO-based rubric-guided optimization, outperforming baselines on the JuDGE benchmark.

Reference graph

Works this paper leans on

117 extracted references · 19 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

  2. [2]

    Claude code

    Anthropic. Claude code. https://www.anthropic.com/product/claude-code, 2026. Official product page. Accessed: 2026-04-20

  3. [3]

    An analysis of fusion functions for hybrid retrieval

    Sebastian Bruch, Siyu Gai, and Amir Ingber. An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems, 42(1):1–35, 2023

  4. [4]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, 2023

  5. [5]

    Decoupling knowledge and context: An efficient and effective retrieval augmented generation framework via cross attention

    Qian Dong, Qingyao Ai, Hongning Wang, Yiding Liu, Haitao Li, Weihang Su, Yiqun Liu, Tat- Seng Chua, and Shaoping Ma. Decoupling knowledge and context: An efficient and effective retrieval augmented generation framework via cross attention. InProceedings of the ACM on Web Conference 2025, 2025

  6. [6]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  7. [7]

    Scaling laws for dense retrieval

    Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. Scaling laws for dense retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1339–1349, 2024

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    Unsupervised Dense Information Retrieval with Contrastive Learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021

  10. [10]

    Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023

  11. [11]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  12. [12]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445, 2022

  13. [13]

    Medcalc- bench: Evaluating large language models for medical calculations.Advances in Neural Infor- mation Processing Systems, 37:84730–84745, 2024

    Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina S Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W Safranek, Abid A Anwar, Andrew Zhang, et al. Medcalc- bench: Evaluating large language models for medical calculations.Advances in Neural Infor- mation Processing Systems, 37:84730–84745, 2024

  14. [14]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 19

  15. [15]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

  16. [16]

    Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004,

    George Ling, Shanshan Zhong, and Richard Huang. Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004, 2026

  17. [17]

    Champ: A competition-level dataset for fine-grained analyses of llms’ mathematical reasoning capabilities

    Yujun Mao, Yoon Kim, and Yilun Zhou. Champ: A competition-level dataset for fine-grained analyses of llms’ mathematical reasoning capabilities. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13256–13274, 2024

  18. [18]

    Llama 3.3 model card

    Meta. Llama 3.3 model card. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama3_3/, 2024. Accessed: 2026-04-20

  19. [19]

    Mistral small 3.1

    Mistral AI. Mistral small 3.1. https://mistral.ai/news/mistral-small-3-1 , March

  20. [20]

    Accessed: 2026-04-20

    Official release announcement. Accessed: 2026-04-20

  21. [21]

    OpenAI. Codex. https://openai.com/codex/, 2026. Official product page. Accessed: 2026-04-20

  22. [22]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026. Accessed: 2026-04-20

  23. [23]

    Logicbench: Towards systematic evaluation of logical reasoning ability of large language models

    Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. Logicbench: Towards systematic evaluation of logical reasoning ability of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13679–13707, 2024

  24. [24]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

  25. [25]

    Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems, 37:126544–126565, 2024

  26. [26]

    Toolllm: Facilitating large language models to master 16000+ real-world apis.The Twelfth International Conference on Learning Representations, 2023

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.The Twelfth International Conference on Learning Representations, 2023

  27. [27]

    The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009

    Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009

  28. [28]

    Okapi at trec

    Stephen Edward Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at trec. 1994

  29. [29]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

  30. [30]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  31. [31]

    Skillsmp: Agent skills marketplace

    SkillsMP. Skillsmp: Agent skills marketplace. https://skillsmp.com/, 2026. Accessed: 2026-04-26

  32. [32]

    A statistical interpretation of term specificity and its application in retrieval

    Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972

  33. [33]

    Openclaw, 2026

    Peter Steinberger and OpenClaw Contributors. Openclaw, 2026. Open-source personal AI assistant. Accessed: 2026-04-20. 20

  34. [34]

    Wikiformer: Pre-training with structured information of wikipedia for ad-hoc retrieval

    Weihang Su, Qingyao Ai, Xiangsheng Li, Jia Chen, Yiqun Liu, Xiaolong Wu, and Shengluan Hou. Wikiformer: Pre-training with structured information of wikipedia for ad-hoc retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19026–19034, 2024

  35. [35]

    Pre-training for legal case retrieval based on inter-case distinctions.ACM Transactions on Information Systems, 43(5):1–27, 2025

    Weihang Su, Qingyao Ai, Yueyue Wu, Anzhe Xie, Changyue Wang, Yixiao Ma, Haitao Li, Zhijing Wu, Yiqun Liu, and Min Zhang. Pre-training for legal case retrieval based on inter-case distinctions.ACM Transactions on Information Systems, 43(5):1–27, 2025

  36. [36]

    Dynamic and parametric retrieval-augmented generation, 2025

    Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, and Yiqun Liu. Dynamic and parametric retrieval-augmented generation, 2025

  37. [37]

    Sigir-ap 2025 tutorial proposal: Dynamic and parametric retrieval-augmented generation

    Weihang Su, Qian Dong, Qingyao Ai, and Yiqun Liu. Sigir-ap 2025 tutorial proposal: Dynamic and parametric retrieval-augmented generation. In3rd International ACM SIGIR Conference on Information Retrieval in the Asia Pacific, 2025

  38. [38]

    STARD: A Chinese statute retrieval dataset derived from real-life queries by non-professionals

    Weihang Su, Yiran Hu, Anzhe Xie, Qingyao Ai, Quezi Bing, Ning Zheng, Yun Liu, Weixing Shen, and Yiqun Liu. STARD: A Chinese statute retrieval dataset derived from real-life queries by non-professionals. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10658–10671, M...

  39. [39]

    Towards unification of hallucination detection and fact verification for large language models.arXiv preprint arXiv:2512.02772, 2025

    Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu, Ziyi Ye, Qingyao Ai, and Yiqun Liu. Towards unification of hallucination detection and fact verification for large language models.arXiv preprint arXiv:2512.02772, 2025

  40. [40]

    Miti- gating entity-level hallucination in large language models

    Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, and Yiqun Liu. Miti- gating entity-level hallucination in large language models. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 23–31, 2024

  41. [41]

    DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12...

  42. [42]

    Parametric retrieval augmented generation

    Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. Parametric retrieval augmented generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1240–1250, 2025

  43. [43]

    Unsupervised

    Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. Unsupervised real-time hallucination detection based on the internal states of large language models.arXiv preprint arXiv:2403.06448, 2024

  44. [44]

    SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

    Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, and Yiqun Liu. Surge: A benchmark and evaluation framework for scientific survey generation. arXiv preprint arXiv:2508.15658, 2025

  45. [45]

    Judge: Benchmarking judgment document generation for chinese legal system

    Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. Judge: Benchmarking judgment document generation for chinese legal system. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy, 2025

  46. [46]

    Is chatgpt good at search? investigating large language models as re-ranking agents

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 14918–14937, 2023

  47. [47]

    Better wit than wealth: Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement, March 2025

    Yuqiao Tan, Shizhu He, Huanxuan Liao, Jun Zhao, and Kang Liu. Dynamic parametric retrieval augmented generation for test-time knowledge enhancement.arXiv preprint arXiv:2503.23895, 2025. 21

  48. [48]

    Multi-field tool retrieval.arXiv preprint arXiv:2602.05366, 2026

    Yichen Tang, Weihang Su, Yiqun Liu, and Qingyao Ai. Multi-field tool retrieval.arXiv preprint arXiv:2602.05366, 2026

  49. [49]

    Analytical search.arXiv preprint arXiv:2602.11581, 2026

    Yiteng Tu, Shuo Miao, Weihang Su, Yiqun Liu, and Qingyao Ai. Analytical search.arXiv preprint arXiv:2602.11581, 2026

  50. [50]

    Robust fine-tuning for retrieval augmented generation against retrieval defects

    Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, and Qingyao Ai. Robust fine-tuning for retrieval augmented generation against retrieval defects. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1272–1282, 2025

  51. [51]

    Improve large language model systems with user logs.arXiv preprint arXiv:2602.06470, 2026

    Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. Improve large language model systems with user logs.arXiv preprint arXiv:2602.06470, 2026

  52. [52]

    Joint evaluation of answer and reasoning consistency for hallucination detection in large reasoning models

    Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. Joint evaluation of answer and reasoning consistency for hallucination detection in large reasoning models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33377–33385, 2026

  53. [53]

    Knowledge editing through chain-of-thought

    Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, and Yiqun Liu. Knowledge editing through chain-of-thought. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10684–10704, 2025

  54. [54]

    Decoupling reasoning and knowledge injection for in-context knowledge editing

    Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, and Yiqun Liu. Decoupling reasoning and knowledge injection for in-context knowledge editing. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24543–24562, 2025

  55. [55]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  56. [56]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  57. [57]

    C- pack: Packed resources for general chinese embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C- pack: Packed resources for general chinese embeddings. InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pages 641–649, 2024

  58. [58]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  59. [59]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  60. [60]

    Glm-5.1: Towards long-horizon tasks

    Z.ai. Glm-5.1: Towards long-horizon tasks. https://z.ai/blog/glm-5.1, April 2026. Accessed: 2026-04-20

  61. [61]

    Toolqa: A dataset for llm question answering with external tools.Advances in Neural Information Processing Systems, 36:50117–50143, 2023

    Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools.Advances in Neural Information Processing Systems, 36:50117–50143, 2023

  62. [62]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024. 22 A Dataset-Specific Construction Details This appendix provides the ...

  63. [63]

    name: Canonical theorem name

  64. [64]

    description: One sentence summarizing what the theorem computes or states

  65. [65]

    •When to recognize that this theorem applies

    Content (Markdown): •Core principle or theorem statement. •When to recognize that this theorem applies. •Step-by-step application procedure. •Common errors or pitfalls. •One worked example using a new problem (not from the problems above). A.1.2 Expert Revision Beyond the general revision principles described in Section 3.2, TheoremQA requires special att...

  66. [66]

    Compute each high-growth dividend:D t =D 0(1+g1)t fort= 1, . . . , n

  67. [67]

    Discount each to present: PV t =D t/(1+r)t

  68. [68]

    Note:the numerator isD n+1 =D n(1+g2),notD n

    Terminal value at yearn: TV n =D n(1+g2)/(r−g2). Note:the numerator isD n+1 =D n(1+g2),notD n

  69. [69]

    Discount terminal value: PV(TV)=TV n/(1+r)n

  70. [70]

    Critical Pitfalls: • Wrong terminal value base: TV usesD n+1, notD n

    Sum:P 0 =P PVt +PV(TV). Critical Pitfalls: • Wrong terminal value base: TV usesD n+1, notD n. • Missing growth step: givenD 0, the first dividend isD 1 =D 0(1+g), notD 0. • Rate format: if r and g are given as percentages (e.g., 12%), convert to decimals (0.12) before computing r−g. Worked Example: D0 = 4, g1 = 18% for 3 years, then g2 = 6%, r= 14% . D1 =...

  71. [72]

    Special Cases:L(n,1) =n!;L(n, n) = 1

    ComputeL(n, k) = n! k! × n−1 k−1 . Special Cases:L(n,1) =n!;L(n, n) = 1. ▷No distinction from Stirling numbers; models confuse ordered vs. unordered partitions Worked Example:Divide 8 elements into 5 ordered subsets: L(8,5) = 8! 5! × 7 4 = 336×35 = 11760 . ▷Same parameters as an input instance Golden Skill name:Lah Numbers description: Computing Lah numbe...

  72. [73]

    Identifyn(total elements) andk(number of groups)

  73. [74]

    Verify the problem asks fororderedgroups (sequences, ranked committees, lists)—not unordered subsets

  74. [75]

    ComputeL(n, k) = n! k! × n−1 k−1

  75. [76]

    subsets” or “groups

    Sanity check:L(n,1) =n!,L(n, n) = 1. Distinguishing from Related Problems: Concept Groups ordered? Within-group order? Lah numberL(n, k)No Yes Stirling 2nd kindS(n, k)No No k!×S(n, k)Yes No Key distinction: “subsets” or “groups” (unordered within) → Stirling numbers. “Sequences,” “ranked lists,” or “ordered subsets”→Lah numbers. Worked Example:Divide 4 bo...

  76. [77]

    name: Canonical inference rule name

  77. [78]

    description: One sentence summarizing what the rule states and its scope

  78. [79]

    typically,

    Content (Markdown): •Formal statement of the inference rule. •When to recognize that this rule applies. •Step-by-step application procedure. •Common errors or pitfalls. •One worked example using a new problem (not from the problems above). 2https://plato.stanford.edu, a comprehensive, peer-reviewed reference for logic and philosophy. 27 A.2.2 Expert Revis...

  79. [80]

    Identify the conditional premise: extract antecedent A and consequent C

  80. [81]

    Confirm that C is negated in the premises

Showing first 80 references.