arxiv: 2604.26805 · v2 · submitted 2026-04-29 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

· Lean Theorem

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Bochao Liu , Zhipeng Qian , Yang Zhao , Xinyuan Jiang , Zihan Liang , Yufei Ma , Junpeng Zhuang , Ben Chen

show 5 more authors

Shuo Yang Hongen Wan Yao Wu Chenyi Lei Xiao Liang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords LLM agentsagentic frameworksystem operationsskill arrangementroot cause analysisonline systemsself-evolvingalert management

0 comments

The pith

Bian Que enables LLM agents to handle online system operations by using flexible Skill Arrangement to precisely select relevant data and knowledge for each event.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to demonstrate that LLM-based agents can be made practical for operating and maintaining large-scale online systems like search engines by solving the key problem of orchestration. Instead of feeding all data and knowledge indiscriminately which causes errors, or manually mapping everything which is impossible with many daily releases, the framework provides a way to define and arrange Skills that match the right information to each situation. This matters because manual effort in monitoring, responding to alerts, and analyzing root causes is a major burden, and automation could free up engineers significantly if it works reliably. The approach includes abstracting operations into standard patterns, auto-generating and optimizing the Skills, and a self-evolving system based on corrections.

Core claim

Bian Que abstracts routine O&M actions into three patterns—release interception, proactive inspection, and alert root cause analysis—and introduces the flexible Skill Arrangement where each Skill specifies the exact data and knowledge needed for a context. These Skills are generated and updated automatically by LLM agents and can be refined by engineers through natural language, with a self-evolving mechanism that uses correction signals to distill knowledge and refine Skills further.

What carries the argument

The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context, allowing automatic generation, updates by agents, and iterative optimization via natural language instructions.

Load-bearing premise

LLM agents using the flexible Skill Arrangement can reliably select the precise data and knowledge for each event without dilution or hallucination, while Skills can be automatically generated and iteratively optimized with minimal ongoing human curation.

What would settle it

A sustained live deployment where root cause analysis accuracy drops below 80 percent or alert reductions fall short of 50 percent would indicate the framework does not deliver as claimed.

Figures

Figures reproduced from arXiv: 2604.26805 by Ben Chen, Bochao Liu, Chenyi Lei, Hongen Wan, Junpeng Zhuang, Shuo Yang, Xiao Liang, Xinyuan Jiang, Yang Zhao, Yao Wu, Yufei Ma, Zhipeng Qian, Zihan Liang.

**Figure 1.** Figure 1: Overview of the BIAN QUE architecture. Operational events from the OPS platform (top) are dispatched to a matching Agent, which invokes one or more matched Skills to assemble the relevant data (system signals: logs, metrics, change events) and knowledge (domain knowledge distilled from case memory, seeded by operational handbooks) for the LLM to reason over; the resulting diagnosis is returned to the OPS p… view at source ↗

**Figure 2.** Figure 2: Agent Matrix and Skill Pool. Each Agent (top) implements one canonical pattern; Skills view at source ↗

**Figure 3.** Figure 3: Flexible Skill lifecycle. New Skills are generated from seed configurations with validation view at source ↗

**Figure 4.** Figure 4: Day-level alert-analysis accuracy on pro view at source ↗

read the original abstract

Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bian Que gives a practical LLM-agent framework for ops with auto-generated skills and self-evolution, but the big deployment wins lack baselines and data on selection errors or human curation needs.

read the letter

Bian Que presents a working agent setup for handling daily operations and maintenance on large online systems like search engines. The main new pieces are the three standard patterns for routine tasks, the flexible Skill Arrangement that bundles exact data and knowledge per event, and the dual self-evolution paths that turn fixes into memory and skill updates. Skills can be created or changed by the LLM itself or by engineers using plain language instructions, which directly tackles the orchestration bottleneck the abstract identifies. The code release on GitHub is a clear positive for anyone wanting to try the approach. The KuaiShou deployment reports 75% fewer alerts, 80% root-cause accuracy, over 50% shorter resolution time, and 99% offline pass rate, which are the kind of numbers that get attention in production settings. The framework seems to engage the existing agent literature without overclaiming a new paradigm, and the low circularity in the claims is a strength since they rest on observed outcomes rather than fitted parameters. The main weakness is the thin evaluation. The abstract and stress-test note both flag the absence of baselines, collection methods, error bars, skill selection error rates, hallucination frequency, or volume of ongoing human interventions. Without those, it is difficult to attribute the gains cleanly to the skill mechanism rather than residual oversight or unrelated system changes. If the full paper supplies more on those points it would strengthen the case; if not, the soundness stays limited. This is applied work aimed at ops teams and agent builders who need concrete patterns for production use. It deserves peer review because the real deployment experience is substantive enough to warrant scrutiny and feedback on the missing metrics, even if revisions will be needed.

Referee Report

3 major / 2 minor

Summary. The paper presents Bian Que, an agentic LLM framework for online system O&M that abstracts operations into three patterns (release interception, proactive inspection, alert root cause analysis). Its core is the flexible Skill Arrangement mechanism, in which Skills explicitly bundle relevant data (metrics/logs/changes) and knowledge (handbooks/practitioner experience) for each event; Skills are LLM-generated, LLM-updated, and iteratively refined via natural-language instructions from engineers. A unified self-evolving loop distills corrections into memory and Skill updates. Deployed on KuaiShou’s e-commerce search engine, the system is reported to reduce alert volume by 75%, achieve 80% RCA accuracy, cut MTTR by >50%, and reach 99.0% offline pass rate. Code is released at https://github.com/benchen4395/BianQue_Assistant.

Significance. If the deployment claims can be substantiated with transparent methodology, baselines, and error analysis, the work would provide concrete evidence that LLM agents can be orchestrated reliably enough for production O&M at scale, directly addressing the data/knowledge selection bottleneck that currently limits such systems. The open-sourced code is a clear strength for reproducibility. The flexible Skill design and self-evolution loop are conceptually appealing and could generalize beyond the reported deployment.

major comments (3)

[Abstract and §4] Abstract and §4 (Deployment Results): The headline metrics (75% alert-volume reduction, 80% RCA accuracy, >50% MTTR reduction, 99.0% offline pass rate) are stated without any baseline system, pre-deployment measurements, statistical error bars, data-collection window, or alert-counting definition. Because the central claim is that the framework itself produces these gains, the absence of this information is load-bearing and prevents attribution to the Skill Arrangement rather than other factors.
[§3.2] §3.2 (Flexible Skill Arrangement): The paper asserts that Skills are automatically generated, updated, and NL-optimized by LLMs with a self-evolving loop, yet supplies no quantitative data on Skill-selection error rates, hallucination frequency during live events, or the volume/frequency of human interventions required over the deployment period. This directly bears on the weakest assumption that the agent reliably maps events to the precise data/knowledge subset without dilution or heavy curation.
[§4.3] §4.3 (Offline Evaluation): The 99.0% pass rate is reported without test-set size, definition of a “pass,” task distribution, or comparison against non-agent baselines. This leaves the offline validation of the Skill mechanism and self-evolution loop unanchored and weakens the supporting evidence for the deployment claims.

minor comments (2)

[Abstract] The abstract is unusually dense with performance numbers; moving the quantitative claims to a dedicated results paragraph or table would improve readability.
[§3] Notation for “Skill” is introduced without an explicit formal definition or pseudocode; a small diagram or boxed definition in §3 would clarify the interface between LLM generation and engineer NL optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's recognition of the potential of the flexible Skill design and self-evolution loop for production O&M. We address each major comment below and will incorporate revisions to improve transparency and substantiation of the results.

read point-by-point responses

Referee: [Abstract and §4] The headline metrics (75% alert-volume reduction, 80% RCA accuracy, >50% MTTR reduction, 99.0% offline pass rate) are stated without any baseline system, pre-deployment measurements, statistical error bars, data-collection window, or alert-counting definition. This prevents attribution to the Skill Arrangement.

Authors: We agree that additional context is needed to strengthen attribution of the gains to the framework. In the revised manuscript, we will expand §4 (and update the abstract accordingly) to describe the pre-deployment baseline system, the data collection window (pre- and post-deployment periods), the precise definition of alert volume and counting methodology, and any available statistical measures such as variance in MTTR where computable from logs. While certain production details remain subject to internal confidentiality, we will provide sufficient information to allow readers to evaluate the role of Skill Arrangement versus other factors. revision: yes
Referee: [§3.2] The paper asserts that Skills are automatically generated, updated, and NL-optimized by LLMs with a self-evolving loop, yet supplies no quantitative data on Skill-selection error rates, hallucination frequency during live events, or the volume/frequency of human interventions required over the deployment period.

Authors: This point is well-taken, as quantitative indicators of Skill reliability would better support the claims. We will revise §3.2 to include deployment-derived statistics: the volume and frequency of LLM-generated Skill updates, observed rates of selection errors or hallucinations (mitigated by the self-evolving loop), and the number and nature of human interventions via natural-language instructions. These will be summarized in a new table or paragraph based on production logs, quantifying the human effort and the loop's effectiveness without revealing proprietary details. revision: yes
Referee: [§4.3] The 99.0% pass rate is reported without test-set size, definition of a “pass,” task distribution, or comparison against non-agent baselines.

Authors: We acknowledge the need for more detail to anchor the offline results. In the revised §4.3, we will specify the test-set size, the exact definition of a 'pass' (e.g., successful task completion per pattern criteria), the distribution of evaluated tasks across release interception, proactive inspection, and alert RCA, and direct comparisons against non-agent baselines such as direct LLM prompting without Skills or rule-based approaches. This will better substantiate the offline validation of the Skill mechanism and self-evolution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical deployment results independent of internal definitions or self-citations

full rationale

The paper describes an agentic O&M framework (unified paradigm, flexible Skill Arrangement, self-evolving mechanism) and reports concrete deployment metrics from KuaiShou (75% alert reduction, 80% RCA accuracy, >50% MTTR reduction, 99% offline pass rate). No equations, fitted parameters, or 'predictions' appear that reduce by construction to inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The central claims rest on observed system outcomes rather than any derivation chain that loops back to its own definitions or fitted data. This is a standard engineering/deployment paper with externally measurable results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the premise that LLM orchestration via Skills avoids hallucination and that self-evolution improves performance without introducing new instabilities.

axioms (1)

domain assumption LLM agents can precisely select relevant data and knowledge for each operational event when guided by predefined Skills.
This assumption underpins the claim that the framework avoids dilution and hallucination.

invented entities (1)

Skill no independent evidence
purpose: A predefined package that explicitly defines the requisite data and operational knowledge for a specific operational context.
Core new abstraction introduced to enable flexible arrangement and automatic updates.

pith-pipeline@v0.9.0 · 5656 in / 1320 out tokens · 45296 ms · 2026-05-12T01:46:11.170687+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Flexible Skill Arrangement... each Skill specifies the relevant data and knowledge... LLM-generated, LLM-updated... unified self-evolving mechanism... one feedback signal simultaneously drives memory-to-knowledge distillation and Skill refinement
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified operational paradigm... three canonical patterns: release interception, proactive inspection, and alert root cause analysis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 9 internal anchors

[1]

Openclaw

OpenClaw. Openclaw. https://github.com/openclaw/openclaw, 2026. Open-source personal AI assistant, version 2026.3.8, accessed 2026-03-09

work page 2026
[2]

Claude code overview

Anthropic. Claude code overview. https://code.claude.com/docs/en/overview, 2026. Official documentation, accessed 2026-03-10

work page 2026
[3]

Harness engineering: leveraging codex in an agent-first world

OpenAI. Harness engineering: leveraging codex in an agent-first world. Engineering blog,

work page
[4]

Published: 2026-02-

URL https://openai.com/index/harness-engineering/. Published: 2026-02-

work page 2026
[5]

Accessed: 2026-03-13

work page 2026
[6]

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. InProceedings of the 33rd ACM international conference on information and knowledge management, pages 4966–4974, 2024

work page 2024
[7]

A survey of aiops for failure management in the era of large language models.arXiv preprint arXiv:2406.11213, 2024

Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S Yu, and Ying Li. A survey of aiops for failure management in the era of large language models.arXiv preprint arXiv:2406.11213, 2024

work page arXiv 2024
[8]

arXiv preprint arXiv:2509.03236 , year=

Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, et al. Onesearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arXiv preprint arXiv:2509.03236, 2025

work page arXiv 2025
[9]

Onesearch-v2: The latent reasoning enhanced self- distillation generative search framework.arXiv preprint arXiv:2603.24422, 2026

Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, et al. Onesearch-v2: The latent reasoning enhanced self- distillation generative search framework.arXiv preprint arXiv:2603.24422, 2026

work page internal anchor Pith review arXiv 2026
[10]

Uniecs: Unified multimodal e-commerce search framework with gated cross-modal fusion

Zihan Liang, Yufei Ma, ZhiPeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, and Han Li. Uniecs: Unified multimodal e-commerce search framework with gated cross-modal fusion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 1788–1797, 2025

work page 2025
[11]

CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, and Xiaoshuai Sun. Csmcir: Cot-enhanced symmetric alignment with memory bank for composed image retrieval.arXiv preprint arXiv:2601.03728, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965, 2025

work page internal anchor Pith review arXiv 2025
[13]

A survey of aiops methods for failure management.ACM Transactions on Intelligent Systems and Technology (TIST), 12(6):1–45, 2021

Paolo Notaro, Jorge Cardoso, and Michael Gerndt. A survey of aiops methods for failure management.ACM Transactions on Intelligent Systems and Technology (TIST), 12(6):1–45, 2021

work page 2021
[14]

Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009, 2025

Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009, 2025

work page arXiv 2025
[15]

Stalled, biased, and confused: Uncovering reasoning failures in llms for cloud-based root cause analysis.arXiv preprint arXiv:2601.22208, 2026

Evelien Riddell, James Riddell, Gengyi Sun, Micha´L Antkiewicz, and Krzysztof Czarnecki. Stalled, biased, and confused: Uncovering reasoning failures in llms for cloud-based root cause analysis.arXiv preprint arXiv:2601.22208, 2026

work page arXiv 2026
[16]

Empowering aiops: Leveraging large language models for it operations management.arXiv preprint arXiv:2501.12461, 2025

Arthur Vitui and Tse-Hsun Chen. Empowering aiops: Leveraging large language models for it operations management.arXiv preprint arXiv:2501.12461, 2025

work page arXiv 2025
[17]

Anomaly detection in univariate time-series: A survey on the state-of-the-art.arXiv preprint arXiv:2004.00433, 2020

Mohammad Braei and Sebastian Wagner. Anomaly detection in univariate time-series: A survey on the state-of-the-art.arXiv preprint arXiv:2004.00433, 2020

work page arXiv 2004
[18]

Drain: An online log parsing approach with fixed depth tree

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. Drain: An online log parsing approach with fixed depth tree. In2017 IEEE international conference on web services (ICWS), pages 33–40. IEEE, 2017. 13

work page 2017
[19]

Predicting node failure in cloud service systems

Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, et al. Predicting node failure in cloud service systems. InProceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pages 480–490, 2018

work page 2018
[20]

Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey.ACM Computing Surveys (CSUR), 55(3):1–39, 2022

Jacopo Soldani and Antonio Brogi. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey.ACM Computing Surveys (CSUR), 55(3):1–39, 2022

work page 2022
[21]

Llm-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677, 2025

Siraaj Akhtar, Saad Khan, and Simon Parkinson. Llm-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677, 2025

work page arXiv 2025
[22]

Retrieval augmented generation-based incident resolution recommendation system for it support.arXiv preprint arXiv:2409.13707, 2024

Paulina Toro Isaza, Michael Nidd, Noah Zheutlin, Jae-wook Ahn, Chidansh Amitkumar Bhatt, Yu Deng, Ruchi Mahindru, Martin Franz, Hans Florian, and Salim Roukos. Retrieval augmented generation-based incident resolution recommendation system for it support.arXiv preprint arXiv:2409.13707, 2024

work page arXiv 2024
[23]

Exploring llm-based agents for root cause analysis

Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. Exploring llm-based agents for root cause analysis. InCompanion pro- ceedings of the 32nd ACM international conference on the foundations of software engineering, pages 208–219, 2024

work page 2024
[24]

Raglog: Log anomaly detection using retrieval augmented generation

Jonathan Pan, Wong Swee Liang, and Yuan Yidi. Raglog: Log anomaly detection using retrieval augmented generation. In2024 IEEE World Forum on Public Safety Technology (WFPST), pages 169–174. IEEE, 2024

work page 2024
[25]

A survey of aiops in the era of large language models.ACM Computing Surveys, 58(2):1–35, 2025

Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip Yu, and Ying Li. A survey of aiops in the era of large language models.ACM Computing Surveys, 58(2):1–35, 2025

work page 2025
[26]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[27]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023
[28]

Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224,

Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023

work page arXiv 2023
[29]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023
[30]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

TaskWeaver: A code-first agent framework,

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023
[32]

IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, and Wenwu Ou. Ig-search: Step-level information gain rewards for search- augmented reasoning.arXiv preprint arXiv:2604.15148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

The anatomy of an agent harness

LangChain. The anatomy of an agent harness. Engineering blog, 2026. URL https:// blog.langchain.com/the-anatomy-of-an-agent-harness/ . Published: 2026-03-10. Accessed: 2026-03-12. 14

work page 2026
[34]

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, et al. Mcp-flow: Facilitating llm agents to master real-world, diverse and scaling mcp tools.arXiv preprint arXiv:2510.24284, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Llma4itops: A lightweight llm-based multi-agent framework for it operations and maintenance

Zhuoxuan Jiang, Tianyang Zhang, Haotian Zhang, Yinong Xun, Yang Liu, Dehua Feng, Wen Si, and Shaohua Zhang. Llma4itops: A lightweight llm-based multi-agent framework for it operations and maintenance. InCCF International Conference on Natural Language Processing and Chinese Computing, pages 471–482. Springer, 2025

work page 2025
[36]

Aoi: Turning failed trajectories into training signals for autonomous cloud diagnosis.arXiv preprint arXiv:2603.03378, 2026

Pei Yang, Wanyi Chen, Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Bill Shi, Lynn Ai, et al. Aoi: Turning failed trajectories into training signals for autonomous cloud diagnosis.arXiv preprint arXiv:2603.03378, 2026

work page arXiv 2026
[37]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[38]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[39]

Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7036...

work page 2024
[40]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024
[41]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[42]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[43]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

work page 2023
[45]

A survey on self-evolution of large language models

Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024

work page arXiv 2024
[46]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review arXiv 2026
[47]

Live-evo: Online evolution of agentic memory from continuous feedback.arXiv preprint arXiv:2602.02369, 2026

Yaolun Zhang, Yiran Wu, Yijiong Yu, Qingyun Wu, and Huazheng Wang. Live-evo: Online evolution of agentic memory from continuous feedback.arXiv preprint arXiv:2602.02369, 2026

work page arXiv 2026
[48]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, et al. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025. 15

work page internal anchor Pith review arXiv 2025