Recognition: 2 theorem links
· Lean TheoremBian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
Pith reviewed 2026-05-12 01:46 UTC · model grok-4.3
The pith
Bian Que enables LLM agents to handle online system operations by using flexible Skill Arrangement to precisely select relevant data and knowledge for each event.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bian Que abstracts routine O&M actions into three patterns—release interception, proactive inspection, and alert root cause analysis—and introduces the flexible Skill Arrangement where each Skill specifies the exact data and knowledge needed for a context. These Skills are generated and updated automatically by LLM agents and can be refined by engineers through natural language, with a self-evolving mechanism that uses correction signals to distill knowledge and refine Skills further.
What carries the argument
The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context, allowing automatic generation, updates by agents, and iterative optimization via natural language instructions.
Load-bearing premise
LLM agents using the flexible Skill Arrangement can reliably select the precise data and knowledge for each event without dilution or hallucination, while Skills can be automatically generated and iteratively optimized with minimal ongoing human curation.
What would settle it
A sustained live deployment where root cause analysis accuracy drops below 80 percent or alert reductions fall short of 50 percent would indicate the framework does not deliver as claimed.
Figures
read the original abstract
Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Bian Que, an agentic LLM framework for online system O&M that abstracts operations into three patterns (release interception, proactive inspection, alert root cause analysis). Its core is the flexible Skill Arrangement mechanism, in which Skills explicitly bundle relevant data (metrics/logs/changes) and knowledge (handbooks/practitioner experience) for each event; Skills are LLM-generated, LLM-updated, and iteratively refined via natural-language instructions from engineers. A unified self-evolving loop distills corrections into memory and Skill updates. Deployed on KuaiShou’s e-commerce search engine, the system is reported to reduce alert volume by 75%, achieve 80% RCA accuracy, cut MTTR by >50%, and reach 99.0% offline pass rate. Code is released at https://github.com/benchen4395/BianQue_Assistant.
Significance. If the deployment claims can be substantiated with transparent methodology, baselines, and error analysis, the work would provide concrete evidence that LLM agents can be orchestrated reliably enough for production O&M at scale, directly addressing the data/knowledge selection bottleneck that currently limits such systems. The open-sourced code is a clear strength for reproducibility. The flexible Skill design and self-evolution loop are conceptually appealing and could generalize beyond the reported deployment.
major comments (3)
- [Abstract and §4] Abstract and §4 (Deployment Results): The headline metrics (75% alert-volume reduction, 80% RCA accuracy, >50% MTTR reduction, 99.0% offline pass rate) are stated without any baseline system, pre-deployment measurements, statistical error bars, data-collection window, or alert-counting definition. Because the central claim is that the framework itself produces these gains, the absence of this information is load-bearing and prevents attribution to the Skill Arrangement rather than other factors.
- [§3.2] §3.2 (Flexible Skill Arrangement): The paper asserts that Skills are automatically generated, updated, and NL-optimized by LLMs with a self-evolving loop, yet supplies no quantitative data on Skill-selection error rates, hallucination frequency during live events, or the volume/frequency of human interventions required over the deployment period. This directly bears on the weakest assumption that the agent reliably maps events to the precise data/knowledge subset without dilution or heavy curation.
- [§4.3] §4.3 (Offline Evaluation): The 99.0% pass rate is reported without test-set size, definition of a “pass,” task distribution, or comparison against non-agent baselines. This leaves the offline validation of the Skill mechanism and self-evolution loop unanchored and weakens the supporting evidence for the deployment claims.
minor comments (2)
- [Abstract] The abstract is unusually dense with performance numbers; moving the quantitative claims to a dedicated results paragraph or table would improve readability.
- [§3] Notation for “Skill” is introduced without an explicit formal definition or pseudocode; a small diagram or boxed definition in §3 would clarify the interface between LLM generation and engineer NL optimization.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We appreciate the referee's recognition of the potential of the flexible Skill design and self-evolution loop for production O&M. We address each major comment below and will incorporate revisions to improve transparency and substantiation of the results.
read point-by-point responses
-
Referee: [Abstract and §4] The headline metrics (75% alert-volume reduction, 80% RCA accuracy, >50% MTTR reduction, 99.0% offline pass rate) are stated without any baseline system, pre-deployment measurements, statistical error bars, data-collection window, or alert-counting definition. This prevents attribution to the Skill Arrangement.
Authors: We agree that additional context is needed to strengthen attribution of the gains to the framework. In the revised manuscript, we will expand §4 (and update the abstract accordingly) to describe the pre-deployment baseline system, the data collection window (pre- and post-deployment periods), the precise definition of alert volume and counting methodology, and any available statistical measures such as variance in MTTR where computable from logs. While certain production details remain subject to internal confidentiality, we will provide sufficient information to allow readers to evaluate the role of Skill Arrangement versus other factors. revision: yes
-
Referee: [§3.2] The paper asserts that Skills are automatically generated, updated, and NL-optimized by LLMs with a self-evolving loop, yet supplies no quantitative data on Skill-selection error rates, hallucination frequency during live events, or the volume/frequency of human interventions required over the deployment period.
Authors: This point is well-taken, as quantitative indicators of Skill reliability would better support the claims. We will revise §3.2 to include deployment-derived statistics: the volume and frequency of LLM-generated Skill updates, observed rates of selection errors or hallucinations (mitigated by the self-evolving loop), and the number and nature of human interventions via natural-language instructions. These will be summarized in a new table or paragraph based on production logs, quantifying the human effort and the loop's effectiveness without revealing proprietary details. revision: yes
-
Referee: [§4.3] The 99.0% pass rate is reported without test-set size, definition of a “pass,” task distribution, or comparison against non-agent baselines.
Authors: We acknowledge the need for more detail to anchor the offline results. In the revised §4.3, we will specify the test-set size, the exact definition of a 'pass' (e.g., successful task completion per pattern criteria), the distribution of evaluated tasks across release interception, proactive inspection, and alert RCA, and direct comparisons against non-agent baselines such as direct LLM prompting without Skills or rule-based approaches. This will better substantiate the offline validation of the Skill mechanism and self-evolution. revision: yes
Circularity Check
No circularity: empirical deployment results independent of internal definitions or self-citations
full rationale
The paper describes an agentic O&M framework (unified paradigm, flexible Skill Arrangement, self-evolving mechanism) and reports concrete deployment metrics from KuaiShou (75% alert reduction, 80% RCA accuracy, >50% MTTR reduction, 99% offline pass rate). No equations, fitted parameters, or 'predictions' appear that reduce by construction to inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The central claims rest on observed system outcomes rather than any derivation chain that loops back to its own definitions or fitted data. This is a standard engineering/deployment paper with externally measurable results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can precisely select relevant data and knowledge for each operational event when guided by predefined Skills.
invented entities (1)
-
Skill
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Flexible Skill Arrangement... each Skill specifies the relevant data and knowledge... LLM-generated, LLM-updated... unified self-evolving mechanism... one feedback signal simultaneously drives memory-to-knowledge distillation and Skill refinement
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
unified operational paradigm... three canonical patterns: release interception, proactive inspection, and alert root cause analysis
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Anthropic. Claude code overview. https://code.claude.com/docs/en/overview, 2026. Official documentation, accessed 2026-03-10
work page 2026
-
[3]
Harness engineering: leveraging codex in an agent-first world
OpenAI. Harness engineering: leveraging codex in an agent-first world. Engineering blog,
-
[4]
URL https://openai.com/index/harness-engineering/. Published: 2026-02-
work page 2026
-
[5]
Accessed: 2026-03-13
work page 2026
-
[6]
Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models
Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, and Qingsong Wen. Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. InProceedings of the 33rd ACM international conference on information and knowledge management, pages 4966–4974, 2024
work page 2024
-
[7]
Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S Yu, and Ying Li. A survey of aiops for failure management in the era of large language models.arXiv preprint arXiv:2406.11213, 2024
-
[8]
arXiv preprint arXiv:2509.03236 , year=
Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, et al. Onesearch: A preliminary exploration of the unified end-to-end generative framework for e-commerce search.arXiv preprint arXiv:2509.03236, 2025
-
[9]
Ben Chen, Siyuan Wang, Yufei Ma, Zihan Liang, Xuxin Zhang, Yue Lv, Ying Yang, Huangyu Dai, Lingtao Mao, Tong Zhao, et al. Onesearch-v2: The latent reasoning enhanced self- distillation generative search framework.arXiv preprint arXiv:2603.24422, 2026
work page internal anchor Pith review arXiv 2026
-
[10]
Uniecs: Unified multimodal e-commerce search framework with gated cross-modal fusion
Zihan Liang, Yufei Ma, ZhiPeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, and Han Li. Uniecs: Unified multimodal e-commerce search framework with gated cross-modal fusion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pages 1788–1797, 2025
work page 2025
-
[11]
CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval
Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, and Xiaoshuai Sun. Csmcir: Cot-enhanced symmetric alignment with memory bank for composed image retrieval.arXiv preprint arXiv:2601.03728, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Paolo Notaro, Jorge Cardoso, and Michael Gerndt. A survey of aiops methods for failure management.ACM Transactions on Intelligent Systems and Technology (TIST), 12(6):1–45, 2021
work page 2021
-
[14]
Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009, 2025
-
[15]
Evelien Riddell, James Riddell, Gengyi Sun, Micha´L Antkiewicz, and Krzysztof Czarnecki. Stalled, biased, and confused: Uncovering reasoning failures in llms for cloud-based root cause analysis.arXiv preprint arXiv:2601.22208, 2026
-
[16]
Arthur Vitui and Tse-Hsun Chen. Empowering aiops: Leveraging large language models for it operations management.arXiv preprint arXiv:2501.12461, 2025
-
[17]
Mohammad Braei and Sebastian Wagner. Anomaly detection in univariate time-series: A survey on the state-of-the-art.arXiv preprint arXiv:2004.00433, 2020
-
[18]
Drain: An online log parsing approach with fixed depth tree
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. Drain: An online log parsing approach with fixed depth tree. In2017 IEEE international conference on web services (ICWS), pages 33–40. IEEE, 2017. 13
work page 2017
-
[19]
Predicting node failure in cloud service systems
Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, et al. Predicting node failure in cloud service systems. InProceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pages 480–490, 2018
work page 2018
-
[20]
Jacopo Soldani and Antonio Brogi. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey.ACM Computing Surveys (CSUR), 55(3):1–39, 2022
work page 2022
-
[21]
Llm-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677, 2025
Siraaj Akhtar, Saad Khan, and Simon Parkinson. Llm-based event log analysis techniques: A survey.arXiv preprint arXiv:2502.00677, 2025
-
[22]
Paulina Toro Isaza, Michael Nidd, Noah Zheutlin, Jae-wook Ahn, Chidansh Amitkumar Bhatt, Yu Deng, Ruchi Mahindru, Martin Franz, Hans Florian, and Salim Roukos. Retrieval augmented generation-based incident resolution recommendation system for it support.arXiv preprint arXiv:2409.13707, 2024
-
[23]
Exploring llm-based agents for root cause analysis
Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. Exploring llm-based agents for root cause analysis. InCompanion pro- ceedings of the 32nd ACM international conference on the foundations of software engineering, pages 208–219, 2024
work page 2024
-
[24]
Raglog: Log anomaly detection using retrieval augmented generation
Jonathan Pan, Wong Swee Liang, and Yuan Yidi. Raglog: Log anomaly detection using retrieval augmented generation. In2024 IEEE World Forum on Public Safety Technology (WFPST), pages 169–174. IEEE, 2024
work page 2024
-
[25]
A survey of aiops in the era of large language models.ACM Computing Surveys, 58(2):1–35, 2025
Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip Yu, and Ying Li. A survey of aiops in the era of large language models.ACM Computing Surveys, 58(2):1–35, 2025
work page 2025
-
[26]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[27]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023
work page 2023
-
[28]
Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224, 2023
-
[29]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023
work page 2023
-
[30]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
TaskWeaver: A code-first agent framework,
Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023
-
[32]
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, and Wenwu Ou. Ig-search: Step-level information gain rewards for search- augmented reasoning.arXiv preprint arXiv:2604.15148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
The anatomy of an agent harness
LangChain. The anatomy of an agent harness. Engineering blog, 2026. URL https:// blog.langchain.com/the-anatomy-of-an-agent-harness/ . Published: 2026-03-10. Accessed: 2026-03-12. 14
work page 2026
-
[34]
MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, et al. Mcp-flow: Facilitating llm agents to master real-world, diverse and scaling mcp tools.arXiv preprint arXiv:2510.24284, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Llma4itops: A lightweight llm-based multi-agent framework for it operations and maintenance
Zhuoxuan Jiang, Tianyang Zhang, Haotian Zhang, Yinong Xun, Yang Liu, Dehua Feng, Wen Si, and Shaohua Zhang. Llma4itops: A lightweight llm-based multi-agent framework for it operations and maintenance. InCCF International Conference on Natural Language Processing and Chinese Computing, pages 471–482. Springer, 2025
work page 2025
-
[36]
Pei Yang, Wanyi Chen, Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Bill Shi, Lynn Ai, et al. Aoi: Turning failed trajectories into training signals for autonomous cloud diagnosis.arXiv preprint arXiv:2603.03378, 2026
-
[37]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[38]
Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[39]
Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7036...
work page 2024
-
[40]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024
work page 2024
-
[41]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[42]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[43]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023
work page 2023
-
[45]
A survey on self-evolution of large language models
Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024
-
[46]
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026
work page internal anchor Pith review arXiv 2026
-
[47]
Yaolun Zhang, Yiran Wu, Yijiong Yu, Qingyun Wu, and Huazheng Wang. Live-evo: Online evolution of agentic memory from continuous feedback.arXiv preprint arXiv:2602.02369, 2026
-
[48]
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, et al. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025. 15
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.