Unsupervised Skill Discovery for Agentic Data Analysis

Huajun Chen; Kangqi Song; Lei Liang; Shengwei tang; Shumin Deng; Shuofei Qiao; Zhisong Qiu

arxiv: 2606.06416 · v1 · pith:MUOJIBK7new · submitted 2026-06-04 · 💻 cs.AI · cs.CL· cs.LG· cs.MA

Unsupervised Skill Discovery for Agentic Data Analysis

Zhisong Qiu , Kangqi Song , Shengwei Tang , Shuofei Qiao , Lei Liang , Huajun Chen , Shumin Deng This is my paper

Pith reviewed 2026-06-28 01:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MA

keywords unsupervised skill discoverydata-analytic agentsverifier-guided learningcontrastive distillationreport-style analysisreasoning-style analysisadaptive checklist verifieranswer agreement verifier

0 comments

The pith

DataCOPE discovers reusable data-analysis skills from unlabeled trajectories by extracting verifier signals that indicate relative quality or agreement, then distilling them contrastively to improve agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DataCOPE as a framework that lets data-analytic agents acquire reusable procedural skills at inference time without any labeled supervision or parameter updates. It generates exploration trajectories with an agent, extracts quality or agreement signals using an unsupervised verifier, and distills effective skills through contrastive learning in a closed loop. Separate verifier designs handle report-style tasks via adaptive checklists and reasoning-style tasks via answer agreement. Experiments on two benchmarks show consistent gains over baselines across multiple models.

Core claim

DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or agreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis the verifier is instantiated as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis it is instantiated as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary s

What carries the argument

The unsupervised verifier (Adaptive Checklist Verifier or Answer Agreement Verifier) that extracts relative-quality or agreement signals from unlabeled trajectories to drive contrastive skill distillation.

If this is right

Averaged across four model settings, DataCOPE raises mean score by 9.71 percent on report-style tasks.
Averaged across four model settings, DataCOPE raises mean score by 32.30 percent on reasoning-style tasks.
The same iterative loop of agent, verifier, and skill manager improves held-out performance over baselines in both analysis formats.
Skills are acquired without updating model parameters, remaining usable at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signal-extraction loop could be applied to other agent domains that produce long, variable-quality trajectories.
If the verifiers prove reliable, the approach could lower the cost of curating training data for agentic systems.
Discovered skills might transfer across datasets that share similar analytical structure even if surface features differ.

Load-bearing premise

Signals extracted by the unsupervised verifier from exploration trajectories reliably indicate relative quality or agreement without any external supervision or task-specific labels.

What would settle it

A controlled run in which skills distilled using the verifier signals produce no improvement or a performance drop on held-out tasks compared with the base agent.

Figures

Figures reproduced from arXiv: 2606.06416 by Huajun Chen, Kangqi Song, Lei Liang, Shengwei tang, Shumin Deng, Shuofei Qiao, Zhisong Qiu.

**Figure 2.** Figure 2: Overview of the DataCOPE framework. The data-analytic agent samples trajectories from an unlabeled exploration set under the current skill, while an unsupervised verifier derives signals and groups trajectories without gold answers or task success labels. The Skill Manager contrasts the grouped trajectories to distill reusable procedures and create or update the skill iteratively. For report-style tasks, t… view at source ↗

**Figure 3.** Figure 3: Iteration Analysis. (a) Checklist-Score Dynamics on Refinement. We track the scores of reports generated by the Data-Analytic Agent on checklists generated by the Checklist Agent throughout the refinement process. Hollow markers denote refinement steps that fail to produce a valid skill update. (b) Report-Task Iteration Analysis. We evaluate the last valid skill after each Data-Analytic Agent refinement ro… view at source ↗

**Figure 4.** Figure 4: Further Analysis of DataCOPE. (a): Skill Granularity Analysis. We evaluate different skill granularities on DABStep and show that proper granularity is crucial for effective skill discovery. (b): Data-Analytic Agent Analysis. We replace the DataAnalytic Agent in DataCOPE and find that DataCOPE consistently improve the performance of skill discovery. (c): Supervised Skill Discovery Analysis. We compare Dat… view at source ↗

read the original abstract

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DataCOPE gives a practical unsupervised loop for skill discovery in data agents but the verifier signals are the untested load-bearing piece.

read the letter

The main takeaway is that DataCOPE runs an iterative loop of trajectory generation, unsupervised verifier signals, and contrastive skill distillation, then shows average gains of 9.71% on report-style tasks and 32.30% on reasoning-style tasks across four model settings on held-out data from Deep Data Research and DABStep.

The framework is new in its split verifiers: an adaptive checklist that builds task-specific criteria for reports, and an answer-agreement verifier that uses self-consistency for reasoning. That handles the different output formats without labels, and the paper does a clean job of keeping the whole thing unsupervised while still claiming held-out generalization.

The soft spot is exactly the verifier reliability the stress-test flagged. The gains rest on those signals actually ranking trajectory quality rather than length, frequency, or surface patterns, yet the abstract supplies no ablations, human checks, or external validation of the signals. Without that, it is hard to know whether the distilled skills are doing the work or whether something simpler is driving the numbers. Experimental details on baseline construction and statistical tests are also absent, which keeps the evidence thin.

This is for people working on lightweight skill augmentation for data-analytic agents. A reader in that area would get value from the two verifier designs and the reported cross-task results, even if they plan to test the signals themselves.

The work shows clear engagement with the supervision problem and stays within its scoped claims. It deserves a serious referee to examine the methods section and the verifier validation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DataCOPE, an unsupervised verifier-guided framework for skill discovery in data-analytic agents. It coordinates a Data-Analytic Agent to generate exploration trajectories, an Unsupervised Verifier (instantiated as Adaptive Checklist Verifier for report-style tasks or Answer Agreement Verifier for reasoning-style tasks) to extract relative quality or agreement signals, and a Skill Manager for contrastive skill distillation. The central empirical claim is that DataCOPE yields average held-out performance gains of 9.71% on report-style tasks (Deep Data Research) and 32.30% on reasoning-style tasks (DABStep) across four model settings, outperforming baselines.

Significance. If the verifier-derived signals reliably rank trajectories by substantive quality rather than surface artifacts, the method would offer a practical route to label-free skill acquisition for agentic data analysis, lowering supervision costs while improving generalization on held-out tasks. The iterative coordination loop and dual verifier instantiations represent a concrete advance over purely supervised or heuristic skill-discovery approaches in this domain.

major comments (2)

[Evaluation] Evaluation section: The reported mean improvements (9.71% and 32.30%) are stated without accompanying details on the number of runs, standard deviations, statistical significance tests, or data-exclusion criteria, preventing assessment of whether the gains are robust or attributable to the method rather than experimental variance.
[Unsupervised Verifier] Unsupervised Verifier subsection (Adaptive Checklist Verifier and Answer Agreement Verifier): No correlation analysis, ablation, or human validation is presented showing that the extracted signals (checklist coverage scores or answer-agreement/self-consistency) track actual trajectory quality or correctness on held-out data, rather than proxies such as trajectory length or answer frequency. Because skill distillation and all downstream coordination inherit rankings directly from these signals, this assumption is load-bearing for the performance claims.

minor comments (2)

[Abstract] Abstract contains the typo "aggreement" (should be "agreement").
[Abstract] The phrase "consistently outperforming baselines on held-out performance" is imprecise; it should specify the exact metrics and held-out splits used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and commit to revisions that strengthen the empirical claims.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported mean improvements (9.71% and 32.30%) are stated without accompanying details on the number of runs, standard deviations, statistical significance tests, or data-exclusion criteria, preventing assessment of whether the gains are robust or attributable to the method rather than experimental variance.

Authors: We agree that these statistical details are necessary. In the revised manuscript we will report results from 5 independent runs per model-task setting, include standard deviations, conduct paired t-tests against baselines with p-values, and explicitly state data-exclusion criteria (none were applied beyond standard formatting filters). The updated tables and text will appear in Section 4. revision: yes
Referee: [Unsupervised Verifier] Unsupervised Verifier subsection (Adaptive Checklist Verifier and Answer Agreement Verifier): No correlation analysis, ablation, or human validation is presented showing that the extracted signals (checklist coverage scores or answer-agreement/self-consistency) track actual trajectory quality or correctness on held-out data, rather than proxies such as trajectory length or answer frequency. Because skill distillation and all downstream coordination inherit rankings directly from these signals, this assumption is load-bearing for the performance claims.

Authors: We acknowledge the load-bearing nature of this assumption. We will add (i) Pearson correlations between verifier scores and trajectory length / answer frequency to rule out surface artifacts, (ii) an ablation that replaces the learned verifier with random or length-based ranking, and (iii) a small-scale human validation study (n=50 trajectories per verifier type) measuring agreement with expert quality judgments on held-out data. These analyses will be reported in a new subsection of Section 3 and the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework relies on empirical held-out evaluation without self-referential reductions

full rationale

The paper describes an empirical framework (DataCOPE) that generates trajectories, extracts unsupervised verifier signals (Adaptive Checklist Verifier or Answer Agreement Verifier), and distills skills via contrastive learning, then reports mean gains on held-out tasks from Deep Data Research and DABStep. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claims rest on external benchmark improvements rather than any derivation that reduces to its own inputs by construction. The unsupervised verifier assumption is a methodological hypothesis subject to falsification on held-out data, not a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are identifiable or extractable from the provided text.

pith-pipeline@v0.9.1-grok · 5793 in / 1181 out tokens · 28879 ms · 2026-06-28T01:10:47.990848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 30 canonical work pages · 11 internal anchors

[1]

A survey of data agents: Emerging paradigm or overstated hype?

Y . Zhu, L. Wang, C. Yang, X. Lin, B. Li, W. Zhou, X. Liu, Z. Peng, T. Luo, Y . Li, C. Chai, C. Chen, S. Di, J. Fan, J. Sun, N. Tang, F. Tsung, J. Wang, C. Wu, Y . Xu, S. Zhang, Y . Zhang, X. Zhou, G. Li, and Y . Luo, “A survey of data agents: Emerging paradigm or overstated hype?”CoRR, vol. abs/2510.23587, 2025. [Online]. Available: https://doi.org/10.48...

work page doi:10.48550/arxiv.2510.23587 2025
[2]

Large language model-based data science agent: A survey,

P. Wang, Y . Yu, K. Chen, X. Zhan, and H. Wang, “Large language model-based data science agent: A survey,”CoRR, vol. abs/2508.02744,

arXiv
[3]

Available: https://doi.org/10.48550/arXiv.2508.02744

[Online]. Available: https://doi.org/10.48550/arXiv.2508.02744

work page doi:10.48550/arxiv.2508.02744
[4]

Deep research: A survey of autonomous research agents,

W. Zhang, X. Li, Y . Zhang, P. Jia, Y . Wang, H. Guo, Y . Liu, and X. Zhao, “Deep research: A survey of autonomous research agents,”CoRR, vol. abs/2508.12752, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.12752

work page doi:10.48550/arxiv.2508.12752 2025
[5]

Data interpreter: An LLM agent for data science,

S. Hong, Y . Lin, B. Liu, B. Liu, B. Wu, C. Zhang, D. Li, J. Chen, J. Zhang, J. Wang, L. Zhang, L. Zhang, M. Yang, M. Zhuge, T. Guo, T. Zhou, W. Tao, R. Tang, X. Lu, X. Zheng, X. Liang, Y . Fei, Y . Cheng, Y . Ni, Z. Gou, Z. Xu, Y . Luo, and C. Wu, “Data interpreter: An LLM agent for data science,” inFindings of the Association for Computational Linguisti...

2025
[6]

Agenticdata: An agentic data analytics system for heterogeneous data,

J. Sun, G. Li, P. Zhou, Y . Ma, J. Xu, and Y . Li, “Agenticdata: An agentic data analytics system for heterogeneous data,”CoRR, vol. abs/2508.05002, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.05002

work page doi:10.48550/arxiv.2508.05002 2025
[7]

DS-STAR: data science agent via iterative planning and verification,

J. Nam, J. Yoon, J. Chen, and T. Pfister, “DS-STAR: data science agent via iterative planning and verification,”CoRR, vol. abs/2509.21825,

arXiv
[8]

Available: https://doi.org/10.48550/arXiv.2509.21825

[Online]. Available: https://doi.org/10.48550/arXiv.2509.21825

work page doi:10.48550/arxiv.2509.21825
[9]

Agentada: Skill-adaptive data analytics for tailored insight discovery,

A. Abaskohi, A. V . Ramesh, S. Nanisetty, C. Goel, D. V ´azquez, C. Pal, S. Gella, G. Carenini, and I. H. Laradji, “Agentada: Skill-adaptive data analytics for tailored insight discovery,”CoRR, vol. abs/2504.07421,

arXiv
[10]

Available: https://doi.org/10.48550/arXiv.2504.07421

[Online]. Available: https://doi.org/10.48550/arXiv.2504.07421

work page doi:10.48550/arxiv.2504.07421
[11]

Datawiseagent: A notebook-centric LLM agent framework for automated data science,

Z. You, Y . Zhang, D. Xu, Y . Lou, Y . Yan, W. Wang, H. Zhang, and Y . Huang, “Datawiseagent: A notebook-centric LLM agent framework for automated data science,”CoRR, vol. abs/2503.07044, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.07044

work page doi:10.48550/arxiv.2503.07044 2025
[12]

Data-copilot: Bridging billions of data and humans with autonomous workflow,

W. Zhang, Y . Shen, W. Lu, and Y . Zhuang, “Data-copilot: Bridging billions of data and humans with autonomous workflow,”CoRR, vol. abs/2306.07209, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.07209

work page doi:10.48550/arxiv.2306.07209 2023
[13]

Dagent: A relational database-driven data analysis report generation agent,

W. Xu, Y . Mao, X. Zhang, C. Zhang, X. Dong, M. Zhang, and Y . Gao, “Dagent: A relational database-driven data analysis report generation agent,”CoRR, vol. abs/2503.13269, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.13269

work page doi:10.48550/arxiv.2503.13269 2025
[14]

Scaling generalist data- analytic agents,

S. Qiao, Y . Zhao, Z. Qiu, X. Wang, J. Zhang, Z. Bin, N. Zhang, Y . Jiang, P. Xie, F. Huang, and H. Chen, “Scaling generalist data- analytic agents,”CoRR, vol. abs/2509.25084, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.25084

work page doi:10.48550/arxiv.2509.25084 2025
[15]

Deepanalyze: Agentic large language models for autonomous data science,

S. Zhang, J. Fan, M. Fan, G. Li, and X. Du, “Deepanalyze: Agentic large language models for autonomous data science,”CoRR, vol. abs/2510.16872, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.16872

work page doi:10.48550/arxiv.2510.16872 2025
[16]

What are skills?

Anthropic, “What are skills?” Claude Help Cen- ter, 2026, accessed: 2026-05-03. [Online]. Available: https://support.claude.com/en/articles/12512176-what-are-skills

arXiv 2026
[17]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Y . Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu, “Sok: Agentic skills - beyond tool use in LLM agents,”CoRR, vol. abs/2602.20867, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.20867

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.20867 2026
[18]

Agent skills: A data- driven analysis of claude skills for extending large language model functionality,

G. F. Ling, S. Zhong, and R. L. Huang, “Agent skills: A data- driven analysis of claude skills for extending large language model functionality,”ArXiv, vol. abs/2602.08004, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:285453033

arXiv 2026
[19]

Skillx: Automatically constructing skill knowledge bases for agents,

C. Wang, Z. Yu, X. Xie, W. Yao, R. Fang, S. Qiao, K. Cao, G. Zheng, X. Qi, P. Zhang, and S. Deng, “Skillx: Automatically constructing skill knowledge bases for agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287204111

2026
[20]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

J. Ni, Y . Liu, X. Liu, Y . Sun, M. Zhou, P. Cheng, D. Wang, E. Zhao, X. Jiang, and G. Jiang, “Trace2skill: Distill trajectory-local lessons into transferable agent skills,”CoRR, vol. abs/2603.25158, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2603.25158

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.25158 2026
[21]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

S. Alzubi, N. Provenzano, J. Bingham, W. Chen, and T. Vu, “Evoskill: Automated skill discovery for multi-agent systems,”CoRR, vol. abs/2603.02766, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2603.02766

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.02766 2026
[22]

Coevoskills: Self- evolving agent skills via co-evolutionary verification,

H. Zhang, S. Fan, H. P. Zou, Y . Chen, Z. Wang, J. Zhou, C. Li, W.-C. Huang, Y . Yao, K. Zheng, X. Liu, X. Li, and P. S. Yu, “Coevoskills: Self- evolving agent skills via co-evolutionary verification,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287071917

2026
[23]

Skillopt: Executive strategy for self-evolving agent skills,

Y . Yang, Z. Gong, W. Huang, Q. Yang, Z. Zhou, Z. Huang, Y . Li, X. Gao, Q. Dai, B. Liu, K. Qiu, Y . Yang, D. Chen, X.-T. Yang, and C. Luo, “Skillopt: Executive strategy for self-evolving agent skills,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288652900

2026
[24]

Skillclaw: Let skills evolve collectively with agentic evolver,

Z. Ma, S. Yang, Y . Ji, X. Wang, Y . Wang, Y . Hu, T. Huang, and X. Chu, “Skillclaw: Let skills evolve collectively with agentic evolver,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287256390

2026
[25]

Skillos: Learning skill curation for self-evolving agents,

S. Ouyang, J. Yan, Y . Chen, R. Han, Z. Wang, B. Dalvi, R. Meng, C.-L. Li, Y . Jiao, K. Zha, M. Shen, V . Tirumalashetty, G. Lee, J. Han, T. Pfister, and C.-Y . Lee, “Skillos: Learning skill curation for self-evolving agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288014414

2026
[26]

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

W. Liu, P. Yu, M. Orini, Y . Du, and Y . He, “Hunt instead of wait: Evaluating deep data research on large language models,”CoRR, vol. abs/2602.02039, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.02039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02039 2026
[27]

Dabstep: Data agent benchmark for multi-step reasoning,

A. Egg, M. I. Goyanes, F. Kingma, A. Mora, L. von Werra, and T. Wolf, “Dabstep: Data agent benchmark for multi-step reasoning,”CoRR, vol. abs/2506.23719, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2506.23719

work page doi:10.48550/arxiv.2506.23719 2025
[28]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=WE vluYUL-X

2023
[29]

Openai GPT-5 system card,

OpenAI, “Openai GPT-5 system card,”CoRR, vol. abs/2601.03267,

Pith/arXiv arXiv
[30]

OpenAI GPT-5 System Card

[Online]. Available: https://doi.org/10.48550/arXiv.2601.03267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267
[31]

Skill Creator,

Anthropic, “Skill Creator,” 2026, accessed: 2026-06-03. [Online]. Available: https://github.com/anthropics/skills/blob/main/skills/skill- creator/SKILL.md

2026
[32]

System Card: Claude Sonnet 4.6,

——, “System Card: Claude Sonnet 4.6,” Anthropic Model System Cards, 2026, accessed: 2026-06-03. [Online]. Available: https://www- cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

2026
[33]

System Card: Claude Sonnet 4.5,

——, “System Card: Claude Sonnet 4.5,” Anthropic Model System Cards, 2025, accessed: 2026-05-17. [Online]. Available: https://www- cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd.pdf

2025
[34]

Deepseek-v4: Towards highly efficient million-token context intelligence,

DeepSeek-AI, “Deepseek-v4: Towards highly efficient million-token context intelligence,” 2026

2026
[35]

Qwen3.5: Accelerating productivity with native multimodal agents,

Q. Team, “Qwen3.5: Accelerating productivity with native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.5

2026
[36]

Dsgym: A holistic framework for evaluating and training data science agents,

F. Nie, J. Wang, H. Hua, F. Bianchi, Y . Kwon, Z. Qi, O. Queen, S. Zhu, and J. Zou, “Dsgym: A holistic framework for evaluating and training data science agents,”CoRR, vol. abs/2601.16344, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.16344

work page doi:10.48550/arxiv.2601.16344 2026
[37]

Kramabench: A benchmark for AI systems on data-to-insight pipelines over data lakes,

E. Lai, G. Vitagliano, Z. Zhang, S. Sudhir, O. Chabra, A. Zeng, A. A. Zabreyko, C. Li, F. Kossmann, J. Ding, J. Chen, M. Markakis, M. Russo, W. Wang, Z. Wu, M. J. Cafarella, L. Cao, S. Madden, and T. Kraska, “Kramabench: A benchmark for AI systems on data-to-insight pipelines over data lakes,”CoRR, vol. abs/2506.06541,

arXiv
[38]

Available: https://doi.org/10.48550/arXiv.2506.06541

[Online]. Available: https://doi.org/10.48550/arXiv.2506.06541

work page doi:10.48550/arxiv.2506.06541
[39]

Sanity checks for agentic data science,

Z. T. Rewolinski, A. Zane, H. Huang, C. Singh, C. Wang, J. Gao, and B. Yu, “Sanity checks for agentic data science,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287433410

2026
[40]

Matplotagent: Method and evaluation for llm-based agentic scientific data visualization,

Z. Yang, Z. Zhou, S. Wang, X. Cong, X. Han, Y . Yan, Z. Liu, Z. Tan, P. Liu, D. Yu, Z. Liu, X. Shi, and M. Sun, “Matplotagent: Method and evaluation for llm-based agentic scientific data visualization,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V...

work page doi:10.18653/v1/2024.findings-acl.701 2024
[41]

Insightpilot: An llm-empowered automated data exploration system,

P. Ma, R. Ding, S. Wang, S. Han, and D. Zhang, “Insightpilot: An llm-empowered automated data exploration system,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, Y . Feng and E. Lefever, Eds. Association for Computational Linguistics, 2023, pp. 3...

work page doi:10.18653/v1/2023.emnlp-demo.31 2023
[42]

Datastorm: Deep research on large-scale databases using exploratory data analysis and data storytelling,

S. Liu, Y . Jiang, S. Farook, C. N. Sanchez, D. F. C. Pena, and M. S. Lam, “Datastorm: Deep research on large-scale databases using exploratory data analysis and data storytelling,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287248168

2026
[43]

Datacross: A unified benchmark and agent framework for cross-modal heterogeneous data analysis,

R. Qi, Z. Liu, and W. Zhang, “Datacross: A unified benchmark and agent framework for cross-modal heterogeneous data analysis,”ArXiv, vol. abs/2601.21403, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:285140426

arXiv 2026
[44]

DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework

B. Li, C. Chen, Z. Xue, Y . Mei, and Y . Luo, “Deepeye-sql: A software-engineering-inspired text-to-sql frame- work,”CoRR, vol. abs/2510.17586, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.17586

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.17586 2025
[45]

Why do open-source llms struggle with data analysis? A systematic empirical study,

Y . Zhu, Y . Zhong, J. Zhang, Z. Zhang, S. Qiao, Y . Luo, L. Du, D. Zheng, H. Chen, and N. Zhang, “Why do open-source llms struggle with data analysis? A systematic empirical study,”CoRR, vol. abs/2506.19794,

arXiv
[46]

Available: https://doi.org/10.48550/arXiv.2506.19794

[Online]. Available: https://doi.org/10.48550/arXiv.2506.19794

work page doi:10.48550/arxiv.2506.19794
[47]

Welcome to the era of experience

D. Silver and R. Sutton, “Welcome to the era of experience.” [Online]. Available: https://api.semanticscholar.org/CorpusID:277919528
[48]

Agent skills for large language models: Architecture, acquisition, security, and the path forward,

R. Xu and Y . Yan, “Agent skills for large language models: Architecture, acquisition, security, and the path forward,”CoRR, vol. abs/2602.12430,

Pith/arXiv arXiv
[49]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

[Online]. Available: https://doi.org/10.48550/arXiv.2602.12430

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.12430
[50]

Inducing programmatic skills for agentic tasks,

Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried, “Inducing programmatic skills for agentic tasks,”CoRR, vol. abs/2504.06821,

arXiv
[51]

arXiv preprint arXiv:2504.06821 , year=

[Online]. Available: https://doi.org/10.48550/arXiv.2504.06821

work page doi:10.48550/arxiv.2504.06821
[52]

Reinforcement Learning for Self-Improving Agent with Skill Library

J. Wang, Q. Yan, Y . Wang, Y . Tian, S. S. Mishra, Z. Xu, M. Gandhi, P. Xu, and L. L. Cheong, “Reinforcement learning for self-improving agent with skill library,”CoRR, vol. abs/2512.17102, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.17102

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.17102 2025
[53]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

H. Zhang, Q. Long, J. Bao, T. Feng, W. Zhang, H. Yue, and W. Wang, “Memskill: Learning and evolving memory skills for self-evolving agents,”CoRR, vol. abs/2602.02474, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.02474

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02474 2026
[54]

Memp: Exploring Agent Procedural Memory

R. Fang, Y . Liang, X. Wang, J. Wu, S. Qiao, P. Xie, F. Huang, H. Chen, and N. Zhang, “Memp: Exploring agent procedural memory,”CoRR, vol. abs/2508.06433, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.06433

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06433 2025
[55]

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

G. Zhao, Q. Shi, X. Xiao, X. Zhang, T. Yang, and L. Sun, “Thinking with reasoning skills: Fewer tokens, more accuracy,”CoRR, vol. abs/2604.21764, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2604.21764

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.21764 2026
[56]

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning,

P. Xia, J. Chen, H. Wang, J. Liu, K. Zeng, Y . Wang, S. Han, Y . Zhou, X. Zhao, H. Chen, Z. Zheng, C. Xie, and H. Yao, “Skillrl: Evolving agents via recursive skill-augmented reinforcement learning,”ArXiv, vol. abs/2602.08234, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:285452037

Pith/arXiv arXiv 2026
[57]

Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support,

X. Liu, X. Luo, L. Li, G. Huang, J. Liu, and H. Qiao, “Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287351631

2026
[58]

Memento-skills: Let agents design agents,

H. Zhou, S. Guo, A. Liu, Z. Yu, Z. Gong, B. Zhao, Z. Chen, M. Zhang, Y . Chen, J. Li, R. Yang, Q. Liu, X. Yu, J. Zhou, N. Wang, C. Sun, and J. Wang, “Memento-skills: Let agents design agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286673350

2026
[59]

Autoskill: Experience-driven lifelong learning via skill self- evolution,

Y . Yang, J. Li, Q. Pan, B. Zhan, Y . Cai, L. Du, J. Zhou, K. Chen, Q. Chen, X. Li, B. Zhang, and L. He, “Autoskill: Experience-driven lifelong learning via skill self- evolution,”ArXiv, vol. abs/2603.01145, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286224498

arXiv 2026
[60]

From context to skills: Can language models learn from context skillfully?

S. Si, H. Zhao, Y . Lei, Q. Wang, D. Chen, Z. Wang, Z. Wang, K. Luo, Z. Wang, G. Chen, F. Qi, M. Zhang, and M. Sun, “From context to skills: Can language models learn from context skillfully?” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287915777

2026
[61]

Mmskills: Towards multimodal skills for general visual agents,

K. Zhang, S. Shao, Q. Li, J. Lin, L. Fu, S. Wang, W. Jiao, Y . Lu, W. Liu, W. Zhang, and Y . Yu, “Mmskills: Towards multimodal skills for general visual agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288254572

2026
[62]

Skillsvote: Lifecycle governance of agent skills from collection, recommendation to evolution,

H. Liu, H. Yang, T. Jiang, B. Tang, F. Xiong, and Z. Li, “Skillsvote: Lifecycle governance of agent skills from collection, recommendation to evolution,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288651284

2026
[63]

arXiv preprint arXiv:2603.04448 , year=

Y . Liang, R. Zhong, H. Xu, C. Jiang, Y . Zhong, R. Fang, J. Gu, S. Deng, Y . Yao, M. Wang, S. Qiao, X. Xu, T. Wu, K. Wang, Y . Liu, Z. Bi, J. Lou, Y . E. Jiang, H. Zhu, G. Yu, H. Hong, L. Huang, H. Xue, C. Wang, Y . Wang, Z. Shan, X. Chen, Z. Tu, F. Xiong, X. Xie, P. Zhang, Z. Gui, L. Liang, J. Zhou, C. Wu, J. Shang, Y . Gong, J. Lin, C. Xu, H. Deng, W. ...

work page doi:10.48550/arxiv.2603.04448 2026
[64]

Skillrouter: Skill routing for llm agents at scale,

Y . Zheng, Z. Zhang, C. Ma, Y . Yu, J. Zhu, Y . Wu, T. Xu, B. Dong, H. Zhu, R. Huang, and G. Yu, “Skillrouter: Skill routing for llm agents at scale,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286770530

2026
[65]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale,

H. Li, C. Mu, J. Chen, S. Ren, Z. Cui, Y . Zhang, L. Bai, and S. Hu, “Organizing, orchestrating, and benchmarking agent skills at ecosystem scale,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286222444

2026

[1] [1]

A survey of data agents: Emerging paradigm or overstated hype?

Y . Zhu, L. Wang, C. Yang, X. Lin, B. Li, W. Zhou, X. Liu, Z. Peng, T. Luo, Y . Li, C. Chai, C. Chen, S. Di, J. Fan, J. Sun, N. Tang, F. Tsung, J. Wang, C. Wu, Y . Xu, S. Zhang, Y . Zhang, X. Zhou, G. Li, and Y . Luo, “A survey of data agents: Emerging paradigm or overstated hype?”CoRR, vol. abs/2510.23587, 2025. [Online]. Available: https://doi.org/10.48...

work page doi:10.48550/arxiv.2510.23587 2025

[2] [2]

Large language model-based data science agent: A survey,

P. Wang, Y . Yu, K. Chen, X. Zhan, and H. Wang, “Large language model-based data science agent: A survey,”CoRR, vol. abs/2508.02744,

arXiv

[3] [3]

Available: https://doi.org/10.48550/arXiv.2508.02744

[Online]. Available: https://doi.org/10.48550/arXiv.2508.02744

work page doi:10.48550/arxiv.2508.02744

[4] [4]

Deep research: A survey of autonomous research agents,

W. Zhang, X. Li, Y . Zhang, P. Jia, Y . Wang, H. Guo, Y . Liu, and X. Zhao, “Deep research: A survey of autonomous research agents,”CoRR, vol. abs/2508.12752, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.12752

work page doi:10.48550/arxiv.2508.12752 2025

[5] [5]

Data interpreter: An LLM agent for data science,

S. Hong, Y . Lin, B. Liu, B. Liu, B. Wu, C. Zhang, D. Li, J. Chen, J. Zhang, J. Wang, L. Zhang, L. Zhang, M. Yang, M. Zhuge, T. Guo, T. Zhou, W. Tao, R. Tang, X. Lu, X. Zheng, X. Liang, Y . Fei, Y . Cheng, Y . Ni, Z. Gou, Z. Xu, Y . Luo, and C. Wu, “Data interpreter: An LLM agent for data science,” inFindings of the Association for Computational Linguisti...

2025

[6] [6]

Agenticdata: An agentic data analytics system for heterogeneous data,

J. Sun, G. Li, P. Zhou, Y . Ma, J. Xu, and Y . Li, “Agenticdata: An agentic data analytics system for heterogeneous data,”CoRR, vol. abs/2508.05002, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.05002

work page doi:10.48550/arxiv.2508.05002 2025

[7] [7]

DS-STAR: data science agent via iterative planning and verification,

J. Nam, J. Yoon, J. Chen, and T. Pfister, “DS-STAR: data science agent via iterative planning and verification,”CoRR, vol. abs/2509.21825,

arXiv

[8] [8]

Available: https://doi.org/10.48550/arXiv.2509.21825

[Online]. Available: https://doi.org/10.48550/arXiv.2509.21825

work page doi:10.48550/arxiv.2509.21825

[9] [9]

Agentada: Skill-adaptive data analytics for tailored insight discovery,

A. Abaskohi, A. V . Ramesh, S. Nanisetty, C. Goel, D. V ´azquez, C. Pal, S. Gella, G. Carenini, and I. H. Laradji, “Agentada: Skill-adaptive data analytics for tailored insight discovery,”CoRR, vol. abs/2504.07421,

arXiv

[10] [10]

Available: https://doi.org/10.48550/arXiv.2504.07421

[Online]. Available: https://doi.org/10.48550/arXiv.2504.07421

work page doi:10.48550/arxiv.2504.07421

[11] [11]

Datawiseagent: A notebook-centric LLM agent framework for automated data science,

Z. You, Y . Zhang, D. Xu, Y . Lou, Y . Yan, W. Wang, H. Zhang, and Y . Huang, “Datawiseagent: A notebook-centric LLM agent framework for automated data science,”CoRR, vol. abs/2503.07044, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.07044

work page doi:10.48550/arxiv.2503.07044 2025

[12] [12]

Data-copilot: Bridging billions of data and humans with autonomous workflow,

W. Zhang, Y . Shen, W. Lu, and Y . Zhuang, “Data-copilot: Bridging billions of data and humans with autonomous workflow,”CoRR, vol. abs/2306.07209, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.07209

work page doi:10.48550/arxiv.2306.07209 2023

[13] [13]

Dagent: A relational database-driven data analysis report generation agent,

W. Xu, Y . Mao, X. Zhang, C. Zhang, X. Dong, M. Zhang, and Y . Gao, “Dagent: A relational database-driven data analysis report generation agent,”CoRR, vol. abs/2503.13269, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.13269

work page doi:10.48550/arxiv.2503.13269 2025

[14] [14]

Scaling generalist data- analytic agents,

S. Qiao, Y . Zhao, Z. Qiu, X. Wang, J. Zhang, Z. Bin, N. Zhang, Y . Jiang, P. Xie, F. Huang, and H. Chen, “Scaling generalist data- analytic agents,”CoRR, vol. abs/2509.25084, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.25084

work page doi:10.48550/arxiv.2509.25084 2025

[15] [15]

Deepanalyze: Agentic large language models for autonomous data science,

S. Zhang, J. Fan, M. Fan, G. Li, and X. Du, “Deepanalyze: Agentic large language models for autonomous data science,”CoRR, vol. abs/2510.16872, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.16872

work page doi:10.48550/arxiv.2510.16872 2025

[16] [16]

What are skills?

Anthropic, “What are skills?” Claude Help Cen- ter, 2026, accessed: 2026-05-03. [Online]. Available: https://support.claude.com/en/articles/12512176-what-are-skills

arXiv 2026

[17] [17]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Y . Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu, “Sok: Agentic skills - beyond tool use in LLM agents,”CoRR, vol. abs/2602.20867, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.20867

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.20867 2026

[18] [18]

Agent skills: A data- driven analysis of claude skills for extending large language model functionality,

G. F. Ling, S. Zhong, and R. L. Huang, “Agent skills: A data- driven analysis of claude skills for extending large language model functionality,”ArXiv, vol. abs/2602.08004, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:285453033

arXiv 2026

[19] [19]

Skillx: Automatically constructing skill knowledge bases for agents,

C. Wang, Z. Yu, X. Xie, W. Yao, R. Fang, S. Qiao, K. Cao, G. Zheng, X. Qi, P. Zhang, and S. Deng, “Skillx: Automatically constructing skill knowledge bases for agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287204111

2026

[20] [20]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

J. Ni, Y . Liu, X. Liu, Y . Sun, M. Zhou, P. Cheng, D. Wang, E. Zhao, X. Jiang, and G. Jiang, “Trace2skill: Distill trajectory-local lessons into transferable agent skills,”CoRR, vol. abs/2603.25158, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2603.25158

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.25158 2026

[21] [21]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

S. Alzubi, N. Provenzano, J. Bingham, W. Chen, and T. Vu, “Evoskill: Automated skill discovery for multi-agent systems,”CoRR, vol. abs/2603.02766, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2603.02766

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.02766 2026

[22] [22]

Coevoskills: Self- evolving agent skills via co-evolutionary verification,

H. Zhang, S. Fan, H. P. Zou, Y . Chen, Z. Wang, J. Zhou, C. Li, W.-C. Huang, Y . Yao, K. Zheng, X. Liu, X. Li, and P. S. Yu, “Coevoskills: Self- evolving agent skills via co-evolutionary verification,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287071917

2026

[23] [23]

Skillopt: Executive strategy for self-evolving agent skills,

Y . Yang, Z. Gong, W. Huang, Q. Yang, Z. Zhou, Z. Huang, Y . Li, X. Gao, Q. Dai, B. Liu, K. Qiu, Y . Yang, D. Chen, X.-T. Yang, and C. Luo, “Skillopt: Executive strategy for self-evolving agent skills,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288652900

2026

[24] [24]

Skillclaw: Let skills evolve collectively with agentic evolver,

Z. Ma, S. Yang, Y . Ji, X. Wang, Y . Wang, Y . Hu, T. Huang, and X. Chu, “Skillclaw: Let skills evolve collectively with agentic evolver,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287256390

2026

[25] [25]

Skillos: Learning skill curation for self-evolving agents,

S. Ouyang, J. Yan, Y . Chen, R. Han, Z. Wang, B. Dalvi, R. Meng, C.-L. Li, Y . Jiao, K. Zha, M. Shen, V . Tirumalashetty, G. Lee, J. Han, T. Pfister, and C.-Y . Lee, “Skillos: Learning skill curation for self-evolving agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288014414

2026

[26] [26]

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

W. Liu, P. Yu, M. Orini, Y . Du, and Y . He, “Hunt instead of wait: Evaluating deep data research on large language models,”CoRR, vol. abs/2602.02039, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.02039

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02039 2026

[27] [27]

Dabstep: Data agent benchmark for multi-step reasoning,

A. Egg, M. I. Goyanes, F. Kingma, A. Mora, L. von Werra, and T. Wolf, “Dabstep: Data agent benchmark for multi-step reasoning,”CoRR, vol. abs/2506.23719, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2506.23719

work page doi:10.48550/arxiv.2506.23719 2025

[28] [28]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=WE vluYUL-X

2023

[29] [29]

Openai GPT-5 system card,

OpenAI, “Openai GPT-5 system card,”CoRR, vol. abs/2601.03267,

Pith/arXiv arXiv

[30] [30]

OpenAI GPT-5 System Card

[Online]. Available: https://doi.org/10.48550/arXiv.2601.03267

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.03267

[31] [31]

Skill Creator,

Anthropic, “Skill Creator,” 2026, accessed: 2026-06-03. [Online]. Available: https://github.com/anthropics/skills/blob/main/skills/skill- creator/SKILL.md

2026

[32] [32]

System Card: Claude Sonnet 4.6,

——, “System Card: Claude Sonnet 4.6,” Anthropic Model System Cards, 2026, accessed: 2026-06-03. [Online]. Available: https://www- cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

2026

[33] [33]

System Card: Claude Sonnet 4.5,

——, “System Card: Claude Sonnet 4.5,” Anthropic Model System Cards, 2025, accessed: 2026-05-17. [Online]. Available: https://www- cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd.pdf

2025

[34] [34]

Deepseek-v4: Towards highly efficient million-token context intelligence,

DeepSeek-AI, “Deepseek-v4: Towards highly efficient million-token context intelligence,” 2026

2026

[35] [35]

Qwen3.5: Accelerating productivity with native multimodal agents,

Q. Team, “Qwen3.5: Accelerating productivity with native multimodal agents,” February 2026. [Online]. Available: https://qwen.ai/blog?id=qwen3.5

2026

[36] [36]

Dsgym: A holistic framework for evaluating and training data science agents,

F. Nie, J. Wang, H. Hua, F. Bianchi, Y . Kwon, Z. Qi, O. Queen, S. Zhu, and J. Zou, “Dsgym: A holistic framework for evaluating and training data science agents,”CoRR, vol. abs/2601.16344, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.16344

work page doi:10.48550/arxiv.2601.16344 2026

[37] [37]

Kramabench: A benchmark for AI systems on data-to-insight pipelines over data lakes,

E. Lai, G. Vitagliano, Z. Zhang, S. Sudhir, O. Chabra, A. Zeng, A. A. Zabreyko, C. Li, F. Kossmann, J. Ding, J. Chen, M. Markakis, M. Russo, W. Wang, Z. Wu, M. J. Cafarella, L. Cao, S. Madden, and T. Kraska, “Kramabench: A benchmark for AI systems on data-to-insight pipelines over data lakes,”CoRR, vol. abs/2506.06541,

arXiv

[38] [38]

Available: https://doi.org/10.48550/arXiv.2506.06541

[Online]. Available: https://doi.org/10.48550/arXiv.2506.06541

work page doi:10.48550/arxiv.2506.06541

[39] [39]

Sanity checks for agentic data science,

Z. T. Rewolinski, A. Zane, H. Huang, C. Singh, C. Wang, J. Gao, and B. Yu, “Sanity checks for agentic data science,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287433410

2026

[40] [40]

Matplotagent: Method and evaluation for llm-based agentic scientific data visualization,

Z. Yang, Z. Zhou, S. Wang, X. Cong, X. Han, Y . Yan, Z. Liu, Z. Tan, P. Liu, D. Yu, Z. Liu, X. Shi, and M. Sun, “Matplotagent: Method and evaluation for llm-based agentic scientific data visualization,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V...

work page doi:10.18653/v1/2024.findings-acl.701 2024

[41] [41]

Insightpilot: An llm-empowered automated data exploration system,

P. Ma, R. Ding, S. Wang, S. Han, and D. Zhang, “Insightpilot: An llm-empowered automated data exploration system,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, Y . Feng and E. Lefever, Eds. Association for Computational Linguistics, 2023, pp. 3...

work page doi:10.18653/v1/2023.emnlp-demo.31 2023

[42] [42]

Datastorm: Deep research on large-scale databases using exploratory data analysis and data storytelling,

S. Liu, Y . Jiang, S. Farook, C. N. Sanchez, D. F. C. Pena, and M. S. Lam, “Datastorm: Deep research on large-scale databases using exploratory data analysis and data storytelling,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287248168

2026

[43] [43]

Datacross: A unified benchmark and agent framework for cross-modal heterogeneous data analysis,

R. Qi, Z. Liu, and W. Zhang, “Datacross: A unified benchmark and agent framework for cross-modal heterogeneous data analysis,”ArXiv, vol. abs/2601.21403, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:285140426

arXiv 2026

[44] [44]

DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework

B. Li, C. Chen, Z. Xue, Y . Mei, and Y . Luo, “Deepeye-sql: A software-engineering-inspired text-to-sql frame- work,”CoRR, vol. abs/2510.17586, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.17586

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.17586 2025

[45] [45]

Why do open-source llms struggle with data analysis? A systematic empirical study,

Y . Zhu, Y . Zhong, J. Zhang, Z. Zhang, S. Qiao, Y . Luo, L. Du, D. Zheng, H. Chen, and N. Zhang, “Why do open-source llms struggle with data analysis? A systematic empirical study,”CoRR, vol. abs/2506.19794,

arXiv

[46] [46]

Available: https://doi.org/10.48550/arXiv.2506.19794

[Online]. Available: https://doi.org/10.48550/arXiv.2506.19794

work page doi:10.48550/arxiv.2506.19794

[47] [47]

Welcome to the era of experience

D. Silver and R. Sutton, “Welcome to the era of experience.” [Online]. Available: https://api.semanticscholar.org/CorpusID:277919528

[48] [48]

Agent skills for large language models: Architecture, acquisition, security, and the path forward,

R. Xu and Y . Yan, “Agent skills for large language models: Architecture, acquisition, security, and the path forward,”CoRR, vol. abs/2602.12430,

Pith/arXiv arXiv

[49] [49]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

[Online]. Available: https://doi.org/10.48550/arXiv.2602.12430

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.12430

[50] [50]

Inducing programmatic skills for agentic tasks,

Z. Z. Wang, A. Gandhi, G. Neubig, and D. Fried, “Inducing programmatic skills for agentic tasks,”CoRR, vol. abs/2504.06821,

arXiv

[51] [51]

arXiv preprint arXiv:2504.06821 , year=

[Online]. Available: https://doi.org/10.48550/arXiv.2504.06821

work page doi:10.48550/arxiv.2504.06821

[52] [52]

Reinforcement Learning for Self-Improving Agent with Skill Library

J. Wang, Q. Yan, Y . Wang, Y . Tian, S. S. Mishra, Z. Xu, M. Gandhi, P. Xu, and L. L. Cheong, “Reinforcement learning for self-improving agent with skill library,”CoRR, vol. abs/2512.17102, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2512.17102

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.17102 2025

[53] [53]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

H. Zhang, Q. Long, J. Bao, T. Feng, W. Zhang, H. Yue, and W. Wang, “Memskill: Learning and evolving memory skills for self-evolving agents,”CoRR, vol. abs/2602.02474, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2602.02474

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02474 2026

[54] [54]

Memp: Exploring Agent Procedural Memory

R. Fang, Y . Liang, X. Wang, J. Wu, S. Qiao, P. Xie, F. Huang, H. Chen, and N. Zhang, “Memp: Exploring agent procedural memory,”CoRR, vol. abs/2508.06433, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.06433

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06433 2025

[55] [55]

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

G. Zhao, Q. Shi, X. Xiao, X. Zhang, T. Yang, and L. Sun, “Thinking with reasoning skills: Fewer tokens, more accuracy,”CoRR, vol. abs/2604.21764, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2604.21764

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.21764 2026

[56] [56]

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning,

P. Xia, J. Chen, H. Wang, J. Liu, K. Zeng, Y . Wang, S. Han, Y . Zhou, X. Zhao, H. Chen, Z. Zheng, C. Xie, and H. Yao, “Skillrl: Evolving agents via recursive skill-augmented reinforcement learning,”ArXiv, vol. abs/2602.08234, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:285452037

Pith/arXiv arXiv 2026

[57] [57]

Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support,

X. Liu, X. Luo, L. Li, G. Huang, J. Liu, and H. Qiao, “Skillforge: Forging domain-specific, self-evolving agent skills in cloud technical support,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287351631

2026

[58] [58]

Memento-skills: Let agents design agents,

H. Zhou, S. Guo, A. Liu, Z. Yu, Z. Gong, B. Zhao, Z. Chen, M. Zhang, Y . Chen, J. Li, R. Yang, Q. Liu, X. Yu, J. Zhou, N. Wang, C. Sun, and J. Wang, “Memento-skills: Let agents design agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286673350

2026

[59] [59]

Autoskill: Experience-driven lifelong learning via skill self- evolution,

Y . Yang, J. Li, Q. Pan, B. Zhan, Y . Cai, L. Du, J. Zhou, K. Chen, Q. Chen, X. Li, B. Zhang, and L. He, “Autoskill: Experience-driven lifelong learning via skill self- evolution,”ArXiv, vol. abs/2603.01145, 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286224498

arXiv 2026

[60] [60]

From context to skills: Can language models learn from context skillfully?

S. Si, H. Zhao, Y . Lei, Q. Wang, D. Chen, Z. Wang, Z. Wang, K. Luo, Z. Wang, G. Chen, F. Qi, M. Zhang, and M. Sun, “From context to skills: Can language models learn from context skillfully?” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287915777

2026

[61] [61]

Mmskills: Towards multimodal skills for general visual agents,

K. Zhang, S. Shao, Q. Li, J. Lin, L. Fu, S. Wang, W. Jiao, Y . Lu, W. Liu, W. Zhang, and Y . Yu, “Mmskills: Towards multimodal skills for general visual agents,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288254572

2026

[62] [62]

Skillsvote: Lifecycle governance of agent skills from collection, recommendation to evolution,

H. Liu, H. Yang, T. Jiang, B. Tang, F. Xiong, and Z. Li, “Skillsvote: Lifecycle governance of agent skills from collection, recommendation to evolution,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:288651284

2026

[63] [63]

arXiv preprint arXiv:2603.04448 , year=

Y . Liang, R. Zhong, H. Xu, C. Jiang, Y . Zhong, R. Fang, J. Gu, S. Deng, Y . Yao, M. Wang, S. Qiao, X. Xu, T. Wu, K. Wang, Y . Liu, Z. Bi, J. Lou, Y . E. Jiang, H. Zhu, G. Yu, H. Hong, L. Huang, H. Xue, C. Wang, Y . Wang, Z. Shan, X. Chen, Z. Tu, F. Xiong, X. Xie, P. Zhang, Z. Gui, L. Liang, J. Zhou, C. Wu, J. Shang, Y . Gong, J. Lin, C. Xu, H. Deng, W. ...

work page doi:10.48550/arxiv.2603.04448 2026

[64] [64]

Skillrouter: Skill routing for llm agents at scale,

Y . Zheng, Z. Zhang, C. Ma, Y . Yu, J. Zhu, Y . Wu, T. Xu, B. Dong, H. Zhu, R. Huang, and G. Yu, “Skillrouter: Skill routing for llm agents at scale,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286770530

2026

[65] [65]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale,

H. Li, C. Mu, J. Chen, S. Ren, Z. Cui, Y . Zhang, L. Bai, and S. Hu, “Organizing, orchestrating, and benchmarking agent skills at ecosystem scale,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:286222444

2026