pith. sign in

arxiv: 2606.01185 · v1 · pith:3FERRCFBnew · submitted 2026-05-31 · 💻 cs.AI

"Skill issues'': data-centric optimization of lakehouse agents

Pith reviewed 2026-06-28 17:24 UTC · model grok-4.3

classification 💻 cs.AI
keywords coding agentslakehouseskill optimizationdata-centric optimizationagent evaluationbranching data workflowssandbox execution
0
0 comments X

The pith

A data-centric pipeline optimizes skills for branching lakehouse agents and raises accuracy by 31.9 percent on 25 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how coding agents interact with data infrastructure through a branching lakehouse system. Success depends on both the model and the skills that teach agents to use the system via headless APIs and Git-like primitives. The key shift is that agent-generated code produces inspectable lakehouse state changes, turning evaluation into state verification rather than output matching. A pipeline generates task-verifier pairs, runs candidate skills in sandboxes, and scores trajectories with trace signals plus programmatic lakehouse checks. In preliminary tests on 25 tasks this yields a 31.9 percent accuracy gain, indicating that write-path data workflows supply a practical substrate for skill improvement beyond read-only settings.

Core claim

The central claim is that a branching lakehouse converts data-agent evaluation into a state-verification problem. This enables a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills inside isolated sandboxes, and scores trajectories with both trace-level signals and programmatic checks over lakehouse state. When applied to 25 tasks the optimized skills improve accuracy by 31.9 percent. The results indicate that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

What carries the argument

The data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using trace-level signals and programmatic checks over lakehouse state.

If this is right

  • Agent evaluation on a branching lakehouse becomes a state-verification problem instead of output matching.
  • Optimized skills raise accuracy by 31.9 percent on the 25 evaluated tasks.
  • Write-path data workflows supply a substrate for skill optimization that read-only tasks lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be tested on other data systems that expose branching or commit primitives to agents.
  • Skill optimization might allow smaller models to reach comparable performance when the environment files are tuned.
  • Extending the sandbox checks to include merge-conflict detection or branch-history queries would test the approach on more complex workflows.

Load-bearing premise

The 25 tasks and generated task-verifier pairs are representative of real data workflows and sandbox execution faithfully measures production lakehouse behavior without post-hoc selection effects.

What would settle it

Run the optimized skills on a larger collection of tasks drawn from actual production lakehouse environments and measure whether the accuracy gain holds.

Figures

Figures reproduced from arXiv: 2606.01185 by Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue, Nicole Rose Schneider.

Figure 1
Figure 1. Figure 1: Coding agent setup. The user chooses a task and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline code maps to verifiable lakehouse changes. A one-node DAG transforming source into table-1 (Listing 1) runs on a data branch: a successful run will alter the lake state in a predictable and verifiable way. Here, commits in the branch correspond to writes (delete, create, import), and one commit in main corresponds to the merge, which atomically updates only the snapshot of table-1. The same APIs c… view at source ↗
Figure 3
Figure 3. Figure 3: High-level system architecture. An LLM-powered [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean agent cost (USD/task) and wall-clock time [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GEPA optimization curves for the skill components that produced a committed candidate beyond the seed: validation [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a data-centric optimization pipeline for agent 'skills' in a branching lakehouse (Bauplan) that exposes data workflows via headless APIs and Git-like primitives. The central observation is that branching lakehouses convert agent evaluation from output-matching to state-verification. The pipeline generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using trace-level signals plus programmatic checks on lakehouse state. A preliminary evaluation on 25 tasks reports that optimized skills improve accuracy by 31.9%, suggesting write-path data workflows as a useful substrate for skill optimization beyond read-only tasks.

Significance. If the empirical result can be substantiated with independent tasks, proper controls, and production-fidelity sandboxes, the work would offer a concrete method for improving coding-agent performance on data infrastructure by optimizing skills and environments rather than models alone. It identifies a novel evaluation substrate (branching lakehouse state changes) that could generalize to other write-heavy agent settings. The preliminary status and absence of baseline comparisons currently constrain the significance to a proof-of-concept level.

major comments (2)
  1. [Abstract] Abstract: the headline claim of a 31.9% accuracy improvement on 25 tasks is presented without baselines, error bars, control conditions, or any description of how the optimization was performed; this directly undermines assessment of the central empirical result.
  2. [Evaluation] Evaluation section (implied by the preliminary results paragraph): the 25 task-verifier pairs are generated by the pipeline itself, yet no evidence is supplied that they are drawn from an independent distribution of real lakehouse workloads or that sandbox state checks replicate production behavior (concurrency, scale, schema evolution); this selection/measurement bias risk is load-bearing for the generalization claim.
minor comments (2)
  1. [Introduction] The manuscript would benefit from an explicit early definition of 'skills' and 'environment files' and how they differ from standard prompt engineering or tool-use setups.
  2. [Pipeline description] Figure or table captions describing the optimization pipeline should include the exact scoring formula that combines trace signals and lakehouse state checks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the preliminary nature of our evaluation. We address each major comment below and will revise the manuscript accordingly to better contextualize the results and limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of a 31.9% accuracy improvement on 25 tasks is presented without baselines, error bars, control conditions, or any description of how the optimization was performed; this directly undermines assessment of the central empirical result.

    Authors: We agree the abstract is too terse. The 31.9% figure reflects the lift from initial to optimized skills on the identical 25 task-verifier pairs generated and scored by the pipeline; no external baselines or error bars are reported because the study is framed as a proof-of-concept. We will revise the abstract to (1) briefly describe the optimization loop (task generation, sandbox execution, state-verification scoring), (2) state that the comparison is against unoptimized skills on the same tasks, and (3) qualify the result as preliminary. revision: yes

  2. Referee: [Evaluation] Evaluation section (implied by the preliminary results paragraph): the 25 task-verifier pairs are generated by the pipeline itself, yet no evidence is supplied that they are drawn from an independent distribution of real lakehouse workloads or that sandbox state checks replicate production behavior (concurrency, scale, schema evolution); this selection/measurement bias risk is load-bearing for the generalization claim.

    Authors: The tasks are indeed produced by the same pipeline under study, and the sandboxes are isolated rather than production-scale. The manuscript already labels the evaluation “preliminary” and does not assert generalization. We will expand the evaluation section to (a) describe the task-generation procedure, (b) explicitly note the risk of distribution shift and lack of independent real-world workloads, and (c) list the sandbox limitations (no concurrency, scale, or schema-evolution testing) as open constraints on the current claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result with no derivation chain

full rationale

The paper reports an empirical accuracy improvement of 31.9% on 25 tasks via a pipeline that generates task-verifier pairs and evaluates skills in sandboxes. No equations, derivations, or mathematical reductions are present in the provided text. The central claim does not reduce to its inputs by construction, nor does it rely on self-citations for uniqueness, ansatzes, or load-bearing premises. The evaluation is presented as preliminary without any self-definitional or fitted-input patterns that would force the result. This is a standard empirical study whose validity rests on external questions of task representativeness rather than internal circularity in any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all fields left empty.

pith-pipeline@v0.9.1-grok · 5707 in / 1018 out tokens · 21921 ms · 2026-06-28T17:24:58.417021+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 3 canonical work pages

  1. [1]

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. 2025. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. InFirst Workshop on Foun...

  2. [2]

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. 2026. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766(2026)

  3. [3]

    Anthropic. 2025. The Complete Guide to Building Skills for Claude. https://resources.anthropic.com/hubfs/The-Complete-Guide-to-Building- Skill-for-Claude.pdf

  4. [4]

    Apache. 2024. Iceberg. https://github.com/apache/iceberg

  5. [5]

    Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. 2009. Action understand- ing as inverse planning.Cognition113, 3 (2009), 329–349

  6. [6]

    Ke Chen, Peiran Wang, Yaoning Yu, Xianyang Zhan, and Haohan Wang

  7. [7]

    arXiv:2508.02744 [cs.AI] https://arxiv.org/abs/2508.02744

    Large Language Model-based Data Science Agent: A Survey. arXiv:2508.02744 [cs.AI] https://arxiv.org/abs/2508.02744

  8. [8]

    Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, and Ion Stoica. 2025. Optimizing Model Selection for Compound AI 6 Systems. arXiv:2502.14815 [cs.AI] https://arxiv.org/abs/2502.14815

  9. [9]

    Ruirui Chen, Weifeng Jiang, Chengwei Qin, and Cheston Tan. 2025. The- ory of Mind in Large Language Models: Assessment and Enhancement. arXiv:2505.00026 [cs.CL] https://arxiv.org/abs/2505.00026

  10. [10]

    Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, and Hamza Harkous

    Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, and Hamza Harkous. 2026. Reasoning-Driven Synthetic Data Generation and Evaluation. arXiv:2603.29791 [cs.AI] https://arxiv.org/abs/2603.29791

  11. [11]

    Yudong Gao, Zongjie Li, Zimo Ji, Pingchuan Ma, Shuai Wang, et al . 2026. Skillreducer: Optimizing llm agent skills for token efficiency.arXiv preprint arXiv:2603.29919(2026)

  12. [12]

    Gelpí, Eric Xue, and William A

    Rebekah A. Gelpí, Eric Xue, and William A. Cunningham. 2025. Towards Ma- chine Theory of Mind with Large Language Model-Augmented Inverse Planning. arXiv:2507.03682 [cs.AI] https://arxiv.org/abs/2507.03682

  13. [13]

    Jingzhi Gong, Ruizhen Gu, Zhiwei Fei, Yazhuo Cao, Lukas Twist, Alina Geiger, Shuo Han, Dominik Sobania, Federica Sarro, and Jie M Zhang. 2026. SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering.arXiv preprint arXiv:2604.09297(2026)

  14. [14]

    2026.Harbor: A framework for evaluating and opti- mizing agents and models in container environments

    Harbor Framework Team. 2026.Harbor: A framework for evaluating and opti- mizing agents and models in container environments. https://github.com/harbor- framework/harbor

  15. [15]

    Chenyi Huang, Haoting Zhang, Jingxu Xu, Zeyu Zheng, and Yunduan Lin. 2026. Bilevel Optimization of Agent Skills via Monte Carlo Tree Search.arXiv preprint arXiv:2604.15709(2026)

  16. [16]

    Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, and Brian Hempel

  17. [17]

    arXiv:2512.14012 [cs.SE] https://arxiv.org/abs/2512.14012

    Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025. arXiv:2512.14012 [cs.SE] https://arxiv.org/abs/2512.14012

  18. [18]

    Hyunmin Hwang, Jaemin Kim, Choonghan Kim, Hangeol Chang, and Jong Chul Ye. 2026. AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization.arXiv preprint arXiv:2605.08704(2026)

  19. [19]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InThe Twelfth International Conference on Learning Representations

  20. [20]

    Pan, Guilin Qi, Haofen Wang, and Huajun Chen

    Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan, Xi Chen, Zhaopeng Tu, Feiyu Xiong, X...

  21. [21]

    arXiv:2603.04448 [cs.AI] https://arxiv.org/abs/2603.04448

    SkillNet: Create, Evaluate, and Connect AI Skills. arXiv:2603.04448 [cs.AI] https://arxiv.org/abs/2603.04448

  22. [22]

    Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, and Zhiqiang Shen. 2026. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems. arXiv:2604.14228 [cs.SE] https://arxiv.org/abs/2604.14228

  23. [23]

    Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. 2026. EvoX: Meta-Evolution for Automated Discovery. arXiv:2602.23413 [cs.LG] https://arxiv.org/abs/2602.23413

  24. [24]

    Gonzalez, and Aditya G

    Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, and Aditya G. Parameswaran. 2025. Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First. arXiv:2509.00997 [cs.AI] https://arxiv.org/a...

  25. [25]

    Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. 2025. A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? arXiv:2408.05109 [cs.DB] https://arxiv.org/abs/2408.05109

  26. [26]

    Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. 2025. AlphaEvolve: A coding agent for scientific an...

  27. [27]

    Weiming Sheng, Jinlang Wang, Manuel Barros, Aldrin Montana, Jacopo Tagli- abue, and Luca Bigon. 2026. Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents. arXiv:2602.02335 [cs.DC] https://arxiv.org/abs/2602.02335

  28. [28]

    Jacopo Tagliabue, Tyler Caraza-Harter, and Ciro Greco. 2024. Bauplan: Zero-copy, Scale-up FaaS for Data Pipelines. InProceedings of the 10th International Workshop on Serverless Computing(Hong Kong, Hong Kong)(WoSC10 ’24). Association for Computing Machinery, New York, NY, USA, 31–36. https://doi.org/10.1145/ 3702634.3702955

  29. [29]

    Jacopo Tagliabue, Ryan Curtin, and Ciro Greco. 2024. FaaS and Furious: abstrac- tions and differential caching for efficient data pre-processing . In2024 IEEE Inter- national Conference on Big Data (BigData). IEEE Computer Society, Los Alamitos, CA, USA, 3562–3567. https://doi.org/10.1109/BigData62323.2024.10825377

  30. [30]

    Jacopo Tagliabue and Ciro Greco. 2024. Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie. InProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning(Santiago, AA, Chile)(DEEM ’24). Association for Computing Machinery, New York, NY, USA, 67–71. https://doi.org/10.1145/3650203.3663335

  31. [31]

    Yash Vishe, Rohan Surana, Xunyi Jiang, Zihan Huang, Xintong Li, Nikki Li- jing Kuang, Tong Yu, Ryan A Rossi, Jingbo Shang, Julian McAuley, et al. 2026. Skill-R1: Agent Skill Evolution via Reinforcement Learning.arXiv preprint arXiv:2605.09359(2026)

  32. [32]

    On the calibration of large language models and alignment

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Mod- els with Self-Generated Instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/2023. acl-long.754

  33. [33]

    Bowman, He He, and Shi Feng

    Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, and Shi Feng. 2024. Language Models Learn to Mislead Humans via RLHF. arXiv:2409.12822 [cs.CL] https://arxiv.org/abs/ 2409.12822

  34. [34]

    Renjun Xu and Yang Yan. 2026. Agent Skills for Large Language Models: Archi- tecture, Acquisition, Security, and the Path Forward. arXiv:2602.12430 [cs.MA] https://arxiv.org/abs/2602.12430

  35. [35]

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2025. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. InThe Thirteenth International Conference on Learning Representations

  36. [36]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

  37. [37]

    Zaharia, Ali Ghodsi, Reynold Xin, and Michael Armbrust

    Matei A. Zaharia, Ali Ghodsi, Reynold Xin, and Michael Armbrust. 2021. Lake- house: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. InConference on Innovative Data Systems Research

  38. [38]

    Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. 2023. Data-centric Artificial Intelligence: A Survey. arXiv:2303.10158 [cs.LG] https://arxiv.org/abs/2303.10158

  39. [39]

    task_name

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems. A EXAMPLES OF TASKS We reproduce here some task entries from ...