Recognition: unknown
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
Pith reviewed 2026-05-09 20:58 UTC · model grok-4.3
The pith
An agentic system with twelve domain tools turns mixed drilling reports and measurements into evidence-based answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TADI formalizes agent behavior as sequential tool selection over a dual-store architecture and shows that this produces grounded analytical intelligence from heterogeneous wellsite data, with the Evidence Grounding Score serving as a compliance check based on measurements, attributed quotations, and required answer sections.
What carries the argument
Twelve domain-specialized tools orchestrated by iterative LLM function calling across a DuckDB structured store and a ChromaDB semantic store.
If this is right
- The system parses every daily drilling report XML file without errors and reconciles three incompatible well naming conventions automatically.
- It supports a 130-question stress taxonomy across six operational categories backed by 95 automated tests.
- Analytical quality stems primarily from the design of the twelve domain tools rather than from increasing the size of the underlying language model.
- The full implementation is reproducible from the public Volve dataset plus an API key.
Where Pith is reading between the lines
- Similar tool-augmented setups could be adapted to other technical domains that combine numeric logs with free-text reports.
- Real-time extensions might allow the same orchestration to run against live streaming data feeds during active drilling.
- Explicit comparison runs against larger models on the same question set would provide a quantitative test of whether tool specialization truly dominates scale.
Load-bearing premise
The language model will consistently pick and chain the right tools for multi-step questions without adding ungrounded claims.
What would settle it
A new set of drilling queries where the system frequently selects wrong tools or produces answers lacking required measurements and report quotes would show the approach does not hold.
read the original abstract
We present TADI (Tool-Augmented Drilling Intelligence), an agentic AI system that transforms drilling operational data into evidence-based analytical intelligence. Applied to the Equinor Volve Field dataset, TADI integrates 1,759 daily drilling reports, selected WITSML real-time objects, 15,634 production records, formation tops, and perforations into a dual-store architecture: DuckDB for structured queries over 12 tables with 65,447 rows, and ChromaDB for semantic search over 36,709 embedded documents. Twelve domain-specialized tools, orchestrated by a large language model via iterative function calling, support multi-step evidence gathering that cross-references structured drilling measurements with daily report narratives. The system parses all 1,759 DDR XML files with zero errors, handles three incompatible well naming conventions, and is backed by 95 automated tests plus a 130-question stress-question taxonomy spanning six operational categories. We formalize the agent's behavior as a sequential tool-selection problem and propose the Evidence Grounding Score (EGS) as a simple grounding-compliance proxy based on measurements, attributed DDR quotations, and required answer sections. The complete 6,084-line, framework-free implementation is reproducible given the public Volve download and an API key, and the case studies and qualitative ablation analysis suggest that domain-specialized tool design, rather than model scale alone, is the primary driver of analytical quality in technical operations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents TADI, an agentic LLM-orchestrated system that integrates 1,759 daily drilling reports (DDRs), WITSML objects, production records, and other Volve Field data into a dual-store architecture (DuckDB for 12 structured tables and ChromaDB for semantic search). Twelve domain-specialized tools are called iteratively by the LLM to support multi-step evidence gathering that cross-references measurements with narrative text. The work reports zero XML parsing errors, a fully reproducible 6,084-line implementation, 95 automated tests, a 130-question stress taxonomy across six operational categories, and the Evidence Grounding Score (EGS) as a grounding-compliance proxy. Based on case studies and qualitative ablation analysis, the authors suggest that domain-specialized tool design, rather than model scale, is the primary driver of analytical quality.
Significance. If the qualitative findings hold, the manuscript offers a concrete, reproducible demonstration of agentic LLM systems applied to heterogeneous technical data in drilling operations. Strengths include explicit handling of naming/format incompatibilities, public code, and the introduction of EGS as a simple proxy metric. This could inform tool-augmented agent design in other engineering domains where structured and unstructured data must be combined without parameter fitting.
major comments (2)
- [Case Studies and Qualitative Ablation Analysis] The central suggestion that domain-specialized tool design is the primary driver rests on case studies and qualitative ablation, yet the manuscript reports no numerical EGS values, tool-selection success rates, or ablation deltas across the 130-question taxonomy. Without these, the magnitude and robustness of the claimed effect cannot be assessed from the provided evidence.
- [Tool Orchestration and Stress-Question Taxonomy] The system description assumes the LLM will consistently select and chain the twelve tools correctly without ungrounded content, but no quantitative evaluation of tool-calling accuracy or failure modes (e.g., over the stress-question taxonomy) is supplied. This leaves the weakest assumption untested in the reported results.
minor comments (2)
- [Abstract and Methods] The abstract states that the system 'parses all 1,759 DDR XML files with zero errors' and is 'backed by 95 automated tests,' but neither the methods nor results sections detail the coverage of those tests or the specific failure modes they address.
- [Methods] The definition of the Evidence Grounding Score (EGS) is described only at a high level as a 'simple grounding-compliance proxy based on measurements, attributed DDR quotations, and required answer sections.' A precise formula or pseudocode would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will strengthen the manuscript with additional quantitative evaluations as outlined.
read point-by-point responses
-
Referee: [Case Studies and Qualitative Ablation Analysis] The central suggestion that domain-specialized tool design is the primary driver rests on case studies and qualitative ablation, yet the manuscript reports no numerical EGS values, tool-selection success rates, or ablation deltas across the 130-question taxonomy. Without these, the magnitude and robustness of the claimed effect cannot be assessed from the provided evidence.
Authors: We agree that the absence of numerical EGS values, tool-selection success rates, and ablation deltas limits the ability to quantify the effect size. The current manuscript presents only qualitative ablation and case studies to support the suggestion that tool design is the primary driver. In the revised version, we will compute and report EGS scores across the full 130-question taxonomy, include tool-selection success rates, and provide ablation deltas (e.g., performance with vs. without specific tools) to allow readers to assess the magnitude and robustness of the findings. revision: yes
-
Referee: [Tool Orchestration and Stress-Question Taxonomy] The system description assumes the LLM will consistently select and chain the twelve tools correctly without ungrounded content, but no quantitative evaluation of tool-calling accuracy or failure modes (e.g., over the stress-question taxonomy) is supplied. This leaves the weakest assumption untested in the reported results.
Authors: The manuscript does not currently include quantitative metrics on tool-calling accuracy or failure modes over the stress-question taxonomy. We acknowledge this leaves an important assumption untested in the reported results. In the revision, we will add a quantitative evaluation of tool-selection accuracy, including success rates and categorized failure modes (e.g., incorrect tool choice, chaining errors, or ungrounded outputs) evaluated against the 130-question taxonomy. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper describes a fully specified, reproducible implementation (6,084-line codebase, zero-error parsing of 1,759 public Volve DDR files, 95 tests, dual-store architecture with explicit incompatibility handling) whose central suggestion—that domain-specialized tool design drives quality—is drawn from case studies and qualitative ablation rather than any fitted parameter, self-defined metric, or load-bearing self-citation. The proposed EGS is introduced as an observable proxy based on measurements, quotations, and required sections, not derived from the system's own outputs or prior author results. No equations, uniqueness theorems, or ansatzes reduce the claims to their inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can reliably perform iterative tool selection and function calling for multi-step evidence gathering in technical domains
invented entities (1)
-
Evidence Grounding Score (EGS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
V olve field data set
Equinor. V olve field data set. Equinor Open Data, 2018. Multi-terabyte dataset from the V olve field, Norwegian North Sea, comprising approximately 40,000 files
2018
-
[2]
Rodriguez, and E
Eugenio Ferrigno, M. Rodriguez, and E. Davidsson. Revolutionizing drilling operations: Next-gen LLM-AI for real-time support in well construction control rooms. InSPE Annual Technical Conference and Exhibition, New Orleans, Louisiana, USA, 2024. Society of Petroleum Engineers. SPE-220798-MS
2024
-
[3]
Large language models (LLMs) for natural language processing (NLP) of oil and gas drilling data
Prateek Kumar and Sanjay Kathuria. Large language models (LLMs) for natural language processing (NLP) of oil and gas drilling data. InSPE Annual Technical Conference and Exhibition, San Antonio, Texas, USA, 2023. Society of Petroleum Engineers
2023
-
[4]
Bhatia, A
G. Bhatia, A. Yadav, D. Nanda, D. Goyal, S. Perumalla, A. Shinde, B. C. Jha, and D. Upreti. Digitization of daily drilling reports using LLMs. InSPE Middle East Oil, Gas and Geosciences Show, Manama, Bahrain, 2025. Society of Petroleum Engineers. SPE-227059-MS
2025
-
[5]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[6]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[7]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[8]
HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[9]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
API-bank: A comprehensive benchmark for tool-augmented LLMs
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, 2023. Association for Computational Linguistics
2023
-
[11]
StableToolBench: Towards stable large-scale benchmarking on tool learning of large language models
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. StableToolBench: Towards stable large-scale benchmarking on tool learning of large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11143–11156, Bangkok, Thailand,
2024
-
[12]
Association for Computational Linguistics
-
[13]
Tool learning with foundation models
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models.arXiv preprint arXiv:2304.08354, 2023. 14 TADI: Tool-Augmented Drilling IntelligencePreprint
-
[14]
A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
2024
-
[15]
Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. Multi-agent collaboration mechanisms: A survey of LLMs.arXiv preprint arXiv:2501.06322, 2025
work page internal anchor Pith review arXiv 2025
-
[16]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024
work page internal anchor Pith review arXiv 2024
-
[17]
Natural language processing techniques on oil and gas drilling data
Maria Antoniak, Jeff Dalgliesh, Marc Verkruyse, and Jonathan Lo. Natural language processing techniques on oil and gas drilling data. InSPE Intelligent Energy International Conference and Exhibition, Aberdeen, Scotland, UK, 2016. Society of Petroleum Engineers. SPE-181015-MS
2016
-
[18]
Sequence mining and pattern analysis in drilling reports with deep natural language processing
Júlio Hoffimann, Youli Mao, Avinash Wesley, and Aimee Taylor. Sequence mining and pattern analysis in drilling reports with deep natural language processing. InSPE Annual Technical Conference and Exhibition, Dallas, Texas, USA, 2018. Society of Petroleum Engineers. SPE-191505-MS
2018
-
[19]
Applications of large language models in well construction planning and real-time operation
Michael Yi, Kamil Ceglinski, Pradeepkumar Ashok, Michael Behounek, Spencer White, Trey Peroyea, and Taylor Thetford. Applications of large language models in well construction planning and real-time operation. In IADC/SPE International Drilling Conference and Exhibition, Galveston, Texas, USA, 2024. Society of Petroleum Engineers. IADC/SPE-217700-MS
2024
-
[20]
Pacis, Sergey Alyaev, Gilles Pelfrene, and Tomasz Wiktorski
Felix J. Pacis, Sergey Alyaev, Gilles Pelfrene, and Tomasz Wiktorski. Enhancing information retrieval in the drilling domain: Zero-shot learning with large language models for question-answering. InIADC/SPE International Drilling Conference and Exhibition, Galveston, Texas, USA, 2024. Society of Petroleum Engineers. IADC/SPE-217671-MS
2024
-
[21]
Cloud-free question answering from internal knowledge bases: Building an AI for drilling applications.First Break, 43(2):43–49, 2025
Liang Zhang, Felix James Pacis, Sergey Alyaev, and Tomasz Wiktorski. Cloud-free question answering from internal knowledge bases: Building an AI for drilling applications.First Break, 43(2):43–49, 2025
2025
-
[22]
Oluwatosin Ogundare, Srinath Madasu, and Nathanial Wiggins. Industrial engineering with large language models: A case study of ChatGPT’s performance on oil & gas problems.arXiv preprint arXiv:2304.14354, 2023
-
[23]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020
2020
-
[24]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Shailja Gupta, Rajesh Ranjan, and Surya Narayan Singh. A comprehensive survey of retrieval-augmented generation (RAG): Evolution, current landscape and future directions.arXiv preprint arXiv:2410.12837, 2024
-
[26]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Bhaskarjit Sarmah, Benika Hall, Rohan Rao, Sunil Patel, Stefano Pasquali, and Dhagash Mehta. HybridRAG: Integrating knowledge graphs and vector retrieval augmented generation for efficient information extraction.arXiv preprint arXiv:2408.04948, 2024
-
[28]
MTEB: Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 2014–2037, Dubrovnik, Croatia, 2023
2014
-
[29]
A survey of nl2sql with large language models: Where are we, and where are we going?
Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. A survey of text-to-SQL in the era of LLMs: Where are we, and where are we going?arXiv preprint arXiv:2408.05109, 2024
-
[30]
Tunkiel, Tomasz Wiktorski, and Dan Sui
Andrzej T. Tunkiel, Tomasz Wiktorski, and Dan Sui. Drilling dataset exploration, processing and interpretation using V olve field data. InProceedings of the ASME 2020 39th International Conference on Ocean, Offshore and Arctic Engineering (OMAE), volume 11, page V011T11A076, Virtual, Online, 2020. ASME
2020
-
[31]
Nikitin, Ilia Revin, Alexander Hvatov, Pavel Vychuzhanin, and Anna V
Nikolay O. Nikitin, Ilia Revin, Alexander Hvatov, Pavel Vychuzhanin, and Anna V . Kalyuzhnaya. Hybrid and automated machine learning approaches for oil fields development: The case study of V olve field, North Sea. Computers & Geosciences, 161:105061, 2022. 15 TADI: Tool-Augmented Drilling IntelligencePreprint
2022
-
[32]
Cuthbert Shang Wui Ng, Ashkan Jahanbani Ghahfarokhi, and Menad Nait Amar. Well production forecast in V olve field: Application of rigorous machine learning techniques and metaheuristic algorithm.Journal of Petroleum Science and Engineering, 208:109468, 2022
2022
-
[33]
Geomechanical model construction to resolve field stress profile and reservoir rock properties of Jurassic Hugin Formation, V olve field, North Sea
Sankhajit Saha, Vikram Vishal, Bankim Mahanta, and Sarada Prasad Pradhan. Geomechanical model construction to resolve field stress profile and reservoir rock properties of Jurassic Hugin Formation, V olve field, North Sea. Geomechanics and Geophysics for Geo-Energy and Geo-Resources, 8(2):59, 2022
2022
-
[34]
Petrophysical property prediction from seismic inversion attributes using rock physics and machine learning: V olve field, North Sea.Applied Sciences, 14(4):1345, 2024
Olalere Oloruntobi et al. Petrophysical property prediction from seismic inversion attributes using rock physics and machine learning: V olve field, North Sea.Applied Sciences, 14(4):1345, 2024
2024
-
[35]
WITSML data standards
Energistics. WITSML data standards. Energistics Consortium, 2011. Version 1.4.1.1. Wellsite Information Transfer Standard Markup Language
2011
-
[36]
Suranga C. H. Geekiyanage, Andrzej T. Tunkiel, and Dan Sui. Drilling data quality improvement and information extraction with case studies.Journal of Petroleum Exploration and Production Technology, 11:819–837, 2021
2021
-
[37]
DuckDB: An embeddable analytical database
Mark Raasveldt and Hannes Mühleisen. DuckDB: An embeddable analytical database. InProceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 1981–1984, Amsterdam, Netherlands,
2019
-
[38]
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al. The prompt report: A systematic survey of prompting techniques.arXiv preprint arXiv:2406.06608, 2024
work page internal anchor Pith review arXiv 2024
-
[39]
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A sys- tematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 2024
work page internal anchor Pith review arXiv 2024
-
[40]
Saibo Geng et al. JSONSchemaBench: A rigorous benchmark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025
-
[41]
Developing a large language model for oil- and gas-related rock mechanics: Progress and challenges.Natural Gas Industry B, 12(2):110–122, 2025
Botao Lin, Yan Jin, Qianwen Cao, Han Meng, Huiwen Pang, and Shiming Wei. Developing a large language model for oil- and gas-related rock mechanics: Progress and challenges.Natural Gas Industry B, 12(2):110–122, 2025
2025
-
[42]
Tools, technologies and frameworks for digital twins in the oil and gas industry: An in-depth analysis.Sensors, 24(19):6457, 2024
Edwin Benito Mitacc Meza, Dalton Garcia Borges de Souza, Alessandro Copetti, Ana Paula Barbosa Sobral, Guido Vaz Silva, Iara Tammela, and Rodolfo Cardoso. Tools, technologies and frameworks for digital twins in the oil and gas industry: An in-depth analysis.Sensors, 24(19):6457, 2024. 16 TADI: Tool-Augmented Drilling IntelligencePreprint A Test Coverage D...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.