pith. sign in

arxiv: 2604.06474 · v1 · submitted 2026-04-07 · 💻 cs.CL

DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentsdeep researchexploratory data analysisdata storytellingstructured databasesagentic AIInsightBenchACLED dataset
0
0 comments X

The pith

DataSTORM reframes deep research on structured databases as an autonomous thesis-driven process using exploratory data analysis and storytelling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DataSTORM, an LLM agent that conducts deep research on large-scale structured databases and web sources. It treats analysis as discovering candidate theses from data, validating them through cross-source checks, and turning them into coherent narratives. This method draws from exploratory data analysis and data storytelling to handle the demands of quantitative reasoning over schemas. On InsightBench it sets a new state of the art, and on a new ACLED dataset it surpasses proprietary systems like ChatGPT Deep Research in both metrics and human judgment.

Core claim

DataSTORM is an LLM-based agentic system that autonomously performs deep research across large-scale structured databases and internet sources by discovering candidate theses from data, validating them iteratively, and developing them into analytical narratives grounded in exploratory data analysis and data storytelling principles.

What carries the argument

The thesis-driven analytical process that discovers candidate theses from data, validates them through iterative cross-source investigation, and develops them into coherent narratives.

If this is right

  • DataSTORM achieves a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score on InsightBench.
  • It outperforms ChatGPT Deep Research on a new ACLED-based dataset in automated metrics and human evaluations.
  • The system handles both structured databases and unstructured internet sources in a unified way.
  • Effective data research requires iterative hypothesis generation and quantitative reasoning over schemas rather than just retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If LLM agents can execute this process reliably, they could automate much of the initial exploratory phase in data analysis projects.
  • This approach might extend to scientific databases where hypothesis testing from large datasets is key.
  • Integration with more advanced quantitative tools could further strengthen the validation step.
  • Human oversight might still be needed for final narrative refinement in high-stakes domains.

Load-bearing premise

LLM agents can perform reliable iterative hypothesis generation, quantitative reasoning, and narrative convergence on structured data without substantial human guidance.

What would settle it

Running DataSTORM autonomously on the ACLED dataset or similar complex database and finding that it produces incoherent narratives or incorrect quantitative insights that do not match expert analysis.

Figures

Figures reproduced from arXiv: 2604.06474 by Camila Nicollier Sanchez, David Fernando Castro Pena, Monica S. Lam, Sajid Farook, Shicheng Liu, Yucheng Jiang.

Figure 1
Figure 1. Figure 1: Overview of the DataSTORM system. A complete research workflow consists of three [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Final Report Generation Module [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise preference rates in human evaluation. For each comparison, the bar shows the [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Consent page shown to participants before the human evaluation. Identifying information [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Screenshot of the custom web interface used for human evaluation. Participants reviewed [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DataSTORM, an LLM-based agentic system for autonomous deep research over large-scale structured databases and web sources. Grounded in Exploratory Data Analysis and Data Storytelling, it reframes the task as a thesis-driven process of discovering candidate theses from data, validating them via iterative cross-source investigation, and synthesizing them into coherent analytical narratives. The central empirical claims are a new state-of-the-art on InsightBench (19.4% relative gain in insight-level recall, 7.2% in summary-level score) and outperformance versus ChatGPT Deep Research on a newly introduced ACLED-derived dataset, measured by both automated metrics and human evaluations.

Significance. If the results hold under rigorous scrutiny, the work would be a meaningful contribution by extending LLM agents beyond unstructured web retrieval into quantitative reasoning over structured schemas, an underexplored area. The release of a new ACLED-based dataset is a concrete positive that could support future benchmarking. The framing around EDA and data storytelling provides a principled conceptual anchor, though its translation into reliable agent behavior remains to be demonstrated.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: The SOTA claim of a 19.4% relative improvement in insight-level recall on InsightBench is presented without any description of the experimental setup, baselines, number of trials, statistical significance testing, or controls for prompt sensitivity and model version. This information is load-bearing for the central performance claim and must be supplied before the result can be assessed.
  2. [Method / Evaluation] The manuscript's core assumption—that the agentic loop (hypothesis generation, quantitative schema traversal, cross-source validation, and narrative convergence) operates reliably without substantial human guidance or post-hoc tuning—is not supported by any ablation, failure-mode analysis, or quantitative checks on aggregation accuracy and statistical validity across iterations. This directly affects the validity of both the InsightBench and ACLED results.
minor comments (2)
  1. [Abstract] The abstract would be clearer if it briefly stated the underlying LLM(s) and any external tools or verification mechanisms used by the agent.
  2. [Dataset section] Notation for the new ACLED dataset (size, schema complexity, query types) should be introduced earlier to help readers contextualize the human-evaluation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: The SOTA claim of a 19.4% relative improvement in insight-level recall on InsightBench is presented without any description of the experimental setup, baselines, number of trials, statistical significance testing, or controls for prompt sensitivity and model version. This information is load-bearing for the central performance claim and must be supplied before the result can be assessed.

    Authors: We agree that the abstract, being a high-level summary, does not include these details, and the Evaluation section would benefit from greater elaboration. In the revised version, we will expand the Evaluation section to provide a full description of the experimental setup, including the specific baselines compared against, the number of trials or runs performed, results of statistical significance testing, and measures taken to control for prompt sensitivity and model version variations. This will allow readers to better assess the robustness of the reported SOTA results. revision: yes

  2. Referee: [Method / Evaluation] The manuscript's core assumption—that the agentic loop (hypothesis generation, quantitative schema traversal, cross-source validation, and narrative convergence) operates reliably without substantial human guidance or post-hoc tuning—is not supported by any ablation, failure-mode analysis, or quantitative checks on aggregation accuracy and statistical validity across iterations. This directly affects the validity of both the InsightBench and ACLED results.

    Authors: This is a valid concern. The current manuscript emphasizes the end-to-end performance on the benchmarks but does not include dedicated ablations or failure analyses. We will add a new subsection in the Evaluation or Method section that includes ablations on the key components of the agentic loop (e.g., impact of hypothesis generation and cross-source validation), a discussion of observed failure modes with examples, and quantitative metrics on the accuracy of data aggregation and statistical validity checks across iterations. This will strengthen the evidence for the reliability of the system. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents DataSTORM as an LLM agent system for database research, grounded in established EDA and data storytelling principles. Its central claims consist of empirical SOTA results on InsightBench (19.4% insight-level recall gain) and outperformance on a new ACLED dataset versus baselines including ChatGPT Deep Research, supported by automated metrics and human evaluations. No equations, fitted parameters, derivations, or predictions appear in the text. No self-citations are invoked as load-bearing justifications for the method or results. The evaluation relies on external benchmarks and independent human assessment rather than any self-referential reduction of outputs to inputs. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted from the given text.

pith-pipeline@v0.9.0 · 5554 in / 1147 out tokens · 34650 ms · 2026-05-10T18:48:20.587353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y

    URLhttps://arxiv.org/abs/2602.05867. Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, consensus enforcement, and column exploration, 2025. URLhttps://arxiv.org/abs/2502.00675. Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresea...

  2. [2]

    Hongchao Gu, Dexun Li, Kuicai Dong, Hao Zhang, Hang Lv, Hao Wang, Defu Lian, Yong Liu, and Enhong Chen

    URLhttps://aclanthology.org/2023.emnlp-main.398/. Hongchao Gu, Dexun Li, Kuicai Dong, Hao Zhang, Hang Lv, Hao Wang, Defu Lian, Yong Liu, and Enhong Chen. RAPID: Efficient retrieval-augmented long text generation with writing planning and information discovery. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Findings o...

  3. [3]

    doi: 10.18653/v1/2024.findings-emnlp.815

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.815. URLhttps://aclanthology.org/2024.findings-emnlp.815/. Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang. Ds-agent: automated data science by empowering large language models with case-based reasoning. InProceedings of the 41st International Conference...

  4. [4]

    Wu, T., Xiang, C., Wang, J

    URLhttps://proceedings.mlr.press/v235/hu24s.html. Harper Hua, Zhen Han, Zhengyuan Shen, Jeremy Lee, Patrick Guan, Qi Zhu, Sullam Jeoung, Yueyan Chen, Yunfei Bai, Shuai Wang, Vassilis Ioannidis, and Huzefa Rangwala. Sql-trail: Multi-turn reinforcement learning with interleaved feedback for text-to-sql, 2026. URL https: //arxiv.org/abs/2601.17699. Yucheng J...

  5. [5]

    Andy Kirk.Exploratory data analysis: Using visuals to see your data

    URLhttps://aclanthology.org/2024.emnlp-main.554/. Andy Kirk.Exploratory data analysis: Using visuals to see your data. SAGE Publications, 2016. URLhttps://learningresources.sagepub.com/blog/campus/2021/04/22/ exploratory-data-analysis-using-visuals-to-see-your-data. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, a...

  6. [6]

    Jake Linardon, Hannah K

    URLhttps://arxiv.org/abs/2504.21776. Jake Linardon, Hannah K. Jarman, Zoe McClure, Cleo Anderson, Claudia Liu, and Mariel Messer. Influence of topic familiarity and prompt specificity on citation fabrication in mental health research using large language models: Experimental study.JMIR Mental Health, 12:e80371, 2025. doi: 10.2196/80371. URLhttps://mental....

  7. [7]

    wrangling

    URLhttps://arxiv.org/abs/2503.13262. Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, and Yisong Yue. Datascibench: An llm agent benchmark for data science, 2025a. URLhttps://arxiv.org/abs/2502.13897. Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. Deepanalyze: Agentic large languag...

  8. [8]

    Identify the question you are interested in

  9. [9]

    For database questions, specify the expected output format, the number of columns, and the names of those columns

  10. [10]

    destination

    Ensure the question is self-contained and clearly scoped. For each question, also specify a "destination" to indicate where the question should be routed: - "database": The question can be answered by querying the database (e.g., aggregations, distributions, trends, filters, correlations, rankings, or any computation over the data). - "internet": The ques...

  11. [11]

    For the nodes you would like to correct, issue a follow-up question with the desired SQL predicates

    identify any inconsistencies in the SQL predicates used and standarize any inconsistencies. For the nodes you would like to correct, issue a follow-up question with the desired SQL predicates. You can directly instruct what to modify in the SQLs. DO NOT instruct new variables not seen in the current SQL. DO NOT instruct it correct any variables

  12. [12]

    example_node

    Some noes will be given to you as examples. These examples will be marked with "example_node": True, and you do not need to issue a follow-up question for them

  13. [13]

    If any SQL appears to have forgotten the conversational context, issue a follow-up question to resolve it

    make sure the SQLs reflect the conversation context presented in previous_queries. If any SQL appears to have forgotten the conversational context, issue a follow-up question to resolve it

  14. [14]

    follow_up_question

    If no follow-up question is needed, set "follow_up_question": None. Output a JSON following examples. # input { "example_node_0": { "query": "Show me the top 20 countries by the number of missile or artillery attacks that they have targetted by?", "SQL": "SELECT country, COUNT(*) AS attack_count FROM events WHERE sub_event_type IN ('Shelling/artillery/mis...

  15. [15]

    Sharpen - narrow or deepen the original argument using new supporting evidence

  16. [16]

    Pivot - shift to a better-supported or more compelling argument uncovered by the new findings

  17. [17]

    {{ topic }}

    Confirm - keep the thesis essentially unchanged if the evidence continues to support it strongly Output exactly one refined thesis and the updated research strategy. # input Description of database content: {{ db_description }} Topic: {{ topic }} Current Thesis: {{ current_thesis }} Current Research strategy: {{ current_research_strategy }} Current findin...

  18. [18]

    Give it a short **name** (3-6 words)

  19. [19]

    criteria

    Write a **description** of the general trend or pattern to look for (1-2 sentences, no specific numbers or dates needed but include e.g. the general trend) Return as a JSON object with a "criteria" array, each item having "name" and "description" fields. # input ## Research Task {{task_prompt}} ## Reference Article {{reference_article}} Table 17: Referenc...