Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Batu EI; James Zou; Kevin Qinghong Lin; Pan Lu; Philip Torr; Yuhong Shi

arxiv: 2606.11176 · v1 · pith:ONU4LWHFnew · submitted 2026-06-09 · 💻 cs.CV · cs.CL· cs.CY· cs.HC

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Kevin Qinghong Lin , Batu EI , Yuhong Shi , Pan Lu , Philip Torr , James Zou This is my paper

Pith reviewed 2026-06-27 13:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.CYcs.HC

keywords data journalismmulti-agent systemsverifiable generationmultimodal storiesevidence groundingAI agentsnews automationtransparency

0 comments

The pith

A multi-agent framework produces evidence-grounded multimodal data stories that match expert work in verifiability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Data2Story as a system that assembles specialized agents into one workflow to handle the full data journalism process from raw data to finished articles. Every claim, number, and visual element is tied back to its source by an Inspector component, while the system decides when to generate interactive maps, audio, or other non-text formats. Tests on 18 real published articles compare the outputs to the original human pieces using angle coverage checks, ratings from 53 readers, computer-use judges, and automated code verification. The results show the agent matches or exceeds human performance on transparency and auditability, though human work still leads in creative angle and design. This setup would let reporting teams produce auditable drafts more quickly while keeping every assertion traceable.

Core claim

Data2Story coordinates multiple agents to act as a complete virtual newsroom that converts raw data into published multimedia stories; an Inspector ensures every number, angle, and asset links directly to the underlying data, code, or external reference, and the system generates interactive or audio elements when reader needs indicate they would help; on 18 paired articles the outputs compete with expert human versions on human-agent angle coverage, a 53-participant rubric, computer-use judges, and a coding verifier that re-executes statements and checks references.

What carries the argument

The Inspector agent, which enforces evidence-grounding by linking every element back to data or references, together with multimodal tool selection that chooses interactive maps, audio, or other formats based on content.

If this is right

Every generated story can be audited by re-running the linked code and checking external references.
Multimodal elements such as interactive maps are added only when the agent determines they aid reader understanding.
The system serves as a starting draft that journalists can edit while retaining the built-in traceability.
Verifiability scores rise because the Inspector prevents unsupported claims from reaching the final article.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evidence-linking structure could apply to other domains that require traceable outputs, such as policy reports or scientific summaries.
If extended to live data streams, the framework might support rapid updates with automatic re-verification of changed numbers.
Combining the Inspector with external fact-checking databases could reduce the remaining gap in editorial angle selection.

Load-bearing premise

The 18 chosen articles and the four evaluation axes capture the full range of data journalism tasks and reader requirements without selection bias or missing metrics that would change the competitiveness result.

What would settle it

Re-running the coding verifier on a new set of 20 articles finds that more than 25 percent of Data2Story claims cannot be re-executed against the source data or matched to references.

read the original abstract

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Data2Story gives a concrete multi-agent pipeline with an Inspector for claim grounding and multimodal output, but the 18-article evaluation leaves the competitiveness claim hard to assess.

read the letter

The main thing to know is that this paper describes a multi-agent system called Data2Story that tries to handle the full data journalism workflow in one framework. It breaks the task into roles and adds an Inspector agent that ties every claim back to data, code, or references, while also reasoning about when to add interactive maps, audio, or other media instead of defaulting to text and charts.

The architecture itself is straightforward and the fact that they released code and demos is helpful for anyone who wants to inspect or extend the role orchestration. The verifiability check, where a separate verifier re-executes statements against the source data, is a practical step that directly targets a common weakness in generated content.

The evaluation is the weaker part. They tested on 18 articles paired with published expert versions and used four axes including human ratings from 53 participants and computer-use agents as judges. No selection protocol or diversity metrics are given for the articles, so it is difficult to tell whether the results would hold on messier or more time-sensitive stories. Human pieces still score higher on editorial angle and presentation, and the proxy judges are an efficiency move but may not capture real reader navigation or engagement. The abstract-level description also leaves open how the coding verifier handles ambiguous references or edge cases.

This is aimed at people working on agent systems for specialized content tasks or on tools that could assist journalists. A reader looking for working examples of grounded multimodal generation would find usable ideas here.

I would send it to peer review. The system is explicit enough and the verifiability focus is worth referee scrutiny even if the current experiments need more detail and breadth.

Referee Report

4 major / 2 minor

Summary. The manuscript introduces Data2Story, a multi-agent framework that orchestrates specialized roles to automate end-to-end data journalism. It contributes an Inspector agent that grounds every claim, angle, and asset to data, code, or external references, plus multimodal generation that reasons about reader needs to produce interactive maps, audio, and other assets rather than static text/charts. The system is evaluated on 18 articles (each paired with the originally published expert piece) along four axes: human-agent angle coverage, a rubric scored by 53 participants across five dimensions, computer-use agents as judges for interactive navigation, and a coding verifier that re-executes statements against the original data and checks references. The authors conclude that Data2Story produces competitive, evidence-traceable multimedia stories with particular strength in transparency and auditability, while human articles retain advantages in editorial angle, creative design, and presentation; they position the system as a collaborator for journalists.

Significance. If the evaluation holds, the work would be significant for showing that agentic systems can handle the full pipeline of data journalism with explicit verifiability mechanisms, moving beyond isolated tools for analysis or design. The open release of code and demos at the provided GitHub link supports reproducibility and allows direct inspection of the Inspector and multimodal components. This could influence future agent frameworks that require auditability in high-stakes domains like news.

major comments (4)

[Evaluation section] Evaluation section (abstract and implied §4): the claim that Data2Story produces 'competitive' stories rests on the 18 selected articles, yet the manuscript provides no explicit selection protocol, diversity metrics, or justification that these cases represent the range of data journalism challenges (e.g., time pressure, ethical framing, long-form engagement). This selection bias risk directly affects the generalizability of the competitiveness conclusion.
[Evaluation section] Evaluation section: no information is given on how the 53 participants were recruited, what statistical tests were applied to the rubric scores, inter-rater reliability, or data exclusion rules. These omissions make it impossible to confirm that the reported human evaluation results support the competitiveness claim.
[Verifiability axis] Verifiability axis (abstract): the coding verifier is presented as a strength for re-executing statements against data and checking references, but the manuscript supplies no description of how it handles edge cases, ambiguous claims, or complex multimodal assets, which is load-bearing for the auditability advantage asserted.
[Method section] Method section (abstract): the Inspector agent is introduced as a core innovation for evidence-grounding, yet the paper does not detail its implementation, failure modes, or how it avoids missing references, leaving the 'evidence-traceable' claim underspecified.

minor comments (2)

The abstract would be strengthened by including at least one quantitative result (e.g., average rubric scores or verifiability pass rate) rather than the qualitative statement 'produces competitive' stories.
The four evaluation axes are listed but their precise mapping to the five rubric dimensions is not clarified in the provided text.

Simulated Author's Rebuttal

4 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each of the four major comments below. Where information was missing or underspecified, we agree that revisions are needed and will incorporate the requested details in the next version.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (abstract and implied §4): the claim that Data2Story produces 'competitive' stories rests on the 18 selected articles, yet the manuscript provides no explicit selection protocol, diversity metrics, or justification that these cases represent the range of data journalism challenges (e.g., time pressure, ethical framing, long-form engagement). This selection bias risk directly affects the generalizability of the competitiveness conclusion.

Authors: We agree that the selection protocol requires explicit description to support generalizability claims. The 18 articles were chosen from publicly available data journalism pieces with accompanying datasets. In the revised manuscript we will add a subsection in §4 specifying the selection criteria (public data availability, topic variety across politics, environment, health, and economics), diversity metrics (e.g., distribution of data types and story lengths), and limitations regarding real-time reporting or ethically complex cases. revision: yes
Referee: [Evaluation section] Evaluation section: no information is given on how the 53 participants were recruited, what statistical tests were applied to the rubric scores, inter-rater reliability, or data exclusion rules. These omissions make it impossible to confirm that the reported human evaluation results support the competitiveness claim.

Authors: We acknowledge the omission of these methodological details. The revised manuscript will expand the human evaluation description to include recruitment procedures, the statistical tests applied to rubric scores, inter-rater reliability metrics, and data exclusion rules. These additions will allow readers to evaluate the robustness of the reported results. revision: yes
Referee: [Verifiability axis] Verifiability axis (abstract): the coding verifier is presented as a strength for re-executing statements against data and checking references, but the manuscript supplies no description of how it handles edge cases, ambiguous claims, or complex multimodal assets, which is load-bearing for the auditability advantage asserted.

Authors: The current manuscript does not detail the verifier's handling of edge cases. We will add a description in the revised Evaluation section covering tolerance thresholds for numerical matches, flagging of ambiguous claims for review, and verification procedures for multimodal assets via associated data sources and code. This will strengthen the auditability claims. revision: yes
Referee: [Method section] Method section (abstract): the Inspector agent is introduced as a core innovation for evidence-grounding, yet the paper does not detail its implementation, failure modes, or how it avoids missing references, leaving the 'evidence-traceable' claim underspecified.

Authors: We agree that the Inspector agent's implementation requires more detail. The revised Method section will include its prompting approach, mechanisms for linking claims to sources, documented failure modes such as missed references, and mitigation steps including multi-agent cross-verification. These additions will better substantiate the evidence-grounding contribution. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper's central claims rest on external evaluation: 18 articles are compared to originally published expert pieces, scored by 53 human participants on a rubric, judged by separate computer-use agents, and verified by a coding verifier that re-executes statements against the original data and references. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided abstract or evaluation description. The derivation chain (multi-agent orchestration with Inspector for grounding) is assessed against independent benchmarks rather than quantities defined by the system itself, making the result self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no numerical free parameters are mentioned. The framework introduces the Inspector as a new component whose reliability is assumed rather than derived from prior results.

axioms (1)

domain assumption Specialized agents can be orchestrated into a reliable end-to-end pipeline for complex creative tasks without coordination failures that would break evidence links.
Invoked by the claim that the multi-agent framework functions as a single virtual newsroom.

invented entities (1)

Inspector agent no independent evidence
purpose: Links every number, angle, and asset back to data, code, or external reference for verifiability.
New component introduced to enforce evidence grounding; no independent falsifiable evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5883 in / 1492 out tokens · 27258 ms · 2026-06-27T13:41:25.574340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

work page arXiv 2024
[2]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024
[3]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023. 18

work page arXiv 2023
[5]

Matplotagent: Method and evaluation for llm-based agentic scientific data visualization

Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11789–11804, 2024

2024
[6]

LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models

Victor Dibia. LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 113–126, 2023

2023
[7]

Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

Zichen Chen, Jiefeng Chen, Sercan Ö Arik, Misha Sra, Tomas Pfister, and Jinsung Yoon. Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

work page arXiv 2025
[8]

Design2code: Bench- marking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Bench- marking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 39...

2025
[9]

Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County

Holly Rusch. Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County. San Mateo Daily Journal, 2025. Accessed: 2026-06-08

2025
[10]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023
[11]

MindSearch: Mimicking human minds elicits deep AI searcher

Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, and Feng Zhao. MindSearch: Mimicking human minds elicits deep AI searcher. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2407.20183

work page arXiv 2025
[12]

MMSearch: Benchmarkingthepotential of large models as multi-modal search engines

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, GuangluSong,PengGao,YuLiu,ChunyuanLi,andHongshengLi. MMSearch: Benchmarkingthepotential of large models as multi-modal search engines. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2409.12959

work page arXiv 2025
[13]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

DSGym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. DSGym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

work page arXiv 2026
[15]

Data Interpreter: An LLM agent for data science

Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Chenxing Wei, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chengli...

2025
[16]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs

Alexander Spangher, Nanyun Peng, Sebastian Gehrmann, and Mark Dredze. Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[19]

O’Reilly Media, Inc

Jonathan Gray, Lucy Chambers, and Liliana Bounegru.The data journalism handbook: How journalists can use data to improve the news. " O’Reilly Media, Inc.", 2012. 19

2012
[20]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[21]

Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

OpenAI. Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

2025
[22]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, AlexTachardPassos,WilliamFedus,andAmeliaGlaese. Browsecomp: Asimpleyetchallengingbenchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.arXiv preprint arXiv:2603.20278, 2026

Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, and Wenhu Chen. Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.arXiv preprint arXiv:2603.20278, 2026

work page arXiv 2026
[25]

DataNarrative: Automated data-driven storytelling with visualizations and texts

Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. DataNarrative: Automated data-driven storytelling with visualizations and texts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 19253–19286,

2024
[26]

DeepAnalyze: Agentic large language models for autonomous data science.arXiv preprint arXiv:2510.16872, 2025

Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. DeepAnalyze: Agentic large language models for autonomous data science.arXiv preprint arXiv:2510.16872, 2025

work page arXiv 2025
[27]

PublicAgent: Multi-agent design principles from an LLM-based open data analysis framework.arXiv preprint arXiv:2511.03023, 2025

Sina Montazeri, Yunhe Feng, and Kewei Sha. PublicAgent: Multi-agent design principles from an LLM-based open data analysis framework.arXiv preprint arXiv:2511.03023, 2025

work page arXiv 2025
[28]

Developing story: Case studies of generative ai’s use in journalism.arXiv preprint arXiv:2406.13706, 2024

Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno, Franziska Roesner, and Niloofar Mireshghallah. Developing story: Case studies of generative ai’s use in journalism.arXiv preprint arXiv:2406.13706, 2024

work page arXiv 2024
[29]

When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025

Sophia Cheng. When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025

2025
[30]

A novel multi-document retrieval benchmark: Journalist source-selection in newswriting

Alexander Spangher, Tenghao Huang, Yiqin Huang, Lucas Spangher, Sewon Min, and Mark Dredze. A novel multi-document retrieval benchmark: Journalist source-selection in newswriting. InProceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 180–204, 2025

2025
[31]

Llms as science journalists: Supporting early-stage researchers in communicating their science to the public.arXiv preprint arXiv:2601.05821, 2026

Milad Alshomary, Grace Li, Anubhav Jangra, Yufang Hou, Kathleen McKeown, and Smaranda Muresan. Llms as science journalists: Supporting early-stage researchers in communicating their science to the public.arXiv preprint arXiv:2601.05821, 2026

work page arXiv 2026
[32]

From data to story: Towards automatic animated data video creation with LLM-based multi-agent systems

Leixian Shen, Haotian Li, Yun Wang, and Huamin Qu. From data to story: Towards automatic animated data video creation with LLM-based multi-agent systems. InIEEE VIS Workshop on Generative AI for Data Storytelling (Gen4DS), 2024. arXiv:2408.03876

work page arXiv 2024
[33]

Amsterdam University Press, 2021

Liliana Bounegru and Jonathan Gray.The Data Journalism Handbook 2: Towards a Critical Data Practice. Amsterdam University Press, 2021

2021
[34]

Tufte.The Visual Display of Quantitative Information

Edward R. Tufte.The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 2nd edition, 2001

2001
[35]

Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004

Colin Ware.Information Visualization: Perception for Design. Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004

2004
[36]

Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010

Edward Segel and Jeffrey Heer. Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010. 20

2010
[37]

John Wiley & Sons, 2025

Cole Nussbaumer Knaflic.Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons, 2025

2025
[38]

Computational journalism.Communications of the ACM, 54(10):66–71, 2011

Sarah Cohen, James T Hamilton, and Fred Turner. Computational journalism.Communications of the ACM, 54(10):66–71, 2011

2011
[39]

Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015

Nicholas Diakopoulos. Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015

2015
[40]

fishing expedition

Andrew Gelman and Eric Loken. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.Department of Statistics, Columbia University, 2013. Unpublished manuscript

2013
[41]

New Riders, 2016

Alberto Cairo.The truthful art: Data, charts, and maps for communication. New Riders, 2016

2016
[42]

Paul Grice

H. Paul Grice. Logic and conversation. In Peter Cole and Jerry L. Morgan, editors,Syntax and Semantics, Vol. 3: Speech Acts, pages 41–58. Academic Press, New York, 1975

1975
[43]

Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006

Chris North. Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006

2006
[44]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

2023
[45]

Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024

work page arXiv 2024
[46]

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

DongpingChen, RuoxiChen, ShilinZhang, YaochenWang, YinuoLiu, HuichiZhou, QihuiZhang, YaoWan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

2024
[47]

public data

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024. Appendix A Model Settings Data Journalist Agent is based o...

2024

[1] [1]

Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

work page arXiv 2024

[2] [2]

Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

work page arXiv 2024

[3] [3]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023. 18

work page arXiv 2023

[5] [5]

Matplotagent: Method and evaluation for llm-based agentic scientific data visualization

Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11789–11804, 2024

2024

[6] [6]

LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models

Victor Dibia. LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 113–126, 2023

2023

[7] [7]

Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

Zichen Chen, Jiefeng Chen, Sercan Ö Arik, Misha Sra, Tomas Pfister, and Jinsung Yoon. Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

work page arXiv 2025

[8] [8]

Design2code: Bench- marking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Bench- marking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 39...

2025

[9] [9]

Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County

Holly Rusch. Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County. San Mateo Daily Journal, 2025. Accessed: 2026-06-08

2025

[10] [10]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023

[11] [11]

MindSearch: Mimicking human minds elicits deep AI searcher

Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, and Feng Zhao. MindSearch: Mimicking human minds elicits deep AI searcher. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2407.20183

work page arXiv 2025

[12] [12]

MMSearch: Benchmarkingthepotential of large models as multi-modal search engines

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, GuangluSong,PengGao,YuLiu,ChunyuanLi,andHongshengLi. MMSearch: Benchmarkingthepotential of large models as multi-modal search engines. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2409.12959

work page arXiv 2025

[13] [13]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

DSGym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. DSGym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

work page arXiv 2026

[15] [15]

Data Interpreter: An LLM agent for data science

Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Chenxing Wei, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chengli...

2025

[16] [16]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs

Alexander Spangher, Nanyun Peng, Sebastian Gehrmann, and Mark Dredze. Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024

[19] [19]

O’Reilly Media, Inc

Jonathan Gray, Lucy Chambers, and Liliana Bounegru.The data journalism handbook: How journalists can use data to improve the news. " O’Reilly Media, Inc.", 2012. 19

2012

[20] [20]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020

[21] [21]

Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

OpenAI. Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

2025

[22] [22]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, AlexTachardPassos,WilliamFedus,andAmeliaGlaese. Browsecomp: Asimpleyetchallengingbenchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.arXiv preprint arXiv:2603.20278, 2026

Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, and Wenhu Chen. Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.arXiv preprint arXiv:2603.20278, 2026

work page arXiv 2026

[25] [25]

DataNarrative: Automated data-driven storytelling with visualizations and texts

Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. DataNarrative: Automated data-driven storytelling with visualizations and texts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 19253–19286,

2024

[26] [26]

DeepAnalyze: Agentic large language models for autonomous data science.arXiv preprint arXiv:2510.16872, 2025

Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. DeepAnalyze: Agentic large language models for autonomous data science.arXiv preprint arXiv:2510.16872, 2025

work page arXiv 2025

[27] [27]

PublicAgent: Multi-agent design principles from an LLM-based open data analysis framework.arXiv preprint arXiv:2511.03023, 2025

Sina Montazeri, Yunhe Feng, and Kewei Sha. PublicAgent: Multi-agent design principles from an LLM-based open data analysis framework.arXiv preprint arXiv:2511.03023, 2025

work page arXiv 2025

[28] [28]

Developing story: Case studies of generative ai’s use in journalism.arXiv preprint arXiv:2406.13706, 2024

Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno, Franziska Roesner, and Niloofar Mireshghallah. Developing story: Case studies of generative ai’s use in journalism.arXiv preprint arXiv:2406.13706, 2024

work page arXiv 2024

[29] [29]

When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025

Sophia Cheng. When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025

2025

[30] [30]

A novel multi-document retrieval benchmark: Journalist source-selection in newswriting

Alexander Spangher, Tenghao Huang, Yiqin Huang, Lucas Spangher, Sewon Min, and Mark Dredze. A novel multi-document retrieval benchmark: Journalist source-selection in newswriting. InProceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 180–204, 2025

2025

[31] [31]

Llms as science journalists: Supporting early-stage researchers in communicating their science to the public.arXiv preprint arXiv:2601.05821, 2026

Milad Alshomary, Grace Li, Anubhav Jangra, Yufang Hou, Kathleen McKeown, and Smaranda Muresan. Llms as science journalists: Supporting early-stage researchers in communicating their science to the public.arXiv preprint arXiv:2601.05821, 2026

work page arXiv 2026

[32] [32]

From data to story: Towards automatic animated data video creation with LLM-based multi-agent systems

Leixian Shen, Haotian Li, Yun Wang, and Huamin Qu. From data to story: Towards automatic animated data video creation with LLM-based multi-agent systems. InIEEE VIS Workshop on Generative AI for Data Storytelling (Gen4DS), 2024. arXiv:2408.03876

work page arXiv 2024

[33] [33]

Amsterdam University Press, 2021

Liliana Bounegru and Jonathan Gray.The Data Journalism Handbook 2: Towards a Critical Data Practice. Amsterdam University Press, 2021

2021

[34] [34]

Tufte.The Visual Display of Quantitative Information

Edward R. Tufte.The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 2nd edition, 2001

2001

[35] [35]

Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004

Colin Ware.Information Visualization: Perception for Design. Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004

2004

[36] [36]

Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010

Edward Segel and Jeffrey Heer. Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010. 20

2010

[37] [37]

John Wiley & Sons, 2025

Cole Nussbaumer Knaflic.Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons, 2025

2025

[38] [38]

Computational journalism.Communications of the ACM, 54(10):66–71, 2011

Sarah Cohen, James T Hamilton, and Fred Turner. Computational journalism.Communications of the ACM, 54(10):66–71, 2011

2011

[39] [39]

Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015

Nicholas Diakopoulos. Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015

2015

[40] [40]

fishing expedition

Andrew Gelman and Eric Loken. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.Department of Statistics, Columbia University, 2013. Unpublished manuscript

2013

[41] [41]

New Riders, 2016

Alberto Cairo.The truthful art: Data, charts, and maps for communication. New Riders, 2016

2016

[42] [42]

Paul Grice

H. Paul Grice. Logic and conversation. In Peter Cole and Jerry L. Morgan, editors,Syntax and Semantics, Vol. 3: Speech Acts, pages 41–58. Academic Press, New York, 1975

1975

[43] [43]

Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006

Chris North. Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006

2006

[44] [44]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

2023

[45] [45]

Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024

work page arXiv 2024

[46] [46]

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

DongpingChen, RuoxiChen, ShilinZhang, YaochenWang, YinuoLiu, HuichiZhou, QihuiZhang, YaoWan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

2024

[47] [47]

public data

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024. Appendix A Model Settings Data Journalist Agent is based o...

2024