pith. sign in

arxiv: 2606.11176 · v1 · pith:ONU4LWHFnew · submitted 2026-06-09 · 💻 cs.CV · cs.CL· cs.CY· cs.HC

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Pith reviewed 2026-06-27 13:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.CYcs.HC
keywords data journalismmulti-agent systemsverifiable generationmultimodal storiesevidence groundingAI agentsnews automationtransparency
0
0 comments X

The pith

A multi-agent framework produces evidence-grounded multimodal data stories that match expert work in verifiability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Data2Story as a system that assembles specialized agents into one workflow to handle the full data journalism process from raw data to finished articles. Every claim, number, and visual element is tied back to its source by an Inspector component, while the system decides when to generate interactive maps, audio, or other non-text formats. Tests on 18 real published articles compare the outputs to the original human pieces using angle coverage checks, ratings from 53 readers, computer-use judges, and automated code verification. The results show the agent matches or exceeds human performance on transparency and auditability, though human work still leads in creative angle and design. This setup would let reporting teams produce auditable drafts more quickly while keeping every assertion traceable.

Core claim

Data2Story coordinates multiple agents to act as a complete virtual newsroom that converts raw data into published multimedia stories; an Inspector ensures every number, angle, and asset links directly to the underlying data, code, or external reference, and the system generates interactive or audio elements when reader needs indicate they would help; on 18 paired articles the outputs compete with expert human versions on human-agent angle coverage, a 53-participant rubric, computer-use judges, and a coding verifier that re-executes statements and checks references.

What carries the argument

The Inspector agent, which enforces evidence-grounding by linking every element back to data or references, together with multimodal tool selection that chooses interactive maps, audio, or other formats based on content.

If this is right

  • Every generated story can be audited by re-running the linked code and checking external references.
  • Multimodal elements such as interactive maps are added only when the agent determines they aid reader understanding.
  • The system serves as a starting draft that journalists can edit while retaining the built-in traceability.
  • Verifiability scores rise because the Inspector prevents unsupported claims from reaching the final article.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evidence-linking structure could apply to other domains that require traceable outputs, such as policy reports or scientific summaries.
  • If extended to live data streams, the framework might support rapid updates with automatic re-verification of changed numbers.
  • Combining the Inspector with external fact-checking databases could reduce the remaining gap in editorial angle selection.

Load-bearing premise

The 18 chosen articles and the four evaluation axes capture the full range of data journalism tasks and reader requirements without selection bias or missing metrics that would change the competitiveness result.

What would settle it

Re-running the coding verifier on a new set of 20 articles finds that more than 25 percent of Data2Story claims cannot be re-executed against the source data or matched to references.

read the original abstract

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript introduces Data2Story, a multi-agent framework that orchestrates specialized roles to automate end-to-end data journalism. It contributes an Inspector agent that grounds every claim, angle, and asset to data, code, or external references, plus multimodal generation that reasons about reader needs to produce interactive maps, audio, and other assets rather than static text/charts. The system is evaluated on 18 articles (each paired with the originally published expert piece) along four axes: human-agent angle coverage, a rubric scored by 53 participants across five dimensions, computer-use agents as judges for interactive navigation, and a coding verifier that re-executes statements against the original data and checks references. The authors conclude that Data2Story produces competitive, evidence-traceable multimedia stories with particular strength in transparency and auditability, while human articles retain advantages in editorial angle, creative design, and presentation; they position the system as a collaborator for journalists.

Significance. If the evaluation holds, the work would be significant for showing that agentic systems can handle the full pipeline of data journalism with explicit verifiability mechanisms, moving beyond isolated tools for analysis or design. The open release of code and demos at the provided GitHub link supports reproducibility and allows direct inspection of the Inspector and multimodal components. This could influence future agent frameworks that require auditability in high-stakes domains like news.

major comments (4)
  1. [Evaluation section] Evaluation section (abstract and implied §4): the claim that Data2Story produces 'competitive' stories rests on the 18 selected articles, yet the manuscript provides no explicit selection protocol, diversity metrics, or justification that these cases represent the range of data journalism challenges (e.g., time pressure, ethical framing, long-form engagement). This selection bias risk directly affects the generalizability of the competitiveness conclusion.
  2. [Evaluation section] Evaluation section: no information is given on how the 53 participants were recruited, what statistical tests were applied to the rubric scores, inter-rater reliability, or data exclusion rules. These omissions make it impossible to confirm that the reported human evaluation results support the competitiveness claim.
  3. [Verifiability axis] Verifiability axis (abstract): the coding verifier is presented as a strength for re-executing statements against data and checking references, but the manuscript supplies no description of how it handles edge cases, ambiguous claims, or complex multimodal assets, which is load-bearing for the auditability advantage asserted.
  4. [Method section] Method section (abstract): the Inspector agent is introduced as a core innovation for evidence-grounding, yet the paper does not detail its implementation, failure modes, or how it avoids missing references, leaving the 'evidence-traceable' claim underspecified.
minor comments (2)
  1. The abstract would be strengthened by including at least one quantitative result (e.g., average rubric scores or verifiability pass rate) rather than the qualitative statement 'produces competitive' stories.
  2. The four evaluation axes are listed but their precise mapping to the five rubric dimensions is not clarified in the provided text.

Simulated Author's Rebuttal

4 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each of the four major comments below. Where information was missing or underspecified, we agree that revisions are needed and will incorporate the requested details in the next version.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (abstract and implied §4): the claim that Data2Story produces 'competitive' stories rests on the 18 selected articles, yet the manuscript provides no explicit selection protocol, diversity metrics, or justification that these cases represent the range of data journalism challenges (e.g., time pressure, ethical framing, long-form engagement). This selection bias risk directly affects the generalizability of the competitiveness conclusion.

    Authors: We agree that the selection protocol requires explicit description to support generalizability claims. The 18 articles were chosen from publicly available data journalism pieces with accompanying datasets. In the revised manuscript we will add a subsection in §4 specifying the selection criteria (public data availability, topic variety across politics, environment, health, and economics), diversity metrics (e.g., distribution of data types and story lengths), and limitations regarding real-time reporting or ethically complex cases. revision: yes

  2. Referee: [Evaluation section] Evaluation section: no information is given on how the 53 participants were recruited, what statistical tests were applied to the rubric scores, inter-rater reliability, or data exclusion rules. These omissions make it impossible to confirm that the reported human evaluation results support the competitiveness claim.

    Authors: We acknowledge the omission of these methodological details. The revised manuscript will expand the human evaluation description to include recruitment procedures, the statistical tests applied to rubric scores, inter-rater reliability metrics, and data exclusion rules. These additions will allow readers to evaluate the robustness of the reported results. revision: yes

  3. Referee: [Verifiability axis] Verifiability axis (abstract): the coding verifier is presented as a strength for re-executing statements against data and checking references, but the manuscript supplies no description of how it handles edge cases, ambiguous claims, or complex multimodal assets, which is load-bearing for the auditability advantage asserted.

    Authors: The current manuscript does not detail the verifier's handling of edge cases. We will add a description in the revised Evaluation section covering tolerance thresholds for numerical matches, flagging of ambiguous claims for review, and verification procedures for multimodal assets via associated data sources and code. This will strengthen the auditability claims. revision: yes

  4. Referee: [Method section] Method section (abstract): the Inspector agent is introduced as a core innovation for evidence-grounding, yet the paper does not detail its implementation, failure modes, or how it avoids missing references, leaving the 'evidence-traceable' claim underspecified.

    Authors: We agree that the Inspector agent's implementation requires more detail. The revised Method section will include its prompting approach, mechanisms for linking claims to sources, documented failure modes such as missed references, and mitigation steps including multi-agent cross-verification. These additions will better substantiate the evidence-grounding contribution. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper's central claims rest on external evaluation: 18 articles are compared to originally published expert pieces, scored by 53 human participants on a rubric, judged by separate computer-use agents, and verified by a coding verifier that re-executes statements against the original data and references. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided abstract or evaluation description. The derivation chain (multi-agent orchestration with Inspector for grounding) is assessed against independent benchmarks rather than quantities defined by the system itself, making the result self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no numerical free parameters are mentioned. The framework introduces the Inspector as a new component whose reliability is assumed rather than derived from prior results.

axioms (1)
  • domain assumption Specialized agents can be orchestrated into a reliable end-to-end pipeline for complex creative tasks without coordination failures that would break evidence links.
    Invoked by the claim that the multi-agent framework functions as a single virtual newsroom.
invented entities (1)
  • Inspector agent no independent evidence
    purpose: Links every number, angle, and asset back to data, code, or external reference for verifiability.
    New component introduced to enforce evidence grounding; no independent falsifiable evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5883 in / 1492 out tokens · 27258 ms · 2026-06-27T13:41:25.574340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 20 canonical work pages · 6 internal anchors

  1. [1]

    Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

  2. [2]

    Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024

  3. [3]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

  4. [4]

    Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023. 18

  5. [5]

    Matplotagent: Method and evaluation for llm-based agentic scientific data visualization

    Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11789–11804, 2024

  6. [6]

    LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models

    Victor Dibia. LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 113–126, 2023

  7. [7]

    Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

    Zichen Chen, Jiefeng Chen, Sercan Ö Arik, Misha Sra, Tomas Pfister, and Jinsung Yoon. Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025

  8. [8]

    Design2code: Bench- marking multimodal code generation for automated front-end engineering

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Bench- marking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 39...

  9. [9]

    Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County

    Holly Rusch. Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County. San Mateo Daily Journal, 2025. Accessed: 2026-06-08

  10. [10]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

  11. [11]

    MindSearch: Mimicking human minds elicits deep AI searcher

    Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, and Feng Zhao. MindSearch: Mimicking human minds elicits deep AI searcher. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2407.20183

  12. [12]

    MMSearch: Benchmarkingthepotential of large models as multi-modal search engines

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, GuangluSong,PengGao,YuLiu,ChunyuanLi,andHongshengLi. MMSearch: Benchmarkingthepotential of large models as multi-modal search engines. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2409.12959

  13. [13]

    DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

  14. [14]

    DSGym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

    Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. DSGym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026

  15. [15]

    Data Interpreter: An LLM agent for data science

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Chenxing Wei, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chengli...

  16. [16]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  17. [17]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025

  18. [18]

    Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs

    Alexander Spangher, Nanyun Peng, Sebastian Gehrmann, and Mark Dredze. Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  19. [19]

    O’Reilly Media, Inc

    Jonathan Gray, Lucy Chambers, and Liliana Bounegru.The data journalism handbook: How journalists can use data to improve the news. " O’Reilly Media, Inc.", 2012. 19

  20. [20]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  21. [21]

    Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

    OpenAI. Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025

  22. [22]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, AlexTachardPassos,WilliamFedus,andAmeliaGlaese. Browsecomp: Asimpleyetchallengingbenchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  23. [23]

    DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025

  24. [24]

    Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.arXiv preprint arXiv:2603.20278, 2026

    Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, and Wenhu Chen. Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.arXiv preprint arXiv:2603.20278, 2026

  25. [25]

    DataNarrative: Automated data-driven storytelling with visualizations and texts

    Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. DataNarrative: Automated data-driven storytelling with visualizations and texts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 19253–19286,

  26. [26]

    DeepAnalyze: Agentic large language models for autonomous data science.arXiv preprint arXiv:2510.16872, 2025

    Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. DeepAnalyze: Agentic large language models for autonomous data science.arXiv preprint arXiv:2510.16872, 2025

  27. [27]

    PublicAgent: Multi-agent design principles from an LLM-based open data analysis framework.arXiv preprint arXiv:2511.03023, 2025

    Sina Montazeri, Yunhe Feng, and Kewei Sha. PublicAgent: Multi-agent design principles from an LLM-based open data analysis framework.arXiv preprint arXiv:2511.03023, 2025

  28. [28]

    Developing story: Case studies of generative ai’s use in journalism.arXiv preprint arXiv:2406.13706, 2024

    Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno, Franziska Roesner, and Niloofar Mireshghallah. Developing story: Case studies of generative ai’s use in journalism.arXiv preprint arXiv:2406.13706, 2024

  29. [29]

    When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025

    Sophia Cheng. When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025

  30. [30]

    A novel multi-document retrieval benchmark: Journalist source-selection in newswriting

    Alexander Spangher, Tenghao Huang, Yiqin Huang, Lucas Spangher, Sewon Min, and Mark Dredze. A novel multi-document retrieval benchmark: Journalist source-selection in newswriting. InProceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 180–204, 2025

  31. [31]

    Llms as science journalists: Supporting early-stage researchers in communicating their science to the public.arXiv preprint arXiv:2601.05821, 2026

    Milad Alshomary, Grace Li, Anubhav Jangra, Yufang Hou, Kathleen McKeown, and Smaranda Muresan. Llms as science journalists: Supporting early-stage researchers in communicating their science to the public.arXiv preprint arXiv:2601.05821, 2026

  32. [32]

    From data to story: Towards automatic animated data video creation with LLM-based multi-agent systems

    Leixian Shen, Haotian Li, Yun Wang, and Huamin Qu. From data to story: Towards automatic animated data video creation with LLM-based multi-agent systems. InIEEE VIS Workshop on Generative AI for Data Storytelling (Gen4DS), 2024. arXiv:2408.03876

  33. [33]

    Amsterdam University Press, 2021

    Liliana Bounegru and Jonathan Gray.The Data Journalism Handbook 2: Towards a Critical Data Practice. Amsterdam University Press, 2021

  34. [34]

    Tufte.The Visual Display of Quantitative Information

    Edward R. Tufte.The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 2nd edition, 2001

  35. [35]

    Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004

    Colin Ware.Information Visualization: Perception for Design. Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004

  36. [36]

    Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010

    Edward Segel and Jeffrey Heer. Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010. 20

  37. [37]

    John Wiley & Sons, 2025

    Cole Nussbaumer Knaflic.Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons, 2025

  38. [38]

    Computational journalism.Communications of the ACM, 54(10):66–71, 2011

    Sarah Cohen, James T Hamilton, and Fred Turner. Computational journalism.Communications of the ACM, 54(10):66–71, 2011

  39. [39]

    Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015

    Nicholas Diakopoulos. Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015

  40. [40]

    fishing expedition

    Andrew Gelman and Eric Loken. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.Department of Statistics, Columbia University, 2013. Unpublished manuscript

  41. [41]

    New Riders, 2016

    Alberto Cairo.The truthful art: Data, charts, and maps for communication. New Riders, 2016

  42. [42]

    Paul Grice

    H. Paul Grice. Logic and conversation. In Peter Cole and Jerry L. Morgan, editors,Syntax and Semantics, Vol. 3: Speech Acts, pages 41–58. Academic Press, New York, 1975

  43. [43]

    Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006

    Chris North. Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006

  44. [44]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

  45. [45]

    Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024

  46. [46]

    Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark

    DongpingChen, RuoxiChen, ShilinZhang, YaochenWang, YinuoLiu, HuichiZhou, QihuiZhang, YaoWan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024

  47. [47]

    public data

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024. Appendix A Model Settings Data Journalist Agent is based o...