pith. machine review for the scientific record. sign in

arxiv: 2604.24819 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

Recognition: unknown

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Cheng Tan, Chenkai Pan, Conghui He, Jingxuan Wei, Jintao Chen, Siyuan Li, Xinglong Xu, Yuhang Xu, Yujun Wu

Pith reviewed 2026-05-08 03:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords test-driven data engineeringLLM fine-tuningstructured knowledge extractionfailure-driven repairdata as codeself-improving modelsdomain adaptationknowledge base construction
0
0 comments X

The pith

When training data and evaluation share a structured knowledge base extracted from raw text, model failures become traceable data deficiencies that can be repaired like software bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that extracting a structured knowledge representation from source corpora creates a shared foundation for both generating training data and building evaluation benchmarks. Under this setup the data-engineering process maps directly onto the software development lifecycle, with training examples functioning as source code, model updates as compilation, benchmarks as unit tests, and error analysis as debugging. Model mistakes break down into missing concepts or broken reasoning chains that point back to specific gaps in the extracted knowledge, allowing targeted data patches that improve performance consistently across model sizes and types. This correspondence turns the otherwise opaque process of knowledge transfer into a repeatable, repairable engineering workflow. The authors demonstrate the approach across sixteen scientific and social-science domains and release the resulting knowledge base, benchmarks, and corpora.

Core claim

When a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing the 2

What carries the argument

The structured knowledge representation extracted from raw corpora that serves as the single source for both training examples and evaluation benchmarks, enabling the full mapping of data engineering onto the software development lifecycle.

If this is right

  • Each debugging cycle improves domain performance across different model scales and architectures without harming general capabilities.
  • The same structured knowledge base can generate both the training corpus and the test suite, closing the loop between data creation and verification.
  • Specialized human knowledge from any text corpus can be transferred into models through repeated, measurable repair steps rather than blind data scaling.
  • The method applies uniformly across natural sciences, engineering, biomedicine, and social sciences once the initial knowledge extraction is performed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the mapping holds, organizations could maintain version-controlled knowledge bases whose updates automatically propagate to model behavior through the same debugging workflow used for software.
  • The approach suggests a route to partial interpretability: each model error can be linked to a concrete, human-readable knowledge item rather than remaining inside opaque parameter changes.
  • Extending the method to multimodal or agentic systems would require only that their failures also decompose into traceable gaps in a shared structured representation.

Load-bearing premise

Model failures can be decomposed into specific concept gaps or reasoning breaks that reliably trace back to particular missing or incorrect items in the extracted knowledge base.

What would settle it

A controlled experiment in which targeted data patches derived from failure analysis produce no measurable improvement on the corresponding benchmark items while general capabilities remain unchanged.

Figures

Figures reproduced from arXiv: 2604.24819 by Cheng Tan, Chenkai Pan, Conghui He, Jingxuan Wei, Jintao Chen, Siyuan Li, Xinglong Xu, Yuhang Xu, Yujun Wu.

Figure 1
Figure 1. Figure 1: The conceptual correspondence between test-driven engineering in software and the Pro view at source ↗
Figure 2
Figure 2. Figure 2: From open-loop data engineering to Programming with Data. a, The pre-training playbook breaks for domain fine-tuning because failures cannot be traced back to data without a shared struc￾ture. b, Software engineering solved this by deriving code and tests from a shared specification. c, Programming with Data applies the same principle, using a shared knowledge structure to close the loop between training d… view at source ↗
Figure 3
Figure 3. Figure 3: Structured knowledge extracted from 16 disciplines. a, Corpus distillation pipeline. Successive filtering reduces 117,000 raw documents ( 15B tokens) to 48,000 high-quality chunks, from which 43,953 L3 reasoning chains, 186,784 L2 relational statements, and 227,869 L1 atomic concepts are extracted top-down. Percentages indicate retention rates. b, Representative knowledge subgraph for a single corpus chunk… view at source ↗
Figure 4
Figure 4. Figure 4: Meta-evaluation of the ProDa-16 benchmark. a, Spearman rank correlation between ProDa-16 and 11 established benchmarks across models (dashed line, ρ = 0.80; red dotted line, mean ρ = 0.847). b, Overall accuracy by model, ranked in descending order; error bars denote 95% bootstrap confidence intervals across disciplines. c, Per-discipline accuracy distributions across all models; thick bars, interquartile r… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of data synthesis methods. Average benchmark scores of Qwen2.5- 7B fine-tuned on data generated by Alpaca, EasyDataset, DataFlow, and our ProDa framework across 1K–10K data scales. R and F denote random sampling and heuristic filtering baselines. ProDa V2, leveraging closed-loop diagnostic repair, exhibits exceptional sample efficiency at 1K and consistently outperforms all conventio… view at source ↗
Figure 6
Figure 6. Figure 6: Three diagnostic-repair case studies. Each row shows the question (left), the relevant knowledge structure (centre), and the diagnostic report with V1/V2 responses (right). Cases span Physics (concept gap), Economics (capability deficit), and Medicine (concept gap). In all cases the V2 model corrects the V1 error after training on patches anchored to the diagnosed knowledge nodes. 15 view at source ↗
Figure 7
Figure 7. Figure 7: The ProDa Studio integrated development environment. a, Knowledge extraction interface showing L3 chains, L2 statements, and L1 concepts. b, Data generation interface displaying individual training instances with their type, source chain, linked knowledge nodes, and generation metadata. c, Fine-tuning console with real-time loss and learning rate monitoring. d, Evaluation dashboard with per-discipline scor… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template used for corpus-level document triage in ProDa’s preprocessing stage. A.2 Chunk quality scoring rubrics This appendix provides the complete scoring rubrics for the six-dimensional quality matrix used in chunk-level quality assessment (§X.4 of the main text). Each dimension is scored on an integer scale from 1 to 5 by a language model evaluator. Below we define each dimension, state its purp… view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of documents by discipline and academic level after document-level curation. view at source ↗
Figure 10
Figure 10. Figure 10: Reasoning type analysis of the raw corpus during document-level curation. view at source ↗
Figure 11
Figure 11. Figure 11: Score distributions of the six-dimensional quality matrix across all corpus chunks. view at source ↗
Figure 12
Figure 12. Figure 12: Prompt template used for L3 Reasoning Chain extraction from high-quality corpus chunks. Each chunk yields exactly one chain representing its primary inferential pathway. B.2 L2 atomic statement decomposition prompt L2 Atomic Statement Decomposition Role. You are a Structural Logic Decomposer. Your task is to break down complex, continuous reasoning chains into discrete, atomic factual statements (L2 Knowl… view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template used for L2 atomic statement decomposition. Each adjacent step pair in an L3 chain is converted into a single typed relational triple with textual evidence. B.3 L1 key concept extraction prompt L1 Key Concept Extraction Role. You are a Cross-Domain Ontologist and Terminologist. Your task is to extract, standardize, and define the core concepts found in a set of atomic factual statements. G… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template used for L1 Key Concept extraction. Concepts are harvested from L2 statement subjects and objects, then deduplicated, defined in context, and linked back to their source statements for traceability. B.4 Per-discipline statistics — L3/L2/L1 view at source ↗
Figure 15
Figure 15. Figure 15: Knowledge hierarchy example from Molecular Biology. An L3 reasoning chain captures the multi-step mechanism of chromatin activation. L2 decomposes it into atomic subject–predicate– object statements. L1 harvests and defines the key concept Histone Proteins (H3 and H4) with full traceability. 38 view at source ↗
Figure 16
Figure 16. Figure 16: Knowledge hierarchy example from Chemistry. The L3 chain captures the full electrolysis mechanism from ion dissociation through Faraday’s Laws. L2 isolates the causal link between ion migration and competitive discharge. L1 extracts Electrochemical Series as the governing framework. 39 view at source ↗
Figure 17
Figure 17. Figure 17: Knowledge hierarchy example from Sociology. The L3 chain captures how visual framing mediates representation viewing through physical boundaries, psychological oscillation, and meta￾representational exposure. L2 isolates the link between frame establishment and physical mediation. L1 extracts Parergon (Frame), demonstrating that ProDa generalizes to interpretive social-science disciplines. 40 view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for choice question generation. This prompt directs the model to synthesize single-choice and multiple-choice questions from atomic L1 concepts and L2 factual statements. It enforces strict distractor construction rules, maintains a specified question distribution, and mandates detailed scientific reasoning for the answers without revealing internal metadata. Question and Answer Finetune Data Promp… view at source ↗
Figure 19
Figure 19. Figure 19: Prompt for instructional QA pair generation. This prompt directs the model to translate atomic L2 statements into natural, classroom-style question-answer pairs. It mandates comprehensive coverage, diverse question styles, and contextually unambiguous natural language, ensuring the dataset is suitable for high-quality LLM instruction tuning. True/False Finetune Data Prompt Role. You are an Expert True/Fal… view at source ↗
Figure 20
Figure 20. Figure 20: Prompt for true/false statement generation. This prompt directs the model to generate diverse, educationally valuable true and false statements from atomic L1 concepts and L2 facts. It mandates detailed scientific explanations for both correct and incorrect statements, ensuring high￾quality reasoning data for downstream LLM instruction tuning. 44 view at source ↗
Figure 21
Figure 21. Figure 21: Prompt for complex MCQ generation. This prompt directs the model to design high￾quality, deep-reasoning multiple-choice questions based on academic process chains. It enforces strict guidelines on distractor plausibility, question depth, and language consistency, ensuring the output avoids trivial recall and accurately tests causal and logical comprehension. C.3 Distractor generation strategy Distractor G… view at source ↗
Figure 22
Figure 22. Figure 22: Distractor generation strategy. This extracted guideline details the strict constraints for constructing plausible distractors. It focuses on testing deep comprehension through logical fallacies while enforcing rigorous structural consistency across all options. C.4 SFT data statistics open ended (60.0%) single choice (23.2%) true false (10.0%) multiple choice (6.8%) view at source ↗
Figure 23
Figure 23. Figure 23: Global distribution of question types in the SFT_v1 dataset. The dataset predominantly view at source ↗
Figure 24
Figure 24. Figure 24: Diagnosis prompt. This prompt instructs the model to act as an evaluation expert, analyzing error samples to categorize failures as either conceptual gaps or reasoning deficits, while enforcing a strict JSON output schema. D.2 Concept gap prompt Concept Gap Prompt Role You are an Expert Knowledge Injection Curator specialized in iterative learning optimization. Your mission is to produce high-fidelity tra… view at source ↗
Figure 25
Figure 25. Figure 25: Concept gap prompt. This Prompt specifically designed to address the "Concept Gap" error type, utilizes diagnostic reports to generate targeted, high-fidelity training samples. It aims to correct specific conceptual misunderstandings and reinforce precise knowledge boundaries. 51 view at source ↗
Figure 26
Figure 26. Figure 26: Capability deficit prompt. Specifically designed to address "Capability Deficit" errors, this prompt configures the model as an Elite Reasoning Scaffolding Specialist. It utilizes diagnostic insights to generate high-quality Chain-of-Thought (CoT) training samples aimed at systematically building multi-step reasoning abilities and eliminating logical gaps. D.4 Data mixing and replay strategy Data Mixing a… view at source ↗
Figure 27
Figure 27. Figure 27: Data mixing and replay strategy. This protocol details the transition from a uniform baseline to an error-proportional data allocation. It outlines the generation of multi-format repair samples (at a 6:3:1 ratio) and introduces an L2 ID-disjoint experience replay mechanism to fill category quotas while actively preventing catastrophic forgetting. D.5 Diagnostic report example Evaluation Diagnostic Report … view at source ↗
Figure 28
Figure 28. Figure 28: Automated diagnostic report example. This extracted report showcases the performance of Qwen2.5-7B-SFT, providing a quantitative breakdown of error patterns and qualitative diagnoses for specific failure cases to guide the subsequent refinement process. E Experimental Configuration E.1 Training hyperparameters and infrastructure All fine-tuning experiments were conducted using the open-source LLaMA-Factor… view at source ↗
read the original abstract

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the 'Programming with Data' framework, which extracts a structured knowledge representation from raw source corpora to serve as the shared foundation for both training data and evaluation benchmarks. It claims that this correspondence maps the full data-engineering lifecycle onto the software development lifecycle in a precise and operative manner: training data becomes source code, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this mapping, model failures are said to decompose into concept-level gaps and reasoning-chain breaks traceable to specific data deficiencies, which can be repaired via targeted patches. The authors assert that each repair cycle yields consistent improvements across model scales and architectures in sixteen disciplines (natural sciences, engineering, biomedicine, social sciences) without degrading general capabilities, and they release the associated knowledge base, benchmark suite, and training corpus as open resources.

Significance. If the traceability of failures and the reported improvements hold under rigorous controls, the work would establish a principled, feedback-driven methodology for domain adaptation of LLMs that treats data engineering as a debuggable engineering discipline rather than an ad-hoc process. The explicit analogy to software lifecycles and the release of open resources could enable reproducible, iterative improvement of specialized model capabilities and reduce reliance on indiscriminate data scaling.

major comments (2)
  1. Abstract: The assertion that 'each repair cycle producing consistent improvements across model scales and architectures' is load-bearing for the central claim that the mapping is 'precise and operative,' yet the abstract supplies no quantitative results, performance deltas, error analysis, or controls. Without these, the effectiveness of the debugging analogy cannot be assessed.
  2. Abstract: The claim that failures 'decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data' assumes the extracted structured representation captures the relevant causal structure of model behavior. No formal criterion, attribution method, or validation that this decomposition is unambiguous (rather than post-hoc) is provided, which is required for the analogy to hold.
minor comments (1)
  1. Abstract: The sixteen disciplines are referenced but not enumerated, and no concrete example of a structured knowledge unit, a benchmark item, or a single repair cycle is given; adding one would clarify the framework for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to strengthen the abstract's support for the central claims. We address each point below and indicate revisions to the manuscript.

read point-by-point responses
  1. Referee: Abstract: The assertion that 'each repair cycle producing consistent improvements across model scales and architectures' is load-bearing for the central claim that the mapping is 'precise and operative,' yet the abstract supplies no quantitative results, performance deltas, error analysis, or controls. Without these, the effectiveness of the debugging analogy cannot be assessed.

    Authors: We agree that the abstract would benefit from including concise quantitative indicators to allow readers to assess the claims directly. The full manuscript reports these results in the experimental sections, including performance deltas across model scales and architectures, error breakdowns, and controls confirming no degradation in general capabilities. We will revise the abstract to incorporate a brief summary of these findings. revision: yes

  2. Referee: Abstract: The claim that failures 'decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data' assumes the extracted structured representation captures the relevant causal structure of model behavior. No formal criterion, attribution method, or validation that this decomposition is unambiguous (rather than post-hoc) is provided, which is required for the analogy to hold.

    Authors: The manuscript formalizes the decomposition criteria, attribution method, and validation in the 'Programming with Data' framework section, using mismatches against the structured knowledge representation to identify gaps and breaks, with empirical support from repair success and generalization across models. We will revise the abstract to include a short reference to this formalization and validation to address the presentation concern. revision: partial

Circularity Check

0 steps flagged

No circularity: conceptual mapping introduced independently and demonstrated empirically

full rationale

The paper defines a principle called 'Programming with Data' by positing that a structured knowledge representation extracted from the source corpus can serve as the shared foundation for training data and evaluation, thereby mapping the data-engineering lifecycle onto the software development lifecycle. This mapping is presented as a formalizable principle that is then instantiated across sixteen disciplines, with open resources released to support the claim of traceable, repairable failures. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the central claims to prior inputs by construction. The decomposition of failures into concept-level gaps and reasoning-chain breaks is asserted as a consequence of the correspondence rather than defined into existence, and the reported improvements are framed as empirical outcomes rather than tautological results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract claims only; full paper would likely reveal additional parameters and assumptions around knowledge extraction and failure tracing.

axioms (1)
  • domain assumption A structured knowledge representation can be reliably extracted from the source corpus to serve as shared foundation for both training data and evaluation
    This extraction step is required for the lifecycle mapping to function as described.
invented entities (1)
  • Programming with Data framework no independent evidence
    purpose: To provide a traceable, repairable correspondence between data engineering and software development for LLMs
    Newly introduced principle that organizes the claimed process.

pith-pipeline@v0.9.0 · 5583 in / 1351 out tokens · 23301 ms · 2026-05-08T03:10:05.370718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    Can we trust the evaluation on chatgpt? arXiv preprint arXiv:2303.12767, 2023

    Rachith Aiyappa, Jisun An, Haewoon Kwak, and Yong-Yeol Ahn. Can we trust the evaluation on chatgpt? arXiv preprint arXiv:2303.12767, 2023

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  3. [3]

    Addison-Wesley Professional, 2003

    Kent Beck.Test-driven development: by example. Addison-Wesley Professional, 2003

  4. [4]

    Pearson Education, 1995

    Frederick P Brooks Jr.The mythical man-month: essays on software engineering. Pearson Education, 1995

  5. [5]

    Self-play fine-tuning converts weak language models to strong language models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. InInternational Conference on Machine Learning, pages 6621–6642. PMLR, 2024

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    Investigating data contami- nation in modern benchmarks for large language models

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contami- nation in modern benchmarks for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8706–8719, 2024

  8. [8]

    Academics can contribute to domain-specialized language models

    Mark Dredze, Genta Indra Winata, Prabhanjan Kambadur, Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravol- ski, David S Rosenberg, and Sebastian Gehrmann. Academics can contribute to domain-specialized language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5100–5110, 2024

  9. [9]

    Time travel in LLMs: Tracing data contamination in large language models

    Shahriar Golchin and Mihai Surdeanu. Time travel in LLMs: Tracing data contamination in large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=2Rwq6c3tvr

  10. [10]

    arXiv preprint arXiv:2305.15717 , year =

    Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms.arXiv preprint arXiv:2305.15717, 2023

  11. [11]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023

  12. [12]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023

  13. [13]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations,

  14. [14]

    URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

  15. [15]

    Unnatural instructions: Tuning language mod- els with (almost) no human labor

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language mod- els with (almost) no human labor. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, 2023

  16. [16]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

  17. [17]

    Large language models cannot self-correct reasoning yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=IkmD3fKBPQ. 22 Programming with Data

  18. [18]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36:62991–63010, 2023

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.Advances in Neural Information Processing Systems, 36:62991–63010, 2023

  19. [19]

    Dynabench: Rethinking benchmarking in nlp

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. In Proceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, pages 4110–...

  20. [20]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, 2023

  21. [21]

    Evaluating language models as synthetic data generators

    Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, and Graham Neubig. Evaluating language models as synthetic data generators. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6385–6403, 2025

  22. [22]

    Longform: Effective instruction tuning with reverse instructions

    Abdullatif Köksal, Timo Schick, Anna Korhonen, and Hinrich Schütze. Longform: Effective instruction tuning with reverse instructions. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7056–7078, 2024

  23. [23]

    Datainf: Efficiently estimating data influence in loRA-tuned LLMs and diffusion models

    Yongchan Kwon, Eric Wu, Kevin Wu, and James Zou. Datainf: Efficiently estimating data influence in loRA-tuned LLMs and diffusion models. InThe Twelfth International Conference on Learning Representations,

  24. [24]

    URLhttps://openreview.net/forum?id=9m02ib92Wz

  25. [25]

    RLAIF: Scaling reinforcement learning from human feedback with AI feedback, 2024

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. RLAIF: Scaling reinforcement learning from human feedback with AI feedback, 2024. URLhttps://openreview.net/forum?id=AAxIs3D2ZZ

  26. [26]

    Alpacaeval: An automatic evaluator of instruction-following models, 2023

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023

  27. [27]

    Textbooks Are All You Need II: phi-1.5 technical report

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023

  28. [28]

    Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

    Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, et al. Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai.arXiv preprint arXiv:2512.16676, 2025

  29. [29]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

  30. [30]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  31. [31]

    Domain specialization as the key to make large language models disruptive: A comprehensive survey.ACM Computing Surveys, 58(3):1–39, 2025

    Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al. Domain specialization as the key to make large language models disruptive: A comprehensive survey.ACM Computing Surveys, 58(3):1–39, 2025

  32. [32]

    Explainaboard: An explainable leaderboard for nlp

    Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, Shuaichen Chang, Junqi Dai, Yixin Liu, Zihuiwen Ye, and Graham Neubig. Explainaboard: An explainable leaderboard for nlp. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations...

  33. [33]

    Easy dataset: A unified and extensible framework for synthesizing llm fine-tuning data from unstructured documents

    Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, and Richong Zhang. Easy dataset: A unified and extensible framework for synthesizing llm fine-tuning data from unstructured documents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 960–968, 2025. 23 Programmin...

  34. [34]

    Mukherjee, A

    Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

  35. [35]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  36. [36]

    Adaptive testing and debugging of nlp models

    Marco Tulio Ribeiro and Scott Lundberg. Adaptive testing and debugging of nlp models. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3253–3267, 2022

  37. [37]

    Managing the development of large software systems: concepts and techniques

    Winston W Royce. Managing the development of large software systems: concepts and techniques. In Proceedings of the 9th international conference on Software Engineering, pages 328–338, 1987

  38. [38]

    Brown, and et al

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, and et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview. net/forum?id=uyTL5Bvosj. Featured Certification

  39. [39]

    Improving data efficiency for LLM reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=uwUkETPIJN

  40. [40]

    Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023

  41. [41]

    Language models as continuous self-evolving data engineers

    Peidong Wang, Ming Wang, Zhiming Ma, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, and Kaisong Song. Language models as continuous self-evolving data engineers. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18108–18127, 2025

  42. [42]

    Self-instruct: Aligning language models with self-generated instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 13484–13508, 2023

  43. [43]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

  44. [44]

    Self-rewarding language models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

  45. [45]

    STar: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=_3ELRdg2sgI

  46. [46]

    arXiv preprint arXiv:2502.05605 , year =

    Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Guoqing Liu, Zexu Sun, Dong Li, Ning Yang, Jianye Hao, Haifeng Zhang, and Jun Wang. Evolving llms’ self-refinement capability via iterative preference optimization.arXiv preprint arXiv:2502.05605, 2025

  47. [47]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

  48. [48]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  49. [49]

    Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36: 55006–55021, 2023. 24 Programming with Data

  50. [50]

    Don’t make your llm an evaluation benchmark cheater

    Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. Don’t make your llm an evaluation benchmark cheater.arXiv preprint arXiv:2311.01964, 2023. 25 Programming with Data Appendix A Corpus Curation A.1 Document classification prompt Document Classification & Curation Judge Role.You are aScientifi...

  51. [51]

    Adapt extraction logic to fit the domain’s reasoning style

    Domain Analysis.Determine the domain of the text (e.g., Legal Argument, Chemical Reaction, Medical Diagnosis, Historical Sequence). Adapt extraction logic to fit the domain’s reasoning style. 2.Chain Extraction.Identify distinct, multi-step processes or logical arguments: •Causal chains: A causes B, which causes C. •Procedural chains: Step 1, then Step 2,...

  52. [52]

    Every step must have a direct connection to the next

    Validation.Ensure the chain is continuous. Every step must have a direct connection to the next. Do not skip intermediate steps mentioned in the text

  53. [53]

    Water is H2O

    Narrative Synthesis.For each chain, write a paragraph-length summary explaining the mechanism or logic behind it. 5.Step-by-Step Breakdown.List the exact sequence of nodes in text format. Key Constraints •Strict Logic: Every step must connect directly to the next. • No Disconnected Facts: Do not list static facts (e.g., “Water is H2O”). Only extract flows...

  54. [54]

    All chains must be processed with equal attention

    Process Chains.Iterate through every chain_id in the input. All chains must be processed with equal attention

  55. [55]

    causes”, “is followed by

    Sliding Window Decomposition.For every adjacent pair of steps (Step[i] andStep[i+1]), generate one Factual Statement. Example chain: A→B→C Target output: Statement(A→B)ANDStatement(B→C). 3.Formulate the Statement.Each statement must contain: •Subject: The concept or entity inStep[i]. •Object: The concept or entity inStep[i+1]. • Predicate: A specific verb...

  56. [56]

    It expands

    Contextualization.If a step contains a pronoun or vague reference (e.g., “It expands”), replace it with the specific noun based on preceding steps (e.g., “The lung expands”). Both Subject and Object must be understandable in isolation. Key Constraints •Atomicity: Each statement must describe exactly one logical step. • Strict Adjacency: Only link immediat...

  57. [57]

    Collect every unique term

    Term Collection.Scan all subject andobject fields in the L2 statements. Collect every unique term

  58. [58]

    The defendant

    Deduplication and Canonicalization.Merge lexical variations into a single canonical entry (e.g., “The defendant” and “Defendant party” → “Defendant”). Choose the most standard, professional name for the given domain

  59. [59]

    The definition must fit the context of the source statements

    Context-Aware Definition.Provide a concise definition (1–2 sentences) for each concept. The definition must fit the context of the source statements. For example, if the domain is Law, define “Bond” as a financial instrument, not a chemical connection. Use the predicate and source_quotefields in L2 as context clues

  60. [60]

    concept_id

    Typing.Assign a category type relevant to the domain (e.g., Legal Entity, Chemical Element, Physiological Structure, Abstract Concept). 5.Traceability Annotation.For each concept, record: •parent_statement_ids : All statement_id values from input statements where this term appears as subject or object. •CID: All unique CID values from the corresponding so...

  61. [61]

    Total questions MUST reach or exceed{MAX_QUESTIONS}

    Question Distribution Generate approximately{SINGLE_CHOICE_RATIO}% single-choice and the rest multiple-choice ques- tions. Total questions MUST reach or exceed{MAX_QUESTIONS}

  62. [62]

    What is the primary function of the diaphragm?

    Single-Choice Question Construction • Pick one L2 statement (Subject→Predicate→Object). • Use the Subject as the question stem. • Correct answer: the actual Object/Predicate combination. • Distractors (3): extract from other unrelated L2 statements or generate plausible alternatives. • Example: “What is the primary function of the diaphragm?” Options: A. ...

  63. [63]

    Which of the following are correct about [the concept]? (Select all that apply)

    Multiple-Choice Question Construction • Pick one L1 concept as the core topic. • Find 2–4 related L2 statements involving this concept. • Question stem: “Which of the following are correct about [the concept]? (Select all that apply)”. • Correct options: real L2 facts (2–3 correct answers). • Distractors: slightly modified incorrect statements. • Answer f...

  64. [64]

    If {MAX_QUESTIONS} is large, generate multiple questions per statement using different angles

    Coverage Requirement Cover at least 70% of the provided L2 statements. If {MAX_QUESTIONS} is large, generate multiple questions per statement using different angles

  65. [65]

    Avoid mechanical JSON-to-sentence conversion

    Natural Language Write questions in fluent, educational language. Avoid mechanical JSON-to-sentence conversion

  66. [66]

    stmt-XXX

    Explanation (CRITICAL) Every question MUST include a detailed explanation in the metadata, explaining why the answer is correct and why distractors are wrong. IMPORTANT - Explanation Quality Rules: • Write explanations in natural, educational language using domain knowledge. • DO NOT reference internal identifiers like “stmt-XXX” or “concept-XXX” in expla...

  67. [67]

    MANDATORY: Cover at least 70–90% of the provided L2 statements

    Coverage Requirement (CRITICAL) You MUST generate questions from as many L2 statements as possible. MANDATORY: Cover at least 70–90% of the provided L2 statements. If {MAX_QUESTIONS} is large, generate 2–3 questions per statement using different question styles to ensure comprehensive coverage. IMPORTANT: The total number of questions MUST reach or exceed...

  68. [68]

    Atomic Focus Each QA pair must strictly focus on ONE L2 statement (Subject→Predicate→Object)

  69. [69]

    Q: What does Diaphragm do? A: The Diaphragm contracts increasing volume

    Natural Language Refinement (Critical) Do NOT simply reformat the JSON into a sentence. • Bad Example: “Q: What does Diaphragm do? A: The Diaphragm contracts increasing volume.” (Robotic) 42 Programming with Data • Good Example: “Q: What is the immediate mechanical effect of diaphragm contraction? A: When the diaphragm contracts, it flattens out, which di...

  70. [70]

    It increases pressure

    Contextualization If the L2 statement uses a pronoun (e.g., “It increases pressure”), replace “It” with the specific noun in the Question. Ensure the Question provides enough context to be unambiguous

  71. [71]

    Define X in the context of

    Variety Use different question styles to maximize coverage: •Definition: “Define X in the context of...” •Function: “What is the role of X?” •Mechanistic: “How does X lead to Y?” •True/False explanation: “Is it true that X causes Y? Explain why.” •Comparison: “What is the difference between X and Y?” •Application: “In what scenario would X occur?” Output ...

  72. [72]

    Total ques- 43 Programming with Data tions MUST reach or exceed{MAX_QUESTIONS}

    Statement Distribution Generate approximately{TRUE_RATIO}% true statements and the rest false statements. Total ques- 43 Programming with Data tions MUST reach or exceed{MAX_QUESTIONS}

  73. [73]

    The diaphragm contracts to increase thoracic volume

    True Statement Construction • Base on actual L2 statements (Subject→Predicate→Object). • Rephrase in natural language while maintaining factual accuracy. • Example: “The diaphragm contracts to increase thoracic volume.” (True)

  74. [74]

    The diaphragm contracts to decrease thoracic volume

    False Statement Construction • Modify real L2 statements to create plausible but incorrect statements. • Change relationships, add misconceptions, or invert facts. • Ensure false statements are educationally valuable. • Example: “The diaphragm contracts to decrease thoracic volume.” (False)

  75. [75]

    Generate multiple variations per statement when needed

    Coverage Requirement Cover at least 70% of the provided L2 statements. Generate multiple variations per statement when needed

  76. [76]

    Avoid obvious giveaways of truth/falsity

    Natural Language Write statements in fluent, natural language. Avoid obvious giveaways of truth/falsity

  77. [77]

    statement

    Explanation (CRITICAL) Every statement MUST include a detailed explanation of why it is true or false. IMPORTANT - Explanation Quality Rules: • Explain using domain knowledge and scientific reasoning. • For TRUE: Explain why the statement is correct. • For FALSE: Explain what the correct fact is and why the statement is wrong. • Use natural, educational l...

  78. [78]

    Focus on causal relationships and synthesis across multiple steps

    Depth and Complexity The question must testdeep understandingof the reasoning process. Focus on causal relationships and synthesis across multiple steps. Avoid simple memorization

  79. [79]

    why” a step follows another or “how

    Question Type (Priority Order) • Priority 1 – Process Reasoning: Questions about “why” a step follows another or “how” a mechanism works. •Priority 2 – Causal Analysis: Cause-and-effect relationships. •Priority 3 – Critical Understanding: Significance, implications, or principles. •Priority 4 – Application: Applying reasoning to a new but related scenario

  80. [80]

    Question Length and Detail Questions MUST be comprehensive (typically 40–100 words). Include specific context and use direct quotations from the steps or summary.CRITICAL:When referencing specific concepts, you MUST include the complete content, not vague references

Showing first 80 references.