pith. sign in

arxiv: 2605.25038 · v1 · pith:U43GRQVTnew · submitted 2026-05-24 · 💻 cs.CL · cs.LG· cs.SE

TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

Pith reviewed 2026-06-30 11:57 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SE
keywords synthetic datasetApplied Behavior Analysisinstruction tuningteaching program generationbehavioral interpretationABA taxonomyclinical examples
0
0 comments X

The pith

TRACE supplies a 2,999-example synthetic dataset for teaching-program generation and behavioral interpretation in Applied Behavior Analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates TRACE because real ABA clinical data cannot be released due to privacy protections. It uses a deterministic generator based on a taxonomy from standard literature to produce examples for two main tasks. One task is generating teaching programs using Discrete Trial Training, Natural Environment Teaching, and Task Analysis. The other is interpreting multi-session behavioral data across twelve patterns and thirteen behaviors. The dataset includes full provenance for each example and is split into train, validation, test, and sanity sets for research use.

Core claim

We present TRACE, a 2,999-example synthetic instruction-tuning dataset for teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis, and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and carries complete sampling provenance.

What carries the argument

A deterministic taxonomy-driven generator that creates examples from a taxonomy extracted from canonical ABA literature, providing each with traceable provenance.

If this is right

  • Models can be trained on the provided train split of 2,549 examples for the two ABA tasks.
  • The validation, test, and sanity splits allow for evaluation of model performance on the generated examples.
  • Full sampling provenance enables analysis of how taxonomy cells map to specific examples.
  • The CC BY-NC 4.0 license for data permits non-commercial research applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar taxonomy-driven synthetic data generation could help other medical or clinical fields facing data privacy barriers.
  • Performance of models trained on TRACE on actual clinical cases would test how well the taxonomy captures real-world variability.
  • Extensions might include adding more ABA tasks or refining the taxonomy based on expert feedback.

Load-bearing premise

That examples produced by the taxonomy-driven generator are representative of real clinical ABA practice in a way that makes them useful for training models.

What would settle it

Evaluating a model fine-tuned on TRACE against performance on a held-out set of real, de-identified ABA session records; significantly lower accuracy on real data would challenge the dataset's utility.

read the original abstract

Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi-session behavioral logs, is formulaic and high-volume, yet real session data is HIPAA-protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy-Referenced ABA Clinical Examples), a 2,999-example synthetic instruction-tuning dataset covering two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY-NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The paper presents TRACE, a 2,999-example synthetic instruction-tuning dataset for two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in canonical ABA literature, each carrying complete sampling provenance. The release includes stratified train (2,549), validation (149), test (281), and sanity (20) splits under CC BY-NC 4.0 (data) and MIT (code) licenses, with an explicit statement that it is a research artifact that has not been clinically validated.

Significance. If the dataset supports downstream model training on these formulaic ABA documentation tasks, it would provide a valuable public resource where real clinical data cannot be released due to privacy constraints. The deterministic construction from an external taxonomy, full provenance metadata for every example, and explicit non-validation disclaimer are explicit strengths that promote transparency and reproducibility without circularity or unstated assumptions about real-world fidelity. These features allow users to assess and extend the artifact directly.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thorough and positive review, which accurately summarizes the TRACE dataset and its contributions. We appreciate the recommendation to accept and the recognition of the dataset's transparency, provenance, and non-validation disclaimer as strengths.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is the release of a 2,999-example synthetic dataset generated deterministically from a taxonomy extracted from external canonical ABA literature, with explicit provenance metadata and a disclaimer of no clinical validation. No equations, fitted parameters, predictions, or self-citations are load-bearing; the generation process is described as grounded in independent prior literature rather than any self-referential definition or reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the adequacy of an external ABA taxonomy to generate representative examples; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption A canonical ABA taxonomy from the literature is sufficient to generate representative examples for teaching-program generation and behavioral interpretation tasks.
    The deterministic generator is explicitly grounded in this taxonomy to produce all 2,999 examples.

pith-pipeline@v0.9.1-grok · 5701 in / 1143 out tokens · 35095 ms · 2026-06-30T11:57:43.514460+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 23 canonical work pages · 7 internal anchors

  1. [1]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui \ n onero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench : Evaluating large language models towards improved human health. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.08775

  2. [2]

    Position statement on the use of restraint and seclusion

    Association for Behavior Analysis International . Position statement on the use of restraint and seclusion. https://www.abainternational.org/about-us/policies-and-positions/restraint-and-seclusion,-2010.aspx, 2010

  3. [3]

    Ethics code for behavior analysts

    Behavior Analyst Certification Board . Ethics code for behavior analysts. https://www.bacb.com/ethics-information/ethics-codes/, 2020

  4. [4]

    BACB certificant data

    Behavior Analyst Certification Board . BACB certificant data. https://www.bacb.com/bacb-certificant-data/, 2026. Certification counts as of 2026-04-01; accessed 2026-05-23

  5. [5]

    Bender and Batya Friedman

    Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6: 0 587--604, 2018. URL https://aclanthology.org/Q18-1041/

  6. [6]

    Carr and V

    Edward G. Carr and V. Mark Durand. Reducing behavior problems through functional communication training. Journal of Applied Behavior Analysis, 18 0 (2): 0 111--126, 1985. doi:10.1901/jaba.1985.18-111

  7. [7]

    The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

    Davide Chicco and Giuseppe Jurman. The advantages of the Matthews correlation coefficient ( MCC ) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21: 0 6, 2020. doi:10.1186/s12864-019-6413-7

  8. [8]

    Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit

    Jacob Cohen. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70 0 (4): 0 213--220, 1968

  9. [9]

    Cooper, Timothy E

    John O. Cooper, Timothy E. Heron, and William L. Heward. Applied Behavior Analysis. Pearson Education Limited, Harlow, England, 3rd, global edition edition, 2020. ISBN 978-1-292-32463-0

  10. [10]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.14314

  11. [11]

    The rise of small language models in healthcare: A comprehensive survey

    Muskan Garg, Shaina Raza, Shebuti Rayana, Xingyi Liu, and Sunghwan Sohn. The rise of small language models in healthcare: A comprehensive survey. arXiv preprint, 2025. URL https://arxiv.org/abs/2504.17119

  12. [12]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum \'e III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64 0 (12): 0 86--92, 2021. URL https://arxiv.org/abs/1803.09010

  13. [13]

    MedGemma Technical Report

    Google Research and Google DeepMind . MedGemma technical report. Technical report, Google DeepMind, 2025. URL https://arxiv.org/abs/2507.05201

  14. [14]

    On Calibration of Modern Neural Networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017. URL https://arxiv.org/abs/1706.04599

  15. [15]

    Hanley, Brian A

    Gregory P. Hanley, Brian A. Iwata, and Brandon E. McCord. Functional analysis of problem behavior: A review. Journal of Applied Behavior Analysis, 36 0 (2): 0 147--185, 2003. doi:10.1901/jaba.2003.36-147

  16. [16]

    Iwata, Michael F

    Brian A. Iwata, Michael F. Dorsey, Keith J. Slifer, Kenneth E. Bauman, and Gina S. Richman. Toward a functional analysis of self-injury. Journal of Applied Behavior Analysis, 27 0 (2): 0 197--209, 1994. doi:10.1901/jaba.1994.27-197. Reprint of the 1982 article in Analysis and Intervention in Developmental Disabilities, 2(1), 3--20

  17. [17]

    Jennings and David J

    Adrienne M. Jennings and David J. Cox. Starting the conversation around the ethical use of artificial intelligence in applied behavior analysis. Behavior Analysis in Practice, 17: 0 107--122, 2024. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC10891004/

  18. [18]

    Prometheus 2: An open source language model specialized in evaluating other language models,

    Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2 : An open source language model specialized in evaluating other language models. arXiv preprint, 2024. URL https://arxiv.org/abs/2405.01535

  19. [19]

    Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing

    Aman Kumar, Mareiko Au, Raj Semlawat, Malavica Sridhar, and Hitesh Gurnani. Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing. In Proceedings of the 1st Workshop on Natural Language Processing for Science (NLP4Science), pages 188--196. Association for Computational Linguistics, 2024....

  20. [20]

    Ivar Lovaas

    O. Ivar Lovaas. Behavioral treatment and normal educational and intellectual functioning in young autistic children. Journal of Consulting and Clinical Psychology, 55 0 (1): 0 3--9, 1987. doi:10.1037/0022-006X.55.1.3

  21. [21]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models. arXiv preprint, 2023. URL https://arxiv.org/abs/2303.08896

  22. [22]

    How to design, create, and evaluate an instruction-tuning dataset for large language model training in health care: Tutorial from a clinical perspective

    Wojciech Nazar, Grzegorz Nazar, Aleksandra Kami \'n ska, and Ludmila Danilowicz-Szymanowicz. How to design, create, and evaluate an instruction-tuning dataset for large language model training in health care: Tutorial from a clinical perspective. Journal of Medical Internet Research, 27: 0 e70481, 2025. doi:10.2196/70481. URL https://www.jmir.org/2025/1/e70481

  23. [23]

    Parsons, Jeannia H

    Marsha B. Parsons, Jeannia H. Rollyson, and Dennis H. Reid. Evidence-based staff training: A guide for practitioners. Behavior Analysis in Practice, 5 0 (2): 0 2--11, 2012

  24. [24]

    S. Peck, C. O'Brien, J. Bourret, and D. Agostinelli. ChatGPT versus clinician responses to questions in ABA : Preference, identification, and level of agreement. Journal of Applied Behavior Analysis, 58 0 (4): 0 731--743, 2025. doi:10.1002/jaba.70029. URL https://onlinelibrary.wiley.com/doi/10.1002/jaba.70029

  25. [25]

    A structured review of the validity of BLEU

    Ehud Reiter. A structured review of the validity of BLEU . Computational Linguistics, 44 0 (3): 0 393--401, 2018. URL https://direct.mit.edu/coli/article/44/3/393/

  26. [26]

    TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes

    Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Drew Bertagnolli, and Chaitanya Shivade. TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes. arXiv preprint, 2025. URL https://arxiv.org/abs/2503.20648

  27. [27]

    Singhal, T

    Karan Singhal, Tao Tu, Juraj Gottweis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, 31: 0 943--950, 2025. doi:10.1038/s41591-024-03423-7. URL https://www.nature.com/articles/s41591-024-03423-7

  28. [28]

    Discrete trial training in the treatment of autism

    Tristram Smith. Discrete trial training in the treatment of autism. Focus on Autism and Other Developmental Disabilities, 16 0 (2): 0 86--92, 2001

  29. [29]

    Implementation of generative AI for the assessment and treatment of autism spectrum disorders: A scoping review

    Jun-Seok Sohn, Eojin Lee, Jae-Jin Kim, Hyang-Kyeong Oh, and Eunjoo Kim. Implementation of generative AI for the assessment and treatment of autism spectrum disorders: A scoping review. Frontiers in Psychiatry, 16: 0 1628216, 2025. doi:10.3389/fpsyt.2025.1628216. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12322814/

  30. [30]

    Stokes and Donald M

    Trevor F. Stokes and Donald M. Baer. An implicit technology of generalization. Journal of Applied Behavior Analysis, 10 0 (2): 0 349--367, 1977. doi:10.1901/jaba.1977.10-349

  31. [31]

    Tiger, Gregory P

    Jeffrey H. Tiger, Gregory P. Hanley, and Jennifer Bruzek. Functional communication training: A review and practical guide. Behavior Analysis in Practice, 1 0 (1): 0 16--23, 2008

  32. [32]

    Touchette and Jane S

    Paul E. Touchette and Jane S. Howard. Errorless learning: Reinforcement contingencies and stimulus control transfer in delayed prompting. Journal of Applied Behavior Analysis, 17 0 (2): 0 175--188, 1984

  33. [33]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct : Aligning language models with self-generated instructions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://arxiv.org/abs/2212.10560

  34. [34]

    Menta: A small language model for on-device mental health prediction

    Tianyi Zhang, Xiangyuan Xue, Lingyan Ruan, Shiya Fu, Feng Xia, Simon D'Alfonso, Vassilis Kostakos, Ting Dang, and Hong Jia. Menta: A small language model for on-device mental health prediction. arXiv preprint, 2025. URL https://arxiv.org/abs/2512.02716

  35. [35]

    AlpaCare : Instruction fine-tuned large language models for medical applications

    Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. AlpaCare : Instruction fine-tuned large language models for medical applications. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.14558

  36. [36]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2306.05685