TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

Festus Kahunla

arxiv: 2605.25038 · v1 · pith:U43GRQVTnew · submitted 2026-05-24 · 💻 cs.CL · cs.LG· cs.SE

TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

Festus Kahunla This is my paper

Pith reviewed 2026-06-30 11:57 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SE

keywords synthetic datasetApplied Behavior Analysisinstruction tuningteaching program generationbehavioral interpretationABA taxonomyclinical examples

0 comments

The pith

TRACE supplies a 2,999-example synthetic dataset for teaching-program generation and behavioral interpretation in Applied Behavior Analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates TRACE because real ABA clinical data cannot be released due to privacy protections. It uses a deterministic generator based on a taxonomy from standard literature to produce examples for two main tasks. One task is generating teaching programs using Discrete Trial Training, Natural Environment Teaching, and Task Analysis. The other is interpreting multi-session behavioral data across twelve patterns and thirteen behaviors. The dataset includes full provenance for each example and is split into train, validation, test, and sanity sets for research use.

Core claim

We present TRACE, a 2,999-example synthetic instruction-tuning dataset for teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis, and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and carries complete sampling provenance.

What carries the argument

A deterministic taxonomy-driven generator that creates examples from a taxonomy extracted from canonical ABA literature, providing each with traceable provenance.

If this is right

Models can be trained on the provided train split of 2,549 examples for the two ABA tasks.
The validation, test, and sanity splits allow for evaluation of model performance on the generated examples.
Full sampling provenance enables analysis of how taxonomy cells map to specific examples.
The CC BY-NC 4.0 license for data permits non-commercial research applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar taxonomy-driven synthetic data generation could help other medical or clinical fields facing data privacy barriers.
Performance of models trained on TRACE on actual clinical cases would test how well the taxonomy captures real-world variability.
Extensions might include adding more ABA tasks or refining the taxonomy based on expert feedback.

Load-bearing premise

That examples produced by the taxonomy-driven generator are representative of real clinical ABA practice in a way that makes them useful for training models.

What would settle it

Evaluating a model fine-tuned on TRACE against performance on a held-out set of real, de-identified ABA session records; significantly lower accuracy on real data would challenge the dataset's utility.

read the original abstract

Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi-session behavioral logs, is formulaic and high-volume, yet real session data is HIPAA-protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy-Referenced ABA Clinical Examples), a 2,999-example synthetic instruction-tuning dataset covering two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY-NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE is a clean release of a 2,999-example synthetic ABA dataset with full provenance tracking, but its usefulness for training still rests on an untested assumption about real-world match.

read the letter

The paper's core move is to publish TRACE, a synthetic instruction-tuning set for two ABA tasks: generating teaching programs (DTT, NET, task analysis) and interpreting multi-session trajectories. Every example comes from a deterministic generator tied to a taxonomy pulled from standard ABA literature, and each carries exact sampling metadata. They split it into train/val/test/sanity sets, license the data CC BY-NC and code MIT, and state plainly that it is not clinically validated.

What stands out is the provenance detail and the explicit disclaimer. In a field where HIPAA blocks real session logs, this gives researchers something they can actually download and use without privacy issues. The dual-task coverage and the way they ground everything in external taxonomy rather than fitted parameters keeps the construction transparent.

The soft spot is the lack of any quality or coverage checks in the description. No metrics on how well the generated examples reflect actual clinical variation, no small human review, and no downstream test showing that models trained on TRACE improve on real (or even held-out) ABA data. The utility claim is left open, which is honest but also means the dataset's value is still hypothetical.

This is for people building models for clinical documentation in restricted domains, or anyone who needs a small, fully traceable synthetic corpus for ABA-style tasks. It is narrow but the construction is straightforward enough that a referee could give useful feedback on the generator and taxonomy choices. I would send it to review rather than desk reject.

Referee Report

0 major / 0 minor

Summary. The paper presents TRACE, a 2,999-example synthetic instruction-tuning dataset for two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in canonical ABA literature, each carrying complete sampling provenance. The release includes stratified train (2,549), validation (149), test (281), and sanity (20) splits under CC BY-NC 4.0 (data) and MIT (code) licenses, with an explicit statement that it is a research artifact that has not been clinically validated.

Significance. If the dataset supports downstream model training on these formulaic ABA documentation tasks, it would provide a valuable public resource where real clinical data cannot be released due to privacy constraints. The deterministic construction from an external taxonomy, full provenance metadata for every example, and explicit non-validation disclaimer are explicit strengths that promote transparency and reproducibility without circularity or unstated assumptions about real-world fidelity. These features allow users to assess and extend the artifact directly.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thorough and positive review, which accurately summarizes the TRACE dataset and its contributions. We appreciate the recommendation to accept and the recognition of the dataset's transparency, provenance, and non-validation disclaimer as strengths.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is the release of a 2,999-example synthetic dataset generated deterministically from a taxonomy extracted from external canonical ABA literature, with explicit provenance metadata and a disclaimer of no clinical validation. No equations, fitted parameters, predictions, or self-citations are load-bearing; the generation process is described as grounded in independent prior literature rather than any self-referential definition or reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the adequacy of an external ABA taxonomy to generate representative examples; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption A canonical ABA taxonomy from the literature is sufficient to generate representative examples for teaching-program generation and behavioral interpretation tasks.
The deterministic generator is explicitly grounded in this taxonomy to produce all 2,999 examples.

pith-pipeline@v0.9.1-grok · 5701 in / 1143 out tokens · 35095 ms · 2026-06-30T11:57:43.514460+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 23 canonical work pages · 7 internal anchors

[1]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui \ n onero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench : Evaluating large language models towards improved human health. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Position statement on the use of restraint and seclusion

Association for Behavior Analysis International . Position statement on the use of restraint and seclusion. https://www.abainternational.org/about-us/policies-and-positions/restraint-and-seclusion,-2010.aspx, 2010

2010
[3]

Ethics code for behavior analysts

Behavior Analyst Certification Board . Ethics code for behavior analysts. https://www.bacb.com/ethics-information/ethics-codes/, 2020

2020
[4]

BACB certificant data

Behavior Analyst Certification Board . BACB certificant data. https://www.bacb.com/bacb-certificant-data/, 2026. Certification counts as of 2026-04-01; accessed 2026-05-23

2026
[5]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6: 0 587--604, 2018. URL https://aclanthology.org/Q18-1041/

2018
[6]

Carr and V

Edward G. Carr and V. Mark Durand. Reducing behavior problems through functional communication training. Journal of Applied Behavior Analysis, 18 0 (2): 0 111--126, 1985. doi:10.1901/jaba.1985.18-111

work page doi:10.1901/jaba.1985.18-111 1985
[7]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Davide Chicco and Giuseppe Jurman. The advantages of the Matthews correlation coefficient ( MCC ) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21: 0 6, 2020. doi:10.1186/s12864-019-6413-7

work page doi:10.1186/s12864-019-6413-7 2020
[8]

Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit

Jacob Cohen. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70 0 (4): 0 213--220, 1968

1968
[9]

Cooper, Timothy E

John O. Cooper, Timothy E. Heron, and William L. Heward. Applied Behavior Analysis. Pearson Education Limited, Harlow, England, 3rd, global edition edition, 2020. ISBN 978-1-292-32463-0

2020
[10]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

The rise of small language models in healthcare: A comprehensive survey

Muskan Garg, Shaina Raza, Shebuti Rayana, Xingyi Liu, and Sunghwan Sohn. The rise of small language models in healthcare: A comprehensive survey. arXiv preprint, 2025. URL https://arxiv.org/abs/2504.17119

work page arXiv 2025
[12]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum \'e III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64 0 (12): 0 86--92, 2021. URL https://arxiv.org/abs/1803.09010

work page arXiv 2021
[13]

MedGemma Technical Report

Google Research and Google DeepMind . MedGemma technical report. Technical report, Google DeepMind, 2025. URL https://arxiv.org/abs/2507.05201

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017. URL https://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Hanley, Brian A

Gregory P. Hanley, Brian A. Iwata, and Brandon E. McCord. Functional analysis of problem behavior: A review. Journal of Applied Behavior Analysis, 36 0 (2): 0 147--185, 2003. doi:10.1901/jaba.2003.36-147

work page doi:10.1901/jaba.2003.36-147 2003
[16]

Iwata, Michael F

Brian A. Iwata, Michael F. Dorsey, Keith J. Slifer, Kenneth E. Bauman, and Gina S. Richman. Toward a functional analysis of self-injury. Journal of Applied Behavior Analysis, 27 0 (2): 0 197--209, 1994. doi:10.1901/jaba.1994.27-197. Reprint of the 1982 article in Analysis and Intervention in Developmental Disabilities, 2(1), 3--20

work page doi:10.1901/jaba.1994.27-197 1994
[17]

Jennings and David J

Adrienne M. Jennings and David J. Cox. Starting the conversation around the ethical use of artificial intelligence in applied behavior analysis. Behavior Analysis in Practice, 17: 0 107--122, 2024. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC10891004/

2024
[18]

Prometheus 2: An open source language model specialized in evaluating other language models,

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2 : An open source language model specialized in evaluating other language models. arXiv preprint, 2024. URL https://arxiv.org/abs/2405.01535

work page arXiv 2024
[19]

Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing

Aman Kumar, Mareiko Au, Raj Semlawat, Malavica Sridhar, and Hitesh Gurnani. Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing. In Proceedings of the 1st Workshop on Natural Language Processing for Science (NLP4Science), pages 188--196. Association for Computational Linguistics, 2024....

2024
[20]

Ivar Lovaas

O. Ivar Lovaas. Behavioral treatment and normal educational and intellectual functioning in young autistic children. Journal of Consulting and Clinical Psychology, 55 0 (1): 0 3--9, 1987. doi:10.1037/0022-006X.55.1.3

work page doi:10.1037/0022-006x.55.1.3 1987
[21]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models. arXiv preprint, 2023. URL https://arxiv.org/abs/2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

How to design, create, and evaluate an instruction-tuning dataset for large language model training in health care: Tutorial from a clinical perspective

Wojciech Nazar, Grzegorz Nazar, Aleksandra Kami \'n ska, and Ludmila Danilowicz-Szymanowicz. How to design, create, and evaluate an instruction-tuning dataset for large language model training in health care: Tutorial from a clinical perspective. Journal of Medical Internet Research, 27: 0 e70481, 2025. doi:10.2196/70481. URL https://www.jmir.org/2025/1/e70481

work page doi:10.2196/70481 2025
[23]

Parsons, Jeannia H

Marsha B. Parsons, Jeannia H. Rollyson, and Dennis H. Reid. Evidence-based staff training: A guide for practitioners. Behavior Analysis in Practice, 5 0 (2): 0 2--11, 2012

2012
[24]

S. Peck, C. O'Brien, J. Bourret, and D. Agostinelli. ChatGPT versus clinician responses to questions in ABA : Preference, identification, and level of agreement. Journal of Applied Behavior Analysis, 58 0 (4): 0 731--743, 2025. doi:10.1002/jaba.70029. URL https://onlinelibrary.wiley.com/doi/10.1002/jaba.70029

work page doi:10.1002/jaba.70029 2025
[25]

A structured review of the validity of BLEU

Ehud Reiter. A structured review of the validity of BLEU . Computational Linguistics, 44 0 (3): 0 393--401, 2018. URL https://direct.mit.edu/coli/article/44/3/393/

2018
[26]

TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes

Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Drew Bertagnolli, and Chaitanya Shivade. TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes. arXiv preprint, 2025. URL https://arxiv.org/abs/2503.20648

work page arXiv 2025
[27]

Singhal, T

Karan Singhal, Tao Tu, Juraj Gottweis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, 31: 0 943--950, 2025. doi:10.1038/s41591-024-03423-7. URL https://www.nature.com/articles/s41591-024-03423-7

work page doi:10.1038/s41591-024-03423-7 2025
[28]

Discrete trial training in the treatment of autism

Tristram Smith. Discrete trial training in the treatment of autism. Focus on Autism and Other Developmental Disabilities, 16 0 (2): 0 86--92, 2001

2001
[29]

Implementation of generative AI for the assessment and treatment of autism spectrum disorders: A scoping review

Jun-Seok Sohn, Eojin Lee, Jae-Jin Kim, Hyang-Kyeong Oh, and Eunjoo Kim. Implementation of generative AI for the assessment and treatment of autism spectrum disorders: A scoping review. Frontiers in Psychiatry, 16: 0 1628216, 2025. doi:10.3389/fpsyt.2025.1628216. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12322814/

work page doi:10.3389/fpsyt.2025.1628216 2025
[30]

Stokes and Donald M

Trevor F. Stokes and Donald M. Baer. An implicit technology of generalization. Journal of Applied Behavior Analysis, 10 0 (2): 0 349--367, 1977. doi:10.1901/jaba.1977.10-349

work page doi:10.1901/jaba.1977.10-349 1977
[31]

Tiger, Gregory P

Jeffrey H. Tiger, Gregory P. Hanley, and Jennifer Bruzek. Functional communication training: A review and practical guide. Behavior Analysis in Practice, 1 0 (1): 0 16--23, 2008

2008
[32]

Touchette and Jane S

Paul E. Touchette and Jane S. Howard. Errorless learning: Reinforcement contingencies and stimulus control transfer in delayed prompting. Journal of Applied Behavior Analysis, 17 0 (2): 0 175--188, 1984

1984
[33]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct : Aligning language models with self-generated instructions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://arxiv.org/abs/2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Menta: A small language model for on-device mental health prediction

Tianyi Zhang, Xiangyuan Xue, Lingyan Ruan, Shiya Fu, Feng Xia, Simon D'Alfonso, Vassilis Kostakos, Ting Dang, and Hong Jia. Menta: A small language model for on-device mental health prediction. arXiv preprint, 2025. URL https://arxiv.org/abs/2512.02716

work page arXiv 2025
[35]

AlpaCare : Instruction fine-tuned large language models for medical applications

Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. AlpaCare : Instruction fine-tuned large language models for medical applications. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.14558

work page arXiv 2023
[36]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui \ n onero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench : Evaluating large language models towards improved human health. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Position statement on the use of restraint and seclusion

Association for Behavior Analysis International . Position statement on the use of restraint and seclusion. https://www.abainternational.org/about-us/policies-and-positions/restraint-and-seclusion,-2010.aspx, 2010

2010

[3] [3]

Ethics code for behavior analysts

Behavior Analyst Certification Board . Ethics code for behavior analysts. https://www.bacb.com/ethics-information/ethics-codes/, 2020

2020

[4] [4]

BACB certificant data

Behavior Analyst Certification Board . BACB certificant data. https://www.bacb.com/bacb-certificant-data/, 2026. Certification counts as of 2026-04-01; accessed 2026-05-23

2026

[5] [5]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6: 0 587--604, 2018. URL https://aclanthology.org/Q18-1041/

2018

[6] [6]

Carr and V

Edward G. Carr and V. Mark Durand. Reducing behavior problems through functional communication training. Journal of Applied Behavior Analysis, 18 0 (2): 0 111--126, 1985. doi:10.1901/jaba.1985.18-111

work page doi:10.1901/jaba.1985.18-111 1985

[7] [7]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Davide Chicco and Giuseppe Jurman. The advantages of the Matthews correlation coefficient ( MCC ) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21: 0 6, 2020. doi:10.1186/s12864-019-6413-7

work page doi:10.1186/s12864-019-6413-7 2020

[8] [8]

Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit

Jacob Cohen. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70 0 (4): 0 213--220, 1968

1968

[9] [9]

Cooper, Timothy E

John O. Cooper, Timothy E. Heron, and William L. Heward. Applied Behavior Analysis. Pearson Education Limited, Harlow, England, 3rd, global edition edition, 2020. ISBN 978-1-292-32463-0

2020

[10] [10]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

The rise of small language models in healthcare: A comprehensive survey

Muskan Garg, Shaina Raza, Shebuti Rayana, Xingyi Liu, and Sunghwan Sohn. The rise of small language models in healthcare: A comprehensive survey. arXiv preprint, 2025. URL https://arxiv.org/abs/2504.17119

work page arXiv 2025

[12] [12]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum \'e III, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64 0 (12): 0 86--92, 2021. URL https://arxiv.org/abs/1803.09010

work page arXiv 2021

[13] [13]

MedGemma Technical Report

Google Research and Google DeepMind . MedGemma technical report. Technical report, Google DeepMind, 2025. URL https://arxiv.org/abs/2507.05201

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017. URL https://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Hanley, Brian A

Gregory P. Hanley, Brian A. Iwata, and Brandon E. McCord. Functional analysis of problem behavior: A review. Journal of Applied Behavior Analysis, 36 0 (2): 0 147--185, 2003. doi:10.1901/jaba.2003.36-147

work page doi:10.1901/jaba.2003.36-147 2003

[16] [16]

Iwata, Michael F

Brian A. Iwata, Michael F. Dorsey, Keith J. Slifer, Kenneth E. Bauman, and Gina S. Richman. Toward a functional analysis of self-injury. Journal of Applied Behavior Analysis, 27 0 (2): 0 197--209, 1994. doi:10.1901/jaba.1994.27-197. Reprint of the 1982 article in Analysis and Intervention in Developmental Disabilities, 2(1), 3--20

work page doi:10.1901/jaba.1994.27-197 1994

[17] [17]

Jennings and David J

Adrienne M. Jennings and David J. Cox. Starting the conversation around the ethical use of artificial intelligence in applied behavior analysis. Behavior Analysis in Practice, 17: 0 107--122, 2024. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC10891004/

2024

[18] [18]

Prometheus 2: An open source language model specialized in evaluating other language models,

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2 : An open source language model specialized in evaluating other language models. arXiv preprint, 2024. URL https://arxiv.org/abs/2405.01535

work page arXiv 2024

[19] [19]

Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing

Aman Kumar, Mareiko Au, Raj Semlawat, Malavica Sridhar, and Hitesh Gurnani. Personalized- ABA : Personalized treatment plan generation for applied behavior analysis using natural language processing. In Proceedings of the 1st Workshop on Natural Language Processing for Science (NLP4Science), pages 188--196. Association for Computational Linguistics, 2024....

2024

[20] [20]

Ivar Lovaas

O. Ivar Lovaas. Behavioral treatment and normal educational and intellectual functioning in young autistic children. Journal of Consulting and Clinical Psychology, 55 0 (1): 0 3--9, 1987. doi:10.1037/0022-006X.55.1.3

work page doi:10.1037/0022-006x.55.1.3 1987

[21] [21]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models. arXiv preprint, 2023. URL https://arxiv.org/abs/2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

How to design, create, and evaluate an instruction-tuning dataset for large language model training in health care: Tutorial from a clinical perspective

Wojciech Nazar, Grzegorz Nazar, Aleksandra Kami \'n ska, and Ludmila Danilowicz-Szymanowicz. How to design, create, and evaluate an instruction-tuning dataset for large language model training in health care: Tutorial from a clinical perspective. Journal of Medical Internet Research, 27: 0 e70481, 2025. doi:10.2196/70481. URL https://www.jmir.org/2025/1/e70481

work page doi:10.2196/70481 2025

[23] [23]

Parsons, Jeannia H

Marsha B. Parsons, Jeannia H. Rollyson, and Dennis H. Reid. Evidence-based staff training: A guide for practitioners. Behavior Analysis in Practice, 5 0 (2): 0 2--11, 2012

2012

[24] [24]

S. Peck, C. O'Brien, J. Bourret, and D. Agostinelli. ChatGPT versus clinician responses to questions in ABA : Preference, identification, and level of agreement. Journal of Applied Behavior Analysis, 58 0 (4): 0 731--743, 2025. doi:10.1002/jaba.70029. URL https://onlinelibrary.wiley.com/doi/10.1002/jaba.70029

work page doi:10.1002/jaba.70029 2025

[25] [25]

A structured review of the validity of BLEU

Ehud Reiter. A structured review of the validity of BLEU . Computational Linguistics, 44 0 (3): 0 393--401, 2018. URL https://direct.mit.edu/coli/article/44/3/393/

2018

[26] [26]

TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes

Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Drew Bertagnolli, and Chaitanya Shivade. TN-Eval : Rubric and evaluation protocols for measuring the quality of behavioral therapy notes. arXiv preprint, 2025. URL https://arxiv.org/abs/2503.20648

work page arXiv 2025

[27] [27]

Singhal, T

Karan Singhal, Tao Tu, Juraj Gottweis, et al. Toward expert-level medical question answering with large language models. Nature Medicine, 31: 0 943--950, 2025. doi:10.1038/s41591-024-03423-7. URL https://www.nature.com/articles/s41591-024-03423-7

work page doi:10.1038/s41591-024-03423-7 2025

[28] [28]

Discrete trial training in the treatment of autism

Tristram Smith. Discrete trial training in the treatment of autism. Focus on Autism and Other Developmental Disabilities, 16 0 (2): 0 86--92, 2001

2001

[29] [29]

Implementation of generative AI for the assessment and treatment of autism spectrum disorders: A scoping review

Jun-Seok Sohn, Eojin Lee, Jae-Jin Kim, Hyang-Kyeong Oh, and Eunjoo Kim. Implementation of generative AI for the assessment and treatment of autism spectrum disorders: A scoping review. Frontiers in Psychiatry, 16: 0 1628216, 2025. doi:10.3389/fpsyt.2025.1628216. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12322814/

work page doi:10.3389/fpsyt.2025.1628216 2025

[30] [30]

Stokes and Donald M

Trevor F. Stokes and Donald M. Baer. An implicit technology of generalization. Journal of Applied Behavior Analysis, 10 0 (2): 0 349--367, 1977. doi:10.1901/jaba.1977.10-349

work page doi:10.1901/jaba.1977.10-349 1977

[31] [31]

Tiger, Gregory P

Jeffrey H. Tiger, Gregory P. Hanley, and Jennifer Bruzek. Functional communication training: A review and practical guide. Behavior Analysis in Practice, 1 0 (1): 0 16--23, 2008

2008

[32] [32]

Touchette and Jane S

Paul E. Touchette and Jane S. Howard. Errorless learning: Reinforcement contingencies and stimulus control transfer in delayed prompting. Journal of Applied Behavior Analysis, 17 0 (2): 0 175--188, 1984

1984

[33] [33]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct : Aligning language models with self-generated instructions. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2023. URL https://arxiv.org/abs/2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Menta: A small language model for on-device mental health prediction

Tianyi Zhang, Xiangyuan Xue, Lingyan Ruan, Shiya Fu, Feng Xia, Simon D'Alfonso, Vassilis Kostakos, Ting Dang, and Hong Jia. Menta: A small language model for on-device mental health prediction. arXiv preprint, 2025. URL https://arxiv.org/abs/2512.02716

work page arXiv 2025

[35] [35]

AlpaCare : Instruction fine-tuned large language models for medical applications

Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. AlpaCare : Instruction fine-tuned large language models for medical applications. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.14558

work page arXiv 2023

[36] [36]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023