arxiv: 2605.11091 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

Shubhankit Singh , Hassan Shaikh , Kuldeep Raghuwanshi , Keshav Bulia

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords autism spectrum disordermachine learning benchmarkASD screeningtabular classificationmodel calibrationage-specific performanceinterpretabilityadversarial robustness

0 comments

The pith

A benchmark of AI models for autism screening shows near-perfect adult performance but lower results for adolescents, with accuracy often unrelated to calibration quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create ASD-Bench to test a range of machine learning models on autism questionnaire data from children, adolescents, and adults. They evaluate each model on accuracy, how well its confidence matches actual correctness, feature importance for explanation, and resistance to small data changes. Results indicate adults are classified most reliably while adolescents are the most difficult group, and that some models reach high accuracy yet produce poorly calibrated probabilities. This approach matters because single-number scores can mislead when deciding which tools to use in practice for identifying autism spectrum disorder.

Core claim

The paper claims that applying a four-axis benchmark to 4,068 AQ-10 records across three age cohorts demonstrates high performance for adult classification with many models achieving perfect F1 and AUC scores, a lower ceiling for adolescents at F1 of 0.837, shifting dominant features by age group, and a clear dissociation between accuracy and calibration as seen in models like AdaBoost.

What carries the argument

The ASD-Bench framework that assesses models on predictive performance, calibration, interpretability, and adversarial robustness, supported by the Heuristic Aggregate Penalty metric which weights false negatives and cross-validation variance.

If this is right

Deployment decisions for ASD screening should account for age-specific performance differences rather than aggregate scores.
Calibration metrics must be checked independently because high accuracy does not ensure reliable probability estimates.
Interpretability analysis reveals that different questionnaire items matter most for each age cohort.
Adversarial testing is required to confirm model stability before any real-world use.
Cohort-specific recommendations can guide selection of models for children, adolescents, or adults.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These age patterns might reflect developmental changes in how autism traits appear, suggesting screening questions could be adapted by age.
Extending similar multi-axis benchmarks to other diagnostic questionnaires could identify comparable limitations in single-metric evaluations.
If clinical validation confirms the calibration issues, hybrid human-AI systems may be needed to handle uncertain cases.
Future work could test whether adding more data sources beyond questionnaires improves robustness across all ages.

Load-bearing premise

That the AQ-10 questionnaire responses serve as dependable ground-truth labels for autism spectrum disorder, allowing conclusions about model suitability for clinical use.

What would settle it

Collecting model outputs on a new dataset where autism diagnoses come from independent clinical assessments instead of questionnaire scores, and checking whether the age performance gaps and calibration mismatches remain.

Figures

Figures reproduced from arXiv: 2605.11091 by Hassan Shaikh, Keshav Bulia, Kuldeep Raghuwanshi, Shubhankit Singh.

**Figure 1.** Figure 1: F1 Score for all 17 models across three age cohorts (Adult, Child, Adolescent). Adults achieve perfect F1 for 10 of 17 models; child F1 peaks at 0.915, adolescent at 0.837. Adults. Ten of 17 models achieve F1 = 1.000 and AUC = 1.000, confirming near-perfect separability of the adult AQ-10 feature space. XGBoost Baseline (F1 = 0.962) and TabNet Baseline (F1 = 0.940) are the only notable underperformers; th… view at source ↗

**Figure 2.** Figure 2: AUC-ROC for all 17 models across three age cohorts. Adults: 11 of 17 models at AUC = 1.000. TabPFN achieves the highest child AUC (0.963) and adolescent AUC (0.900). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Precision vs. Recall scatter per cohort. Diagonal dashed line: equal precision / recall. TabNet Baseline achieves recall = 1.000 on adults at the cost of precision = 0.886. Children. F1 ranges 0.864–0.915. TabTransformer Tuned leads on F1 (0.915), TabPFN achieves the highest AUC (0.963). Simpler models (AdaBoost Baseline, Logistic Regression) form the lower tier (F1 ≈ 0.867–0.870) [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 4.** Figure 4: ECE vs. Brier Score per cohort (bottom-left = ideal). Adult on log-log scale; child/adolescent on linear scale. AdaBoost is a clear outlier in both panels [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Consensus feature importance (averaged across 17 applicable models, normalised) per cohort. ⋆ = top-ranked feature per cohort. Note distinct hierarchies: A9 dominates children; A5 leads adolescents; adults show a flat multi-feature profile. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Feature importance heatmap: cohort × AQ-10 feature. Bold values = cohort maximum. The three cohorts show distinct importance profiles [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness scores ranked by cohort. Green: ≥0.88 (high); yellow: 0.82–0.88; red: <0.82. Dashed line: score = 0.90 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Average accuracy drop under Gaussian noise injection. Negative values indicate noise-immune models (slight regularisation benefit). Transformer models degrade by about 24%; TabPFN shows a larger drop [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: F1 Score vs. Robustness tradeoff per cohort. Dashed lines: F1 = 0.95 and Robustness = 0.90. Ideal models appear top-right. The accuracy-robustness tradeoff is evident across all three cohorts. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: hap metric rankings per cohort (lower = better), computed via 5-fold stratified cross-validation with wFN = 10, wFP = 2, λ = 1.0. Colours indicate performance tiers: green (lowest hap), yellow (middle), red (highest hap). Adults. Twelve of 17 models achieve hap = 0.000, meaning zero weighted misclassifications across all five cross-validation folds, consistent with the near-perfect F1 and AUC results on t… view at source ↗

**Figure 11.** Figure 11: Four-axis normalised scorecard for eight key models (A = Adult, C = Child, Ado = Adolescent). All axes normalised to [0, 1]; higher = better. Calibration plotted as 1 − ECE. The radar profiles below highlight the recommended models per cohort [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Four-axis radar profiles for the three recommended models: TabTransformer Tuned (adult), XGBoost Tuned (child), TabPFN v2 (adolescent). Axes: F1, AUC, Calibration (1 − ECE), Robustness, feature clarity 5 Discussion Building on the quantitative results presented above, we now analyse the qualitative factors driving model performance and their clinical implications. 5.1 Child–Adolescent Performance Gap Chil… view at source ↗

read the original abstract

Automated ASD screening tools remain limited by single-architecture evaluations, axis-restricted assessment, and near-exclusive focus on adult cohorts, obscuring age-specific diagnostic patterns critical for early intervention. We introduce ASD-Bench, a systematic tabular benchmark evaluating ML, deep learning, and foundation model configurations across three age cohorts (children 1-11 yr, adolescents 12-16 yr, adults 17-64 yr) on four axes: predictive performance, calibration, interpretability, and adversarial robustness. Applied to a curated v3 dataset of 4,068 AQ-10 records, our benchmark spans classical models (XGBoost, AdaBoost, Random Forest, Logistic Regression), neural networks (MLP), deep tabular transformers (TabNet, TabTransformer, FT-Transformer), and TabPFN v2. We introduce the Heuristic Aggregate Penalty (HAP): a cost-sensitive metric penalising false negatives more heavily and incorporating cross-validation variance for deployment stability. Adult classification yields high performance (10/17 models achieve perfect F1 and AUC), while adolescents present a harder task (F1 ceiling 0.837 vs. 0.915 for children). Feature hierarchies shift across cohorts: A9 (social motivation) dominates for children, A5 (pattern recognition) leads for adolescents, and adults exhibit a flatter importance profile consistent with developmental social masking. Accuracy and calibration are dissociated: AdaBoost achieves F1=1.000 on adults with ECE=0.302, confirming single-metric evaluation is insufficient for clinical AI. Cohort-specific deployment recommendations are provided. All findings should be interpreted as proof-of-concept evidence on questionnaire-derived labels rather than clinically validated diagnostic performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark mostly measures how well models recover the AQ-10 scoring rule rather than detecting ASD.

read the letter

The main thing to know is that the high performance numbers are largely expected once you notice the labels are a direct function of the same ten AQ-10 items used as features. If the positive class is just a threshold on the item sum, then perfect F1 and AUC on adults simply means the model learned to add and cut off at the right point. That also explains why calibration can stay poor even when accuracy hits 1.0 and why feature importances shift across cohorts without needing extra clinical interpretation. The abstract is honest that these are proof-of-concept results on questionnaire-derived labels, but the deployment recommendations still read stronger than the setup supports. The stress-test concern holds up on the information given. What the paper actually contributes is a four-axis protocol that forces separate reporting of calibration, interpretability, and adversarial robustness alongside accuracy, plus the HAP metric that folds in false-negative cost and cross-validation variance. Running the same models across children, adolescents, and adults is a reasonable way to surface age-related differences in response patterns. Those pieces are new relative to the single-metric ASD studies cited. The soft spot is the circular label construction, which makes the numerical results descriptive of the instrument rather than of model capability for ASD screening. Minor additional issues are the lack of external clinical validation and the fact that adversarial robustness numbers are only as meaningful as the attack model chosen. This paper is for people who maintain tabular benchmarks or study questionnaire-based screening tools. A reader who wants a reusable multi-axis template will find the structure useful even if they have to discount the raw performance claims. It is not the right reference for anyone selecting models for actual clinical deployment. I would send it to peer review so referees can require clearer statements on label provenance and narrower conclusions about what the numbers demonstrate.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ASD-Bench, a systematic tabular benchmark evaluating 17 ML, neural, and foundation models (XGBoost, AdaBoost, TabNet, TabPFN v2, etc.) on ASD classification using a curated dataset of 4,068 AQ-10 questionnaire records across three age cohorts (children 1-11, adolescents 12-16, adults 17-64). It assesses four axes—predictive performance, calibration, interpretability, and adversarial robustness—introduces the Heuristic Aggregate Penalty (HAP) metric that penalizes false negatives and incorporates CV variance, reports perfect F1/AUC for 10/17 models on adults (F1 ceiling 0.837 on adolescents), notes shifting feature importances (A9 for children, A5 for adolescents) and accuracy-calibration dissociation (AdaBoost F1=1.000 with ECE=0.302), and offers cohort-specific deployment recommendations while framing all results as proof-of-concept on questionnaire-derived labels.

Significance. If the label-construction issue is resolved, the benchmark would usefully demonstrate the value of multi-axis evaluation for clinical tabular tasks and the insufficiency of single-metric assessment. The age-cohort feature-importance shifts and explicit HAP formulation are concrete contributions that could inform more deployment-aware model selection in questionnaire-based screening.

major comments (2)

[Abstract] Abstract and dataset section: the binary labels are a deterministic function of the sum of the same 10 AQ-10 items supplied as features (standard AQ-10 scoring threshold). Consequently the reported perfect F1=1.000 and AUC=1.000 for 10/17 models on adults, the cohort performance gap, and the feature-importance shifts are expected artifacts of recovering the fixed scoring rule rather than evidence of ASD detection capability. This circularity directly undermines the central claims of clinical utility and cohort-specific deployment recommendations.
[§3] §3 (dataset and label construction): the manuscript must explicitly state the exact rule used to derive the binary ground-truth label from the AQ-10 responses and quantify how much of the reported performance is attributable to this deterministic mapping. Without that disclosure the HAP metric and calibration results cannot be interpreted as measures of model quality for the intended clinical task.

minor comments (2)

[Abstract] Abstract: the phrase 'v3 dataset' is undefined; clarify its provenance and any filtering steps applied to the 4,068 records.
[Tables 2-4] Tables reporting F1/AUC/ECE: include per-fold standard deviations or confidence intervals so that 'perfect' scores can be assessed for stability rather than treated as point estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the label construction must be made fully explicit and that the implications for interpreting the benchmark results need to be clarified to prevent overstatement of clinical applicability. We will revise the manuscript accordingly to address these points.

read point-by-point responses

Referee: [Abstract] Abstract and dataset section: the binary labels are a deterministic function of the sum of the same 10 AQ-10 items supplied as features (standard AQ-10 scoring threshold). Consequently the reported perfect F1=1.000 and AUC=1.000 for 10/17 models on adults, the cohort performance gap, and the feature-importance shifts are expected artifacts of recovering the fixed scoring rule rather than evidence of ASD detection capability. This circularity directly undermines the central claims of clinical utility and cohort-specific deployment recommendations.

Authors: We acknowledge this observation and agree that the task reduces to recovering the standard AQ-10 scoring rule (positive if sum of the 10 item scores meets or exceeds the established threshold) from the individual item responses provided as features. The perfect performance observed for several models on the adult cohort is indeed consistent with their capacity to approximate this deterministic function. However, the benchmark retains value in demonstrating architectural differences on other axes, such as calibration (where even perfect-accuracy models like AdaBoost show high ECE) and adversarial robustness, as well as cohort-specific variations in feature importance and task difficulty. We will revise the abstract to explicitly describe the label derivation process and to emphasize that all results are to be interpreted as proof-of-concept evaluations on questionnaire-derived labels, not as validated ASD diagnostic tools. This will also temper the deployment recommendations to reflect the benchmark nature of the study. revision: yes
Referee: [§3] §3 (dataset and label construction): the manuscript must explicitly state the exact rule used to derive the binary ground-truth label from the AQ-10 responses and quantify how much of the reported performance is attributable to this deterministic mapping. Without that disclosure the HAP metric and calibration results cannot be interpreted as measures of model quality for the intended clinical task.

Authors: We will revise §3 to include the precise label construction rule: the binary label is set to 1 (ASD positive screen) if the total AQ-10 score, computed as the sum of the 10 item responses (each scored 0-3), is greater than or equal to 6, following the standard AQ-10 protocol. To quantify the contribution of this mapping, we will add an analysis showing that a simple threshold-on-sum baseline achieves the same performance ceiling as the best models on adults, confirming that the high performance is largely due to rule recovery. For the adolescent cohort, where performance is lower, we will discuss potential factors such as data noise or developmental variability that prevent perfect recovery. This addition will allow readers to properly contextualize the HAP metric and calibration results as measures of how well models implement the screening rule rather than novel diagnostic capability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard benchmark on external AQ-10 dataset with explicit caveats.

full rationale

The paper is an empirical benchmark evaluating 17 models across performance, calibration, interpretability, and robustness on a curated dataset of 4,068 AQ-10 records for three age cohorts. It introduces the HAP metric and reports cohort-specific results while explicitly stating all findings are proof-of-concept evidence on questionnaire-derived labels rather than clinically validated diagnoses. No derivation chain, equations, or central claims reduce by construction to self-defined quantities, fitted parameters renamed as predictions, or self-citation load-bearing steps. Evaluations rely on standard cross-validation applied to an external dataset; the setup is self-contained against external benchmarks with no load-bearing self-referential elements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that AQ-10 responses serve as usable proxy labels for ASD and on standard supervised learning evaluation practices; no new physical entities or ad-hoc constants are introduced beyond the definition of HAP.

free parameters (1)

HAP penalty weights
The heuristic aggregate penalty incorporates tunable costs for false negatives and cross-validation variance; exact values are not stated in the abstract.

axioms (1)

domain assumption AQ-10 questionnaire responses provide sufficiently reliable labels for benchmarking ASD classification models
Invoked when treating the curated dataset labels as ground truth for all performance, calibration, and robustness measurements.

pith-pipeline@v0.9.0 · 5618 in / 1338 out tokens · 55573 ms · 2026-05-13T07:18:30.133265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Yi Fang, Huiyu Duan, Fangyu Shi, Xiongkuo Min, and Guangtao Zhai

doi: 10.3389/fninf.2020.575999. Yi Fang, Huiyu Duan, Fangyu Shi, Xiongkuo Min, and Guangtao Zhai. Identifying children with autism spectrum disorder based on gaze-following. In2020 IEEE International Conference on Image Processing (ICIP), pages 423–427,

work page doi:10.3389/fninf.2020.575999 2020
[2]

Yoav Freund and Robert E Schapire

doi: 10.1109/ICIP40778.2020.9190831. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139,

work page doi:10.1109/icip40778.2020.9190831 2020
[3]

Goodfellow, Jonathon Shlens, and Christian Szegedy

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May

work page 2015
[4]

Explaining and Harnessing Adversarial Examples

URLhttps://arxiv.org/abs/1412.6572. Yury Gorishniy et al. Revisiting deep learning models for tabular data. InAdvances in Neural Information Processing Systems, volume 34, pages 18932–18943,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Grinsztajn, E

URLhttps://arxiv.org/abs/2207.08815. Afarinbar Grizan. ASD questionnaires – final,

work page arXiv
[6]

TabTransformer: Tabular data modeling using contextual embeddings,

Xin Huang et al. TabTransformer: Tabular data modeling using contextual embeddings.arXiv preprint arXiv:2012.06678,

work page arXiv 2012
[7]

doi: https://doi.org/ 10.1016/j.neucom.2018.04.080

ISSN 0925-2312. doi: https://doi.org/ 10.1016/j.neucom.2018.04.080. URL https://www.sciencedirect.com/science/article/ pii/S0925231218306234. Deep Learning for Biological/Clinical Data. 19 ASD-Bench: Benchmark for ASD Screening Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using dee...

work page doi:10.1016/j.neucom.2018.04.080 2018
[8]

Scott M Lundberg and Su-In Lee

doi: 10.1145/3556677.3556694. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30,

work page doi:10.1145/3556677.3556694
[9]

Why should I trust you?

doi: 10.1177/13623613251375199. URLhttps://doi.org/10.1177/13623613251375199. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144,

work page doi:10.1177/13623613251375199
[10]

Goodfellow, and Rob Fergus

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April

work page 2014
[11]

Intriguing properties of neural networks

URL https://arxiv.org/abs/1312.6199. Qandeel Tariq, Jena Daniels, Jessica N Schwartz, Peter Washington, Haik Kalantarian, and Dennis P Wall. Mobile detection of autism through machine learning on home video: A development and prospective validation study.PLOS Medicine, 15(11):e1002705,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Fadi Thabtah

doi: 10.1371/journal.pmed.1002705. Fadi Thabtah. Autism spectrum disorder screening dataset. UCI Machine Learning Repository,

work page doi:10.1371/journal.pmed.1002705
[13]

URL https://bmcmedicine.biomedcentral

doi: 10.1186/s12916-019-1466-7. URL https://bmcmedicine.biomedcentral. com/articles/10.1186/s12916-019-1466-7. World Health Organization. Autism spectrum disorders. https://www.who.int/news-room/ fact-sheets/detail/autism-spectrum-disorders,

work page doi:10.1186/s12916-019-1466-7
[14]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April

work page 2017
[15]

URL https://arxiv.org/abs/ 1611.03530. 20

work page internal anchor Pith review arXiv