Recognition: unknown
CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors
Pith reviewed 2026-05-10 10:58 UTC · model grok-4.3
The pith
A multi-agent AI system discovers 66 candidate digital biomarkers from wearable sensor data across large cohorts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoDaS is a multi-agent system that treats biomarker discovery as an iterative workflow of hypothesis generation, statistical analysis, adversarial validation, literature-grounded reasoning, and human oversight. When applied to three wearable datasets totaling 9,279 participant-observations, the system surfaced 41 candidate features for mental health outcomes and 25 for metabolic outcomes. Notable outputs include replicated links between sleep variability and depression scores, a steps-to-resting-heart-rate fitness index, and recovery of the known AST/ALT ratio as a correlate of insulin resistance. Adding the derived features to demographic baselines produced cross-validated R-squared gains.
What carries the argument
CoDaS, the multi-agent AI co-data-scientist that structures biomarker discovery as an iterative loop of hypothesis generation, statistical analysis, adversarial validation, literature reasoning, and human oversight.
If this is right
- Sleep duration and onset variability features correlate with depression scores and replicate across separate cohorts.
- A cardiovascular fitness index derived from step count and resting heart rate correlates negatively with insulin resistance markers.
- The known hepatic function ratio AST/ALT is recovered as a correlate of metabolic health.
- CoDaS-derived features produce modest but consistent increases in cross-validated predictive performance when added to demographic variables.
- Each candidate biomarker passes an internal battery of replication, stability, robustness, and discriminative-power checks.
Where Pith is reading between the lines
- The same workflow could be applied to other sensor-rich domains such as respiratory or neurological conditions without redesigning the core agents.
- Widespread use might shift research effort from manual feature engineering toward validating AI-generated hypotheses at scale.
- Deploying CoDaS on streaming data from consumer devices could enable earlier detection of health shifts in real time.
- The modest performance lifts suggest the method is best viewed as a hypothesis generator that still requires traditional clinical confirmation.
Load-bearing premise
The multi-agent process with adversarial validation and human oversight can produce novel, clinically relevant hypotheses without systematic bias or overfitting to the datasets examined.
What would settle it
A new independent wearable cohort in which none of the 66 candidate biomarkers show significant correlations with the target outcomes or yield predictive gains would indicate the system did not generalize.
read the original abstract
Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, \rho = 0.252, p < 0.001) and sleep onset variability (GLOBEM, \rho = 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; \rho = -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; \rho = -0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated \Delta R^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CoDaS, a multi-agent AI system for biomarker discovery from wearable sensors. It structures the process with hypothesis generation, statistical analysis, adversarial validation, and human oversight. Applied to three cohorts with 9,279 observations, it identifies 41 mental health and 25 metabolic candidate biomarkers. Key findings include replicated circadian instability features in two depression cohorts (e.g., sleep variability correlations), recovery of the AST/ALT ratio as a metabolic correlate, and modest predictive gains (ΔR² of 0.040 for depression and 0.021 for insulin resistance) when adding CoDaS features to demographics.
Significance. If the methodological transparency supports the claims, this work could advance AI-assisted scientific discovery in digital health by offering a structured, traceable framework for hypothesis generation from large wearable datasets. The cross-cohort replication of circadian instability features and recovery of the established AST/ALT-insulin resistance association are notable strengths that provide direct empirical support for the system's ability to surface non-spurious candidates.
major comments (3)
- [Methods (CoDaS system)] Methods section describing the CoDaS architecture: The multi-agent system with adversarial validation is load-bearing for the claim of unbiased, clinically meaningful hypothesis prioritization, yet the manuscript provides insufficient detail on agent roles, the specific mechanisms of adversarial challenge, and integration with human oversight, preventing assessment of whether systematic bias or dataset-specific overfitting is mitigated.
- [Results (biomarker identification)] Results section on candidate biomarker identification: The identification of 41 mental health and 25 metabolic biomarkers relies on an internal validation battery, but the text does not specify correction for multiple comparisons or the exact criteria/thresholds used to select and prioritize candidates from the large feature space, which is critical given the risk of spurious correlations in high-dimensional wearable data.
- [Results (predictive modeling)] Results section on predictive performance: The reported cross-validated ΔR² gains (0.040 for depression, 0.021 for insulin resistance) are central to demonstrating utility, but without details on whether feature selection occurred inside or outside the CV loop, the baseline model specification, or handling of cohort-specific tuning, it is unclear if these increments reflect generalizable improvements or optimistic bias.
minor comments (3)
- [Abstract and Results] Clarify the correlation coefficient type (Pearson vs. Spearman) for all reported ρ values, including the sleep variability and fitness index examples.
- [Methods] Provide explicit cohort demographics, inclusion/exclusion criteria, and data exclusion details to support reproducibility and evaluation of generalizability across the 9,279 observations.
- [Results] Define and tabulate the components of the internal validation battery (replication, stability, robustness, discriminative power) with per-biomarker results for transparency.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the transparency of our work. We address each major comment below and have revised the manuscript to incorporate the suggested clarifications.
read point-by-point responses
-
Referee: [Methods (CoDaS system)] Methods section describing the CoDaS architecture: The multi-agent system with adversarial validation is load-bearing for the claim of unbiased, clinically meaningful hypothesis prioritization, yet the manuscript provides insufficient detail on agent roles, the specific mechanisms of adversarial challenge, and integration with human oversight, preventing assessment of whether systematic bias or dataset-specific overfitting is mitigated.
Authors: We agree that additional detail on the CoDaS architecture is required to allow full evaluation of the claims regarding unbiased prioritization. In the revised manuscript, we have expanded the Methods section with explicit descriptions of each agent's responsibilities, the concrete mechanisms used for adversarial challenge (such as counter-hypothesis generation and robustness testing protocols), and the structured integration of human oversight including review stages and decision criteria. These revisions should enable readers to assess potential biases and overfitting risks. revision: yes
-
Referee: [Results (biomarker identification)] Results section on candidate biomarker identification: The identification of 41 mental health and 25 metabolic biomarkers relies on an internal validation battery, but the text does not specify correction for multiple comparisons or the exact criteria/thresholds used to select and prioritize candidates from the large feature space, which is critical given the risk of spurious correlations in high-dimensional wearable data.
Authors: The referee is correct that the original text lacks explicit reporting of multiple-comparison correction and selection criteria. We have revised the Results section to include these details, specifying the multiple testing correction applied and the exact thresholds and prioritization rules derived from the validation battery components used to select the candidate biomarkers. revision: yes
-
Referee: [Results (predictive modeling)] Results section on predictive performance: The reported cross-validated ΔR² gains (0.040 for depression, 0.021 for insulin resistance) are central to demonstrating utility, but without details on whether feature selection occurred inside or outside the CV loop, the baseline model specification, or handling of cohort-specific tuning, it is unclear if these increments reflect generalizable improvements or optimistic bias.
Authors: We appreciate the referee's emphasis on this methodological clarity. In the revised manuscript, we have added explicit information in the Methods and Results sections stating that feature selection was performed inside the cross-validation loop via nested CV, describing the baseline demographic-only model, and explaining the consistent cross-validation and tuning procedures applied across cohorts. These additions confirm the reported performance gains are not attributable to optimistic bias. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes an empirical workflow: a multi-agent AI system (CoDaS) is applied to three independent wearable datasets (total N=9,279) to surface candidate biomarkers, followed by statistical tests, replication checks, and cross-validated predictive increments. These outputs are data-driven observations (e.g., reported ρ values, ΔR² gains) rather than quantities derived from the system's own equations or parameters by construction. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citation chains appear in the reported chain; the internal validation battery and cross-cohort consistency supply external falsifiability. The central claims therefore remain self-contained against the input data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Wearable sensor data from consumer devices contains extractable features that can serve as clinically actionable biomarkers for mental health and metabolic outcomes
invented entities (1)
-
CoDaS multi-agent system
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Data Profiling & EDA8.8 min 61.6 min ∗ ∼1 min 9 s
-
[2]
Literature Search & Synthesis 307.5 min† 34.6 min Partial § —
-
[3]
Hypothesis Generation248.2 min ‡ — —
-
[4]
Statistical & ML Execution89.5 min —∼9 min 143 s
-
[5]
Adversarial Validation17.4 min — — —
-
[6]
Deep Research & Novelty 73.3 min — — —
-
[7]
Report Writing & Review75.0 min∼2 min<1 s Iterative Discovery Rounds4 N/A 1 1 Total Wall-Clock Time 8.28 h 7.00 h 11.6 min 2.76 min LLM-Guided Code GenerationYes No Yes No LLM API Cost (est.)$3.91≥$2.79<$0.05<$0.01 Tokens (in / out)7.2M / 196K≥1.0M / 194K 99K / 9.2K 1.2K / 1.4K ∗LLM-based analysis of data summaries, not deterministic code execution. †CoDa...
-
[8]
The train/test split prevents data leakage, but an improvement on this metric is expected by the design of the check
Definitional overlap between check and metric.Confounder survival on the test set measures the same statistical quantity (partial Spearman significance) that the confounder check enforces on the training set. The train/test split prevents data leakage, but an improvement on this metric is expected by the design of the check. Replication rate and subgroup ...
-
[9]
This experiment validates their effectiveness on held-out data, not an iterative improvement process
Not iterative.These checks are applied once. This experiment validates their effectiveness on held-out data, not an iterative improvement process
-
[10]
Only confounder control and subgroup consistency are evaluated here
Two of many checks.CoDaS’s full validation pipeline includes permutation testing, bootstrap sta- bility, CI consistency, and defender-critic debate. Only confounder control and subgroup consistency are evaluated here
-
[11]
WearMe 𝑅2.The small feature space (15 features) limits the room for pruning on WearMe, and holdout𝑅 2 does not improve
-
[12]
did not differ significantly in their effect size between the IR and IS groups
GLOBEM uninformative.With 𝑁=704and 3 demographics, baseline robustness metrics are at floor, preventing any measurable effect. 47 CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors Figure 8| Benchmark evaluation of CoDaS and baselines.CoDaS is compared against frontier LLMs andagentbasedframeworks onbenchmarksthatcollectively evaluat...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.