arxiv: 2604.14615 · v1 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

Yubin Kim , Salman Rahman , Samuel Schmidgall , Chunjong Park , A. Ali Heydari , Ahmed A. Metwally , Hong Yu , Xin Liu

show 20 more authors

Xuhai Xu Yuzhe Yang Maxwell A. Xu Zhihan Zhang Cynthia Breazeal Tim Althoff Petar Sirkovic Ivor Rendulic Annalisa Pawlosky Nicolas Stroppa Juraj Gottweis Elahe Vedadi Alan Karthikesalingam Pushmeet Kohli Vivek Natarajan Mark Malhotra Shwetak Patel Hae Won Park Hamid Palangi Daniel McDuff

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI multi-agent systemdigital biomarkerswearable sensorsmental healthmetabolic outcomesbiomarker discoveryphysiological signalsdepression prediction

0 comments

The pith

A multi-agent AI system discovers 66 candidate digital biomarkers from wearable sensor data across large cohorts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoDaS as a structured way to turn continuous signals from wearables into potential clinical biomarkers for mental health and metabolic conditions. It organizes discovery into repeated cycles of idea generation, statistical testing, adversarial checks, literature alignment, and human review. Tests on three cohorts with 9,279 observations produced 41 mental-health candidates and 25 metabolic ones, with some features like sleep timing variability appearing in separate depression datasets. These features, when added to basic demographics, raised cross-validated prediction accuracy by small but steady amounts for depression and insulin resistance. The approach aims to make biomarker search more repeatable and less dependent on single researchers sifting through data manually.

Core claim

CoDaS is a multi-agent system that treats biomarker discovery as an iterative workflow of hypothesis generation, statistical analysis, adversarial validation, literature-grounded reasoning, and human oversight. When applied to three wearable datasets totaling 9,279 participant-observations, the system surfaced 41 candidate features for mental health outcomes and 25 for metabolic outcomes. Notable outputs include replicated links between sleep variability and depression scores, a steps-to-resting-heart-rate fitness index, and recovery of the known AST/ALT ratio as a correlate of insulin resistance. Adding the derived features to demographic baselines produced cross-validated R-squared gains.

What carries the argument

CoDaS, the multi-agent AI co-data-scientist that structures biomarker discovery as an iterative loop of hypothesis generation, statistical analysis, adversarial validation, literature reasoning, and human oversight.

If this is right

Sleep duration and onset variability features correlate with depression scores and replicate across separate cohorts.
A cardiovascular fitness index derived from step count and resting heart rate correlates negatively with insulin resistance markers.
The known hepatic function ratio AST/ALT is recovered as a correlate of metabolic health.
CoDaS-derived features produce modest but consistent increases in cross-validated predictive performance when added to demographic variables.
Each candidate biomarker passes an internal battery of replication, stability, robustness, and discriminative-power checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workflow could be applied to other sensor-rich domains such as respiratory or neurological conditions without redesigning the core agents.
Widespread use might shift research effort from manual feature engineering toward validating AI-generated hypotheses at scale.
Deploying CoDaS on streaming data from consumer devices could enable earlier detection of health shifts in real time.
The modest performance lifts suggest the method is best viewed as a hypothesis generator that still requires traditional clinical confirmation.

Load-bearing premise

The multi-agent process with adversarial validation and human oversight can produce novel, clinically relevant hypotheses without systematic bias or overfitting to the datasets examined.

What would settle it

A new independent wearable cohort in which none of the 66 candidate biomarkers show significant correlations with the target outcomes or yield predictive gains would indicate the system did not generalize.

read the original abstract

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, \rho = 0.252, p < 0.001) and sleep onset variability (GLOBEM, \rho = 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; \rho = -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; \rho = -0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated \Delta R^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoDaS is a multi-agent workflow for wearable biomarker discovery that surfaces some replicable correlations across cohorts, but the predictive gains are small and the unique contribution of the AI setup needs tighter comparison to standard pipelines.

read the letter

CoDaS structures biomarker discovery from wearables as an iterative multi-agent process with hypothesis generation, stats, adversarial checks, and literature grounding. Across three cohorts of 9,279 observations it flags 41 mental health candidates and 25 metabolic ones, with sleep variability replicating in two depression datasets and the AST/ALT ratio recovered in the metabolic cohort as a known insulin resistance marker. Adding the derived features to demographics yields cross-validated R-squared lifts of 0.04 and 0.02 respectively. That replication and the tie to established biology are the clearest positives here. The internal validation battery for stability and discriminative power also shows the authors tried to make the outputs traceable rather than one-off findings. The soft spots are the modest effect sizes and the lack of clear evidence that the multi-agent layer outperforms simpler feature selection or human-led analysis on the same data. Without baseline runs or ablation on the adversarial step it is hard to tell how much the system reduces bias versus how much human oversight or cohort tuning carries the results. The abstract does not detail exclusion criteria or exact feature engineering, which leaves room for post-hoc selection concerns even if the reported p-values look reasonable. This paper is aimed at digital health groups working with wearables and at researchers testing agentic systems on real sensor data. Readers who want concrete examples of AI-assisted discovery pipelines will find usable material, while those seeking large clinical impact or novel theory will get less. It has enough empirical grounding and internal checks to merit peer review rather than a desk reject. I would send it out, but ask referees to focus on baseline comparisons and the degree to which the AI components drive the reported gains versus standard statistical practice.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces CoDaS, a multi-agent AI system for biomarker discovery from wearable sensors. It structures the process with hypothesis generation, statistical analysis, adversarial validation, and human oversight. Applied to three cohorts with 9,279 observations, it identifies 41 mental health and 25 metabolic candidate biomarkers. Key findings include replicated circadian instability features in two depression cohorts (e.g., sleep variability correlations), recovery of the AST/ALT ratio as a metabolic correlate, and modest predictive gains (ΔR² of 0.040 for depression and 0.021 for insulin resistance) when adding CoDaS features to demographics.

Significance. If the methodological transparency supports the claims, this work could advance AI-assisted scientific discovery in digital health by offering a structured, traceable framework for hypothesis generation from large wearable datasets. The cross-cohort replication of circadian instability features and recovery of the established AST/ALT-insulin resistance association are notable strengths that provide direct empirical support for the system's ability to surface non-spurious candidates.

major comments (3)

[Methods (CoDaS system)] Methods section describing the CoDaS architecture: The multi-agent system with adversarial validation is load-bearing for the claim of unbiased, clinically meaningful hypothesis prioritization, yet the manuscript provides insufficient detail on agent roles, the specific mechanisms of adversarial challenge, and integration with human oversight, preventing assessment of whether systematic bias or dataset-specific overfitting is mitigated.
[Results (biomarker identification)] Results section on candidate biomarker identification: The identification of 41 mental health and 25 metabolic biomarkers relies on an internal validation battery, but the text does not specify correction for multiple comparisons or the exact criteria/thresholds used to select and prioritize candidates from the large feature space, which is critical given the risk of spurious correlations in high-dimensional wearable data.
[Results (predictive modeling)] Results section on predictive performance: The reported cross-validated ΔR² gains (0.040 for depression, 0.021 for insulin resistance) are central to demonstrating utility, but without details on whether feature selection occurred inside or outside the CV loop, the baseline model specification, or handling of cohort-specific tuning, it is unclear if these increments reflect generalizable improvements or optimistic bias.

minor comments (3)

[Abstract and Results] Clarify the correlation coefficient type (Pearson vs. Spearman) for all reported ρ values, including the sleep variability and fitness index examples.
[Methods] Provide explicit cohort demographics, inclusion/exclusion criteria, and data exclusion details to support reproducibility and evaluation of generalizability across the 9,279 observations.
[Results] Define and tabulate the components of the internal validation battery (replication, stability, robustness, discriminative power) with per-biomarker results for transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the transparency of our work. We address each major comment below and have revised the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: [Methods (CoDaS system)] Methods section describing the CoDaS architecture: The multi-agent system with adversarial validation is load-bearing for the claim of unbiased, clinically meaningful hypothesis prioritization, yet the manuscript provides insufficient detail on agent roles, the specific mechanisms of adversarial challenge, and integration with human oversight, preventing assessment of whether systematic bias or dataset-specific overfitting is mitigated.

Authors: We agree that additional detail on the CoDaS architecture is required to allow full evaluation of the claims regarding unbiased prioritization. In the revised manuscript, we have expanded the Methods section with explicit descriptions of each agent's responsibilities, the concrete mechanisms used for adversarial challenge (such as counter-hypothesis generation and robustness testing protocols), and the structured integration of human oversight including review stages and decision criteria. These revisions should enable readers to assess potential biases and overfitting risks. revision: yes
Referee: [Results (biomarker identification)] Results section on candidate biomarker identification: The identification of 41 mental health and 25 metabolic biomarkers relies on an internal validation battery, but the text does not specify correction for multiple comparisons or the exact criteria/thresholds used to select and prioritize candidates from the large feature space, which is critical given the risk of spurious correlations in high-dimensional wearable data.

Authors: The referee is correct that the original text lacks explicit reporting of multiple-comparison correction and selection criteria. We have revised the Results section to include these details, specifying the multiple testing correction applied and the exact thresholds and prioritization rules derived from the validation battery components used to select the candidate biomarkers. revision: yes
Referee: [Results (predictive modeling)] Results section on predictive performance: The reported cross-validated ΔR² gains (0.040 for depression, 0.021 for insulin resistance) are central to demonstrating utility, but without details on whether feature selection occurred inside or outside the CV loop, the baseline model specification, or handling of cohort-specific tuning, it is unclear if these increments reflect generalizable improvements or optimistic bias.

Authors: We appreciate the referee's emphasis on this methodological clarity. In the revised manuscript, we have added explicit information in the Methods and Results sections stating that feature selection was performed inside the cross-validation loop via nested CV, describing the baseline demographic-only model, and explaining the consistent cross-validation and tuning procedures applied across cohorts. These additions confirm the reported performance gains are not attributable to optimistic bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical workflow: a multi-agent AI system (CoDaS) is applied to three independent wearable datasets (total N=9,279) to surface candidate biomarkers, followed by statistical tests, replication checks, and cross-validated predictive increments. These outputs are data-driven observations (e.g., reported ρ values, ΔR² gains) rather than quantities derived from the system's own equations or parameters by construction. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citation chains appear in the reported chain; the internal validation battery and cross-cohort consistency supply external falsifiability. The central claims therefore remain self-contained against the input data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that wearable physiological signals contain extractable, clinically relevant biomarker information and that the AI agents can perform reliable hypothesis generation and validation. No free parameters or invented physical entities are introduced; the CoDaS system itself is the primary contribution.

axioms (1)

domain assumption Wearable sensor data from consumer devices contains extractable features that can serve as clinically actionable biomarkers for mental health and metabolic outcomes
Invoked throughout the abstract as the foundation for the entire discovery pipeline.

invented entities (1)

CoDaS multi-agent system no independent evidence
purpose: To automate and structure the biomarker discovery process
The system is the proposed method rather than an independently evidenced physical or mathematical entity.

pith-pipeline@v0.9.0 · 5723 in / 1465 out tokens · 52486 ms · 2026-05-10T10:58:17.263194+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references

[1]

Data Profiling & EDA8.8 min 61.6 min ∗ ∼1 min 9 s
[2]

Literature Search & Synthesis 307.5 min† 34.6 min Partial § —
[3]

Hypothesis Generation248.2 min ‡ — —
[4]

Statistical & ML Execution89.5 min —∼9 min 143 s
[5]

Adversarial Validation17.4 min — — —
[6]

Deep Research & Novelty 73.3 min — — —
[7]

Report Writing & Review75.0 min∼2 min<1 s Iterative Discovery Rounds4 N/A 1 1 Total Wall-Clock Time 8.28 h 7.00 h 11.6 min 2.76 min LLM-Guided Code GenerationYes No Yes No LLM API Cost (est.)$3.91≥$2.79<$0.05<$0.01 Tokens (in / out)7.2M / 196K≥1.0M / 194K 99K / 9.2K 1.2K / 1.4K ∗LLM-based analysis of data summaries, not deterministic code execution. †CoDa...
[8]

The train/test split prevents data leakage, but an improvement on this metric is expected by the design of the check

Definitional overlap between check and metric.Confounder survival on the test set measures the same statistical quantity (partial Spearman significance) that the confounder check enforces on the training set. The train/test split prevents data leakage, but an improvement on this metric is expected by the design of the check. Replication rate and subgroup ...
[9]

This experiment validates their effectiveness on held-out data, not an iterative improvement process

Not iterative.These checks are applied once. This experiment validates their effectiveness on held-out data, not an iterative improvement process
[10]

Only confounder control and subgroup consistency are evaluated here

Two of many checks.CoDaS’s full validation pipeline includes permutation testing, bootstrap sta- bility, CI consistency, and defender-critic debate. Only confounder control and subgroup consistency are evaluated here
[11]

WearMe 𝑅2.The small feature space (15 features) limits the room for pruning on WearMe, and holdout𝑅 2 does not improve
[12]

did not differ significantly in their effect size between the IR and IS groups

GLOBEM uninformative.With 𝑁=704and 3 demographics, baseline robustness metrics are at floor, preventing any measurable effect. 47 CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors Figure 8| Benchmark evaluation of CoDaS and baselines.CoDaS is compared against frontier LLMs andagentbasedframeworks onbenchmarksthatcollectively evaluat...

2026