pith. machine review for the scientific record. sign in

arxiv: 2604.14356 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.AI

Recognition: unknown

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords PCOSeating disordersexplainable AIsocial medialanguage modelscomorbidity detectionnatural language processing
0
0 comments X

The pith

Fine-tuned small language models can detect the overlapping presence of PCOS, eating disorders, and related issues in social media with built-in explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops small open-source language models to automatically identify when social media posts show polycystic ovary syndrome together with signs of disordered eating and body image distress. Researchers gathered one thousand posts from relevant online forums, had them labeled by trained annotators, and adapted the models using low-rank fine-tuning so each output includes both a structured detection and the exact text passages that support it. A sympathetic reader would care because these conditions often occur together yet current automated tools cannot reliably spot the combination or show their reasoning. The strongest model reached 75.3 percent exact match on held-out posts while preserving explainability, although results weaken as more conditions appear in a single post.

Core claim

By fine-tuning small language models on annotated social media data, it is possible to automatically detect the triple burden of PCOS with eating disorders and metabolic challenges, while generating structured explanations that cite specific textual evidence from the post for each identified condition.

What carries the argument

Low-rank adaptation of small language models to produce structured outputs including detections and supporting textual evidence.

Load-bearing premise

Annotations by two trained annotators on posts from specific online communities accurately capture the true presence of the triple burden, and the resulting models generalize beyond that collection.

What would settle it

A fresh collection of posts from other platforms labeled independently by clinicians that yields accuracy well below 75 percent would falsify the claim of reliable detection and generalization.

Figures

Figures reproduced from arXiv: 2604.14356 by Apoorv Prasad, Susan McRoy.

Figure 1
Figure 1. Figure 1: Training and validation loss curves for DeepSeek-R1-Distill-Qwen-1.5B [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuned small open-source LLMs (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) with LoRA can detect the triple burden of PCOS, body-image distress, and disordered eating in Reddit posts. Using 1,000 posts from six subreddits labeled by two trained annotators via the Lee et al. (2017) framework, the best model reaches 75.3% exact-match accuracy on a 150-post held-out set, produces structured explanations with textual evidence, shows robust comorbidity detection, and exhibits declining performance with increasing diagnostic complexity, positioning the models as screening rather than diagnostic tools.

Significance. If the central performance and explainability claims hold after addressing annotation reliability, the work would be significant for computational linguistics and health NLP: it demonstrates that small, open-source models can deliver grounded, explainable detection of complex comorbid conditions in social media text, which is more accessible than large proprietary systems. The emphasis on textual evidence in explanations and the observation that performance varies with diagnostic complexity provide a practical, falsifiable framing for screening applications. These elements, combined with the reproducible fine-tuning setup, could support follow-on studies in digital phenotyping.

major comments (2)
  1. [Methods (data annotation)] Methods section on data annotation: the labeling of the 1,000 posts by exactly two trained annotators using the Lee et al. (2017) operationalization is described without any inter-annotator agreement statistic (Cohen's kappa, percentage agreement, or adjudication details). Because the headline 75.3% exact-match accuracy and comorbidity-detection claims are evaluated directly against these labels, the absence of agreement metrics leaves the reliability of the ground truth unquantified and makes it impossible to assess whether the reported performance exceeds label noise.
  2. [Results (held-out evaluation)] Evaluation and results section: the 150-post held-out set is obtained via a single internal split of the 1,000-post corpus collected from only six subreddits in one collection period. No external validation set, temporal hold-out, or cross-subreddit evaluation is reported, which directly limits the strength of the generalization claim that the models are suitable for screening beyond the sampled data.
minor comments (2)
  1. [Methods] The abstract and methods should explicitly state the exact train/validation/held-out split ratios and any stratification by diagnostic complexity to allow replication.
  2. [Results] Figure or table presenting per-model exact-match accuracy and comorbidity F1 scores would improve clarity; currently the 75.3% figure is given only in aggregate.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments below and have made revisions to improve the paper's transparency and discussion of limitations.

read point-by-point responses
  1. Referee: Methods section on data annotation: the labeling of the 1,000 posts by exactly two trained annotators using the Lee et al. (2017) operationalization is described without any inter-annotator agreement statistic (Cohen's kappa, percentage agreement, or adjudication details). Because the headline 75.3% exact-match accuracy and comorbidity-detection claims are evaluated directly against these labels, the absence of agreement metrics leaves the reliability of the ground truth unquantified and makes it impossible to assess whether the reported performance exceeds label noise.

    Authors: We agree that inter-annotator agreement metrics are important for assessing label reliability. The two annotators were trained on the Lee et al. (2017) framework and labeled the posts independently, with any disagreements resolved via discussion and consensus. However, the individual annotation decisions were not archived separately, preventing the calculation of Cohen's kappa or percentage agreement at this stage. In the revised manuscript, we will expand the Methods section to include a more detailed description of the annotator training, the operationalization of the framework, and the adjudication process. We will also note this as a limitation and suggest that future work should report agreement statistics. revision: partial

  2. Referee: Evaluation and results section: the 150-post held-out set is obtained via a single internal split of the 1,000-post corpus collected from only six subreddits in one collection period. No external validation set, temporal hold-out, or cross-subreddit evaluation is reported, which directly limits the strength of the generalization claim that the models are suitable for screening beyond the sampled data.

    Authors: We concur that the current evaluation relies on a single random internal split and does not include external or temporal validation, which restricts the generalizability claims. The held-out set was created by randomly partitioning the 1,000 posts, ensuring no overlap with training data. We have revised the manuscript to clarify the split procedure in the Results section and added an expanded discussion of limitations, including potential subreddit biases and the importance of future multi-site or temporal validation for screening applications. We position the work as an initial demonstration rather than a fully generalizable tool. revision: yes

standing simulated objections not resolved
  • Inter-annotator agreement statistic, as separate annotations were not retained for computation.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper collects fresh Reddit posts, applies an external clinical labeling framework (Lee et al. 2017) via two annotators, fine-tunes open models with LoRA, and reports exact-match accuracy on a held-out split. This empirical pipeline does not reduce any claimed result to its own inputs by definition, fitted-parameter renaming, or self-citation chains; the performance metric is computed against independent annotations rather than being tautological. No equations, uniqueness theorems, or ansatzes are smuggled in, and the cited framework originates from unrelated authors, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim depends on the assumption that social media text contains detectable markers of the triple burden as defined by the external clinical framework and that the 1,000-post sample is representative; no new physical entities are introduced.

free parameters (1)
  • LoRA adaptation parameters
    Trainable low-rank matrices fitted during fine-tuning on the labeled Reddit posts.
axioms (1)
  • domain assumption Reddit posts in PCOS subreddits contain linguistic patterns that reliably indicate the triple burden when labeled according to Lee et al. (2017)
    The entire detection pipeline rests on this mapping from text to clinical constructs.

pith-pipeline@v0.9.0 · 5475 in / 1454 out tokens · 44286 ms · 2026-05-10T13:32:29.344792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Women's use of online health and social media resources to make sense of their polycystic ovary syndrome (PCOS) diagnosis: a qualitative study,

    J. Gomula, M. Warner, and A. Blandford, "Women's use of online health and social media resources to make sense of their polycystic ovary syndrome (PCOS) diagnosis: a qualitative study," BMC Womens Health, vol. 24, no. 1, Art. no. 157, Mar. 2024. [7] S. Naroji, J. John, and V. Gomez-Lobo, "Understanding PCOS-related content across social media platforms — ...

  2. [2]

    A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

    S. Chancellor and M. De Choudhury, "Methods in predictive techniques for mental health status on social media: a critical review," npj Digit. Med., vol. 3, no. 1, p. 43, 2020. [21] I. Lee, L. G. Cooney, S. Saini, M. D. Sammel, K. C. Allison, and A. Dokras, "Increased odds of disordered eating in polycystic ovary syndrome: a systematic review and meta-anal...