Recognition: unknown
When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden
Pith reviewed 2026-05-10 13:32 UTC · model grok-4.3
The pith
Fine-tuned small language models can detect the overlapping presence of PCOS, eating disorders, and related issues in social media with built-in explanations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fine-tuning small language models on annotated social media data, it is possible to automatically detect the triple burden of PCOS with eating disorders and metabolic challenges, while generating structured explanations that cite specific textual evidence from the post for each identified condition.
What carries the argument
Low-rank adaptation of small language models to produce structured outputs including detections and supporting textual evidence.
Load-bearing premise
Annotations by two trained annotators on posts from specific online communities accurately capture the true presence of the triple burden, and the resulting models generalize beyond that collection.
What would settle it
A fresh collection of posts from other platforms labeled independently by clinicians that yields accuracy well below 75 percent would falsify the claim of reliable detection and generalization.
Figures
read the original abstract
Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that fine-tuned small open-source LLMs (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) with LoRA can detect the triple burden of PCOS, body-image distress, and disordered eating in Reddit posts. Using 1,000 posts from six subreddits labeled by two trained annotators via the Lee et al. (2017) framework, the best model reaches 75.3% exact-match accuracy on a 150-post held-out set, produces structured explanations with textual evidence, shows robust comorbidity detection, and exhibits declining performance with increasing diagnostic complexity, positioning the models as screening rather than diagnostic tools.
Significance. If the central performance and explainability claims hold after addressing annotation reliability, the work would be significant for computational linguistics and health NLP: it demonstrates that small, open-source models can deliver grounded, explainable detection of complex comorbid conditions in social media text, which is more accessible than large proprietary systems. The emphasis on textual evidence in explanations and the observation that performance varies with diagnostic complexity provide a practical, falsifiable framing for screening applications. These elements, combined with the reproducible fine-tuning setup, could support follow-on studies in digital phenotyping.
major comments (2)
- [Methods (data annotation)] Methods section on data annotation: the labeling of the 1,000 posts by exactly two trained annotators using the Lee et al. (2017) operationalization is described without any inter-annotator agreement statistic (Cohen's kappa, percentage agreement, or adjudication details). Because the headline 75.3% exact-match accuracy and comorbidity-detection claims are evaluated directly against these labels, the absence of agreement metrics leaves the reliability of the ground truth unquantified and makes it impossible to assess whether the reported performance exceeds label noise.
- [Results (held-out evaluation)] Evaluation and results section: the 150-post held-out set is obtained via a single internal split of the 1,000-post corpus collected from only six subreddits in one collection period. No external validation set, temporal hold-out, or cross-subreddit evaluation is reported, which directly limits the strength of the generalization claim that the models are suitable for screening beyond the sampled data.
minor comments (2)
- [Methods] The abstract and methods should explicitly state the exact train/validation/held-out split ratios and any stratification by diagnostic complexity to allow replication.
- [Results] Figure or table presenting per-model exact-match accuracy and comorbidity F1 scores would improve clarity; currently the 75.3% figure is given only in aggregate.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments below and have made revisions to improve the paper's transparency and discussion of limitations.
read point-by-point responses
-
Referee: Methods section on data annotation: the labeling of the 1,000 posts by exactly two trained annotators using the Lee et al. (2017) operationalization is described without any inter-annotator agreement statistic (Cohen's kappa, percentage agreement, or adjudication details). Because the headline 75.3% exact-match accuracy and comorbidity-detection claims are evaluated directly against these labels, the absence of agreement metrics leaves the reliability of the ground truth unquantified and makes it impossible to assess whether the reported performance exceeds label noise.
Authors: We agree that inter-annotator agreement metrics are important for assessing label reliability. The two annotators were trained on the Lee et al. (2017) framework and labeled the posts independently, with any disagreements resolved via discussion and consensus. However, the individual annotation decisions were not archived separately, preventing the calculation of Cohen's kappa or percentage agreement at this stage. In the revised manuscript, we will expand the Methods section to include a more detailed description of the annotator training, the operationalization of the framework, and the adjudication process. We will also note this as a limitation and suggest that future work should report agreement statistics. revision: partial
-
Referee: Evaluation and results section: the 150-post held-out set is obtained via a single internal split of the 1,000-post corpus collected from only six subreddits in one collection period. No external validation set, temporal hold-out, or cross-subreddit evaluation is reported, which directly limits the strength of the generalization claim that the models are suitable for screening beyond the sampled data.
Authors: We concur that the current evaluation relies on a single random internal split and does not include external or temporal validation, which restricts the generalizability claims. The held-out set was created by randomly partitioning the 1,000 posts, ensuring no overlap with training data. We have revised the manuscript to clarify the split procedure in the Results section and added an expanded discussion of limitations, including potential subreddit biases and the importance of future multi-site or temporal validation for screening applications. We position the work as an initial demonstration rather than a fully generalizable tool. revision: yes
- Inter-annotator agreement statistic, as separate annotations were not retained for computation.
Circularity Check
No significant circularity detected
full rationale
The paper collects fresh Reddit posts, applies an external clinical labeling framework (Lee et al. 2017) via two annotators, fine-tunes open models with LoRA, and reports exact-match accuracy on a held-out split. This empirical pipeline does not reduce any claimed result to its own inputs by definition, fitted-parameter renaming, or self-citation chains; the performance metric is computed against independent annotations rather than being tautological. No equations, uniqueness theorems, or ansatzes are smuggled in, and the cited framework originates from unrelated authors, leaving the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA adaptation parameters
axioms (1)
- domain assumption Reddit posts in PCOS subreddits contain linguistic patterns that reliably indicate the triple burden when labeled according to Lee et al. (2017)
Reference graph
Works this paper leans on
-
[1]
Women's use of online health and social media resources to make sense of their polycystic ovary syndrome (PCOS) diagnosis: a qualitative study,
J. Gomula, M. Warner, and A. Blandford, "Women's use of online health and social media resources to make sense of their polycystic ovary syndrome (PCOS) diagnosis: a qualitative study," BMC Womens Health, vol. 24, no. 1, Art. no. 157, Mar. 2024. [7] S. Naroji, J. John, and V. Gomez-Lobo, "Understanding PCOS-related content across social media platforms — ...
2024
-
[2]
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
S. Chancellor and M. De Choudhury, "Methods in predictive techniques for mental health status on social media: a critical review," npj Digit. Med., vol. 3, no. 1, p. 43, 2020. [21] I. Lee, L. G. Cooney, S. Saini, M. D. Sammel, K. C. Allison, and A. Dokras, "Increased odds of disordered eating in polycystic ovary syndrome: a systematic review and meta-anal...
work page internal anchor Pith review arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.