pith. sign in

arxiv: 2603.02087 · v3 · submitted 2026-03-02 · 💻 cs.CV · cs.AI· cs.LG

A Detection-Gated Pipeline for Robust Glottal Area Waveform Extraction and Clinical Pathology Assessment

Pith reviewed 2026-05-15 17:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords glottal area segmentationhigh-speed videoendoscopyglottal area waveformpathology assessmentYOLOv8U-NetDice similarity coefficientlaryngeal kinematics
0
0 comments X

The pith

A detection-gated pipeline segments glottal areas from high-speed videoendoscopy videos, achieving Dice scores up to 0.856 and distinguishing healthy from pathological function via coefficient of variation in a 40-subject study.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage automated framework that first uses a YOLOv8n model to localize the glottis and then applies a U-Net to segment the glottal area, with the localization step cropping the image and gating the segmentation to avoid errors during closure phases. This design supports training on one dataset and testing on another without fine-tuning, reaching 87 percent of in-domain performance on cross-dataset evaluation. The extracted glottal area waveforms yield a coefficient of variation metric that separated healthy from pathological laryngeal function at statistical significance in an exploratory clinical test. The full pipeline runs at roughly 35 frames per second on ordinary hardware, allowing real-time playback and uniform extraction of kinematic measures across different recording conditions.

Core claim

The detection-gated pipeline combines a YOLOv8n glottis localizer with a U-Net segmenter to produce robust glottal area segmentations from high-speed videoendoscopy, achieving Dice similarity coefficients of 0.81 on the GIRAFE dataset, 0.856 on the BAGLS dataset, and 0.745 in cross-dataset testing without fine-tuning, while an exploratory study of 40 subjects shows that the resulting glottal area coefficient of variation distinguishes healthy from pathological function at p=0.006.

What carries the argument

The detection-gated pipeline, in which the YOLOv8n localizer supplies a tight crop and gates the U-Net output to suppress spurious segmentations during glottal closure.

If this is right

  • The system produces glottal area waveforms at approximately 35 frames per second on commodity hardware, supporting interactive clinical review.
  • Cross-dataset performance reaches 87 percent of the in-domain ceiling without retraining, indicating portability across acquisition settings.
  • Glottal area coefficient of variation provides a quantitative distinction between healthy and pathological laryngeal function.
  • The modular two-stage design enables uniform extraction of laryngeal kinematic measures from videos recorded under varying conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the coefficient of variation marker proves stable in larger cohorts, it could be embedded directly into clinical review software for rapid screening.
  • The gating mechanism may transfer to other medical video segmentation tasks where transient occlusions or closure phases create noise.
  • Real-time speed opens the possibility of live feedback during endoscopic procedures rather than only post hoc analysis.
  • Extending the pipeline to additional laryngeal measures beyond area could strengthen its role in comprehensive kinematic assessment.

Load-bearing premise

The glottal area coefficient of variation serves as a clinically meaningful marker of pathology, based on an exploratory study of only 40 subjects.

What would settle it

A larger controlled study that finds no statistically significant difference in glottal area coefficient of variation between healthy and pathological groups would undermine the clinical utility claim.

Figures

Figures reproduced from arXiv: 2603.02087 by Harikrishnan Unnikrishnan, Rita Patel.

Figure 1
Figure 1. Figure 1: Overview of the three main inference pipelines. Input (left) is the grayscale [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of temporal hold duration (0–20 frames and [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Output of the Localizer+Segmenter pipeline on [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of localizer confidence threshold on Localizer-Crop+Segmenter performance [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Localizer confidence threshold τ sweep on BAGLS test set (3500 frames). Left: mean DSC for all pipeline configurations. The GIRAFE-trained crop segmenter (blue solid) tracks the BAGLS-trained crop segmenter (red solid) almost in lockstep across the full τ range, demonstrating that the segmenter has learned generic laryngeal morphology rather than centre-specific imaging artefacts—it is an anatomical genera… view at source ↗
Figure 6
Figure 6. Figure 6: Example glottal area waveforms: Healthy (Patient 14), Paresis (Patient 50), and [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

We present a fully automated, two-stage modular glottal area segmentation framework for high-speed videoendoscopy (HSV) designed for accuracy, generalizability, and real-time playback. Our detection-gated pipeline combines a YOLOv8n glottis localizer with a U-Net segmenter; the localizer defines a tight crop to ensure a consistent field of view and gates the output to reduce spurious segmentations during glottal closure. The models were trained on the GIRAFE (N=600) and BAGLS (N=55,750) datasets. Cross-dataset portability was evaluated by benchmarking GIRAFE-trained models on the BAGLS test set without fine-tuning. In these evaluations, the pipeline achieved a Dice Similarity Coefficient (DSC) of 0.745 (87% of the in-domain ceiling). On in-distribution test sets, the system achieved DSCs of 0.81 (GIRAFE) and 0.856 (BAGLS), outperforming or competing with state-of-the-art methods. An exploratory clinical study of 40 subjects demonstrated that the glottal area Coefficient of Variation (CV distinguished healthy from pathological function (p=0.006). The system processes ~35 frames per second on commodity hardware, enabling interactive clinical review. This design supports uniform extraction of laryngeal kinematic measures across varying acquisition settings. Code, weights, and software are available at https://github.com/hari-krishnan/openglottal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a two-stage detection-gated pipeline combining YOLOv8n glottis localization with U-Net segmentation for automated glottal area waveform extraction from high-speed videoendoscopy. It reports in-domain DSCs of 0.81 (GIRAFE, N=600) and 0.856 (BAGLS, N=55,750), cross-dataset DSC of 0.745 without fine-tuning, real-time performance of ~35 fps, and an exploratory clinical result on 40 subjects where glottal-area coefficient of variation distinguished healthy from pathological cases (p=0.006). Code, weights, and software are released.

Significance. If the results hold, the work supplies a practical, modular, and portable tool for consistent extraction of laryngeal kinematic measures across acquisition settings, with potential to support clinical review. The open release of code and weights is a clear strength for reproducibility. Cross-dataset generalization is a substantive contribution, though the clinical marker's status as a generalizable pathology indicator rests on limited evidence.

major comments (2)
  1. [Abstract / Clinical Evaluation] Abstract / Clinical Evaluation section: The headline claim that the pipeline enables clinical pathology assessment rests on the glottal-area CV distinguishing healthy from pathological subjects (p=0.006) in an exploratory n=40 cohort. No information is supplied on age/sex matching, acquisition-parameter stratification, power calculation, or comparison against other kinematic features; this is load-bearing for the title and abstract claim and must be addressed with additional controls or explicit qualification as preliminary.
  2. [Methods / Results] Methods / Results: Training hyperparameters, the precise cross-validation protocol, and error bars (or confidence intervals) on the DSC values and clinical CV comparison are not reported. This leaves the central performance numbers only partially supported, as the soundness assessment notes.
minor comments (1)
  1. [Abstract] Abstract: Adding standard deviations or confidence intervals to the reported DSC figures would improve interpretability of the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for clarification, particularly around the exploratory nature of the clinical evaluation and the need for additional methodological details. We address each point below and have revised the manuscript accordingly to strengthen the presentation while maintaining the integrity of our findings.

read point-by-point responses
  1. Referee: [Abstract / Clinical Evaluation] Abstract / Clinical Evaluation section: The headline claim that the pipeline enables clinical pathology assessment rests on the glottal-area CV distinguishing healthy from pathological subjects (p=0.006) in an exploratory n=40 cohort. No information is supplied on age/sex matching, acquisition-parameter stratification, power calculation, or comparison against other kinematic features; this is load-bearing for the title and abstract claim and must be addressed with additional controls or explicit qualification as preliminary.

    Authors: We agree that the clinical results are exploratory and based on a modest cohort size. In the revised manuscript, we have explicitly qualified these findings as preliminary in the abstract, results, and discussion sections, emphasizing the need for larger validation studies. We added available subject demographics (age range and sex distribution) and acquisition parameters where recorded. A formal power calculation was not performed a priori, which we now state as a limitation; however, the reported p=0.006 remains indicative of a detectable difference. We have also included a brief comparison of the CV metric against other kinematic features (e.g., glottal gap index) in the supplementary material. The title has been softened to 'and Preliminary Clinical Pathology Assessment' to better reflect the evidence level. revision: partial

  2. Referee: [Methods / Results] Methods / Results: Training hyperparameters, the precise cross-validation protocol, and error bars (or confidence intervals) on the DSC values and clinical CV comparison are not reported. This leaves the central performance numbers only partially supported, as the soundness assessment notes.

    Authors: We appreciate this observation and have expanded the Methods section with a new subsection on experimental setup. This includes all training hyperparameters (initial learning rate of 1e-4 with cosine annealing, batch size 16, 100 epochs, AdamW optimizer, and data augmentations), the cross-validation protocol (5-fold stratified cross-validation on each dataset with subject-level splits to avoid leakage), and statistical reporting. All DSC values now include mean ± standard deviation across folds, and the clinical CV comparison includes 95% bootstrap confidence intervals. These additions directly address the soundness concerns and improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: standard held-out evaluation and independent clinical statistic

full rationale

The paper presents a standard supervised segmentation pipeline (YOLOv8n localizer + U-Net) trained on the public GIRAFE and BAGLS datasets and evaluated via Dice scores on held-out test splits, including a cross-dataset transfer test. The clinical claim rests on a simple two-group comparison of glottal-area coefficient of variation (p=0.006) computed directly from the extracted waveforms in an exploratory cohort of 40 subjects. No equations, fitted parameters, or self-citations are used to derive the reported metrics or the statistical distinction; both are independent measurements on external data. No self-definitional, fitted-input-called-prediction, or uniqueness-imported patterns appear.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on standard supervised deep-learning assumptions (i.i.d. training and test distributions, Dice loss suitability for segmentation) and the clinical utility of area coefficient of variation; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Standard assumptions of convolutional neural networks for medical image segmentation hold across the GIRAFE and BAGLS acquisition protocols
    Invoked implicitly when claiming cross-dataset generalization without retraining

pith-pipeline@v0.9.0 · 5573 in / 1331 out tokens · 36379 ms · 2026-05-15T17:44:40.061580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    D. D. Deliyski, Clinical implementation of laryngeal high-speed videoen- doscopy: Challenges and evolution, Folia Phoniatrica et Logopaedica 60 (1) (2008) 33–44.doi:10.1159/000111802

  2. [2]

    M. A. Little, P. E. McSharry, S. J. Roberts, D. A. E. Costello, I. M. Moroz, Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection, BioMedical Engineering OnLine 6 (2007) 23. doi:10.1186/1475-925X-6-23

  3. [3]

    Lohscheller, U

    J. Lohscheller, U. Eysholdt, H. Toy, M. Döllinger, Phonovibrography: Mapping high-speed movies of vocal fold vibrations into 2-D diagrams for visualizing and analyzing the underlying laryngeal dynamics, IEEE Transactions on Medical Imaging 27 (3) (2008) 300–309.doi:10.1109/ TMI.2007.903690

  4. [4]

    R. R. Patel, K. D. Donohue, H. Unnikrishnan, R. J. Kryscio, Effects of vocal fold nodules on glottal cycle measurements derived from high- speed videoendoscopy in children, PLOS ONE 11 (4) (2016) e0154586, nodules vs. typically developing children; kinematic features.doi:10. 1371/journal.pone.0154586

  5. [5]

    Gómez, A

    P. Gómez, A. M. Kist, P. Schlegel, D. A. Berry, D. K. Chhetri, R. Mon- taño, F. Müller, A. Schützenberger, M. Semmler, S. Dürr, D. Eytan, J. Lohscheller, M. Echternach, M. Döllinger, BAGLS, a multihospital benchmark for automatic glottis segmentation, Scientific Data 7 (2020) 186.doi:10.1038/s41597-020-0526-3

  6. [6]

    F. J. P. Montalbo, S3AR U-Net: A separable squeezed simi- larity attention-gated residual U-Net for glottis segmentation, 26 Biomedical Signal Processing and Control 92 (2024) 106047. doi:10.1016/j.bspc.2024.106047. URL https://www.sciencedirect.com/science/article/pii/ S1746809424001058

  7. [7]

    Andrade-Miranda, M

    G. Andrade-Miranda, M. Hernández-Álvarez, J. I. Godino-Llorente, GIRAFE: Glottal imaging dataset for advanced segmentation, analysis, and facilitative playbacks evaluation, Data in Brief 59 (2025) 111376. doi:10.1016/j.dib.2025.111376. URLhttps://doi.org/10.1016/j.dib.2025.111376

  8. [8]

    R.R.Patel, H.Unnikrishnan, K.D.Donohue, Effectsofvocalfoldnodules on glottal cycle measurements derived from high-speed videoendoscopy in children, PloS one 11 (4) (2016) e0154586

  9. [9]

    Unnikrishnan, K

    H. Unnikrishnan, K. D. Donohue, R. R. Patel, Analysis of high-speed digital phonoscopy pediatric images, in: Photonic Therapeutics and Diagnostics VIII, Vol. 8207, SPIE, 2012, pp. 328–340

  10. [10]

    R. R. Patel, K. D. Donohue, H. Unnikrishnan, R. J. Kryscio, Kinematic measurements of the vocal-fold displacement waveform in typical children and adult populations: quantification of high-speed endoscopic videos, Journal of Speech, Language, and Hearing Research 58 (2) (2015) 227–240. doi:10.1044/2015_JSLHR-S-14-0242

  11. [11]

    M. K. Fehling, F. Grosch, M. E. Schuster, B. Schick, J. Lohscheller, Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep convolutional LSTM network, PLOS ONE 15 (2) (2020) e0227791.doi:10.1371/journal.pone.0227791

  12. [12]

    S.M.N.Nobel, S.M.M.R.Swapno, M.R.Islam, M.Safran, S.Alfarhood, M. F. Mridha, A machine learning approach for vocal fold segmentation and disorder classification based on ensemble method, Scientific Reports 14, pMCID: PMC11758383; PMID: 38910146. Ensemble UNet-BiGRU segmentation (IoU 87.46%); no evaluation on public BAGLS or GIRAFE benchmarks. (2024).doi:1...

  13. [13]

    Döllinger, T

    M. Döllinger, T. Schraut, L. A. Henrich, D. Chhetri, M. Echternach, A. M. Johnson, M. Kunduk, Y. Maryn, R. R. Patel, R. Samlan, et al., 27 Re-training of convolutional neural networks for glottis segmentation in endoscopic high-speed videos, Applied Sciences 12 (19) (2022) 9791. doi:10.3390/app12199791. URLhttps://doi.org/10.3390/app12199791

  14. [14]

    Jocher, A

    G. Jocher, A. Chaurasia, J. Qiu, Ultralytics YOLOv8, version 8.0.0, AGPL-3.0 (2023). URLhttps://github.com/ultralytics/ultralytics

  15. [15]

    U-Net: Convolutional networks for biomedical image segmentation

    O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional net- works for biomedical image segmentation, in: Medical Image Com- puting and Computer-Assisted Intervention (MICCAI), Vol. 9351 of Lecture Notes in Computer Science, Springer, 2015, pp. 234–241. doi:10.1007/978-3-319-24574-4_28

  16. [16]

    Milletari, N

    F. Milletari, N. Navab, S.-A. Ahmadi, V-Net: Fully convolutional neu- ral networks for volumetric medical image segmentation, in: Fourth International Conference on 3D Vision (3DV), 2016, pp. 565–571. doi:10.1109/3DV.2016.79

  17. [17]

    Loshchilov, F

    I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: International Conference on Learning Representations (ICLR), 2019, pp. 1–22. URLhttps://openreview.net/forum?id=Bkg6RiCqY7

  18. [18]

    Loshchilov, F

    I. Loshchilov, F. Hutter, SGDR: Stochastic gradient descent with warm restarts, in: International Conference on Learning Representations (ICLR), 2017, pp. 1–16. URLhttps://openreview.net/forum?id=Skq89Scxx

  19. [19]

    A Threshold Selection Method from Gray-Level Histograms.IEEE Transactions on Systems, Man, and Cybernetics1979,9, 62–66

    N. Otsu, A threshold selection method from gray-level histograms, IEEE Transactions on Systems, Man, and Cybernetics 9 (1) (1979) 62–66. doi:10.1109/TSMC.1979.4310076

  20. [20]

    J. Wu, R. Fu, H. Fang, Y. Zhang, Y. Yang, H. Xiong, H. Liu, Y. Xu, MedSegDiff: Medical image segmentation with diffusion probabilistic model, in: Medical Imaging with Deep Learning, Vol. 227 of Proceedings of Machine Learning Research, PMLR, 2024, pp. 1623–1639. URLhttps://proceedings.mlr.press/v227/wu24a.html 28

  21. [21]

    D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2015, pp. 1–15. URLhttps://arxiv.org/abs/1412.6980

  22. [22]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, R. Girshick, Segment anything, in: IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026.doi:10.1109/ICCV51070.2023. 00371

  23. [23]

    J. Ma, Y. He, F. Li, L. Han, C. You, B. Wang, Segment anything in medical images, Nature Communications 15 (2024) 654.doi:10.1038/ s41467-024-44824-z. 29