arxiv: 2605.02700 · v1 · submitted 2026-05-04 · 📡 eess.AS

Recognition: unknown

Neck-Learn: Attention-Based Multiple Instance Learning and Ensemble Framework for Ecological Momentary Assessment

Ahsan Jamal Cheema

Pith reviewed 2026-05-08 02:17 UTC · model grok-4.3

classification 📡 eess.AS

keywords vocal hyperfunctionmultiple instance learningattention mechanismneck-surface accelerometervoice disorder detectionecological momentary assessmenttemporal dynamicsensemble framework

0 comments

The pith

A hybrid model with attention-based multiple instance learning improves detection of vocal hyperfunction from daily neck accelerometer recordings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that preserving within-day temporal dynamics in neck-surface accelerometer data, rather than collapsing recordings into fixed subject-level features, enables more accurate identification of vocal hyperfunction. It does so by introducing a hybrid architecture that pairs gradient-boosted trees on day-level distributional features with a CNN-based multiple instance learning framework using attention. A sympathetic reader would care because vocal hyperfunction is a common voice disorder whose ambulatory monitoring has been limited by loss of nuanced temporal patterns. The resulting model reports higher AUC scores on held-out test data and extracts clinically relevant information about the pathologies.

Core claim

The central claim is that the Neck-Learn hybrid architecture, which combines gradient-boosted trees on day-level distributional features with an attention-based CNN multiple instance learning framework, preserves and learns from within-day temporal dynamics in week-long neck-surface accelerometer recordings to achieve AUCs of 0.879 for primary vocal hyperfunction and 0.848 for non-primary vocal hyperfunction on the held-out test set, exceeding the challenge baselines while also yielding insights into clinically relevant information about both pathologies.

What carries the argument

Attention-based multiple instance learning framework in which each day acts as a bag of temporal instances from the accelerometer signal, allowing the model to attend to informative segments without discarding within-day structure.

If this is right

The hybrid model exceeds the given challenge baselines in AUC for detection of both primary and non-primary vocal hyperfunction.
Attention within the multiple instance learning component focuses on clinically relevant temporal patterns that fixed-length feature vectors discard.
The ensemble approach supplies interpretable insights into features associated with the two pathologies.
Ecological momentary assessment data can retain its full temporal resolution and still support subject-level classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar attention-based bag structures could be tested on other continuous sensor streams such as wrist accelerometers for related movement disorders.
The performance gain may depend on the specific day-level distributional features chosen; ablating those features on new data would clarify their contribution.
If the temporal patterns generalize across recording devices, the framework could support longer-term passive monitoring in voice clinics.

Load-bearing premise

Within-day temporal dynamics in neck-surface accelerometer data contain clinically discriminative information about vocal hyperfunction that the multiple instance learning framework can extract without overfitting to training distributions.

What would settle it

A model that ignores within-day temporal structure or omits the attention-based multiple instance learning component achieving equal or higher AUC on the same held-out test set would falsify the necessity of preserving those dynamics.

Figures

Figures reproduced from arXiv: 2605.02700 by Ahsan Jamal Cheema.

**Figure 1.** Figure 1: (a) Overall pipeline: raw accelerometer data is preprocessed into 56-dim window features, then processed via two paths (1) distributional statistics for tree models to learn global patterns and, (2) raw sequences for CNN-MIL to learn temporal dynamics and dependencies, which are combined through optimized ensemble weighting. (b) CNN-MIL: three Conv1D blocks with residual connection extract more abstract in… view at source ↗

read the original abstract

Vocal hyperfunction (VH) is a prevalent voice disorder whose ambulatory detection remains challenging despite extensive daily voice data. Prior approaches capture week-long neck-surface accelerometer recordings but collapse them into fixed-length subject-level feature vectors, discarding within-day temporal dynamics encoding nuanced voicing feature interactions. We introduce a novel hybrid architecture combining gradient-boosted trees on day-level distributional features with a CNN-based multiple instance learning (MIL) framework that preserves and learns from from temporal dynamics throughout each day. On the held-out test set, our model exceeds the challenge baselines (AUC: 0.82 PVH, 0.77 NPVH), achieving AUCs of 0.879 for PVH (Rank 5) and 0.848 for NPVH (Rank 3), while also providing insights into clinically relevant information about both pathologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hybrid GBT plus CNN attention MIL model improves held-out AUC for vocal hyperfunction detection by keeping within-day temporal structure that prior work collapsed away.

read the letter

The main point is that this paper gets better AUC numbers on the held-out test set by using a hybrid that runs gradient-boosted trees on day-level distributional features and feeds the raw time series into a CNN-based attention MIL. That keeps the within-day voicing patterns instead of turning every recording into one fixed subject vector, which is what the earlier challenge entries did. The reported scores are 0.879 for PVH and 0.848 for NPVH, beating the baselines of 0.82 and 0.77, and the model also surfaces some attention-based insights on the pathologies. The combination itself is new relative to the cited priors, and the MIL bag construction matches the natural structure of daily recordings labeled at the subject level. The paper does a clean job showing that the temporal component adds value without obvious leakage or circularity in the pipeline. The soft spots are minor but real. The abstract gives the final AUCs and ranks but leaves out the cross-validation folds, hyperparameter search procedure, feature extraction pipeline, and any error bars or significance tests. If those details are fully spelled out in the methods and the gains survive a few different splits, the claim holds up; otherwise the improvement could be tied to the particular test set or tuning choices. The assumption that the within-day dynamics carry reliable clinical signal is plausible from the numbers, but an ablation removing the MIL part would have made the contribution sharper. This work is for researchers doing ambulatory voice monitoring or time-series health sensing with wearable accelerometers. Anyone already using MIL on irregular daily recordings will find the architecture and the ensemble fusion worth examining. I would send it for peer review. The performance edge is concrete and the modeling choice is sensible, so referees can check the missing validation details and decide how much weight to give the results.

Referee Report

1 major / 2 minor

Summary. The paper proposes Neck-Learn, a hybrid ensemble framework that pairs gradient-boosted trees operating on day-level distributional features with an attention-based CNN multiple-instance learning (MIL) model to preserve and exploit within-day temporal dynamics in neck-surface accelerometer recordings. The central claim is that this architecture yields superior held-out test performance for binary detection of vocal hyperfunction (PVH AUC 0.879, NPVH AUC 0.848) relative to challenge baselines (0.82 and 0.77) while also surfacing clinically interpretable patterns.

Significance. If the reported gains prove robust, the work usefully demonstrates that retaining intra-day temporal structure in ambulatory sensor data can improve detection of voice disorders over subject-level aggregation methods. The MIL-plus-ensemble design is a natural fit for the bag-of-daily-recordings structure and could inform future ecological momentary assessment pipelines in speech and health monitoring.

major comments (1)

[§4] §4 (Experimental protocol): the manuscript provides no description of the cross-validation scheme, hyperparameter search procedure, feature-extraction pipeline, or statistical significance testing for the AUC differences. Without these details the claim that the hybrid model reliably exceeds the baselines cannot be evaluated and remains vulnerable to post-hoc selection or split-specific artifacts.

minor comments (2)

[Abstract] Abstract: repeated word “from from” should be corrected.
[Discussion] The clinical-insight claims in the final paragraph would be strengthened by explicit mapping from attention weights or feature importances to specific voicing parameters (e.g., shimmer, HNR) rather than qualitative statements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The single major comment highlights a genuine gap in the current manuscript, which we will address through revision.

read point-by-point responses

Referee: [§4] §4 (Experimental protocol): the manuscript provides no description of the cross-validation scheme, hyperparameter search procedure, feature-extraction pipeline, or statistical significance testing for the AUC differences. Without these details the claim that the hybrid model reliably exceeds the baselines cannot be evaluated and remains vulnerable to post-hoc selection or split-specific artifacts.

Authors: We agree that §4 currently omits these critical details, which limits independent evaluation of the reported AUC gains. In the revised manuscript we will expand §4 with: (i) the exact cross-validation scheme (subject-wise partitioning to prevent leakage from multiple days per participant), (ii) the hyperparameter search procedure and search space for both the gradient-boosted trees and the attention-based MIL model, (iii) the complete feature-extraction pipeline (day-level distributional statistics for GBT and the raw 24-hour time-series preprocessing for the CNN-MIL branch), and (iv) the statistical procedure used to compare AUCs against the challenge baselines (bootstrap confidence intervals and DeLong tests). These additions will be placed in a new subsection titled “Experimental Protocol and Reproducibility” and will include pseudocode where appropriate. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a standard supervised learning pipeline: a hybrid gradient-boosted trees plus CNN-MIL architecture is trained on day-level distributional features and temporal dynamics extracted from neck-surface accelerometer recordings, then evaluated via AUC on a held-out test set. No derivation step reduces to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no self-citation or ansatz is invoked as load-bearing justification for the central claim. The reported performance numbers (AUC 0.879 PVH, 0.848 NPVH) are direct outputs of the train-test protocol rather than tautological restatements of the model definition or training data.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised ML assumptions plus domain-specific premises about the value of temporal structure in accelerometer data. Hyperparameters of both the GBT and CNN-MIL components are fitted to the dataset. No invented physical entities are introduced.

free parameters (3)

CNN-MIL architecture hyperparameters
Number of layers, filter sizes, attention mechanism parameters, and bag/instance definitions chosen to fit the training data.
GBT hyperparameters
Learning rate, maximum depth, number of estimators, and feature distribution parameters tuned on the training set.
Ensemble weighting or fusion parameters
How outputs from the GBT and MIL components are combined, fitted to optimize held-out performance.

axioms (2)

domain assumption Within-day temporal dynamics in neck-surface accelerometer signals encode clinically relevant interactions among voicing features that are lost when data are collapsed to subject-level vectors.
Explicitly stated as the motivation for moving beyond prior fixed-vector approaches.
domain assumption The held-out test set is representative of real-world variability and free of distribution shift relative to training data.
Required for the reported AUCs to generalize beyond the specific challenge split.

pith-pipeline@v0.9.0 · 5436 in / 1613 out tokens · 84775 ms · 2026-05-08T02:17:56.230973+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 3 canonical work pages · 1 internal anchor

[1]

adult population at any given time [1]

Introduction V ocal hyperfunction (VH) is among the most common causes of voice disorders, affecting an estimated 7.6% of the U.S. adult population at any given time [1]. VH manifests in two clini- cally distinct forms: phonotraumatic VH (PVH), characterized by excessive laryngeal forces that lead to structural vocal fold lesions such as nodules and polyp...
[2]

Neck-Learn: Attention-Based Multiple Instance Learning and Ensemble Framework for Ecological Momentary Assessment

extended this with machine learning on distributional fea- tures for PVH classification. For NPVH, Van Stan et al. [4] found that CPP mean and H1–H2 mode maximally differen- tiated NPVH from controls (AUC 0.78), reflecting a patho- physiological continuum of inefficient phonation. More re- cently, Cheema et al. [15] demonstrated that relative fundamen- ta...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

bag” of window “instances

Methods 2.1. Data and Preprocessing The NeckVibe Challenge dataset [18] comprises week-long neck-surface accelerometer recordings from 582 subjects col- lected using a smartphone-based ambulatory voice monitor [9, 10]. Each raw.matfile contains frame-level features at 50 ms resolution for a single day-long recording (typically 10+ hours). We apply thevoic...
[4]

Results 3.1. Individual and Ensemble Performance Table 1 presents the mean out-of-fold (OOF) validation AUC for each model and the optimized ensemble on both classifica- tion tasks, alongside the held-out test set results from the Neck- Vibe Challenge leaderboard. The optimized ensemble weights for PVH are: CNN-MIL 0.45, XGBoost 0.35, LightGBM 0.20. For N...
[5]

Yet ambulatory VH detection has previously relied exclusively on collapsing days of vocal data into fixed-length summary vectors

Discussion V oice is increasingly recognized as a biomarker where temporal dynamics not just static acoustic snapshots carry diagnostic in- formation across conditions from Parkinson’s disease to depres- sion [23]. Yet ambulatory VH detection has previously relied exclusively on collapsing days of vocal data into fixed-length summary vectors. Our results ...
[6]

All scientific content, experimental design, code, data analysis, interpretation of results, intellectual con- tributions and manuscript drafts are solely the work of the au- thors

Generative AI Use Disclosure GenAI was used only to assist with editing and polishing the manuscript text. All scientific content, experimental design, code, data analysis, interpretation of results, intellectual con- tributions and manuscript drafts are solely the work of the au- thors. The authors reviewed and take full responsibility for the final cont...
[7]

The prevalence of voice problems among adults in the united states,

N. Bhattacharyya, “The prevalence of voice problems among adults in the united states,”Laryngoscope, vol. 124, no. 10, pp. 2359–2362, May 2014

2014
[8]

An updated theoretical framework for vocal hyperfunction,

R. E. Hillman, C. E. Stepp, J. H. V . Stan, M. Za ˜nartu, and D. D. Mehta, “An updated theoretical framework for vocal hyperfunction,”American Journal of Speech-Language Pathol- ogy, vol. 29, no. 4, pp. 2254–2260, 2020. [Online]. Available: https://pubs.asha.org/doi/abs/10.1044/2020 AJSLP-20-00104

work page doi:10.1044/2020 2020
[9]

Current knowledge, controversies and future directions in hyperfunctional voice disorders,

J. Oates and A. Winkworth, “Current knowledge, controversies and future directions in hyperfunctional voice disorders,” International Journal of Speech-Language Pathology, vol. 10, no. 4, pp. 267–277, 2008. [Online]. Available: https://doi.org/10. 1080/17549500802140153

2008
[10]

Differences in daily voice use measures between female patients with nonphonotraumatic vocal hyperfunction and matched controls,

J. H. Van Stan, A. J. Ortiz, J. P. Cortes, K. L. Marks, L. E. Toles, D. D. Mehta, J. A. Burns, T. Hron, T. Stadelman-Cohen, C. Kruse- mark, J. Muise, A. B. Fox-Galalis, C. Nudelman, S. Zeitels, and R. E. Hillman, “Differences in daily voice use measures between female patients with nonphonotraumatic vocal hyperfunction and matched controls,”J. Speech Lang...

2021
[11]

Aerodynamic profiles of women with muscle tension dysphonia/aphonia,

A. I. Gillespie, J. Gartner-Schmidt, E. N. Rubinstein, and K. V . Abbott, “Aerodynamic profiles of women with muscle tension dysphonia/aphonia,”J. Speech Lang. Hear. Res., vol. 56, no. 2, pp. 481–488, Apr. 2013

2013
[12]

Simplified vocal efficiency metrics normalize following voice therapy in sub- groups of patients with nonphonotraumatic vocal hyperfunction,

Z. Zhu, J. H. Van Stan, H. Ghasemzadeh, A. J. Cheema, J. Wolf- berg, R. E. Hillman, A. B. Fox, and D. D. Mehta, “Simplified vocal efficiency metrics normalize following voice therapy in sub- groups of patients with nonphonotraumatic vocal hyperfunction,” Am. J. Speech. Lang. Pathol., vol. 34, no. 5, pp. 2846–2863, Sep. 2025

2025
[13]

Using ambulatory voice monitoring to investigate com- mon voice disorders: Research update,

D. D. Mehta, J. H. Van Stan, M. Za ˜nartu, M. Ghassemi, J. V . Gut- tag, V . M. Espinoza, J. P. Cort´es, H. A. Cheyne, 2nd, and R. E. Hillman, “Using ambulatory voice monitoring to investigate com- mon voice disorders: Research update,”Front Bioeng Biotechnol, vol. 3, p. 155, Oct. 2015

2015
[14]

Toward a consensus description of vocal effort, vocal load, vocal loading, and vocal fatigue,

E. J. Hunter, L. C. Cantor-Cutiva, E. van Leer, M. van Mersbergen, C. D. Nanjundeswaran, P. Bottalico, M. J. Sandage, and S. Whitling, “Toward a consensus description of vocal effort, vocal load, vocal loading, and vocal fatigue,”Journal of Speech, Language, and Hearing Research, vol. 63, no. 2, pp. 509–532,
[15]

Available: https://pubs.asha.org/doi/abs/10.1044/ 2019 JSLHR-19-00057

[Online]. Available: https://pubs.asha.org/doi/abs/10.1044/ 2019 JSLHR-19-00057

2019
[16]

Am- bulatory assessment of phonotraumatic vocal hyperfunction us- ing glottal airflow measures estimated from neck-surface acceler- ation,

J. P. Cort ´es, V . M. Espinoza, M. Ghassemi, D. D. Mehta, J. H. Van Stan, R. E. Hillman, J. V . Guttag, and M. Za ˜nartu, “Am- bulatory assessment of phonotraumatic vocal hyperfunction us- ing glottal airflow measures estimated from neck-surface acceler- ation,”PLoS One, vol. 13, no. 12, p. e0209017, 2018

2018
[17]

Mobile voice health monitoring using a wearable accelerometer sensor and a smartphone platform,

D. D. Mehta, M. Za ˜nartu, S. W. Feng, H. A. Cheyne, 2nd, and R. E. Hillman, “Mobile voice health monitoring using a wearable accelerometer sensor and a smartphone platform,”IEEE Trans. Biomed. Eng., vol. 59, no. 11, pp. 3090–3096, Nov. 2012

2012
[18]

Subglottal impedance-based inverse filtering of voiced sounds using neck surface acceleration,

M. Za ˜nartu, J. C. Ho, D. D. Mehta, R. E. Hillman, and G. R. Wodicka, “Subglottal impedance-based inverse filtering of voiced sounds using neck surface acceleration,”IEEE Trans. Audio Speech Lang. Process., vol. 21, no. 9, pp. 1929–1941, 2013

1929
[19]

Differences in weeklong ambulatory vocal behavior between fe- male patients with phonotraumatic lesions and matched controls,

J. H. Van Stan, D. D. Mehta, A. J. Ortiz, J. A. Burns, L. E. Toles, K. L. Marks, M. Vangel, T. Hron, S. Zeitels, and R. E. Hillman, “Differences in weeklong ambulatory vocal behavior between fe- male patients with phonotraumatic lesions and matched controls,” J. Speech Lang. Hear. Res., vol. 63, no. 2, pp. 372–384, Feb. 2020

2020
[20]

Adaptation of a pocket PC for use as a wearable voice dosimeter,

P. S. Popolo, J. G. Svec, and I. R. Titze, “Adaptation of a pocket PC for use as a wearable voice dosimeter,”J. Speech Lang. Hear. Res., vol. 48, no. 4, pp. 780–791, Aug. 2005

2005
[21]

Learning to de- tect vocal hyperfunction from ambulatory neck-surface accelera- tion features: Initial results for vocal fold nodules,

M. Ghassemi, J. H. Van Stan, D. D. Mehta, M. Za ˜nartu, H. A. Cheyne, 2nd, R. E. Hillman, and J. V . Guttag, “Learning to de- tect vocal hyperfunction from ambulatory neck-surface accelera- tion features: Initial results for vocal fold nodules,”IEEE Trans. Biomed. Eng., vol. 61, no. 6, pp. 1668–1675, 2014

2014
[22]

Characterizing vocal hyperfunction using ecological momentary assessment of relative fundamental frequency,

A. J. Cheema, K. L. Marks, H. Ghasemzadeh, J. H. Van Stan, R. E. Hillman, and D. D. Mehta, “Characterizing vocal hyperfunction using ecological momentary assessment of relative fundamental frequency,”J. Voice, 2024, in press

2024
[23]

Relative fun- damental frequency distinguishes between phonotraumatic and Non-Phonotraumatic vocal hyperfunction,

E. S. Heller Murray, Y .-A. S. Lien, J. H. Van Stan, D. D. Mehta, R. E. Hillman, J. Pieter Noordzij, and C. E. Stepp, “Relative fun- damental frequency distinguishes between phonotraumatic and Non-Phonotraumatic vocal hyperfunction,”J Speech Lang Hear Res, vol. 60, no. 6, pp. 1507–1515, Jun. 2017

2017
[24]

The relationship between perception of vocal effort and relative fundamental fre- quency during voicing offset and onset,

C. E. Stepp, D. E. Sawin, and T. L. Eadie, “The relationship between perception of vocal effort and relative fundamental fre- quency during voicing offset and onset,”J Speech Lang Hear Res, vol. 55, no. 6, pp. 1887–1896, May 2012

2012
[25]

NeckVibe Challenge: V oice disorder detection via real-world monitoring of neck-surface vi- bration,

NeckVibe Challenge Organizers, “NeckVibe Challenge: V oice disorder detection via real-world monitoring of neck-surface vi- bration,” Interspeech 2026 Challenge, 2026, urlhttps://neckvibe.org

2026
[26]

R. R. Patel, S. N. Awan, J. Barkmeier-Kraemer, M. Courey, D. Deliyski, T. Eadie, D. Paul, J. G. ˇSvec, and R. Hillman, “Recommended protocols for instrumental assessment of voice: American speech-language-hearing association expert panel to develop a protocol for instrumental assessment of vocal function,”American Journal of Speech-Language Pathology, vol...

work page doi:10.1044/2018 2018
[27]

Self-ratings of vocal status in daily life: Reliability and validity for patients with vocal hyperfunction and a normative group,

J. H. Van Stan, M. Maffei, M. L. V . Masson, D. D. Mehta, J. A. Burns, and R. E. Hillman, “Self-ratings of vocal status in daily life: Reliability and validity for patients with vocal hyperfunction and a normative group,”Am. J. Speech. Lang. Pathol., vol. 26, no. 4, pp. 1167–1177, Nov. 2017

2017
[28]

Toward gen- eralizable machine learning models in speech, language, and hear- ing sciences: Estimating sample size and reducing overfitting,

H. Ghasemzadeh, R. E. Hillman, and D. D. Mehta, “Toward gen- eralizable machine learning models in speech, language, and hear- ing sciences: Estimating sample size and reducing overfitting,”J. Speech Lang. Hear. Res., vol. 67, no. 3, pp. 753–781, Mar. 2024

2024
[29]

Random forests,

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

2001
[30]

Exploring the use of artificial intelligence techniques to detect the presence of coron- avirus COVID-19 through speech and voice analysis,

L. Verde, G. De Pietro, and G. Sannino, “Exploring the use of artificial intelligence techniques to detect the presence of coron- avirus COVID-19 through speech and voice analysis,”IEEE Ac- cess, vol. 9, pp. 65 750–65 757, 2021

2021
[31]

V ocal dose measures: quantifying accumulated vibration exposure in vocal fold tissues,

I. R. Titze, J. G. Svec, and P. S. Popolo, “V ocal dose measures: quantifying accumulated vibration exposure in vocal fold tissues,” J. Speech Lang. Hear. Res., vol. 46, no. 4, pp. 919–932, Aug. 2003

2003
[32]

Quantifying vocal fatigue recov- ery: Dynamic vocal recovery trajectories after a vocal loading ex- ercise,

E. J. Hunter and I. R. Titze, “Quantifying vocal fatigue recov- ery: Dynamic vocal recovery trajectories after a vocal loading ex- ercise,”Annals of Otology, Rhinology & Laryngology, vol. 118, no. 6, pp. 449–460, 2009

2009
[33]

Attention-based deep multiple instance learning,

M. Ilse, J. M. Tomczak, and M. Welling, “Attention-based deep multiple instance learning,” inProceedings of the 35th Inter- national Conference on Machine Learning (ICML), ser. PMLR, vol. 80, 2018, pp. 2127–2136

2018
[34]

Clinical-grade computational pathol- ogy using weakly supervised deep learning on whole slide im- ages,

G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V . Wer- neck Krauss Silva, K. J. Busber, E. Brogi, V . E. Reuter, D. S. Klimstra, and T. J. Fuchs, “Clinical-grade computational pathol- ogy using weakly supervised deep learning on whole slide im- ages,”Nature Medicine, vol. 25, no. 8, pp. 1301–1309, 2019

2019
[35]

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices,

H. Xu, A. Salekin, B. J. Lau, K. M. Stankovic, and J. Bhatt, “Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices,”EURASIP Journal on Au- dio, Speech, and Music Processing, vol. 2021, no. 1, p. 3, 2021

2021
[36]

A scoping literature review of relative fundamental frequency (rff) in individuals with and without voice disorders,

V . S. McKenna, J. M. V ojtech, M. Previtera, C. L. Kendall, and K. E. Carraro, “A scoping literature review of relative fundamental frequency (rff) in individuals with and without voice disorders,”Applied Sciences, vol. 12, no. 16, 2022. [Online]. Available: https://www.mdpi.com/2076-3417/12/16/8121

2022
[37]

Temporal convolu- tional networks for cough detection using raw waveforms: Re- ducing false positive rates with noise augmentation,

D. D. Vidulejs, J. Telicko, and A. Jakovics, “Temporal convolu- tional networks for cough detection using raw waveforms: Re- ducing false positive rates with noise augmentation,” in2023 3rd International Conference on Electrical, Computer, Communica- tions and Mechatronics Engineering (ICECCME), 2023, pp. 1–6

2023
[38]

V oice-AttentionNet: V oice- based multi-disease detection with lightweight attention-based temporal convolutional neural network,

J. Wang, J. Zhou, and B. Zhang, “V oice-AttentionNet: V oice- based multi-disease detection with lightweight attention-based temporal convolutional neural network,”AI (Basel), vol. 6, no. 4, p. 68, Mar. 2025

2025
[39]

An analysis of causal effect estimation using outcome invariant data augmentation,

U. AKBAR, N. Kilbertus, H. Shen, K. Muandet, and B. Dai, “An analysis of causal effect estimation using outcome invariant data augmentation,” inNeurIPS 2025 Workshop: Reliable ML from Unreliable Data, 2025. [Online]. Available: https://openreview.net/forum?id=yM1awzzIdv

2025
[40]

Real-time causal spectro- temporal voice activity detection based on convolutional encoding and residual decoding,

J. Wang, J. Zhang, and L.-R. Dai, “Real-time causal spectro- temporal voice activity detection based on convolutional encoding and residual decoding,” inINTERSPEECH 2023. ISCA: ISCA, Aug. 2023, pp. 5062–5066

2023