Recognition: 3 theorem links
· Lean TheoremMixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection
Pith reviewed 2026-05-08 19:06 UTC · model grok-4.3
The pith
Mixed-precision quantization creates an information bottleneck that separates speaker identity from agitation states in voice without adversarial training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MP-IB treats mixed-precision quantization as an information bottleneck in which an FP16 trait head with 1,024 bits encodes speaker identity while an INT4 state head with 128 bits captures agitation, creating an 8x information asymmetry that achieves trait-state separation without adversarial losses. On the Bridge2AI-Voice dataset the approach reaches a correlation of rho = 0.117 and outperforms larger models and other disentanglement baselines, while also delivering zero-shot transfer to CREMA-D and near-random identity leakage metrics.
What carries the argument
Mixed-precision information bottleneck implemented through an FP16 trait head and INT4 state head, together with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion, that limits capacity to enforce separation.
If this is right
- The model runs at 23.4 ms latency with a 617 KB footprint, enabling real-time use on devices costing under 20 dollars.
- Identity leakage drops to near-random levels with an EER of 0.42 and MIA-AUC of 0.52.
- Zero-shot performance on CREMA-D reaches an AUC of 0.817.
- Correlation gains of 2.8 to 15.9 points are obtained over a 94 M-parameter WavLM-Adapter, beta VAE, and hand-crafted prosody features.
Where Pith is reading between the lines
- The same capacity-asymmetry principle could be tested for disentangling stable versus transient signals in other sensor modalities such as accelerometer or ECG data.
- If the precision schedule generalizes, training pipelines for disentanglement tasks could become simpler by removing the need for adversarial objectives.
- Deployment on wearables might extend beyond bipolar monitoring to other longitudinal health states where identity must stay hidden.
Load-bearing premise
The chosen bit widths and lack of adversarial losses are enough by themselves to produce the observed trait-state separation through capacity limits.
What would settle it
A controlled ablation in which the FP16 and INT4 heads receive identical bit widths yet still match the reported correlation and leakage results would show that the precision asymmetry is not the operative mechanism.
Figures
read the original abstract
Continuous monitoring of bipolar disorder agitation via voice biomarkers requires disentangling stable speaker traits from volatile affective states on resource-constrained edge devices. We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation. The core insight is that numerical precision itself controls capacity: an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8x information asymmetry without adversarial training. We augment this with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. On Bridge2AI-Voice (N=833, 4 sessions/participant, strict speaker-independent CV), MP-IB achieves rho = 0.117 (95\% CI: [0.089, 0.145], p=0.003 vs. chance), outperforming 94M-parameter WavLM-Adapter with in-domain SSL continuation (rho = -0.042), beta VAE disentanglement (rho = 0.089), and hand-crafted prosody (rho = 0.031) by 2.8--15.9 points absolute. Zero-shot transfer to CREMA-D achieves AUC=0.817. Identity leakage is suppressed to near-random (EER=0.42, MIA-AUC=0.52). End-to-end latency is 23.4 ms with a 617 KB footprint, enabling real-time monitoring on sub 20 dollar devices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MP-IB, a mixed-precision information bottleneck framework that uses quantization precision as a capacity constraint to disentangle speaker traits (via FP16 trait head, 1024 bits) from affective states (via INT4 state head, 128 bits) for on-device bipolar agitation detection from voice, without adversarial losses. Augmented by Dynamic Precision Scheduling and Multi-Scale Temporal Fusion, it reports rho=0.117 (95% CI [0.089,0.145]) on Bridge2AI-Voice (N=833, speaker-independent CV), outperforming WavLM-Adapter, beta-VAE, and prosody baselines, plus zero-shot AUC=0.817 on CREMA-D, low identity leakage (EER=0.42), and a 617KB/23.4ms footprint.
Significance. If the precision-asymmetry mechanism is shown to be causal, the work provides a lightweight alternative to adversarial disentanglement for clinical voice biomarkers, with direct implications for real-time edge deployment in mental health monitoring. The concrete metrics, CIs, and baseline comparisons are strengths, though the absence of controls leaves the core novelty unverified.
major comments (3)
- [Experimental results] Experimental results section: performance gains (rho=0.117 vs. baselines) are reported on held-out splits but without ablations on uniform-precision controls, bit-width sweeps, or removal of Dynamic Precision Scheduling/Multi-Scale Temporal Fusion. This is load-bearing for the central claim that the 8x asymmetry (FP16 vs. INT4) alone enforces trait-state separation via capacity limits.
- [Method] Method and results: no probing accuracies, mutual information estimates, or representation analyses are provided to confirm that the INT4 state head cannot encode speaker identity (as required by the information-bottleneck framing) while the FP16 trait head ignores agitation.
- [Abstract] Abstract and method: the stated capacities (FP16 trait head = 1024 bits, INT4 state head = 128 bits) are presented without explicit derivation from layer dimensions or parameter counts, making it impossible to verify that quantization strictly caps mutual information as claimed.
minor comments (2)
- The manuscript would benefit from an appendix with full training hyperparameters, optimizer settings, and exact model architectures to support reproducibility of the reported EER and AUC values.
- [Results] Figure captions and tables should explicitly state the number of runs or seeds used for the 95% CIs to clarify statistical robustness.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments identify key areas where additional evidence would strengthen the central claims about the mixed-precision information bottleneck. We agree that the manuscript would benefit from further controls and analyses, and we will incorporate revisions accordingly. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Experimental results] Experimental results section: performance gains (rho=0.117 vs. baselines) are reported on held-out splits but without ablations on uniform-precision controls, bit-width sweeps, or removal of Dynamic Precision Scheduling/Multi-Scale Temporal Fusion. This is load-bearing for the central claim that the 8x asymmetry (FP16 vs. INT4) alone enforces trait-state separation via capacity limits.
Authors: We acknowledge that isolating the contribution of the precision asymmetry requires explicit controls. The current experiments demonstrate consistent gains over strong baselines under speaker-independent evaluation, but we agree that uniform-precision ablations, bit-width sweeps, and component removals are necessary to substantiate the causal role of the 8x capacity constraint. In the revised manuscript we will add these ablations to the Experimental Results section, including (i) all-FP16 and all-INT4 variants, (ii) sweeps over state-head bit widths from 2 to 8 bits, and (iii) versions without Dynamic Precision Scheduling and without Multi-Scale Temporal Fusion. These additions will directly test whether the reported performance depends on the mixed-precision bottleneck. revision: yes
-
Referee: [Method] Method and results: no probing accuracies, mutual information estimates, or representation analyses are provided to confirm that the INT4 state head cannot encode speaker identity (as required by the information-bottleneck framing) while the FP16 trait head ignores agitation.
Authors: We agree that direct empirical verification of the information capacities would reinforce the information-bottleneck interpretation. Although the performance metrics and low identity leakage (EER=0.42) are consistent with the intended separation, we will add the requested analyses in the revision. Specifically, we will include linear probing accuracies for speaker identity on both heads, mutual-information estimates between representations and speaker labels (where computationally feasible), and qualitative representation visualizations (e.g., t-SNE) showing trait versus state clustering. These will be presented in a new subsection of the Results to confirm that the INT4 head is capacity-limited with respect to identity while the FP16 head retains trait information. revision: yes
-
Referee: [Abstract] Abstract and method: the stated capacities (FP16 trait head = 1024 bits, INT4 state head = 128 bits) are presented without explicit derivation from layer dimensions or parameter counts, making it impossible to verify that quantization strictly caps mutual information as claimed.
Authors: We apologize for the omission of the explicit derivation. The stated bit capacities follow from the output dimensionality of each head multiplied by the effective bits per value after quantization (FP16 at 16 bits, INT4 at 4 bits), adjusted for the architectural dimensions of the respective heads. In the revised Method section we will provide the full calculation, including the precise layer dimensions, the formula used to obtain 1024 bits for the trait head and 128 bits for the state head, and a brief discussion of how quantization imposes an upper bound on mutual information under the information-bottleneck framework. This will make the 8x asymmetry verifiable from the architecture description. revision: yes
Circularity Check
No significant circularity; central claims rest on empirical evaluation on held-out data rather than self-referential definitions or fits.
full rationale
The paper presents MP-IB as using mixed-precision quantization (FP16 trait head at 1024 bits vs. INT4 state head at 128 bits) to create an 8x information asymmetry for trait-state disentanglement without adversarial losses, augmented by Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. This is validated via rho=0.117 (p=0.003) on speaker-independent CV splits of Bridge2AI-Voice (N=833) and zero-shot AUC=0.817 on CREMA-D, with identity leakage near chance (EER=0.42). No load-bearing step reduces the reported gains or the capacity-control insight to a fitted parameter renamed as prediction, a self-citation chain, or an equation that is true by construction. The performance numbers and comparisons to WavLM-Adapter, beta VAE, and prosody baselines are external and falsifiable on the stated splits; the assumption that quantization strictly caps mutual information is stated as an insight, not derived from the paper's own outputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- Trait head precision =
FP16 (1024 bits)
- State head precision =
INT4 (128 bits)
axioms (1)
- domain assumption Numerical precision directly controls the information capacity of separate model heads
Lean theorems connected to this paper
-
Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.1 (Precision-Based Information Capacity). For z ∈ {−2^(b−1),…,2^(b−1)−1}^d with dimension d and b-bit precision: I(x;z) ≤ H(z) ≤ d·b bits.
-
Foundation.DimensionForcing / Foundation.AlexanderDualityalexander_duality_circle_linking (D=3 ⇒ 2^D = 8) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8× information asymmetry without adversarial training.
-
Foundation.BranchSelection / AlphaCoordinateFixationbranch_selection / alpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[2]
Acoustic and Natural Language Markers for Bipolar Disorder: A Pilot, mHealth Cross-Sectional Study
Crocamo, Cristina and Cioni, Riccardo Matteo and Canestro, Aurelia and Nasti, Christian and Palpella, Dario and Piacenti, Susanna and Bartoccetti, Alessandra and Re, Martina and Simonetti, Valentina and Barattieri di San Pietro, Chiara and Bulgheroni, Maria and Bartoli, Francesco and Carr \`a , Giuseppe. Acoustic and Natural Language Markers for Bipolar D...
-
[5]
2025 , eprint=
Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion , author=. 2025 , eprint=
2025
-
[6]
ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism , url=
Chou, Hsing-Hang and Lin, Yun-Shao and Sung, Ching-Chin and Tsao, Yu and Lee, Chi-Chun , year=. ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism , url=. doi:10.21437/interspeech.2025-1101 , booktitle=
-
[7]
2025 , eprint=
ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training , author=. 2025 , eprint=
2025
-
[8]
2024 , eprint=
Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder , author=. 2024 , eprint=
2024
-
[9]
2024 , eprint=
Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement , author=. 2024 , eprint=
2024
-
[10]
2025 , eprint=
Provable Speech Attributes Conversion via Latent Independence , author=. 2025 , eprint=
2025
-
[11]
2025 , eprint=
Learning Source Disentanglement in Neural Audio Codec , author=. 2025 , eprint=
2025
-
[12]
2025 , eprint=
Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation , author=. 2025 , eprint=
2025
-
[13]
2025 , eprint=
Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning , author=. 2025 , eprint=
2025
-
[14]
2025 , eprint=
Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition , author=. 2025 , eprint=
2025
-
[15]
2025 , eprint=
Effective and Efficient Mixed Precision Quantization of Speech Foundation Models , author=. 2025 , eprint=
2025
-
[16]
2025 , eprint=
Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision , author=. 2025 , eprint=
2025
-
[17]
2025 , eprint=
StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models , author=. 2025 , eprint=
2025
-
[18]
2023 , eprint=
2-bit Conformer quantization for automatic speech recognition , author=. 2023 , eprint=
2023
-
[19]
2025 , eprint=
Quantization for OpenAI's Whisper Models: A Comparative Analysis , author=. 2025 , eprint=
2025
-
[20]
2026 , eprint=
Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models , author=. 2026 , eprint=
2026
-
[21]
2026 , eprint=
Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation , author=. 2026 , eprint=
2026
-
[22]
QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding , url=
Biswas, Subrata and Khan, Mohammad Nur Hossain and Islam, Bashima , year=. QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding , url=. doi:10.21437/interspeech.2025-532 , booktitle=
-
[23]
Xu, Junhao and Yu, Jianwei and Hu, Shoukang and Liu, Xunying and Meng, Helen , year=. Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition , volume=. doi:10.1109/taslp.2021.3129357 , journal=
-
[24]
Workshop on Machine Learning and Compression, NeurIPS 2024 , year=
Layer-Importance guided Adaptive Quantization for Efficient Speech Emotion Recognition , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year=
2024
-
[25]
2024 , eprint=
Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition , author=. 2024 , eprint=
2024
-
[26]
2024 , eprint=
SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training , author=. 2024 , eprint=
2024
-
[27]
2024 , eprint=
A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization , author=. 2024 , eprint=
2024
-
[28]
Translational Psychiatry , volume=
Voice analysis as an objective state marker in bipolar disorder , author=. Translational Psychiatry , volume=
-
[29]
2022 , eprint=
Multi-Task Learning for Depression Detection in Dialogs , author=. 2022 , eprint=
2022
-
[30]
Sensors , VOLUME =
Kamińska, Dorota and Kamińska, Olga and Sochacka, Małgorzata and Sokół-Szawłowska, Marlena , TITLE =. Sensors , VOLUME =. 2024 , NUMBER =
2024
-
[32]
Health Informatics Journal , volume =
Mireia Farrús and Joan Codina-Filbà and Joan Escudero , title =. Health Informatics Journal , volume =. 2021 , doi =
2021
-
[33]
Low, Daniel M. and Bentley, Kate H. and Ghosh, Satrajit S. , title =. Laryngoscope Investigative Otolaryngology , volume =. doi:https://doi.org/10.1002/lio2.354 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/lio2.354 , year =
-
[34]
BMC Psychiatry , volume=
The voice of depression: Speech features as biomarkers for major depressive disorder , author=. BMC Psychiatry , volume=. 2024 , doi=
2024
-
[36]
2014 , publisher=
Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur, Ruben C and Nenkova, Ani and Verma, Ragini , journal=. 2014 , publisher=
2014
-
[37]
2019 , eprint=
Searching for MobileNetV3 , author=. 2019 , eprint=
2019
-
[39]
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
X-Vectors: Robust DNN Embeddings for Speaker Recognition , author=. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
2018
-
[40]
2000 , eprint=
The information bottleneck method , author=. 2000 , eprint=
2000
-
[41]
2019 , eprint=
Deep Variational Information Bottleneck , author=. 2019 , eprint=
2019
-
[42]
Domain-adversarial training of neural networks , year =
Ganin, Yaroslav and Ustinova, Evgeniya and Ajakan, Hana and Germain, Pascal and Larochelle, Hugo and Laviolette, Fran. Domain-adversarial training of neural networks , year =. J. Mach. Learn. Res. , month = jan, pages =
-
[43]
2013 , eprint=
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. 2013 , eprint=
2013
-
[44]
2016 , eprint=
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , author=. 2016 , eprint=
2016
-
[46]
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech , author=. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=
2014
-
[47]
2025 , note=
Mental Disorders: Fact Sheet on Bipolar Disorder , author=. 2025 , note=
2025
-
[49]
2018 , eprint=
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs , author=. 2018 , eprint=
2018
-
[50]
2020 , eprint=
TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices , author=. 2020 , eprint=
2020
-
[51]
2025 , eprint=
Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models , author=. 2025 , eprint=
2025
-
[52]
2026 , eprint=
Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need? , author=. 2026 , eprint=
2026
-
[53]
2025 , eprint=
SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance , author=. 2025 , eprint=
2025
-
[54]
Nilsson, Mattias and Miccini, Riccardo and Laroche, Clement and Piechowiak, Tobias and Zenke, Friedemann , year=. Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps , url=. doi:10.21437/interspeech.2024-1979 , booktitle=
-
[55]
2020 , eprint=
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author=. 2020 , eprint=
2020
-
[56]
Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman , title =. 2021 , issue_date =. doi:10.1109/TASLP.2021.3122291 , journal =
-
[58]
2021 , eprint=
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion , author=. 2021 , eprint=
2021
-
[59]
2021 , eprint=
Unsupervised Speech Decomposition via Triple Information Bottleneck , author=. 2021 , eprint=
2021
-
[60]
Scientific Reports , volume=
Depression recognition using voice-based pre-training model , author=. Scientific Reports , volume=. 2024 , doi=
2024
-
[61]
2021 , eprint=
MINE: Mutual Information Neural Estimation , author=. 2021 , eprint=
2021
-
[64]
2023 , eprint=
VaSAB: The variable size adaptive information bottleneck for disentanglement on speech and singing voice , author=. 2023 , eprint=
2023
-
[65]
Akti, S., Nguyen, T. N., and Waibel, A. Towards better disentanglement in non-autoregressive zero-shot expressive voice conversion, 2025. URL https://arxiv.org/abs/2506.04013
-
[66]
Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. Mine: Mutual information neural estimation, 2021. URL https://arxiv.org/abs/1801.04062
-
[67]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013. URL https://arxiv.org/abs/1308.3432
work page internal anchor Pith review arXiv 2013
-
[68]
Bensoussan, Y., Sigaras, A., Rameau, A., Elemento, O., Powell, M., Dorr, D., Payne, P., Ravitsky, V., Bélisle-Pipon, J.-C., Bahr, R., Watts, S., Bolser, D., Siu, J., Lerner-Ellis, J., Rudzicz, F., Boyer, M., Abdel-Aty, Y., Ahmed Syed , T., Anibal, J., Amraei, D., Aradi, S., Armosh, K., Martinez, A. S., Awan, S., Bedrick, S., Beltran, H., Bernier, A., Berr...
-
[69]
Bous, F. and Roebel, A. Vasab: The variable size adaptive information bottleneck for disentanglement on speech and singing voice, 2023. URL https://arxiv.org/abs/2310.03444
-
[70]
Briganti, G. and Lechien, J. R. Voice quality as digital biomarker in bipolar disorder: A systematic review. Journal of Voice, 2025. ISSN 0892-1997. doi:https://doi.org/10.1016/j.jvoice.2025.01.002. URL https://www.sciencedirect.com/science/article/pii/S0892199725000049
-
[71]
G., Keutmann, M
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., and Verma, R. CREMA-D : Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5 0 (4): 0 377--390, 2014
2014
-
[72]
T., Qian, K., Schultz, T., and Schuller, B
Chang, Y., Ren, Z., Zhao, Z., Nguyen, T. T., Qian, K., Schultz, T., and Schuller, B. W. Breaking resource barriers in speech emotion recognition via data distillation, 2025. URL https://arxiv.org/abs/2406.15119
-
[73]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., and Wei, F. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505–1518, October 2022....
-
[74]
Cho, D.-H., Oh, H.-S., Kim, S.-B., and Lee, S.-W. Diemo-tts: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech. In Interspeech 2025, pp.\ 4373–4377. ISCA, August 2025. doi:10.21437/interspeech.2025-1394. URL http://dx.doi.org/10.21437/Interspeech.2025-1394
-
[75]
Desplanques, B., Thienpondt, J., and Demuynck, K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Interspeech 2020, pp.\ 3830–3834. ISCA, October 2020. doi:10.21437/interspeech.2020-2650. URL http://dx.doi.org/10.21437/Interspeech.2020-2650
-
[76]
Acoustic and prosodic information for home monitoring of bipolar disorder
Farrús, M., Codina-Filbà, J., and Escudero, J. Acoustic and prosodic information for home monitoring of bipolar disorder. Health Informatics Journal, 27 0 (1): 0 1460458220972755, 2021. doi:10.1177/1460458220972755. URL https://doi.org/10.1177/1460458220972755. PMID: 33438502
-
[77]
M., Winther, O., Bardram, J
Faurholt-Jepsen, M., Busk, J., Frost, M., Vinberg, M., Christensen, E. M., Winther, O., Bardram, J. E., and Kessing, L. V. Voice analysis as an objective state marker in bipolar disorder. Translational Psychiatry, 6 0 (7): 0 e856, 2016
2016
-
[78]
Feng, C., Lin, Y., Zhuo, S., Su, C., Ramakrishnan, R. K., Yuan, Z., and Zhang, X. Edge-asr: Towards low-bit quantization of automatic speech recognition models, 2025. URL https://arxiv.org/abs/2507.07877
-
[79]
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 2016. URL https://arxiv.org/abs/1506.02142
work page Pith review arXiv 2016
-
[80]
Domain-adversarial training of neural networks
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res., 17 0 (1): 0 2096–2030, January 2016. ISSN 1532-4435
2096
-
[81]
Stablequant: Layer adaptive post-training quantization for speech foundation models, 2025
Hong, Y., Han, H., jin Chung, W., and Kang, H.-G. Stablequant: Layer adaptive post-training quantization for speech foundation models, 2025. URL https://arxiv.org/abs/2504.14915
-
[82]
and Adam, Hartwig , keywords =
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., and Adam, H. Searching for mobilenetv3, 2019. URL https://arxiv.org/abs/1905.02244
-
[83]
Depression recognition using voice-based pre-training model
Huang, X., Wang, F., Gao, Y., Liao, Y., Zhang, W., Zhang, L., and Xu, Z. Depression recognition using voice-based pre-training model. Scientific Reports, 14: 0 12734, 2024. doi:https://doi.org/10.1038/s41598-024-63556-0
-
[84]
Deep speaker embeddings for speaker verification: Review and experimental comparison
Jakubec, M., Jarina, R., Lieskovska, E., and Kasak, P. Deep speaker embeddings for speaker verification: Review and experimental comparison. Eng. Appl. Artif. Intell., 127 0 (PA), January 2024. ISSN 0952-1976. doi:10.1016/j.engappai.2023.107232. URL https://doi.org/10.1016/j.engappai.2023.107232
-
[85]
Kaczmarek-Majer, K., Dominiak, M., Antosik, A. Z., Hryniewicz, O., Kaminska, O., Opara, K., Owsinski, J., Radziszewska, W., Sochacka, M., and Swiecicki, L. Acoustic features from speech as markers of depressive and manic symptoms in bipolar disorder: A prospective study. Acta Psychiatrica Scandinavica, 151 0 (3): 0 358--374, 2025. doi:https://doi.org/10.1...
-
[86]
N., Provost, E
Karam, Z. N., Provost, E. M., Singh, S., Montgomery, J., Archer, C., Harrington, G., and McInnis, M. G. Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 4858--4862, 2014. URL https://api.semanticscholar.org/CorpusID:10425398
2014
-
[87]
Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information. Physical Review E, 69 0 (6), 2004. ISSN 1550-2376. doi:10.1103/physreve.69.066138. URL http://dx.doi.org/10.1103/PhysRevE.69.066138
-
[88]
CMSIS-NN: Efficient neural network kernels for ARM Cortex-M CPUs,
Lai, L., Suda, N., and Chandra, V. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus, 2018. URL https://arxiv.org/abs/1801.06601
-
[89]
Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis
Li, T., Wang, X., Xie, Q., Wang, Z., and Xie, L. Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30: 0 1448–1460, April 2022. ISSN 2329-9290. doi:10.1109/TASLP.2022.3164181. URL https://doi.org/10.1109/TASLP.2022.3164181
-
[90]
Li, Z., Xu, H., Jin, Z., Meng, L., Wang, T., Wang, H., Chen, Y., Cui, M., Hu, S., and Liu, X. Towards one-bit asr: Extremely low-bit conformer quantization using co-training and stochastic precision, 2025. URL https://arxiv.org/abs/2505.21245
-
[91]
Pan, W., Deng, F., Wang, X., Hang, B., Zhou, W., and Zhu, T. Exploring the ability of vocal biomarkers in distinguishing depression from bipolar disorder, schizophrenia, and healthy controls. Frontiers in Psychiatry, Volume 14 - 2023, 2023. ISSN 1664-0640. doi:10.3389/fpsyt.2023.1079448. URL https://www.frontiersin.org/journals/psychiatry/articles/10.3389...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.