arxiv: 2602.10666 · v1 · submitted 2026-02-11 · 📡 eess.AS · cs.LG· cs.SD

Recognition: no theorem link

From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

Riccardo Miccini , Cl\'ement Laroche , Tobias Piechowiak , Xenofon Fafoutis , Luca Pezzarossa

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:45 UTC · model grok-4.3

classification 📡 eess.AS cs.LGcs.SD

keywords speech enhancementdynamic channel pruningvoice activity detectionnoise classificationfundamental frequencyauxiliary taskson-device inferencepruning masks

0 comments

The pith

Pruning masks from speech enhancement networks encode voice activity, noise type, and pitch estimates with up to 93 percent accuracy via simple predictors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the internal pruning masks produced by a dynamically pruned speech enhancement model already carry enough information to solve three auxiliary audio tasks. Instead of adding separate networks for voice activity detection, noise classification, and fundamental-frequency estimation, the authors train lightweight predictors directly on the masks. These predictors reach 93 percent accuracy on VAD, 84 percent on noise class, and an R-squared of 0.86 on F0, while binary masks collapse the computation to weighted sums with almost zero overhead. The work therefore treats dynamic channel pruning both as an efficiency tool and as an implicit multi-task feature extractor. A sympathetic reader would care because the approach removes the need for extra on-device models, lowering latency, power draw, and privacy risk in audio devices.

Core claim

A DynCP model trained only for speech enhancement generates pruning masks that already contain linearly extractable information about voice activity, noise class, and fundamental frequency; simple predictors operating on those masks recover the three properties at the stated accuracies without any task-specific retraining or architectural changes.

What carries the argument

Dynamic Channel Pruning (DynCP) masks, which adaptively disable network channels according to the current audio input and thereby serve as compact carriers of auxiliary signal properties.

If this is right

On-device speech-enhancement hardware can output VAD, noise class, and pitch estimates at essentially no extra cost beyond the enhancement itself.
Binary masks turn every auxiliary prediction into a fixed weighted sum, removing any need for additional neural layers at inference time.
A single training run for enhancement simultaneously supervises representations useful for three downstream tasks.
Deployment of context-aware audio processing becomes feasible on microcontrollers that cannot host multiple separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic pruning may function as an implicit regularizer that forces the network to discover input properties shared across enhancement and auxiliary tasks.
The same mask-extraction technique could be tested on other time-series domains where pruning decisions are already computed, such as video or sensor data.
If the masks prove sufficient, future enhancement models could be deliberately trained with auxiliary objectives baked into the pruning schedule rather than added as separate heads.
Manufacturers could ship a single lightweight model that simultaneously cleans speech and reports acoustic context, reducing cloud calls for privacy-sensitive applications.

Load-bearing premise

The pruning masks produced by a model trained solely for speech enhancement already contain enough information about voice activity, noise class, and pitch that can be read out by simple linear or weighted-sum predictors without further training.

What would settle it

Train a new DynCP speech-enhancement model on a different dataset or with altered pruning hyperparameters, extract its masks, and check whether any simple predictor built from those masks falls below 70 percent accuracy on VAD; if it does, the claim is falsified.

read the original abstract

Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pruning masks from a speech enhancement DynCP model already encode enough info for VAD, noise classification, and F0 prediction via lightweight post-hoc models.

read the letter

The paper shows that pruning masks from a DynCP speech enhancement model can be fed to simple predictors to get VAD, noise classification, and F0 estimates at negligible extra cost. What is new is the repurposing of those internal masks for downstream signal property estimation without any retraining of the main model or added architecture. They demonstrate that binary masks reduce the predictors to weighted sums, which keeps the overhead tiny. The reported results—up to 93% VAD accuracy, 84% on noise classification, and 0.86 R2 for F0—suggest the masks carry usable information even when the model was trained only for enhancement. This works well as an efficiency play for edge devices. It avoids the need for separate models for each auxiliary task, which matters when power and privacy are concerns. The interpretability angle, where predictions become straightforward sums, is a practical plus. The main soft spot is the lack of detail in the abstract on training data, exact baselines, and ablation studies. Without those, it's difficult to assess how robust the findings are or whether the performance holds across different conditions. The full paper should clarify if the predictors require any task-specific data or if they are purely post-hoc. This kind of work is aimed at people building efficient audio systems for consumer electronics. A reader looking for ways to squeeze multiple functions out of one network would find it useful. I would recommend sending it for peer review. The idea is straightforward to test and the efficiency claim is worth a closer look from referees.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that dynamic pruning masks from a Dynamic Channel Pruning (DynCP) speech enhancement network trained solely for SE already encode sufficient information about auxiliary signal properties to allow simple post-hoc predictors to achieve up to 93% accuracy on VAD, 84% on noise classification, and R²=0.86 on F0 estimation. Binary masks reduce the predictors to weighted sums with negligible overhead. The contribution is presented as both revealing emergent behavior in DynCP models and offering a holistic, efficient solution for simultaneous SE and auxiliary inference without separate models.

Significance. If the empirical results hold under rigorous validation, the work could be significant for on-device audio applications by eliminating the need for dedicated auxiliary models, thereby reducing latency, power consumption, and privacy risks. It also provides a lens into the internal representations learned by pruning-based SE networks. The negligible-overhead claim for binary masks, if substantiated, strengthens the practical appeal.

major comments (2)

[Abstract / Experimental Results] Abstract and Experimental Results: The manuscript states concrete performance numbers (93% VAD accuracy, 84% noise classification, R²=0.86 for F0) but supplies no information on datasets, training details for the DynCP model, the architecture or training of the simple predictors, baselines, or statistical tests. These omissions make it impossible to evaluate whether the masks truly contain the claimed information without task-specific retraining.
[Results] Results section: No ablation studies, cross-validation details, or comparison to dedicated task-specific models are referenced, so it is unclear whether the reported accuracies are load-bearing for the central claim or could be achieved by simpler input features unrelated to the pruning masks.

minor comments (2)

[Method] Clarify the precise mathematical form of the 'weighted sum' prediction for binary masks and how it induces negligible overhead (e.g., any reference to a specific equation or complexity analysis).
[Abstract] The abstract would benefit from a single sentence summarizing the evaluation conditions or dataset characteristics to ground the numerical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of experimental clarity and validation that we will address through targeted revisions to improve reproducibility and strengthen the central claims.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results: The manuscript states concrete performance numbers (93% VAD accuracy, 84% noise classification, R²=0.86 for F0) but supplies no information on datasets, training details for the DynCP model, the architecture or training of the simple predictors, baselines, or statistical tests. These omissions make it impossible to evaluate whether the masks truly contain the claimed information without task-specific retraining.

Authors: We agree that the presentation of experimental details must be expanded for full reproducibility and to allow independent evaluation of the information content in the pruning masks. While the manuscript describes the overall setup, we will revise the Experimental Results section to add explicit subsections covering: the datasets used for DynCP training and evaluation, the full training procedure and hyperparameters for the DynCP model, the architecture (linear classifiers/regressors) and training protocol for the post-hoc predictors, the baselines employed, and the statistical validation methods (including cross-validation folds and significance testing). These additions will clarify that the predictors are trained post-hoc on fixed DynCP masks without retraining the enhancement network itself. revision: yes
Referee: [Results] Results section: No ablation studies, cross-validation details, or comparison to dedicated task-specific models are referenced, so it is unclear whether the reported accuracies are load-bearing for the central claim or could be achieved by simpler input features unrelated to the pruning masks.

Authors: We acknowledge the value of these controls for isolating the contribution of the pruning masks. We will add ablation experiments that directly compare predictor performance when using the DynCP masks versus raw waveform or spectrogram features as input. The revised Results section will also include the cross-validation protocol and quantitative comparisons against separately trained task-specific models for VAD, noise classification, and F0 estimation. These studies will demonstrate that the reported performance levels are specifically enabled by information encoded in the masks rather than generic input statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper trains a DynCP speech enhancement model solely on the SE objective, extracts the resulting pruning masks as a byproduct, and then trains separate, simple post-hoc predictors on those masks to estimate VAD, noise class, and F0. This is an empirical demonstration that the masks encode auxiliary information; the auxiliary predictors are trained independently on held-out data and do not feed back into the SE loss or architecture. No derivation reduces a claimed prediction to a fitted parameter by construction, no self-citation is load-bearing for the central result, and no ansatz or uniqueness theorem is invoked to force the outcome. The reported accuracies (93% VAD, 84% noise, R2=0.86 F0) are presented as direct experimental evidence rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pruning masks encode auxiliary signal properties; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Dynamic channel pruning masks produced by an SE-trained model contain extractable information about VAD, noise type, and F0
This is the load-bearing premise that allows simple predictors to achieve the reported accuracies.

pith-pipeline@v0.9.0 · 5545 in / 1305 out tokens · 64544 ms · 2026-05-16T03:45:12.429382+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

INTRODUCTION Speech Enhancement (SE) is a fundamental part of many devices aimed at improving communication, collaboration, or quality of life, such as hearing aids, audio wearables, and voice-activated systems. Recent advances in deep learning (DL) have achieved state-of-the-art performance in SE. However, to ensure responsiveness, privacy, and offline o...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

target” for a given signal characteristic or class label and “task

METHODS The proposed system is illustrated in Fig. 1. At its core, our work involves estimating y(t) l ∀t∈T , where T is the set of prediction targets1 (summarized in Table 1) and l represents the current time step. Since the SE model considered here operates in the Short-Time Fourier Transform (STFT) domain, each time step l corresponds to a frame of win...

work page
[3]

Artificial

EXPERIMENTAL SETUP Data generation We use the speech utterances and noise ex- cerpts from V oiceBank+DEMAND (VB+D) [30], combining all of its 88 English and non-English speakers. Since the speech data in VB+D is taken from V oiceBank (VB) [28], [29], we use its metadata for gender and accent labels. Noise classes are based on the noise types reported in V...

work page
[4]

We apply a threshold τ=0.005 to the standard deviation of the masks, resulting inC ⋆ =202features (∼18 %of all channels)

with0.25target utilization and surrogate gradients. We apply a threshold τ=0.005 to the standard deviation of the masks, resulting inC ⋆ =202features (∼18 %of all channels). Alternative features We included the noisy input STFT log- magnitude and the predicted suppression mask ˆM as baselines; both comprise 257 features. As additional experiments, we also...

work page
[5]

3 compares predictors trained on features from DynCP against baseline data

RESULTS Fig. 3 compares predictors trained on features from DynCP against baseline data. Regular binary masks (• ) outperform both baselines across most classification and regression tasks, with the 64 most infor- mative features (• ) retaining the same performance as the full set (• ), suggesting that auxiliary models can be very compact. Conversely, onl...

work page
[6]

CONCLUSION We showed that the binary pruning masks learned by a DynCP SE model expose linearly accessible information about speech and acous- tic properties of its input, hinting at local competition inside the model [27]. With as few as 64 features, we achieve 93 % accuracy on V AD and 59 % on noise classification; when predicting input PESQ and SI-SDR, ...

work page
[7]

A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech,

J.-M. Valin, U. Isik, N. Phansalkar, R. Giri, K. Helwani, and A. Krish- naswamy, “A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech,” inInterspeech 2020, Oct. 2020

work page 2020
[8]

PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement,

X. Ge, J. Han, Y . Long, and H. Guan, “PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement,” inInterspeech 2022, Sep. 2022, arXiv:2203.02263 [eess]

work page arXiv 2022
[9]

Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech En- hancement,

X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech En- hancement,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021

work page 2021
[10]

GTCRN: A Speech Enhancement Model Requiring Ultralow Computational Resources,

X. Rong, T. Sun, X. Zhang, Y . Hu, C. Zhu, and J. Lu, “GTCRN: A Speech Enhancement Model Requiring Ultralow Computational Resources,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2024

work page 2024
[11]

DCCRN+: Channel-Wise Sub- band DCCRN with SNR Estimation for Speech Enhancement,

S. Lv, Y . Hu, S. Zhang, and L. Xie, “DCCRN+: Channel-Wise Sub- band DCCRN with SNR Estimation for Speech Enhancement,” in Interspeech 2021, Aug. 2021, arXiv:2106.08672 [eess]

work page arXiv 2021
[12]

Deep- FilterNet: Perceptually Motivated Real-Time Speech Enhancement,

H. Schröter, A. N. Escalante-B, T. Rosenkranz, and A. Maier, “Deep- FilterNet: Perceptually Motivated Real-Time Speech Enhancement,” in 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023, 2023, arXiv:2305.08227 [eess]

work page arXiv 2023
[13]

VSANet: Real-time Speech Enhance- ment Based on V oice Activity Detection and Causal Spatial Attention

Y . Zhang, H. Zou, and J. Zhu. “VSANet: Real-time Speech Enhance- ment Based on V oice Activity Detection and Causal Spatial Attention.” arXiv:2310.07295 [eess], pre-published

work page arXiv
[14]

TGIF: Talker Group-Informed Familiariza- tion of Target Speaker Extraction

T.-A. Hsieh and M. Kim. “TGIF: Talker Group-Informed Familiariza- tion of Target Speaker Extraction.” arXiv: 2507.14044 [eess] , pre-published

work page arXiv
[15]

Sparse Mixture of Local Experts for Efficient Speech Enhancement,

A. Sivaraman and M. Kim, “Sparse Mixture of Local Experts for Efficient Speech Enhancement,” inInterspeech 2020, Oct. 2020

work page 2020
[16]

Speech Enhance- ment with Zero-Shot Model Selection,

R. E. Zezario, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Speech Enhance- ment with Zero-Shot Model Selection,” in2021 29th European Signal Processing Conference (EUSIPCO), Aug. 2021

work page 2021
[17]

DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech,

F. Cumlin, X. Liang, V . Ungureanu, C. K. A. Reddy, C. Schüldt, and S. Chatterjee, “DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech,” inInterspeech 2024, Sep. 2024

work page 2024
[18]

Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps,

M. Nilsson, R. Miccini, C. Laroche, T. Piechowiak, and F. Zenke, “Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps,” inInterspeech 2024, Sep. 2024, arXiv:2407.04578 [cs, eess]

work page arXiv 2024
[19]

Efficient Streaming Speech Quality Prediction with Spiking Neural Networks,

M. Nilsson, R. Miccini, J. Rossbroich, C. Laroche, T. Piechowiak, and F. Zenke, “Efficient Streaming Speech Quality Prediction with Spiking Neural Networks,” inInterspeech 2025, Aug. 2025

work page 2025
[20]

Dynamic Neural Networks: A Survey,

Y . Han, G. Huang, S. Song, L. Yang, H. Wang, and Y . Wang, “Dynamic Neural Networks: A Survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, Nov. 2022

work page 2022
[21]

Adaptive Slimming for Scalable and Efficient Speech Enhancement,

R. Miccini, M. Kim, C. Laroche, L. Pezzarossa, and P. Smaragdis, “Adaptive Slimming for Scalable and Efficient Speech Enhancement,” in2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 2025, arXiv: 2507.04879 [eess]

work page arXiv 2025
[22]

Dynamic Slimmable Networks for Efficient Speech Separation

M. Elminshawi, S. R. Chetupalli, and E. A. P. Habets. “Dynamic Slimmable Networks for Efficient Speech Separation.” arXiv:2507. 06179 [eess], pre-published

work page
[23]

Knowing When to Quit: Probabilistic Early Exits for Speech Separation

K. F. Olsen et al. “Knowing When to Quit: Probabilistic Early Exits for Speech Separation.” arXiv:2507.09768 [cs], pre-published

work page arXiv
[24]

PeakRNN and StatsRNN: Dynamic Pruning in Recurrent Neural Networks,

Z. Jel ˇcicová, R. Jones, D. T. Blix, M. Verhelst, and J. Sparsø, “PeakRNN and StatsRNN: Dynamic Pruning in Recurrent Neural Networks,” in2021 29th European Signal Processing Conference (EUSIPCO), Aug. 2021

work page 2021
[25]

Dynamic gated recurrent neural network for compute-efficient speech enhance- ment,

L. Cheng, A. Pandey, B. Xu, T. Delbruck, and S.-C. Liu, “Dynamic gated recurrent neural network for compute-efficient speech enhance- ment,” inInterspeech 2024, Sep. 2024

work page 2024
[26]

Scalable Speech Enhancement with Dynamic Channel Pruning,

R. Miccini, C. Laroche, T. Piechowiak, and L. Pezzarossa, “Scalable Speech Enhancement with Dynamic Channel Pruning,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2025

work page 2025
[27]

Impairments are Clustered in Latents of Deep Neural Network-based Speech Quality Models,

F. Cumlin, X. Liang, V . Ungureanu, C. K. A. Reddy, C. Schüldt, and S. Chatterjee, “Impairments are Clustered in Latents of Deep Neural Network-based Speech Quality Models,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2025

work page 2025
[28]

Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects,

V . Deng, C. Wang, G. Richard, and B. McFee, “Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2025

work page 2025
[29]

Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchaudio,

A. Kumar et al., “Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchaudio,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), Jun. 2023

work page 2023
[30]

An End-To- End Non-Intrusive Model for Subjective and Objective Real-World Speech Assessment Using a Multi-Task Framework,

Z. Zhang, P. Vyas, X. Dong, and D. S. Williamson, “An End-To- End Non-Intrusive Model for Subjective and Objective Real-World Speech Assessment Using a Multi-Task Framework,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021

work page 2021
[31]

Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features,

R. E. Zezario, S. -W. Fu, F. Chen, C. -S. Fuh, H. -M. Wang, and Y . Tsao, “Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 2023, arXiv: 2111.02363 [cs, eess]

work page arXiv 2023
[32]

Speech Enhancement Aided End-To-End Multi-Task Learning for V oice Activity Detection,

X. Tan and X.-L. Zhang, “Speech Enhancement Aided End-To-End Multi-Task Learning for V oice Activity Detection,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021

work page 2021
[33]

Understanding Locally Competitive Networks

R. K. Srivastava, J. Masci, F. J. Gomez, and J. Schmidhuber, “Under- standing Locally Competitive Networks,” in3rd International Con- ference on Learning Representations, ICLR 2015, May 2015, arXiv: 1410.1165 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[34]

The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,

C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De- sign, collection and data analysis of a large regional accent speech database,” in2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Nov. 2013

work page 2013
[35]

Yamagishi, C

J. Yamagishi, C. Veaux, and K. MacDonald,CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit, ver- sion 0.92, Nov. 2019

work page 2019
[36]

Inves- tigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Inves- tigating RNN-based speech enhancement methods for noise-robust Text-to-Speech,” in9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), Sep. 2016

work page 2016
[37]

Thiemann, N

J. Thiemann, N. Ito, and E. Vincent,Demand: A Collection Of Multi- Channel Recordings Of Acoustic Noise In Diverse Environments, Jun. 2013

work page 2013
[38]

SDR – Half- baked or Well Done?

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half- baked or Well Done?” InICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019

work page 2019
[39]

Perceptual Evalua- tion of Speech Quality (PESQ) - a New Method for Speech Quality Assessment of Telephone Networks and Codecs,

A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual Evalua- tion of Speech Quality (PESQ) - a New Method for Speech Quality Assessment of Telephone Networks and Codecs,” in2001 IEEE In- ternational Conference on Acoustics, Speech, and Signal Processing, May 2001

work page 2001
[40]

WORLD: A V ocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A V ocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and Systems, 2016

work page 2016
[41]

Front- End Factor Analysis for Speaker Verification,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- End Factor Analysis for Speaker Verification,”IEEE Transactions on Audio, Speech, and Language Processing, May 2011

work page 2011