Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech
Pith reviewed 2026-05-25 02:32 UTC · model grok-4.3
The pith
Frame-aligned fusion of WavLM to Canary's timeline before pooling yields the lowest error in non-intrusive intelligibility prediction for hearing-aid speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Among compared systems the best model temporally prepares WavLM with a learnable strided convolution and fuses it with Canary on the coarser Canary timeline before pooling, reaching Eval RMSE 24.96±0.06 and Eval Corr 0.796±0.001. Coarse local temporal correspondence before pooling is a useful inductive bias for this task.
What carries the argument
Frame-aligned fusion, in which WavLM is temporally prepared by a learnable strided convolution and aligned to Canary's coarser timeline before pooling.
If this is right
- Single-backbone baselines are outperformed by the fused model.
- Frame-aligned fusion outperforms uniform score averaging, pool-late fusion, cross-attention, and reverse alignment.
- The left/right-preserving binaural framework maintains separate channel processing through the fusion stage.
- Performance gains hold across different speech severity levels and enhancement systems according to the reported analyses.
Where Pith is reading between the lines
- The value of coarse temporal alignment may extend to other multi-encoder setups in audio quality or intelligibility tasks.
- Testing the same fusion pattern on additional datasets or with different pretrained encoders would test whether the inductive bias generalizes.
- The approach could inform practical systems that combine multiple frozen models for real-time hearing-aid processing pipelines.
Load-bearing premise
The specific fusion architectures tested under the shared binaural framework are sufficient to identify the optimal interaction point between the two encoders.
What would settle it
Demonstrating that an untested fusion architecture or alignment strategy achieves materially lower RMSE than 24.96 on the same evaluation set would undermine the conclusion that frame-aligned fusion on the Canary timeline is optimal among the compared options.
Figures
read the original abstract
Non-intrusive intelligibility prediction estimates how well hearing-impaired listeners understand hearing-aid-processed speech without a clean reference. We study this task in the 3rd Clarity Prediction Challenge using two frozen speech encoders, Canary and WavLM. The central question is not only whether complementary pretrained representations should be combined, but where their interaction should occur. We compare single-backbone baselines, uniform score averaging, pool-late fusion, cross-attention, frame-aligned fusion, and reverse alignment under a shared left/right-preserving binaural framework. Among the compared systems, the best model temporally prepares WavLM with a learnable strided convolution and fuses it with Canary on the coarser Canary timeline before pooling, reaching Eval RMSE 24.96$\pm$0.06 and Eval Corr 0.796$\pm$0.001. Severity, enhancement-system, layer-window, and temporal-shift analyses indicate that coarse local temporal correspondence before pooling is a useful inductive bias for this task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies non-intrusive intelligibility prediction of hearing-aid-processed speech on the 3rd Clarity Prediction Challenge dataset. It compares single-encoder baselines against five fusion strategies (uniform averaging, pool-late, cross-attention, frame-aligned, reverse alignment) that combine frozen Canary and WavLM representations under a shared left/right-preserving binaural framework. The best reported system applies a learnable strided convolution to prepare WavLM features before fusing them with Canary on the coarser Canary timeline, yielding Eval RMSE 24.96±0.06 and Eval Corr 0.796±0.001. Post-hoc analyses on severity, enhancement system, layer window, and temporal shift are used to argue that coarse local temporal correspondence before pooling constitutes a useful inductive bias.
Significance. If the empirical comparisons prove robust, the work supplies concrete evidence that the temporal granularity at which pretrained encoders interact matters for intelligibility prediction. The finding that alignment to the coarser Canary timeline outperforms both late pooling and cross-attention on this task could guide future multi-encoder designs in speech quality assessment. Use of frozen backbones and an external challenge corpus are practical strengths that facilitate reproducibility.
major comments (3)
- [Experiments] Experiments section: performance is reported with ± values (e.g., RMSE 24.96±0.06) yet the text supplies no description of the training procedure, hyper-parameter search, number of random seeds, or whether the standard deviation derives from cross-validation, multiple runs, or bootstrap resampling. Without these details the central empirical claim that frame-aligned fusion is superior cannot be evaluated for statistical reliability.
- [Results] Results section (comparison of fusion variants): the claim that coarse local temporal correspondence is a useful inductive bias rests on the observation that frame-aligned fusion outperforms the other four tested strategies. Because the architecture space is limited to uniform averaging, pool-late, cross-attention, frame-aligned, and reverse alignment, it remains possible that untested combinations (e.g., multi-resolution attention or learnable alignment at WavLM native rate) could match or exceed the reported scores without enforcing the coarse Canary timeline; the inductive-bias interpretation therefore does not yet follow from the comparison alone.
- [Analyses] Analyses subsection (severity/enhancement/layer/temporal-shift): these studies are performed after model selection and therefore cannot address whether the five fusion architectures exhaustively sample the space of possible interaction points. A broader search would be required to isolate temporal correspondence as the decisive factor.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on reproducibility and the scope of our architectural comparisons. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments section: performance is reported with ± values (e.g., RMSE 24.96±0.06) yet the text supplies no description of the training procedure, hyper-parameter search, number of random seeds, or whether the standard deviation derives from cross-validation, multiple runs, or bootstrap resampling. Without these details the central empirical claim that frame-aligned fusion is superior cannot be evaluated for statistical reliability.
Authors: We agree that these procedural details are required for assessing reliability. In the revision we will add a dedicated paragraph in the Experiments section specifying: (i) five independent training runs with distinct random seeds, (ii) the hyper-parameter search (grid over learning rate, batch size, and selected layer indices), (iii) that the reported ± values are standard deviations across the five seeds, and (iv) that all models were trained with the same optimizer schedule and early-stopping criterion on the development set. revision: yes
-
Referee: [Results] Results section (comparison of fusion variants): the claim that coarse local temporal correspondence is a useful inductive bias rests on the observation that frame-aligned fusion outperforms the other four tested strategies. Because the architecture space is limited to uniform averaging, pool-late, cross-attention, frame-aligned, and reverse alignment, it remains possible that untested combinations (e.g., multi-resolution attention or learnable alignment at WavLM native rate) could match or exceed the reported scores without enforcing the coarse Canary timeline; the inductive-bias interpretation therefore does not yet follow from the comparison alone.
Authors: We accept that the inductive-bias interpretation is scoped to the five strategies we compared. These strategies were chosen to isolate the effect of interaction timing while keeping the binaural and pooling stages fixed. We will revise the Results and Discussion sections to present the superiority of frame-aligned fusion as an empirical finding within the tested design space and will remove or qualify any phrasing that generalizes beyond the compared architectures. revision: partial
-
Referee: [Analyses] Analyses subsection (severity/enhancement/layer/temporal-shift): these studies are performed after model selection and therefore cannot address whether the five fusion architectures exhaustively sample the space of possible interaction points. A broader search would be required to isolate temporal correspondence as the decisive factor.
Authors: The post-selection analyses were intended only to characterize the winning model, not to prove exhaustiveness. We will add a short paragraph in the revised manuscript that explicitly states the limitation and explains the rationale for the five chosen strategies (they systematically vary the temporal granularity and direction of alignment while controlling other factors). A full enumeration of all possible multi-encoder interaction mechanisms lies outside the scope of the present study. revision: partial
Circularity Check
No circularity: empirical comparisons on external challenge data
full rationale
The paper reports performance metrics from training and evaluating multiple fusion architectures (uniform averaging, pool-late, cross-attention, frame-aligned, reverse alignment) on the external 3rd Clarity Prediction Challenge dataset. No equations, fitted parameters, or self-citations are used to derive the reported RMSE/Corr values; the metrics are direct outputs of model evaluation on held-out data. The inductive-bias interpretation follows from comparative results rather than any definitional reduction or self-referential construction. This matches the default expectation of a non-circular empirical study.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable strided convolution weights
axioms (1)
- domain assumption Canary and WavLM provide complementary representations whose interaction benefits from explicit temporal alignment.
Reference graph
Works this paper leans on
-
[1]
J. Barker, M. A. Akeroyd, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, and G. Naylor, “The 3rd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” inProc. The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025
work page 2025
-
[2]
3rd clarity prediction challenge (cpc3) dataset for hearing aid speech intelligibility prediction,
J. Barker, M. A. Akeroyd, T. Cox, J. Culling, J. Firth, S. Graetzer, and G. Naylor, “3rd clarity prediction challenge (cpc3) dataset for hearing aid speech intelligibility prediction,” 2025
work page 2025
-
[3]
MBI-Net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,
R. E. Zezario, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “MBI-Net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,” inProc. Interspeech 2022, 2022, pp. 3944–3948
work page 2022
-
[4]
S. Cuervo and R. Marxer, “Temporal-hierarchical features from noise- robust speech foundation models for non-intrusive intelligibility pre- diction,” inProc. The 4th Clarity Workshop on Machine Learning Challenges for Hearing Aids (Clarity-2023), 2023, pp. 17–19
work page 2023
-
[5]
Speech foundation models on intelligibility prediction for hearing-impaired listeners,
S. Cuervo and R. Marxer, “Speech foundation models on intelligibility prediction for hearing-impaired listeners,” inProc. ICASSP 2024, 2024, pp. 1421–1425
work page 2024
-
[6]
R. Mogridge, G. Close, R. Sutherland, T. Hain, J. Barker, S. Goetze, and A. Ragni, “Non-intrusive speech intelligibility prediction for hearing- impaired users using intermediate ASR features and human memory models,” inProc. ICASSP 2024, 2024, pp. 306–310
work page 2024
-
[7]
Non- intrusive speech intelligibility prediction for hearing aids using whisper and metadata,
R. E. Zezario, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Non- intrusive speech intelligibility prediction for hearing aids using whisper and metadata,” inProc. Interspeech 2024, 2024, pp. 3844–3848
work page 2024
-
[8]
G. Lin and F. Chen, “Non-intrusive speech intelligibility prediction model for hearing aids using multi-domain fused features,” inProc. The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025
work page 2025
-
[9]
R. Buragohain, J. Ajaybhai, A. K. Singh, K. Nathwani, and S. K. Koppa- rapu, “Non-intrusive speech intelligibility prediction using whisper ASR and wavelet scattering embeddings for hearing-impaired individuals,” inProc. The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025
work page 2025
-
[10]
Non-intrusive multi-branch speech intelligibility prediction using multi- stage training,
R. E. Zezario, S.-W. Fu, D. A. M. G. Wisnu, H.-M. Wang, and Y . Tsao, “Non-intrusive multi-branch speech intelligibility prediction using multi- stage training,” inProc. The 6th Clarity Workshop on Improving Speech- in-Noise for Hearing Devices (Clarity-2025), 2025
work page 2025
-
[11]
WavLM: Large-scale self-supervised pre- training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[12]
Less is more: Accurate speech recognition & translation without web-scale data,
K. C. Puvvada, P. ˙Zelasko, H. Huang, O. Hrinchuk, N. R. Koluguri, K. Dhawan, S. Majumdar, E. Rastorgueva, Z. Chen, V . Lavrukhin, J. Balam, and B. Ginsburg, “Less is more: Accurate speech recognition & translation without web-scale data,”arXiv preprint arXiv:2406.19674, 2024
-
[13]
An algorithm for intelligibility prediction of time-frequency weighted noisy speech,
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011
work page 2011
-
[14]
An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,
J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016
work page 2009
-
[15]
A. H. Andersen, J. M. de Haan, Z. Tan, and J. Jensen, “Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions,”Speech Communication, vol. 102, pp. 1–13, 2018
work page 2018
-
[16]
The hearing-aid speech perception index (HASPI) version 2,
J. M. Kates and K. H. Arehart, “The hearing-aid speech perception index (HASPI) version 2,”Speech Communication, vol. 131, pp. 35–46, 2021
work page 2021
-
[17]
J. Barker, M. A. Akeroyd, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, H. Griffiths, L. Harris, G. Naylor, Z. Podwinska, E. Porter, and V . M. R. Munoz, “The 1st clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” inProc. Interspeech 2022, 2022, pp. 3508–3512
work page 2022
-
[18]
J. Barker, M. A. Akeroyd, W. Bailey, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, and G. Naylor, “The 2nd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” in Proc. ICASSP 2024, 2024, pp. 11 551–11 555
work page 2024
-
[19]
Robust speech recognition via large-scale weak supervi- sion,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202, 2023, pp. 28 492–28 518
work page 2023
-
[20]
Intrusive intelligibility prediction with asr encoders,
H. Yu, H. Zhou, B. Cao, C. Mo, L. Li, and S. X. Wang, “Intrusive intelligibility prediction with asr encoders,” inThe 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 4–6. [Online]. Available: https: //www.isca-archive.org/clarity 2025/yu25 clarity.html
work page 2025
-
[21]
SUPERB: Speech processing universal performance benchmark,
S. wen Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech processing universal performance benchmark,” inProc. Interspeech 2021, 2021, pp. 1194–1198
work page 2021
-
[22]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 12 449–12 460
work page 2020
-
[23]
K. Yamamoto and K. Miyazaki, “Non-intrusive binaural speech intelli- gibility prediction using mamba for hearing-impaired listeners,” inProc. Interspeech 2025, 2025
work page 2025
-
[24]
UrgentMOS: Unified multi-metric and preference learning for robust speech quality assessment,
W. Wang, W. Zhang, C. Li, J. Wang, S. Cornell, M. Sach, K. Saijo, Y . Fu, Z. Ni, B. Han, X. Gong, M. Bi, T. Fingscheidt, S. Watanabe, and Y . Qian, “UrgentMOS: Unified multi-metric and preference learning for robust speech quality assessment,”arXiv preprint arXiv:2601.18438, 2026
-
[25]
SALMONN: Towards Generic Hearing Abilities for Large Language Models
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “GAMA: A large audio- language model with advanced audio understanding and complex rea- soning abilities,”arXiv preprint arXiv:2406.11768, 2024
-
[27]
ALARM: Audio–language align- ment for reasoning models,
P. Grinberg and H. Shahmohammadi, “ALARM: Audio–language align- ment for reasoning models,”arXiv preprint arXiv:2603.09556, 2026
-
[28]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017
work page 2017
-
[29]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997
work page 1997
-
[30]
Neural machine translation by jointly learning to align and translate,
D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” inProc. International Conference on Learning Representations (ICLR), 2015
work page 2015
-
[31]
Adam: A method for stochastic optimization,
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inProc. International Conference on Learning Representations (ICLR), 2015
work page 2015
-
[32]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. International Conference on Learning Representations (ICLR), 2019
work page 2019
-
[33]
On the difficulty of training re- current neural networks,
R. Pascanu, T. Mikolov, and Y . Bengio, “On the difficulty of training re- current neural networks,” inProc. International Conference on Machine Learning (ICML), 2013, pp. 1310–1318
work page 2013
-
[34]
The 3rd clarity prediction challenge results,
The Clarity Project, “The 3rd clarity prediction challenge results,” https: //claritychallenge.org/docs/cpc3/cpc3 results, 2025, accessed: 2026-04- 27
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.