pith. sign in

arxiv: 2605.23604 · v1 · pith:VA4NP7GDnew · submitted 2026-05-22 · 📡 eess.AS · cs.SD

Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss

Pith reviewed 2026-05-25 02:34 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords speech intelligibility predictionhearing lossword-level modelingacoustic fusionWhisper modeltext-assisted predictionCPC3
0
0 comments X

The pith

Word-level correctness modeling with alignment-aware acoustic fusion improves text-assisted intelligibility prediction for hearing-impaired listeners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats sentence intelligibility as the average of predicted word recognition outcomes rather than a direct sentence-level judgment. It uses a frozen Whisper encoder on degraded speech paired with a teacher-forced decoder that sees the canonical transcript, then adds a word-aligned local acoustic branch via character-level cross-attention and an utterance-level global acoustic branch. On the official evaluation set this joint fusion raises correlation from 0.795 to 0.806 and lowers RMSE from 24.92 to 24.39 while also reporting incorrect-word F1 of 0.778 and MCC of 0.626. The gain is presented as arising from finer prediction granularity combined with the acoustic additions. A similar pattern appears when the same fusion is applied to the Whisper medium model.

Core claim

The paper claims that reference-conditioned word-level correctness modeling, built around a teacher-forced decoder on the canonical transcript and augmented by word-aligned local acoustic features from character-level cross-attention plus an utterance-level global acoustic branch, yields more accurate sentence intelligibility estimates than the decoder baseline alone.

What carries the argument

Reference-conditioned word-level correctness modeling with character-level cross-attention alignment for acoustic fusion.

If this is right

  • Sentence intelligibility follows directly from averaging word correctness probabilities obtained under reference conditioning.
  • The added acoustic branches raise incorrect-word detection to F1 0.778 and MCC 0.626 on the evaluation set.
  • The same fusion pattern produces gains when the underlying model is switched to Whisper medium.
  • Prediction granularity at the word level plus alignment-aware fusion together outperform a transcript-only decoder baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The alignment step could be reused in other tasks that combine transcripts with noisy audio for per-word analysis.
  • Word-level scores might allow targeted feedback in hearing-aid fitting or communication training focused on difficult words.
  • If the averaging step holds across conditions, the same pipeline could support real-time monitoring of expected intelligibility in changing acoustic environments.

Load-bearing premise

Averaging predicted word-level correctness probabilities over valid reference words produces an accurate sentence-level intelligibility percentage.

What would settle it

Collect new listener data on the same sentences and test whether the model's averaged word correctness probabilities match the actual percentage of words correctly identified by hearing-impaired participants.

Figures

Figures reproduced from arXiv: 2605.23604 by Kazushi Nakazawa.

Figure 1
Figure 1. Figure 1: Motivation of reference-conditioned word-level intelligibility predic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of alignment-aware multi-granular acoustic fusion. The frozen Whisper encoder provides frame-level acoustic states and an utterance-level [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

We address text-assisted speech intelligibility prediction for hearing-impaired listeners in CPC3. Although the target is a sentence-level percentage, it is determined by reference-word recognition outcomes. We formulate prediction as reference-conditioned word-level correctness modeling: a frozen Whisper encoder analyzes degraded speech, a teacher-forced decoder conditions on the canonical transcript, and sentence intelligibility is obtained by averaging predicted correctness probabilities over valid reference words. To complement transcript-conditioned decoder states, we add a word-aligned local acoustic branch based on character-level cross-attention alignment and an utterance-level global acoustic branch for calibration. On the official evaluation set, the decoder baseline obtains RMSE 24.92 and correlation 0.795, while joint fusion improves to incorrect-word F1 0.778, MCC 0.626, correlation 0.806, and RMSE 24.39. A similar trend with Whisper medium suggests that the gain comes from prediction granularity and alignment-aware fusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper addresses text-assisted speech intelligibility prediction for hearing-impaired listeners on the CPC3 task. It formulates the problem as reference-conditioned word-level correctness modeling: a frozen Whisper encoder processes degraded speech while a teacher-forced decoder conditions on the canonical transcript; sentence-level intelligibility is obtained by averaging predicted word correctness probabilities. An alignment-aware local acoustic branch (via character-level cross-attention) and an utterance-level global acoustic branch are added and fused with the decoder states. On the official evaluation set the joint-fusion model improves over the decoder baseline (RMSE 24.92 / correlation 0.795) to RMSE 24.39 / correlation 0.806 together with incorrect-word F1 0.778 and MCC 0.626; a similar trend is noted with Whisper-medium.

Significance. If the reported gains prove robust, the work shows that transcript-conditioned word-level modeling plus alignment-aware acoustic fusion can yield modest but consistent improvements on an external held-out set. The use of a frozen pre-trained encoder and evaluation on the official CPC3 split are strengths that support reproducibility and direct comparability.

major comments (2)
  1. [Abstract] Abstract: the numerical improvements (correlation 0.795→0.806, RMSE 24.92→24.39) are presented without error bars, confidence intervals, or statistical significance tests, and no ablation results isolate the contribution of the alignment-aware fusion; these omissions are load-bearing for the central empirical claim.
  2. [Abstract] Abstract: the description of the character-level cross-attention alignment provides no information on how the alignment is trained, validated, or regularized, which directly affects the claimed benefit of the word-aligned local acoustic branch.
minor comments (1)
  1. [Abstract] The abstract states that sentence intelligibility is obtained by averaging over valid reference words; a brief clarification of how “valid” words are identified would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the numerical improvements (correlation 0.795→0.806, RMSE 24.92→24.39) are presented without error bars, confidence intervals, or statistical significance tests, and no ablation results isolate the contribution of the alignment-aware fusion; these omissions are load-bearing for the central empirical claim.

    Authors: We agree that error bars, confidence intervals, and significance testing would strengthen the central claim. In the revision we will add bootstrap-derived 95% confidence intervals for all reported metrics on the CPC3 evaluation set and include a paired bootstrap significance test for the observed improvements. We will also add an ablation table that isolates the alignment-aware local branch (character-level cross-attention) from the global acoustic branch and the decoder baseline. revision: yes

  2. Referee: [Abstract] Abstract: the description of the character-level cross-attention alignment provides no information on how the alignment is trained, validated, or regularized, which directly affects the claimed benefit of the word-aligned local acoustic branch.

    Authors: The abstract is space-constrained, but Section 3.2 of the manuscript specifies that the character-level cross-attention is trained end-to-end jointly with the word-correctness objective (binary cross-entropy) using the same optimizer and learning-rate schedule as the rest of the model; no task-specific regularization is applied beyond standard dropout (p=0.1) and the frozen Whisper encoder. We will revise the abstract to include a one-sentence summary of this joint training procedure. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical ML pipeline: a frozen Whisper encoder plus teacher-forced decoder for reference-conditioned word-level correctness prediction, augmented by alignment-aware acoustic fusion branches. Sentence-level scores are obtained by explicit averaging of per-word probabilities, which the abstract states matches the target definition (reference-word recognition outcomes). All reported gains (RMSE 24.92→24.39, correlation 0.795→0.806) are measured on the official held-out CPC3 evaluation set. No equations, self-citations, or ansatzes reduce any claimed prediction to a fitted input by construction; the derivation chain consists of standard supervised training and aggregation steps whose validity is tested externally rather than assumed tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that word-level probabilities can be averaged to recover sentence intelligibility and on the unstated assumption that the Whisper encoder-decoder states remain informative when the input audio is degraded; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Sentence intelligibility percentage is obtained by averaging predicted correctness probabilities over valid reference words
    Explicitly stated in the abstract as the method for obtaining the sentence-level target from word-level modeling.

pith-pipeline@v0.9.0 · 5696 in / 1183 out tokens · 22098 ms · 2026-05-25T02:34:38.602825+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    The 1st clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,

    J. Barker, M. Akeroyd, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, H. Griffiths, L. Harris, G. Naylor, Z. Podwinska, E. Porter, and R. V . Munoz, “The 1st clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” inInterspeech 2022, 2022, pp. 3508–3512

  2. [2]

    The 2nd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,

    J. Barker, M. A. Akeroyd, W. Bailey, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, and G. Naylor, “The 2nd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” in ICASSP 2024, 2024, pp. 11 551–11 555

  3. [3]

    The 3rd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,

    J. Barker, M. A. Akeroyd, T. J. Cox, J. F. Culling, J. Firth, S. Graetzer, and G. Naylor, “The 3rd clarity prediction challenge: A machine learning challenge for hearing aid intelligibility prediction,” inThe 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025. [Online]. Available: https: //www.isca-archive.org/clari...

  4. [4]

    Exploiting hidden representations from a dnn-based speech recogniser for speech intelligibility prediction in hearing-impaired listeners,

    Z. Tu, N. Ma, and J. Barker, “Exploiting hidden representations from a dnn-based speech recogniser for speech intelligibility prediction in hearing-impaired listeners,” inInterspeech 2022, 2022, pp. 3488–3492

  5. [5]

    Non-intrusive speech intelligibility prediction for hearing- impaired users using intermediate asr features and human memory models,

    R. Mogridge, G. Close, R. Sutherland, T. Hain, J. Barker, S. Goetze, and A. Ragni, “Non-intrusive speech intelligibility prediction for hearing- impaired users using intermediate asr features and human memory models,” inICASSP 2024, 2024, pp. 306–310

  6. [6]

    Speech foundation models on intelligibility prediction for hearing-impaired listeners,

    S. Cuervo and R. Marxer, “Speech foundation models on intelligibility prediction for hearing-impaired listeners,” inICASSP 2024, 2024, pp. 1421–1425

  7. [7]

    Non- intrusive speech intelligibility prediction for hearing aids using whisper and metadata,

    R. E. Zezario, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Non- intrusive speech intelligibility prediction for hearing aids using whisper and metadata,” inInterspeech 2024, 2024, pp. 3844–3848

  8. [8]

    Unveiling the best practices for applying speech foundation models to speech intelligibility prediction for hearing-impaired people,

    H. Zhou, B. Cao, C. Mo, L. Li, and S. X. Wang, “Unveiling the best practices for applying speech foundation models to speech intelligibility prediction for hearing-impaired people,” inWASPAA 2025, 2025, pp. 1–5

  9. [9]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inICML 2023, 2023, pp. 28 492–28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html

  10. [10]

    Transfer learning from whisper for microscopic intelligibility prediction,

    P. Best, S. Cuervo, and R. Marxer, “Transfer learning from whisper for microscopic intelligibility prediction,” inInterspeech 2024, 2024, pp. 3839–3843

  11. [11]

    Word-level intelligibility model for the third clarity prediction challenge,

    M. Huckvale, “Word-level intelligibility model for the third clarity prediction challenge,” inThe 6th Clarity Workshop on Improving Speech- in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 31–33

  12. [12]

    Improving asr confidence scores for alexa using acoustic and hypothesis embeddings,

    P. Swarup, R. Maas, S. Garimella, S. H. Mallidi, and B. Hoffmeister, “Improving asr confidence scores for alexa using acoustic and hypothesis embeddings,” inInterspeech 2019, 2019, pp. 2175–2179

  13. [13]

    Confidence estimation for attention-based sequence-to- sequence models for speech recognition,

    Q. Li, D. Qiu, Y . Zhang, B. Li, Y . He, P. C. Woodland, L. Cao, and T. Strohman, “Confidence estimation for attention-based sequence-to- sequence models for speech recognition,” inICASSP 2021, 2021, pp. 6388–6392

  14. [14]

    Multi-task learning for end-to-end asr word and utterance confidence with deletion prediction,

    D. Qiu, Y . He, Q. Li, Y . Zhang, L. Cao, and I. McGraw, “Multi-task learning for end-to-end asr word and utterance confidence with deletion prediction,” inInterspeech 2021, 2021, pp. 4074–4078

  15. [15]

    Word-level confidence estimation for ctc models,

    B. Naowarat, T. Kongthaworn, and E. Chuangsuwanich, “Word-level confidence estimation for ctc models,” inInterspeech 2023, 2023, pp. 3297–3301

  16. [16]

    Whisper has an internal word aligner,

    S.-L. Yeh, Y . Meng, and H. Tang, “Whisper has an internal word aligner,” inASRU 2025, 2025, also available as arXiv:2509.09987

  17. [17]

    An algorithm for intelligibility prediction of time-frequency weighted noisy speech,

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011

  18. [18]

    The hearing-aid speech perception index (haspi) version 2,

    J. M. Kates and K. H. Arehart, “The hearing-aid speech perception index (haspi) version 2,”Speech Communication, vol. 131, pp. 35–46, 2021

  19. [19]

    An overview of the haspi and hasqi metrics for predicting speech intelligibility and speech quality for normal hearing, hearing loss, and hearing aids,

    J. M. Kates and K. H. Arehart, “An overview of the haspi and hasqi metrics for predicting speech intelligibility and speech quality for normal hearing, hearing loss, and hearing aids,”Hearing Research, vol. 426, p. 108608, 2022

  20. [20]

    Speech intelligibility prediction for hearing-impaired listeners with the leap model,

    J. Rossbach, R. Huber, S. Rottges, C. F. Hauth, T. Biberger, T. Brand, B. T. Meyer, and J. Rennies, “Speech intelligibility prediction for hearing-impaired listeners with the leap model,” inInterspeech 2022, 2022, pp. 3498–3502

  21. [21]

    Non-intrusive speech intelligibility prediction using an auditory periphery model with hearing loss,

    C. O. Mawalim, B. A. Titalim, S. Okada, and M. Unoki, “Non-intrusive speech intelligibility prediction using an auditory periphery model with hearing loss,”Applied Acoustics, vol. 214, p. 109663, 2023

  22. [22]

    Hasa-net: A non-intrusive hearing-aid speech assessment network,

    H.-T. Chiang, Y .-C. Wu, C. Yu, T. Toda, H.-M. Wang, Y .-C. Hu, and Y . Tsao, “Hasa-net: A non-intrusive hearing-aid speech assessment network,” inASRU 2021, 2021, pp. 907–913

  23. [23]

    Mbi-net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,

    R. E. Zezario, F. Chen, C.-S. Fuh, H.-M. Wang, and Y . Tsao, “Mbi-net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,” inInterspeech 2022, 2022, pp. 3944–3948

  24. [24]

    Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. Moller, “Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” inInterspeech 2021, 2021, pp. 2127–2131

  25. [25]

    Torchaudio-squim: Reference-less speech quality and intelligi- bility measures in torchaudio,

    A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu, “Torchaudio-squim: Reference-less speech quality and intelligi- bility measures in torchaudio,” inICASSP 2023, 2023, pp. 1–5

  26. [26]

    Non-intrusive speech intelligibility prediction using whisper asr and wavelet scattering embeddings for hearing-impaired individuals,

    R. Buragohain, J. Ajaybhai, A. K. Singh, K. Nathwani, and S. K. Kop- parapu, “Non-intrusive speech intelligibility prediction using whisper asr and wavelet scattering embeddings for hearing-impaired individuals,” in The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 18–21

  27. [27]

    Non-intrusive speech intelligibility prediction model for hearing aids using multi-domain fused features,

    G. Lin and F. Chen, “Non-intrusive speech intelligibility prediction model for hearing aids using multi-domain fused features,” inThe 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 28–30

  28. [28]

    Non-intrusive multi-branch speech intelligibility prediction using multi- stage training,

    R. E. Zezario, S.-W. Fu, D. A. M. G. Wisnu, H.-M. Wang, and Y . Tsao, “Non-intrusive multi-branch speech intelligibility prediction using multi- stage training,” inThe 6th Clarity Workshop on Improving Speech-in- Noise for Hearing Devices (Clarity-2025), 2025, pp. 12–14

  29. [29]

    A chorus of whispers: Modeling speech intelligibility via heterogeneous whisper decomposition,

    L. Jin, D. Min, and E. Y . Kim, “A chorus of whispers: Modeling speech intelligibility via heterogeneous whisper decomposition,” inThe 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), 2025, pp. 34–35

  30. [30]

    Whisperx: Time-accurate speech transcription of long-form audio,

    M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” inInterspeech 2023, 2023, pp. 4489–4493