arxiv: 2604.03689 · v1 · submitted 2026-04-04 · 📡 eess.AS

Recognition: 1 theorem link

· Lean Theorem

MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting

Lo-Ya Li , Tien-Hong Lo , Jeih-Weih Hung , Shih-Chieh Huang , Berlin Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:16 UTC · model grok-4.3

classification 📡 eess.AS

keywords zero-shot keyword spottingmulti-granularity contrastive learningcross-attention alignmentfalse alarm suppressionuser-defined wake wordslightweight speech modelphoneme-level representations

0 comments

The pith

MALEFA jointly learns utterance- and phoneme-level alignments via cross-attention and multi-granularity contrastive learning to enable accurate zero-shot keyword spotting with very low false alarms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MALEFA as a lightweight framework for user-defined keyword spotting that operates without any domain-specific pre-labeled training data. It combines cross-attention to align entire utterances with individual phonemes and a multi-granularity contrastive objective that pulls matching keyword representations closer while pushing dissimilar ones apart at both coarse and fine scales. This dual-level approach targets the persistent problem of acoustically similar keywords triggering false activations in real deployments. The resulting system reaches 90 percent accuracy across four public benchmarks and drives the false alarm rate down to 0.007 percent on the AMI dataset while remaining efficient enough for on-device inference. A sympathetic reader would care because reliable zero-shot spotting removes the need for repeated data collection when users define their own wake words or commands.

Core claim

MALEFA jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective, allowing the model to distinguish acoustically similar keywords in zero-shot settings and thereby achieve high accuracy while driving false alarm rates to 0.007 percent on the AMI dataset.

What carries the argument

The multi-granularity contrastive learning objective paired with cross-attention layers that produce joint utterance- and phoneme-level alignments, which together capture both global context and local phonetic distinctions without requiring labeled domain data.

If this is right

Voice interfaces can accept arbitrary user-defined keywords without collecting new labeled utterances for each one.
On-device deployment becomes practical on phones and embedded hardware because the model remains lightweight.
False activations drop enough to make always-on listening tolerable in everyday environments.
Personalized wake-word systems can be updated by changing only the target keyword embedding rather than retraining the entire network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment mechanism might extend to open-vocabulary speech recognition by treating arbitrary phrases as keyword sequences.
Adding a small amount of unsupervised adaptation on the target device could further tighten phoneme boundaries for a user's accent.
The contrastive loss at multiple granularities suggests a general template for other audio tasks that need both coarse and fine discrimination without labels.

Load-bearing premise

That joint utterance- and phoneme-level alignments learned via cross-attention and multi-granularity contrastive learning will reliably distinguish acoustically similar keywords in zero-shot settings without domain-specific pre-labeled data.

What would settle it

A new zero-shot test set of acoustically confusable keyword pairs on which accuracy falls below 90 percent or false alarm rate rises above 0.01 percent while keeping the model size and inference latency unchanged.

read the original abstract

User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MALEFA combines cross-attention alignment with multi-granularity contrastive learning for zero-shot KWS and posts strong benchmark numbers, but the zero-shot claim needs an explicit check that test keywords are phonetically disjoint from pretraining data.

read the letter

The main takeaway is that MALEFA uses cross-attention to learn joint utterance- and phoneme-level alignments inside a multi-granularity contrastive objective, then shows 90% accuracy and a 0.007% FAR on AMI while staying lightweight enough for edge devices. That combination directly targets the false-alarm problem that has limited zero-shot KWS in real voice interfaces. The efficiency angle is also useful because most prior zero-shot work has not emphasized real-time constraints on resource-limited hardware. The paper therefore gives a concrete framework that practitioners can try on the usual public benchmarks. The central performance numbers are specific enough to be checked, and the motivation around acoustically similar keywords is stated plainly. The soft spot is the zero-shot integrity. The reported FAR reduction only demonstrates the new alignment mechanism if the pretraining corpus truly contains no phonetically close negatives to the evaluation keywords. The description of the objective does not include an explicit phonetic-disjointness check between pretrain and test sets, so the numbers could partly reflect data overlap rather than the proposed method. Without the full experimental section it is also hard to judge how much the multi-granularity term adds over simpler contrastive baselines or whether error bars and data splits are reported consistently. This work is aimed at speech researchers who build deployable KWS systems rather than those chasing broad theoretical advances. A reader who needs practical zero-shot performance on AMI-style data would find the framework and numbers worth examining. It deserves a serious referee because the core idea is coherent, the claims are falsifiable on public data, and the practical focus is clear, even though revisions will be needed to close the data-separation gap.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MALEFA, a lightweight zero-shot keyword spotting (KWS) framework that jointly learns utterance- and phoneme-level alignments through cross-attention and a multi-granularity contrastive learning objective. It evaluates the approach on four public benchmark datasets, reporting 90% accuracy and a reduction of false alarm rate (FAR) to 0.007% on the AMI dataset, while emphasizing computational efficiency for real-time deployment on resource-constrained devices.

Significance. If the zero-shot performance claims hold without pretraining data overlap and with proper baselines, the work would be significant for enabling adaptable, personalized voice interfaces that handle acoustically similar keywords with low FAR, addressing key practical limitations in existing KWS systems.

major comments (2)

[Section 3] Section 3: The multi-granularity contrastive learning objective is presented as enabling distinction of unseen keywords, but no explicit check or analysis is reported confirming that phonetically similar sequences to the evaluation keywords are absent from the pretraining corpus. This verification is load-bearing for the zero-shot claim and the reported FAR reduction to 0.007% on AMI.
[Evaluations] Evaluations section: The abstract and results claim 90% accuracy and strong FAR reduction across four datasets, yet no details on baselines, error bars, data splits, or statistical significance are provided in the summary of results, preventing full verification of the central performance claims.

minor comments (1)

[Abstract] Abstract: The informal phrasing 'pesky false alarm rate' should be replaced with a more technical term such as 'elevated false alarm rate' for consistency with journal standards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the zero-shot claims and result presentation in our manuscript. We address each major comment below and plan corresponding revisions.

read point-by-point responses

Referee: [Section 3] Section 3: The multi-granularity contrastive learning objective is presented as enabling distinction of unseen keywords, but no explicit check or analysis is reported confirming that phonetically similar sequences to the evaluation keywords are absent from the pretraining corpus. This verification is load-bearing for the zero-shot claim and the reported FAR reduction to 0.007% on AMI.

Authors: We agree that explicit verification of no phonetic overlap with evaluation keywords in the pretraining corpus is essential to support the zero-shot setting and the reported FAR. In the revised manuscript, we will add a new subsection in Section 3 detailing the verification procedure (using phoneme-level Levenshtein distance and forced alignment checks against the pretraining transcripts) and confirming the absence of similar sequences for all evaluation keywords across the four datasets. This analysis will directly bolster the zero-shot claims. revision: yes
Referee: [Evaluations] Evaluations section: The abstract and results claim 90% accuracy and strong FAR reduction across four datasets, yet no details on baselines, error bars, data splits, or statistical significance are provided in the summary of results, preventing full verification of the central performance claims.

Authors: We concur that additional methodological details are required for reproducibility and verification. In the revised manuscript, we will expand the Evaluations section (and update the abstract summary if needed) to explicitly describe: the full set of baselines with citations, error bars computed over multiple random seeds, precise train/validation/test splits for each dataset, and statistical significance testing (e.g., Wilcoxon signed-rank tests) against baselines. These will be added to the text, tables, and figure captions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ML framework is self-contained

full rationale

The paper presents an empirical machine learning architecture (cross-attention for utterance/phoneme alignments plus multi-granularity contrastive loss) evaluated on four public benchmark datasets. Performance numbers are reported as experimental outcomes rather than predictions derived by construction from fitted inputs or self-referential definitions. No equations reduce to input data by definition, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled via self-citation. The zero-shot claim rests on the proposed training objective and benchmark results, which remain independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from deep learning for audio tasks; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (2)

domain assumption Cross-attention mechanisms can produce useful alignments between utterance-level and phoneme-level audio features for keyword spotting.
Invoked as the core mechanism for multi-granularity learning.
domain assumption Multi-granularity contrastive learning will suppress false alarms for acoustically similar keywords without task-specific labeled data.
Central to the claimed false-alarm reduction in zero-shot regime.

pith-pipeline@v0.9.0 · 5474 in / 1298 out tokens · 44413 ms · 2026-05-13T17:16:45.754060+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective... Ltotal = Lutt + Lphon + LCTC + LPCL + LUCL + LFA

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting

INTRODUCTION Keyword spotting (KWS) enables intuitive human-computer inter- action, facilitating the activation of voice assistants or smart de- vices with spoken commands, especially in hands-busy situations such as driving or gaming. Conventional KWS systems typically operate under a closed-set paradigm (using predefined wake words like “Hey Siri”, “OK ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Feature Extractor As schematically depicted in Fig

METHODOLOGY 2.1. Feature Extractor As schematically depicted in Fig. 2, MALEFA employs a two-stream encoder with separate audio and text encoders. Both audio and text modalities are processed independently and later aligned in the pat- tern extractor. Audio encoder .Each utterance is passed through a pre-trained speech encoder [19] using a 775 ms window w...

work page
[3]

Datasets We use the LibriPhrasetrain-clean-100andtrain-clean-360 sets with MUSAN noise [23] for training

EXPERIMENTAL SETUP 3.1. Datasets We use the LibriPhrasetrain-clean-100andtrain-clean-360 sets with MUSAN noise [23] for training. Evaluation is con- ducted on four benchmarks: LibriPhrase Easy/Hard (L E/LH) fromtrain-other-500(low/high phonetic confusion), Google Speech Commands V2 (G) [24] (35 commands under diverse con- ditions), Qualcomm Keyword Speech...

work page
[4]

bed” vs. “three

EXPERIMENTAL RESULTS 4.1. Main Results Table 1 compares MALEFA with prior ZSKWS models and presents an ablation study. While CED [12] achieves strong accuracy, its Conformer-based encoder [13] incurs much higher complexity, lim- iting on-device usage. Compared with PhonMatchNet [8], on LP H, it suffers a significant drop (AUC= 88.52, EER= 18.82), whereas ...

work page
[5]

CONCLUSION AND FUTURE WORK In this work, we have presented MALEFA, a lightweight ZSKWS framework that avoids reliance on large pre-trained models. By integrating multi-granularity contrastive learning with a novel false alarm-aware loss, MALEFA effectively captures global semantics and fine-grained pronunciations, and directly suppresses false trig- gers....

work page
[6]

Any findings and implications in the paper do not necessarily reflect those of the sponsors

ACKNOWLEDGMENTS This work was supported in part by Realtek Semiconductor Corpo- ration under Grant Numbers 113KK01103 and 114KK01005. Any findings and implications in the paper do not necessarily reflect those of the sponsors

work page
[7]

Convolutional neu- ral networks for small-footprint keyword spotting,

Tara N. Sainath and Carolina Parada, “Convolutional neu- ral networks for small-footprint keyword spotting,” inInter- speech, 2015, pp. 1478–1482

work page 2015
[8]

Small- footprint keyword spotting using deep neural networks,

Guoguo Chen, Carolina Parada, and Georg Heigold, “Small- footprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091

work page 2014
[9]

Deep spoken keyword spotting: An overview,

Iv ´an L ´opez-Espejo, Zheng-Hua Tan, John H. L. Hansen, and Jesper Jensen, “Deep spoken keyword spotting: An overview,” IEEE Access, vol. 10, pp. 4169–4199, 2021

work page 2021
[10]

End-to-end transformer-based open-vocabulary key- word spotting with location-guided local attention,

Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee, Kiho Cho, and Sung-Un Park, “End-to-end transformer-based open-vocabulary key- word spotting with location-guided local attention,” inProc. Interspeech, 2021, pp. 361–365

work page 2021
[11]

Zero-shot keyword spotting for visual speech recognition in-the-wild,

Themos Stafylakis and Georgios Tzimiropoulos, “Zero-shot keyword spotting for visual speech recognition in-the-wild,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 513–529

work page 2018
[12]

End-to-end open vocabulary keyword search,

Bolaji Yusuf, Alican Gok, Batuhan Gundogdu, and Murat Sar- aclar, “End-to-end open vocabulary keyword search,” inProc. Interspeech, 2021, pp. 4388–4392

work page 2021
[13]

Learning audio-text agree- ment for open-vocabulary keyword spotting,

Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, Soo-Whan Chung, and Hong-Goo Kang, “Learning audio-text agree- ment for open-vocabulary keyword spotting,”arXiv preprint arXiv:2206.15400, 2022

work page arXiv 2022
[14]

Phonmatchnet: Phoneme-guided zero-shot keyword spotting for user-defined keywords,

Yong-Hyeok Lee and Namhyun Cho, “Phonmatchnet: Phoneme-guided zero-shot keyword spotting for user-defined keywords,” inProceedings of Interspeech. IEEE, 2023

work page 2023
[15]

U2-KWS: Unified two-pass open-vocabulary keyword spotting with keyword bias,

Ao Zhang, Pan Zhou, Kaixun Huang, Yong Zou, Ming Liu, and Lei Xie, “U2-KWS: Unified two-pass open-vocabulary keyword spotting with keyword bias,” inProc. IEEE ASRU Workshop, 2023

work page 2023
[16]

Open-vocabulary keyword-spotting with adaptive instance normalization,

Aviv Navon, Aviv Shamsian, Neta Glazer, Gill Hetz, and Joseph Keshet, “Open-vocabulary keyword-spotting with adaptive instance normalization,”arXiv preprint arXiv:2309.08561, 2023

work page arXiv 2023
[17]

Mm-kws: Multi- modal prompts for multilingual user-defined keyword spot- ting,

Zhiqi Ai, Zhiyong Chen, and Shugong Xu, “Mm-kws: Multi- modal prompts for multilingual user-defined keyword spot- ting,”arXiv preprint arXiv:2406.07310, 2024

work page arXiv 2024
[18]

Flexible keyword spotting based on homogeneous audio-text embedding,

Kumari Nishu, Minsik Cho, Paul Dixon, and Devang Naik, “Flexible keyword spotting based on homogeneous audio-text embedding,” inICASSP 2024-2024 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 5050–5054

work page 2024
[19]

Conformer: Convolution- augmented transformer for speech recognition,

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Par- mar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zheng- dong Zhang, Yonghui Wu, et al., “Conformer: Convolution- augmented transformer for speech recognition,”arXiv preprint arXiv:2005.08100, 2020

work page arXiv 2005
[20]

Large-scale con- trastive language-audio pretraining (clap),

Yusong Wu, Benjamin Elizalde, Soham Deshmukh, Mah- moud Al Ismail, and Huaming Wang, “Large-scale con- trastive language-audio pretraining (clap),”arXiv preprint arXiv:2211.06687, 2022

work page arXiv 2022
[21]

Con- trastive learning with audio discrimination for customizable keyword spotting in continuous speech,

Yu Xi, Baochen Yang, Hao Li, Jiaqi Guo, and Kai Yu, “Con- trastive learning with audio discrimination for customizable keyword spotting in continuous speech,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2024

work page 2024
[22]

Learning transfer- able visual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transfer- able visual models from natural language supervision,” inIn- ternational conference on machine learning, 2021, pp. 8748– 8763

work page 2021
[23]

Phoneme-level contrastive learning for user-defined keyword spotting with flexible enrollment,

Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, and Du Jun, “Phoneme-level contrastive learning for user-defined keyword spotting with flexible enrollment,”arXiv preprint arXiv:2412.20805, 2024

work page arXiv 2024
[24]

Adver- sarial deep metric learning for cross-modal audio-text align- ment in open-vocabulary keyword spotting,

Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaey- oung Roh, Chang Woo Han, and Hoon-Young Cho, “Adver- sarial deep metric learning for cross-modal audio-text align- ment in open-vocabulary keyword spotting,”arXiv preprint arXiv:2505.16735, 2025

work page arXiv 2025
[25]

Training keyword spotters with limited and syn- thesized speech data,

James Lin, Kevin Kilgour, Dominik Roblek, and Matthew Sharifi, “Training keyword spotters with limited and syn- thesized speech data,” inICASSP 2020-2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7474–7478

work page 2020
[26]

Jongseok Park, Kyubyong & Kim, “g2pe,”https:// github.com/Kyubyong/g2p, 2019

work page 2019
[27]

Connectionist temporal classification: la- belling unsegmented sequence data with recurrent neural net- works,

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J¨urgen Schmidhuber, “Connectionist temporal classification: la- belling unsegmented sequence data with recurrent neural net- works,” inProceedings of the 23rd International Conference on Machine Learning. IEEE, 2006, ICML ’06, p. 369–376

work page 2006
[28]

Optimizing early warning classifiers to control false alarms via a minimum precision con- straint,

Preetish Rath and Michael Hughes, “Optimizing early warning classifiers to control false alarms via a minimum precision con- straint,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics, Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, Eds., 28–30 Mar 2022, vol. 151 ofProceedings of Machine Learning Re...

work page 2022
[29]

MUSAN: A Music, Speech, and Noise Corpus

David Snyder, Guoguo Chen, and Daniel Povey, “Mu- san: A music, speech, and noise corpus,”arXiv preprint arXiv:1510.08484, 2015

work page Pith review arXiv 2015
[30]

Speech commands: A dataset for limited- vocabulary keyword spotting,

Pete Warden, “Speech commands: A dataset for limited- vocabulary keyword spotting,” inProceedings of Interspeech, 2018

work page 2018
[31]

Query-by-example on-device keyword spotting,

Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang, “Query-by-example on-device keyword spotting,” in2019 IEEE automatic speech recognition and un- derstanding workshop (ASRU), 2019, pp. 532–538

work page 2019
[32]

The ami meeting corpus: A pre-announcement,

Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Ma¨el Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, et al., “The ami meeting corpus: A pre-announcement,” inProc. Interna- tional Workshop on Machine Learning for Multimodal Interac- tion (MLMI). Springer, 2005, pp. 28–39

work page 2005