hub Tool reference

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Pete Warden · 2018 · cs.CL · arXiv 1804.03209

Tool reference. 83% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

34 Pith papers citing it

Method reference 83% of classified citations

open full Pith review browse 34 citing papers arXiv PDF

abstract

Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Discusses why this task is an interesting challenge, and why it requires a specialized dataset that is different from conventional datasets used for automatic speech recognition of full sentences. Suggests a methodology for reproducible and comparable accuracy metrics for this task. Describes how the data was collected and verified, what it contains, previous versions and properties. Concludes by reporting baseline results of models trained on this dataset.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 5 background 1

citation-polarity summary

use dataset 5 unclear 1

representative citing papers

DiffWave: A Versatile Diffusion Model for Audio Synthesis

eess.AS · 2020-09-21 · unverdicted · novelty 8.0

DiffWave is a non-autoregressive diffusion model that generates high-fidelity audio waveforms from noise in constant steps, matching WaveNet vocoder quality while being orders of magnitude faster and outperforming prior models in unconditional generation.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Efficiently Modeling Long Sequences with Structured State Spaces

cs.LG · 2021-10-31 · unverdicted · novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.

Covariance Estimation for Matrix-variate Data via Fixed-rank Core Covariance Geometry

math.DG · 2025-11-30 · unverdicted · novelty 7.0

The space of rank-r core covariances forms a smooth manifold except on a measure-zero set, enabling a partial-isotropy shrinkage estimator for matrix-variate data.

DASB - Discrete Audio and Speech Benchmark

cs.SD · 2024-06-20 · unverdicted · novelty 7.0

DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.

Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

cs.CV · 2024-05-01 · unverdicted · novelty 7.0

Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.

FiTS: Interpretable Spiking Neurons via Frequency Selectivity and Temporal Shaping

cs.NE · 2026-05-13 · unverdicted · novelty 7.0

FiTS spiking neurons improve auditory task performance over LIF baselines by factorizing computation into frequency selectivity and group-delay-based temporal shaping, yielding interpretable per-neuron parameters.

End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor

cs.LG · 2026-05-10 · conditional · novelty 7.0

An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

What changes after deployment? A survey on On-device Learning in TinyML

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

A survey of on-device learning in TinyML organized by distribution change regimes, highlighting influences on applications, hardware, and solutions plus a gap between benchmarks and deployments.

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Plug-in losses approximate EDL training objectives at the Dirichlet mean with decaying error as evidence grows, including softmax under a specific mapping, and match classical EDL performance on Google Speech Commands.

AudioMosaic: Contrastive Masked Audio Representation Learning

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.

ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples

cs.CR · 2025-12-16 · unverdicted · novelty 6.0

ComMark embeds covert watermarks in models using frequency-domain compressed samples and simulated attacks, claiming state-of-the-art covertness and robustness across image, speech, text, and video tasks.

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

cs.SD · 2025-12-03 · unverdicted · novelty 6.0

AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, reaching SOTA on AS-20K, ESC-50, and NSynth under fine-tuning.

SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks

cs.NE · 2025-06-04 · unverdicted · novelty 6.0

SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.

Simplified State Space Layers for Sequence Modeling

cs.LG · 2022-08-09 · accept · novelty 6.0

S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.

Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification

eess.AS · 2019-07-02 · unverdicted · novelty 6.0

Sub-band CNN applies distinct kernels per frequency sub-band to reduce computation 39.7-49.3% versus full-band CNN on Speech Commands dataset while maintaining accuracy.

Federated Learning with Non-IID Data

cs.LG · 2018-06-02 · conditional · novelty 6.0

Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.

EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures

cs.NE · 2026-04-29 · unverdicted · novelty 6.0

EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.

Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation

eess.AS · 2026-05-21 · unverdicted · novelty 5.0

DMA-KWS achieves 97.85% AUC and 6.13% EER on LibriPhrase Hard via dual-stage CTC/QbyT matching, multi-modal enrollment, and lightweight continual adaptation with 187k parameters.

How Class Ontology and Data Scale Affect Audio Transfer Learning

cs.LG · 2026-03-26 · unverdicted · novelty 5.0

Larger pre-training data scale and class diversity improve audio transfer learning performance, yet similarity between pre-training and target task has a stronger positive effect.

Prototype-Guided Robust Learning against Backdoor Attacks

cs.CR · 2025-09-03 · unverdicted · novelty 5.0

PGRL defends ML models from backdoor attacks by using a few verified clean samples to guide removal of suspicious training data and unlearning of backdoor features during fine-tuning, outperforming prior defenses in experiments.

Towards Debugging Deep Neural Networks by Generating Speech Utterances

cs.LG · 2019-07-06 · unverdicted · novelty 5.0

Activation maximization applied to a speech command DNN, followed by WaveNet synthesis, produces class-specific utterances that human evaluators can interpret, supporting its use for model debugging.

ShiftLIF: Efficient Multi-Level Spiking Neurons with Power-of-Two Quantization

cs.NE · 2026-05-03 · unverdicted · novelty 5.0

ShiftLIF maps membrane potentials to logarithmically spaced power-of-two spike levels, improving representational capacity in SNNs while keeping synaptic operations multiplier-free.

citing papers explorer

Showing 34 of 34 citing papers.

DiffWave: A Versatile Diffusion Model for Audio Synthesis eess.AS · 2020-09-21 · unverdicted · none · ref 18 · internal anchor
DiffWave is a non-autoregressive diffusion model that generates high-fidelity audio waveforms from noise in constant steps, matching WaveNet vocoder quality while being orders of magnitude faster and outperforming prior models in unconditional generation.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces cs.LG · 2023-12-01 · unverdicted · none · ref 109
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Efficiently Modeling Long Sequences with Structured State Spaces cs.LG · 2021-10-31 · unverdicted · none · ref 47
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.
Covariance Estimation for Matrix-variate Data via Fixed-rank Core Covariance Geometry math.DG · 2025-11-30 · unverdicted · none · ref 2 · internal anchor
The space of rank-r core covariances forms a smooth manifold except on a measure-zero set, enabling a partial-isotropy shrinkage estimator for matrix-variate data.
DASB - Discrete Audio and Speech Benchmark cs.SD · 2024-06-20 · unverdicted · none · ref 55 · internal anchor
DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.
Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications cs.CV · 2024-05-01 · unverdicted · none · ref 19 · internal anchor
Wake Vision pipeline produces a 6M-image person detection dataset for TinyML with 2.2% label error, improving model accuracy up to 6.6% over prior VWW benchmark across architectures and subsets.
FiTS: Interpretable Spiking Neurons via Frequency Selectivity and Temporal Shaping cs.NE · 2026-05-13 · unverdicted · none · ref 36
FiTS spiking neurons improve auditory task performance over LIF baselines by factorizing computation into frequency selectivity and group-delay-based temporal shaping, yielding interpretable per-neuron parameters.
End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor cs.LG · 2026-05-10 · conditional · none · ref 34
An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 34
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
What changes after deployment? A survey on On-device Learning in TinyML cs.LG · 2026-05-29 · unverdicted · none · ref 104 · internal anchor
A survey of on-device learning in TinyML organized by distribution change regimes, highlighting influences on applications, hardware, and solutions plus a gap between benchmarks and deployments.
Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier cs.LG · 2026-05-21 · unverdicted · none · ref 8 · internal anchor
Plug-in losses approximate EDL training objectives at the Dirichlet mean with decaying error as evidence grows, including softmax under a specific mapping, and match classical EDL performance on Google Speech Commands.
AudioMosaic: Contrastive Masked Audio Representation Learning cs.LG · 2026-05-14 · unverdicted · none · ref 14 · internal anchor
AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.
ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples cs.CR · 2025-12-16 · unverdicted · none · ref 68 · internal anchor
ComMark embeds covert watermarks in models using frequency-domain compressed samples and simulated attacks, claiming state-of-the-art covertness and robustness across image, speech, text, and video tasks.
AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers cs.SD · 2025-12-03 · unverdicted · none · ref 20 · internal anchor
AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, reaching SOTA on AS-20K, ESC-50, and NSynth under fine-tuning.
SiLIF: Structured State Space Model Dynamics and Parametrization for Spiking Neural Networks cs.NE · 2025-06-04 · unverdicted · none · ref 42 · internal anchor
SiLIF models apply SSM dynamics and parametrization to spiking neurons for stable training, reaching new SOTA on event-based and raw-audio speech datasets while using half the compute of SSMs via synaptic delays.
Simplified State Space Layers for Sequence Modeling cs.LG · 2022-08-09 · accept · none · ref 150 · internal anchor
S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.
Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification eess.AS · 2019-07-02 · unverdicted · none · ref 6 · internal anchor
Sub-band CNN applies distinct kernels per frequency sub-band to reduce computation 39.7-49.3% versus full-band CNN on Speech Commands dataset while maintaining accuracy.
Federated Learning with Non-IID Data cs.LG · 2018-06-02 · conditional · none · ref 20 · internal anchor
Non-IID data causes up to 55% accuracy loss in federated learning due to weight divergence measured by earth mover's distance; 5% globally shared data recovers 30% accuracy on CIFAR-10.
EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures cs.NE · 2026-04-29 · unverdicted · none · ref 40
EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.
Effective User-defined Keyword Spotting with Dual-stage Matching, Multi-modal Enrollment, and Continual Adaptation eess.AS · 2026-05-21 · unverdicted · none · ref 58 · internal anchor
DMA-KWS achieves 97.85% AUC and 6.13% EER on LibriPhrase Hard via dual-stage CTC/QbyT matching, multi-modal enrollment, and lightweight continual adaptation with 187k parameters.
How Class Ontology and Data Scale Affect Audio Transfer Learning cs.LG · 2026-03-26 · unverdicted · none · ref 24 · internal anchor
Larger pre-training data scale and class diversity improve audio transfer learning performance, yet similarity between pre-training and target task has a stronger positive effect.
Prototype-Guided Robust Learning against Backdoor Attacks cs.CR · 2025-09-03 · unverdicted · none · ref 44 · internal anchor
PGRL defends ML models from backdoor attacks by using a few verified clean samples to guide removal of suspicious training data and unlearning of backdoor features during fine-tuning, outperforming prior defenses in experiments.
Towards Debugging Deep Neural Networks by Generating Speech Utterances cs.LG · 2019-07-06 · unverdicted · none · ref 26 · internal anchor
Activation maximization applied to a speech command DNN, followed by WaveNet synthesis, produces class-specific utterances that human evaluators can interpret, supporting its use for model debugging.
ShiftLIF: Efficient Multi-Level Spiking Neurons with Power-of-Two Quantization cs.NE · 2026-05-03 · unverdicted · none · ref 50
ShiftLIF maps membrane potentials to logarithmically spaced power-of-two spike levels, improving representational capacity in SNNs while keeping synaptic operations multiplier-free.
From Cortical Synchronous Rhythm to Brain Inspired Learning Mechanism: An Oscillatory Spiking Neural Network with Time-Delayed Coordination q-bio.NC · 2026-05-03 · unverdicted · none · ref 41
S2-Net is an oscillatory spiking neural network that uses time-delayed synchronization for bottom-up and top-down coordination to enable efficient, brain-inspired information processing across tasks like decoding and reasoning.
ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals eess.AS · 2026-04-08 · unverdicted · none · ref 21
ULTRAS unifies audio and speech representation learning in a single transformer by applying patch masking to log-mel spectrograms and using a joint spectral-temporal prediction loss.
Keyword Spotting for Hearing Assistive Devices Robust to External Speakers cs.SD · 2019-06-22 · unverdicted · none · ref 17 · internal anchor
A multi-task deep residual network jointly performs keyword spotting and own-voice detection, delivering around 32% relative KWS accuracy gain on a hearing-aid-emulated corpus derived from Google Speech Commands.
A Monaural Speech Enhancement Method for Robust Small-Footprint Keyword Spotting cs.SD · 2019-06-20 · unverdicted · none · ref 6 · internal anchor
Joint training of speech enhancement and KWS with a novel CRN and Mel features improves noise robustness for small-footprint devices.
minAction.net: Energy-First Neural Architecture Design -- From Biological Principles to Systematic Validation cs.LG · 2026-04-27 · conditional · none · ref 11
Large-scale experiments show architecture performance depends on task type, not universality, and a single-parameter energy penalty reduces computational energy by ~1000x with negligible accuracy cost.
Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training cs.SD · 2026-04-12 · unverdicted · none · ref 18
Whisper-AuT is a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on mixed speech, environmental, and music data, yielding gains of +23% on ESC-50, +5% on GTZAN, and +0.7% on Speech Commands.
Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing cs.LG · 2026-04-09 · unverdicted · none · ref 18
Bayesian weight learning in surrogate-gradient SNNs smooths the loss landscape and improves negative log-likelihood plus Brier score on Heidelberg Digits and Speech Commands datasets.
Attention Is not Everything: Efficient Alternatives for Vision cs.CV · 2026-04-19 · unverdicted · none · ref 12
A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.
Keyword spotting using convolutional neural network for speech recognition in Hindi cs.SD · 2026-04-26 · unverdicted · none · ref 13
CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.
Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations cs.AR · 2026-05-12 · unreviewed · ref 62 · internal anchor

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer