Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
hub
Speech commands: A dataset for limited-vocabulary speech recognition
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.
FiTS spiking neurons improve auditory task performance over LIF baselines by factorizing computation into frequency selectivity and group-delay-based temporal shaping, yielding interpretable per-neuron parameters.
An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.
ShiftLIF maps membrane potentials to logarithmically spaced power-of-two spike levels, improving representational capacity in SNNs while keeping synaptic operations multiplier-free.
S2-Net is an oscillatory spiking neural network that uses time-delayed synchronization for bottom-up and top-down coordination to enable efficient, brain-inspired information processing across tasks like decoding and reasoning.
ULTRAS unifies audio and speech representation learning in a single transformer by applying patch masking to log-mel spectrograms and using a joint spectral-temporal prediction loss.
Large-scale experiments show architecture performance depends on task type, not universality, and a single-parameter energy penalty reduces computational energy by ~1000x with negligible accuracy cost.
Whisper-AuT is a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on mixed speech, environmental, and music data, yielding gains of +23% on ESC-50, +5% on GTZAN, and +0.7% on Speech Commands.
Bayesian weight learning in surrogate-gradient SNNs smooths the loss landscape and improves negative log-likelihood plus Brier score on Heidelberg Digits and Speech Commands datasets.
A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.
CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.
citing papers explorer
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Efficiently Modeling Long Sequences with Structured State Spaces
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.
-
FiTS: Interpretable Spiking Neurons via Frequency Selectivity and Temporal Shaping
FiTS spiking neurons improve auditory task performance over LIF baselines by factorizing computation into frequency selectivity and group-delay-based temporal shaping, yielding interpretable per-neuron parameters.
-
End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor
An FPGA implementation of a neuromorphic auditory sensor plus graph neural network achieves 87.43% accuracy on Google Speech Commands v2 with sub-35 µs latency and 1.12 W power.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
EdgeSpike: Spiking Neural Networks for Low-Power Autonomous Sensing in Edge IoT Architectures
EdgeSpike delivers 91.4% mean accuracy on five sensing tasks with 31x lower energy on neuromorphic hardware and 6.3x longer battery life in a seven-month field deployment compared to conventional CNNs.
-
ShiftLIF: Efficient Multi-Level Spiking Neurons with Power-of-Two Quantization
ShiftLIF maps membrane potentials to logarithmically spaced power-of-two spike levels, improving representational capacity in SNNs while keeping synaptic operations multiplier-free.
-
From Cortical Synchronous Rhythm to Brain Inspired Learning Mechanism: An Oscillatory Spiking Neural Network with Time-Delayed Coordination
S2-Net is an oscillatory spiking neural network that uses time-delayed synchronization for bottom-up and top-down coordination to enable efficient, brain-inspired information processing across tasks like decoding and reasoning.
-
ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals
ULTRAS unifies audio and speech representation learning in a single transformer by applying patch masking to log-mel spectrograms and using a joint spectral-temporal prediction loss.
-
minAction.net: Energy-First Neural Architecture Design -- From Biological Principles to Systematic Validation
Large-scale experiments show architecture performance depends on task type, not universality, and a single-parameter energy penalty reduces computational energy by ~1000x with negligible accuracy cost.
-
Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training
Whisper-AuT is a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on mixed speech, environmental, and music data, yielding gains of +23% on ESC-50, +5% on GTZAN, and +0.7% on Speech Commands.
-
Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing
Bayesian weight learning in surrogate-gradient SNNs smooths the loss landscape and improves negative log-likelihood plus Brier score on Heidelberg Digits and Speech Commands datasets.
-
Attention Is not Everything: Efficient Alternatives for Vision
A survey that taxonomizes non-Transformer vision models and evaluates their practical trade-offs across efficiency, scalability, and robustness.
-
Keyword spotting using convolutional neural network for speech recognition in Hindi
CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.