Layer-wise probing of wav2vec2-base and Whisper-small shows both models distinguish reduced vs. canonical consonant clusters in AAE with high accuracy and retain cues to underlying stops, encoding CCR as gradient variation.
Transformer Transducer: A Streamable Speech Recognition Model
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
MIRAGE uses adaptive multimodal gating on native multimodal backbones plus a transformer encoder to achieve state-of-the-art whole-brain fMRI prediction for naturalistic audiovisual stimuli, outperforming post-hoc unimodal aggregation.
An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
citing papers explorer
-
Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English
Layer-wise probing of wav2vec2-base and Whisper-small shows both models distinguish reduced vs. canonical consonant clusters in AAE with high accuracy and retain cues to underlying stops, encoding CCR as gradient variation.
-
MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding
MIRAGE uses adaptive multimodal gating on native multimodal backbones plus a transformer encoder to achieve state-of-the-art whole-brain fMRI prediction for naturalistic audiovisual stimuli, outperforming post-hoc unimodal aggregation.
-
Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe
An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.
-
WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
-
MOSS-Audio Technical Report
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.