Transformer Transducer: A Streamable Speech Recognition Model

· 2021 · arXiv 1503.2021

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection

cs.CR · 2026-05-18 · unverdicted · novelty 7.0

MARS is a transfer-based black-box attack that uses bi-level optimization on semantic and artifact anchors to escape the linearity trap and improve attack success rates on SSL-SVDD by up to 36%.

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

Layer-wise probing of wav2vec2-base and Whisper-small shows both models distinguish reduced vs. canonical consonant clusters in AAE with high accuracy and retain cues to underlying stops, encoding CCR as gradient variation.

MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

MIRAGE uses adaptive multimodal gating on native multimodal backbones plus a transformer encoder to achieve state-of-the-art whole-brain fMRI prediction for naturalistic audiovisual stimuli, outperforming post-hoc unimodal aggregation.

Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.

MOSS-Audio Technical Report

cs.SD · 2026-06-01 · unverdicted · novelty 4.0

MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.

citing papers explorer

Showing 1 of 1 citing paper after filters.

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition cs.CL · 2026-04-28 · unverdicted · none · ref 13
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.

Transformer Transducer: A Streamable Speech Recognition Model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer