Speechtokenizer: Unified speech tokenizer for speech large language models

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu · 2024 · arXiv 2308.16692

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.

TokenChain: A Discrete Speech Chain via Semantic Token Modeling

eess.AS · 2025-10-07 · unverdicted · novelty 7.0

TokenChain demonstrates that a discrete semantic-token interface can sustain effective chain learning between ASR and TTS, yielding faster convergence and lower error rates on LibriSpeech and TED-LIUM.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

cs.CL · 2025-09-26 · unverdicted · novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07

citing papers explorer

Showing 8 of 8 citing papers.

Codec-Robust Attacks on Audio LLMs cs.SD · 2026-05-19 · unverdicted · none · ref 56 · 2 links
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction cs.CV · 2026-04-09 · unverdicted · none · ref 93
PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.
TokenChain: A Discrete Speech Chain via Semantic Token Modeling eess.AS · 2025-10-07 · unverdicted · none · ref 15
TokenChain demonstrates that a discrete semantic-token interface can sustain effective chain learning between ASR and TTS, yielding faster convergence and lower error rates on LibriSpeech and TED-LIUM.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 82
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 91
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 84
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 154
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG · 2026-05-07 · unreviewed · ref 61

Speechtokenizer: Unified speech tokenizer for speech large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer