ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.
hub
Bigcodec: Pushing the limits of low-bitrate neural speech codec
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
FMelCodec is a three-stage mel-spectrogram codec using 640x VQ compression, conditional flow matching refinement, and HiFi-GAN reconstruction that reports higher quality than prior methods at 250 bps for 16 kHz speech.
LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.
Pupu-Vocoder and Pupu-Codec integrate differentiable anti-aliasing into neural audio models to eliminate aliasing artifacts from non-linear activations and upsampling, yielding better results on music and singing voice.
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
HybridCodec combines discrete tokens with continuous residuals via a focal modulation codec and hybrid Transformer to improve speaker retention and reduce autoregressive steps in speech language models.
citing papers explorer
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.