{"total":12,"items":[{"citing_arxiv_id":"2606.27627","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-26T00:53:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HybridCodec combines discrete tokens with continuous residuals via a focal modulation codec and hybrid Transformer to improve speaker retention and reduce autoregressive steps in speech language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25669","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction","primary_cat":"eess.AS","submitted_at":"2026-05-25T10:19:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FMelCodec is a three-stage mel-spectrogram codec using 640x VQ compression, conditional flow matching refinement, and HiFi-GAN reconstruction that reports higher quality than prior methods at 250 bps for 16 kHz speech.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19541","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning","primary_cat":"cs.SD","submitted_at":"2026-05-19T08:40:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11192","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Exploring Token-Space Manipulation in Latent Audio Tokenizers","primary_cat":"cs.SD","submitted_at":"2026-05-11T19:58:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11098","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling","primary_cat":"cs.SD","submitted_at":"2026-05-11T18:04:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08608","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation","primary_cat":"eess.AS","submitted_at":"2026-05-09T02:07:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"dicting clean speech tokens in a discrete latent space. SELM [51] combines WavLM with k-means clustering to obtain discrete speech tokens and then uses an LM to map noisy tokens to clean ones. LLaSE-G1 [20] instead conditions a decoder-only LM on continuous WavLM representations and predicts discrete clean-speech tokens for waveform reconstruction, while UniSE [57] follows a similar paradigm for autoregressive acoustic token generation. More recent methods employ staged or hierarchical generation strategies. GenSE [60] adopts a two-stage framework that first pre- dicts enhanced semantic tokens and then generates clean acoustic tokens. OmniGSE [33] follows a continuous-to-discrete collabora- tive design, where pre-quantized codec-domain features are first"},{"citing_arxiv_id":"2605.06582","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:11:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Given encoder states Z= [z 1,...,zT ], z t∈Rd,(52) we first compute a pooled encoder summary ¯z= 1 T T∑ t=1 zt,(53) or, when padding is present, the corresponding mask-normalized average over valid encoder positions. The summary is normalized and projected into the decoder state space: c(Z) =W c LayerNorm(¯z) +bc, c(Z)∈R ddec.(54) Let em =E(˜um−1)(55) denote the token embedding at teacher-forcing input positionm. In the decoder used in this work, positional information is represented by an additive learned positional embedding pm∈Rddec.(56) We therefore form the decoder input by adding the token embedding, positional embedding, and projected encoder summary: ˜em =e m +pm +c(Z).(57) Thus, each prediction step receives three complementary signals: the previous token identity throughem,"},{"citing_arxiv_id":"2604.26296","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding","primary_cat":"eess.AS","submitted_at":"2026-04-29T04:51:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17852","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Codec: Neural Audio Codec Meets Language Model Objectives","primary_cat":"cs.SD","submitted_at":"2026-04-20T06:02:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20211","ref_index":94,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aliasing-Free Neural Audio Synthesis","primary_cat":"cs.SD","submitted_at":"2025-12-23T10:04:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pupu-Vocoder and Pupu-Codec integrate differentiable anti-aliasing into neural audio models to eliminate aliasing artifacts from non-linear activations and upsampling, yielding better results on music and singing voice.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01537","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Two-Dimensional Quantization for Geometry-Aware Audio Coding","primary_cat":"cs.SD","submitted_at":"2025-12-01T11:06:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.16632","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step-Audio 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-07-22T14:23:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}