arxiv: 2210.13438 · v1 · submitted 2022-10-24 · 📡 eess.AS · cs.AI· cs.SD· stat.ML

Recognition: no theorem link

High Fidelity Neural Audio Compression

Alexandre D\'efossez, Gabriel Synnaeve, Jade Copet, Yossi Adi

Pith reviewed 2026-05-13 21:46 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SDstat.ML

keywords neural audio codechigh-fidelity compressionstreaming encoder-decoderquantized latent spacemultiscale spectrogram adversaryloss balancertransformer compressionMUSHRA evaluation

0 comments

The pith

A neural network audio codec with streaming encoder-decoder and quantized latents delivers higher fidelity than baselines at real-time speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a neural audio codec that compresses speech and music while keeping perceptual quality high enough to beat prior methods in listening tests. It trains an end-to-end streaming encoder-decoder whose latent space is quantized, using one multiscale spectrogram discriminator to suppress artifacts and a loss balancer that treats each loss weight as the target fraction of the total gradient. The same architecture supports optional lightweight Transformer layers that cut the final bitrate by up to 40 percent without losing real-time performance. These choices matter because real-time, high-quality compression is required for efficient streaming, storage, and transmission of audio on bandwidth-constrained devices. The authors show the gains hold for 24 kHz monophonic and 48 kHz stereophonic signals across clean speech, noisy-reverberant speech, and music.

Core claim

The authors present a real-time high-fidelity neural audio codec built from a streaming encoder-decoder with a quantized latent space trained end-to-end. Training is stabilized by a single multiscale spectrogram adversary that reduces artifacts and by a novel loss-balancer module in which each loss weight directly sets the fraction of the overall gradient it contributes. Lightweight Transformer models can be stacked on the quantized representation to achieve up to 40 percent additional compression while remaining faster than real time. MUSHRA subjective tests across multiple bandwidths and domains establish superiority over existing codecs for both 24 kHz monophonic and 48 kHz stereophonic音频

What carries the argument

Streaming encoder-decoder architecture with quantized latent space, trained using a single multiscale spectrogram adversary and a loss-balancer mechanism that decouples loss weights from gradient scale.

Load-bearing premise

The MUSHRA listening tests on the chosen audio domains and bandwidths are representative of real-world use and the model does not overfit to the training distribution in ways that degrade on unseen content.

What would settle it

A new MUSHRA test on audio outside the training domains (for example live concert recordings or rare speech accents) showing the neural codec no longer rated higher than the baselines at the same bitrate.

read the original abstract

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical neural audio codec paper with released code, a useful loss balancer, and solid MUSHRA results that beat the listed baselines.

read the letter

The paper gives a streaming encoder-decoder neural codec that runs in real time and produces better subjective quality than the baselines at the tested bitrates. The two concrete additions are a single multiscale spectrogram discriminator instead of multiple ones, and a loss balancer that directly sets the target gradient fraction for each term so the weight choice no longer depends on the raw loss scale. Both are described clearly enough to re-implement, and the code is released, which removes most hidden-variable worries. The optional lightweight transformer stage that squeezes the quantized latents another 40 % is a nice extra but not the main claim. Evaluations cover 24 kHz mono and 48 kHz stereo across clean speech, noisy-reverberant speech, and music, with MUSHRA scores and ablations at several bandwidths. The results look consistent within those domains. The central empirical claim therefore rests on external listening tests rather than on any internal fitting metric, which is the right way to do it here. The main soft spot is the usual one for learned codecs: the training data distribution is not fully characterized, so we do not know how far the model generalizes to unseen acoustic conditions or languages. The transformer compression step also adds a small latency cost that is not quantified in the main tables. Nothing in the paper looks load-bearing or circular. This work is aimed at audio engineers and people who need discrete audio tokens for generative models. It is worth sending to peer review because the implementation is public, the subjective tests are extensive, and the new design choices are reproducible.

Referee Report

0 major / 3 minor

Summary. The paper introduces EnCodec, a real-time neural audio codec using a streaming encoder-decoder with quantized latent representations trained end-to-end. Key contributions include a single multiscale spectrogram discriminator to reduce artifacts, a novel loss balancer that sets loss weights as target gradient fractions to stabilize training, and optional lightweight Transformer models for up to 40% further compression of the latent codes. The work reports extensive MUSHRA subjective evaluations across speech, noisy-reverberant speech, and music at multiple bandwidths for both 24 kHz mono and 48 kHz stereo audio, claiming consistent superiority over published baselines, with code and models released for reproducibility.

Significance. If the reported MUSHRA rankings hold, the work provides a practical advance in high-fidelity, low-latency neural audio compression with direct applicability to streaming and storage. Strengths include the public release of code and models, detailed ablation studies, and the loss-balancer formulation that decouples hyper-parameter choice from loss scale; these elements support replication and extension beyond the specific domains tested.

minor comments (3)

[§3.3] §3.3: The single multiscale spectrogram discriminator is described at a high level; adding the exact frequency scales and window sizes used would aid exact replication.
[Table 2] Table 2: The MUSHRA scores for the 48 kHz stereo music condition would benefit from reported confidence intervals or standard deviations to quantify variability across listeners.
[§4.2] §4.2: The claim of 'parameter-free' behavior for certain loss terms is not fully supported by the listed free parameters (number of residual codebooks and balancer targets); a brief clarification on which quantities remain fixed would improve precision.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the practical contributions (including the loss balancer, public code release, and MUSHRA evaluations), and the recommendation to accept. We are pleased that the work's applicability to streaming audio was noted.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical neural audio codec (encoder-decoder with quantization, single multiscale spectrogram discriminator, and a proposed loss-balancer that normalizes gradient contributions by design). All central claims of superiority are grounded in external MUSHRA listening tests across speech, music, and stereo domains plus comparisons to published baselines, with code released for replication. No equation or training step reduces by construction to a fitted parameter renamed as a prediction, no self-citation chain is load-bearing for the architecture or results, and the loss-balancer is introduced as an explicit mechanism rather than derived from the target metric. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model relies on standard assumptions of end-to-end differentiable training and perceptual loss functions; no new physical entities are postulated. Free parameters include the number of codebooks, codebook size, and the target gradient fractions in the loss balancer, all chosen by hand or grid search.

free parameters (2)

number of residual codebooks
Chosen to achieve target bitrate; directly controls the discrete representation size.
loss balancer target fractions
Hyper-parameters that set the desired gradient contribution of each loss term; fitted to stabilize training.

axioms (1)

domain assumption Multiscale spectrogram discrimination is sufficient to suppress perceptual artifacts in audio reconstruction.
Invoked in the training objective section to justify using a single discriminator instead of multiple.

pith-pipeline@v0.9.0 · 5540 in / 1318 out tokens · 33298 ms · 2026-05-13T21:46:14.260988+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
cs.CL 2023-01 unverdicted novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering
cs.SD 2026-05 unverdicted novelty 6.0

Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.
Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis
eess.SP 2026-05 unverdicted novelty 6.0

A compact 0.09B model using hierarchical discrete tokenization and prompted latent translation outperforms larger baselines in cross-modal PPG-to-ECG synthesis and cross-frequency super-resolution.
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
cs.SD 2026-05 accept novelty 6.0

MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
eess.AS 2026-04 unverdicted novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
LLM-Codec: Neural Audio Codec Meets Language Model Objectives
cs.SD 2026-04 unverdicted novelty 6.0

LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare
eess.AS 2026-04 unverdicted novelty 6.0

HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.
Efficient Training for Cross-lingual Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

CSLM achieves cross-modal and cross-lingual alignment in speech LLMs via continual pre-training on discrete tokens and speech-text interleaved instruction tuning, enabling scalability without massive speech datasets.
Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
eess.AS 2026-04 unverdicted novelty 6.0

A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18...
Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
eess.AS 2026-04 unverdicted novelty 6.0

Diff-VS is an efficient audio-aware diffusion U-Net for vocal separation that matches discriminative baselines on objective metrics while achieving state-of-the-art perceptual quality via proxy measures.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
eess.AS 2023-11 unverdicted novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
cs.SD 2026-05 unverdicted novelty 5.0

A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
cs.SD 2026-04 unverdicted novelty 5.0

Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems
cs.IR 2026-04 unverdicted novelty 5.0

Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 5.0

HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
Woosh: A Sound Effects Foundation Model
cs.SD 2026-04 accept novelty 5.0

Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 24 Pith papers · 4 internal anchors

[1]

Hiﬁ++: a uniﬁed framework for neural vocoding, bandwidth extension and speech enhancement.arXiv preprint arXiv:2203.13086 ,

Pavel Andreev, Aibek Alanov, Oleg Ivanov, and Dmitry Vetrov. Hiﬁ++: a uniﬁed framework for neural vocoding, bandwidth extension and speech enhancement.arXiv preprint arXiv:2203.13086 ,

work page arXiv
[2]

Common voice: A massively-multilingual speech corpus,

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 ,

work page arXiv 1912
[3]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoﬀrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Shlomo E Chazan, Lior Wolf, Eliya Nachmani, and Yossi Adi

URL http: //hdl.handle.net/10230/42015. Shlomo E Chazan, Lior Wolf, Eliya Nachmani, and Yossi Adi. Single channel voice separation for unknown number of speakers under reverberant and noisy settings. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 3730–3734. IEEE,

work page 2021
[6]

Global - 2021 forecast highlights - cisco

Cisco. Global - 2021 forecast highlights - cisco. https://www.cisco.com/c/dam/m/en_us/solutions/ service-provider/vni-forecast-highlights/pdf/Global_2021_Forecast_Highlights.pdf,

work page 2021
[7]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289 ,

work page Pith review arXiv
[8]

Music source separation in the waveform domain

Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 ,

work page arXiv 1911
[9]

Real time speech enhancement in the waveform domain

Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 ,

work page arXiv 2006
[10]

Diﬀerentiable model compression via pseudo quantiza- tion noise

Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve. Diﬀerentiable model compression via pseudo quantiza- tion noise. arXiv preprint arXiv:2104.09987 ,

work page arXiv
[11]

Vladimir Gligorijevi´c, P

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341 ,

work page arXiv 2005
[12]

Icassp 2022 deep noise suppression challenge

Harishchandra Dubey, Vishak Gopal, Ross Cutler, Sergiy Matusevych, Sebastian Braun, Emre Seﬁk Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, and Robert Aichner. Icassp 2022 deep noise suppression challenge. InICASSP,

work page 2022
[13]

Low bit-rate speech coding with vq-vae and a wavenet decoder

Cristina Gârbacea, Aäron van den Oord, Yazhe Li, Felicia SC Lim, Alejandro Luebs, Oriol Vinyals, and Thomas C Walters. Low bit-rate speech coding with vq-vae and a wavenet decoder. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 735–739. IEEE,

work page 2019
[14]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pp. 776–780. IEEE,

work page 2017
[15]

It’s raw! audio generation with state-space models

Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It’s raw! audio generation with state-space models. arXiv preprint arXiv:2202.09729 ,

work page arXiv
[16]

Visqol: The virtual speech quality objective listener

Andrew Hines, Jan Skoglund, Anil Kokaram, and Naomi Harte. Visqol: The virtual speech quality objective listener. In IW AENC 2012; International Workshop on Acoustic Signal Enhancement , pp. 1–4. VDE,

work page 2012
[17]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Technical Report 1502.03167, arXiv,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Architecture for variable bitrate neural speech codec with conﬁgurable computation complexity

Tejas Jayashankar, Thilo Koehler, Kaustubh Kalgaonkar, Zhiping Xiu, Jilong Wu, Ju Lin, Prabhav Agrawal, and Qing He. Architecture for variable bitrate neural speech codec with conﬁgurable computation complexity. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 861–865. IEEE,

work page 2022
[19]

End-to-end neural speech coding for real-time communications

Xue Jiang, Xiulian Peng, Chengyu Zheng, Huaying Xue, Yuan Zhang, and Yan Lu. End-to-end neural speech coding for real-time communications. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 866–870. IEEE,

work page 2022
[20]

Text-free prosody-aware generative spoken language modeling

12 Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264 ,

work page arXiv
[21]

Generative speech coding with predictive variance regularization

W Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia SC Lim, Alejandro Luebs, Jan Skoglund, and Hengchin Yeh. Generative speech coding with predictive variance regularization. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6478–6482. IEEE,

work page 2021
[22]

Textless speech emotion conversion using decomposed and discrete representations.arXiv preprint arXiv:2111.07402 ,

Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, and Yossi Adi. Textless speech emotion conversion using decomposed and discrete representations.arXiv preprint arXiv:2111.07402 ,

work page arXiv
[23]

Direct speech-to-speech translation with discrete units

Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, et al. Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604, 2021a. Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Juan Pino, Jiatao Gu, and Wei-Ni...

work page arXiv 2021
[24]

Robust low rate speech coding based on cloned networks and wavenet

Felicia SC Lim, W Bastiaan Kleijn, Michael Chinen, and Jan Skoglund. Robust low rate speech coding based on cloned networks and wavenet. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6769–6773. IEEE,

work page 2020
[25]

Speech enhancement for low bit rate speech codec

Ju Lin, Kaustubh Kalgaonkar, Qing He, and Xin Lei. Speech enhancement for low bit rate speech codec. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7777–7781. IEEE,

work page 2022
[26]

Generative spoken dialogue language modeling

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling. arXiv preprint arXiv:2203.16502 ,

work page arXiv
[27]

Disentangling speech from surroundings in a neural audio codec

Ahmed Omran, Neil Zeghidour, Zalán Borsos, Félix de Chaumont Quitry, Malcolm Slaney, and Marco Tagliasacchi. Disentangling speech from surroundings in a neural audio codec. arXiv preprint arXiv:2203.15578,

work page arXiv
[28]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.arXiv preprint arXiv:1609.03499,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Speech resynthesis from discrete disentangled self-supervised representations

Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrah- man Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355 ,

work page arXiv
[30]

Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation

Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann Lee. Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967 ,

work page arXiv
[31]

Improving opus low bit rate quality with neural speech synthesis.arXiv preprint arXiv:1905.04628,

Jan Skoglund and Jean-Marc Valin. Improving opus low bit rate quality with neural speech synthesis.arXiv preprint arXiv:1905.04628,

work page arXiv 1905
[32]

Seanet: A multi-modal speech enhancement network

Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek. Seanet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095 ,

work page arXiv 2009
[33]

Lpcnet: Improving neural speech synthesis through linear prediction

Jean-Marc Valin and Jan Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 5891–5895. IEEE, 2019a. Jean-Marc Valin and Jan Skoglund. A real-time wideband neural vocoder at 1.6 kb/s using lpcnet.arXiv preprint arXiv:1903.1...

work page arXiv 2019
[34]

Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6199–6203. IEEE, 2020a. Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Par...

work page arXiv 2020
[35]

License with asterisk annotation * imply that the speciﬁc license varies across the dataset and is speciﬁc to each sample

15 Table A.1: Datasets description. License with asterisk annotation * imply that the speciﬁc license varies across the dataset and is speciﬁc to each sample. Dataset Audio domain Sampling rate Channels Duration License Common Voice 7.0 Speech 48 kHz 1 9,096 h CC-0 DNS Challenge 4 (speech) Speech 48 kHz 1 2,425 h Multiples* AudioSet General audio 48 kHz 2...

work page 2021
[36]

We used LeakyReLU as non-linear activation function

was the only one that prevented the discriminator from diverging. We used LeakyReLU as non-linear activation function. Finally, training hyper parameters are not shared either so we use the same parameters as for ourEnCodec model. A.2 Alternative quantizers A.2.1 DiﬀQ Quantizer Pseudo quantization noise.We perform scalar quantization of the latent represe...

work page 1996
[37]

We extend the DiﬀQ approach for latent space quantization, adding support for streamable rescaling, proper sparsity, and improved prior coding

with a diﬀerentiable bandwidth estimate. We extend the DiﬀQ approach for latent space quantization, adding support for streamable rescaling, proper sparsity, and improved prior coding. Formally, we introduce a learnt parameterB∈ RD (with D the dimension of the latent space) such that B(i) represents the number of bits to use of thei-th dimension. In pract...

work page 2015
[38]

This gives us a diﬀerentiable approximately 1-hot vector over the codebooks, i.e., notingGS the gumbel-softmax, zq,train = NC∑ i=1 GS(log(qi(z)),τ )T Ci

with a temperatureτ = 0.5. This gives us a diﬀerentiable approximately 1-hot vector over the codebooks, i.e., notingGS the gumbel-softmax, zq,train = NC∑ i=1 GS(log(qi(z)),τ )T Ci. (8) At test time, we replace the gumbel-softmax with a sampling from the distributionqi. We deﬁne for alli, pi = softmax(li) the prior distribution over the codebooks entries w...

work page 2021