pith. machine review for the scientific record. sign in

arxiv: 2210.13438 · v1 · submitted 2022-10-24 · 📡 eess.AS · cs.AI· cs.SD· stat.ML

Recognition: no theorem link

High Fidelity Neural Audio Compression

Alexandre D\'efossez, Gabriel Synnaeve, Jade Copet, Yossi Adi

Pith reviewed 2026-05-13 21:46 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SDstat.ML
keywords neural audio codechigh-fidelity compressionstreaming encoder-decoderquantized latent spacemultiscale spectrogram adversaryloss balancertransformer compressionMUSHRA evaluation
0
0 comments X

The pith

A neural network audio codec with streaming encoder-decoder and quantized latents delivers higher fidelity than baselines at real-time speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a neural audio codec that compresses speech and music while keeping perceptual quality high enough to beat prior methods in listening tests. It trains an end-to-end streaming encoder-decoder whose latent space is quantized, using one multiscale spectrogram discriminator to suppress artifacts and a loss balancer that treats each loss weight as the target fraction of the total gradient. The same architecture supports optional lightweight Transformer layers that cut the final bitrate by up to 40 percent without losing real-time performance. These choices matter because real-time, high-quality compression is required for efficient streaming, storage, and transmission of audio on bandwidth-constrained devices. The authors show the gains hold for 24 kHz monophonic and 48 kHz stereophonic signals across clean speech, noisy-reverberant speech, and music.

Core claim

The authors present a real-time high-fidelity neural audio codec built from a streaming encoder-decoder with a quantized latent space trained end-to-end. Training is stabilized by a single multiscale spectrogram adversary that reduces artifacts and by a novel loss-balancer module in which each loss weight directly sets the fraction of the overall gradient it contributes. Lightweight Transformer models can be stacked on the quantized representation to achieve up to 40 percent additional compression while remaining faster than real time. MUSHRA subjective tests across multiple bandwidths and domains establish superiority over existing codecs for both 24 kHz monophonic and 48 kHz stereophonic音频

What carries the argument

Streaming encoder-decoder architecture with quantized latent space, trained using a single multiscale spectrogram adversary and a loss-balancer mechanism that decouples loss weights from gradient scale.

Load-bearing premise

The MUSHRA listening tests on the chosen audio domains and bandwidths are representative of real-world use and the model does not overfit to the training distribution in ways that degrade on unseen content.

What would settle it

A new MUSHRA test on audio outside the training domains (for example live concert recordings or rare speech accents) showing the neural codec no longer rated higher than the baselines at the same bitrate.

read the original abstract

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces EnCodec, a real-time neural audio codec using a streaming encoder-decoder with quantized latent representations trained end-to-end. Key contributions include a single multiscale spectrogram discriminator to reduce artifacts, a novel loss balancer that sets loss weights as target gradient fractions to stabilize training, and optional lightweight Transformer models for up to 40% further compression of the latent codes. The work reports extensive MUSHRA subjective evaluations across speech, noisy-reverberant speech, and music at multiple bandwidths for both 24 kHz mono and 48 kHz stereo audio, claiming consistent superiority over published baselines, with code and models released for reproducibility.

Significance. If the reported MUSHRA rankings hold, the work provides a practical advance in high-fidelity, low-latency neural audio compression with direct applicability to streaming and storage. Strengths include the public release of code and models, detailed ablation studies, and the loss-balancer formulation that decouples hyper-parameter choice from loss scale; these elements support replication and extension beyond the specific domains tested.

minor comments (3)
  1. [§3.3] §3.3: The single multiscale spectrogram discriminator is described at a high level; adding the exact frequency scales and window sizes used would aid exact replication.
  2. [Table 2] Table 2: The MUSHRA scores for the 48 kHz stereo music condition would benefit from reported confidence intervals or standard deviations to quantify variability across listeners.
  3. [§4.2] §4.2: The claim of 'parameter-free' behavior for certain loss terms is not fully supported by the listed free parameters (number of residual codebooks and balancer targets); a brief clarification on which quantities remain fixed would improve precision.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the practical contributions (including the loss balancer, public code release, and MUSHRA evaluations), and the recommendation to accept. We are pleased that the work's applicability to streaming audio was noted.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical neural audio codec (encoder-decoder with quantization, single multiscale spectrogram discriminator, and a proposed loss-balancer that normalizes gradient contributions by design). All central claims of superiority are grounded in external MUSHRA listening tests across speech, music, and stereo domains plus comparisons to published baselines, with code released for replication. No equation or training step reduces by construction to a fitted parameter renamed as a prediction, no self-citation chain is load-bearing for the architecture or results, and the loss-balancer is introduced as an explicit mechanism rather than derived from the target metric. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model relies on standard assumptions of end-to-end differentiable training and perceptual loss functions; no new physical entities are postulated. Free parameters include the number of codebooks, codebook size, and the target gradient fractions in the loss balancer, all chosen by hand or grid search.

free parameters (2)
  • number of residual codebooks
    Chosen to achieve target bitrate; directly controls the discrete representation size.
  • loss balancer target fractions
    Hyper-parameters that set the desired gradient contribution of each loss term; fitted to stabilize training.
axioms (1)
  • domain assumption Multiscale spectrogram discrimination is sufficient to suppress perceptual artifacts in audio reconstruction.
    Invoked in the training objective section to justify using a single discriminator instead of multiple.

pith-pipeline@v0.9.0 · 5540 in / 1318 out tokens · 33298 ms · 2026-05-13T21:46:14.260988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MusicLM: Generating Music From Text

    cs.SD 2023-01 conditional novelty 8.0

    MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

  2. AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

    cs.SD 2026-05 unverdicted novelty 7.0

    AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

  3. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  4. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  5. Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

    eess.AS 2026-04 unverdicted novelty 7.0

    Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

  6. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  7. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    cs.CL 2023-01 unverdicted novelty 7.0

    VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.

  8. Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

    cs.SD 2026-05 unverdicted novelty 6.0

    Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.

  9. Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis

    eess.SP 2026-05 unverdicted novelty 6.0

    A compact 0.09B model using hierarchical discrete tokenization and prompted latent translation outperforms larger baselines in cross-modal PPG-to-ECG synthesis and cross-frequency super-resolution.

  10. MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

    cs.SD 2026-05 accept novelty 6.0

    MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.

  11. Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

    eess.AS 2026-04 unverdicted novelty 6.0

    Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

  12. LLM-Codec: Neural Audio Codec Meets Language Model Objectives

    cs.SD 2026-04 unverdicted novelty 6.0

    LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.

  13. HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

    eess.AS 2026-04 unverdicted novelty 6.0

    HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.

  14. Efficient Training for Cross-lingual Speech Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    CSLM achieves cross-modal and cross-lingual alignment in speech LLMs via continual pre-training on discrete tokens and speech-text interleaved instruction tuning, enabling scalability without massive speech datasets.

  15. Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

    eess.AS 2026-04 unverdicted novelty 6.0

    A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18...

  16. Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation

    eess.AS 2026-04 unverdicted novelty 6.0

    Diff-VS is an efficient audio-aware diffusion U-Net for vocal separation that matches discriminative baselines on objective metrics while achieving state-of-the-art perceptual quality via proxy measures.

  17. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  18. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    eess.AS 2023-11 unverdicted novelty 6.0

    Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.

  19. Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

    cs.SD 2026-05 unverdicted novelty 5.0

    A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.

  20. Diffusion Reconstruction towards Generalizable Audio Deepfake Detection

    cs.SD 2026-04 unverdicted novelty 5.0

    Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.

  21. Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems

    cs.IR 2026-04 unverdicted novelty 5.0

    Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.

  22. HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

    cs.SD 2026-04 unverdicted novelty 5.0

    HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.

  23. Woosh: A Sound Effects Foundation Model

    cs.SD 2026-04 accept novelty 5.0

    Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.

  24. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 24 Pith papers · 4 internal anchors

  1. [1]

    Hifi++: a unified framework for neural vocoding, bandwidth extension and speech enhancement.arXiv preprint arXiv:2203.13086 ,

    Pavel Andreev, Aibek Alanov, Oleg Ivanov, and Dmitry Vetrov. Hifi++: a unified framework for neural vocoding, bandwidth extension and speech enhancement.arXiv preprint arXiv:2203.13086 ,

  2. [2]

    Common voice: A massively-multilingual speech corpus,

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 ,

  3. [3]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

  4. [4]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432 ,

  5. [5]

    Shlomo E Chazan, Lior Wolf, Eliya Nachmani, and Yossi Adi

    URL http: //hdl.handle.net/10230/42015. Shlomo E Chazan, Lior Wolf, Eliya Nachmani, and Yossi Adi. Single channel voice separation for unknown number of speakers under reverberant and noisy settings. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 3730–3734. IEEE,

  6. [6]

    Global - 2021 forecast highlights - cisco

    Cisco. Global - 2021 forecast highlights - cisco. https://www.cisco.com/c/dam/m/en_us/solutions/ service-provider/vni-forecast-highlights/pdf/Global_2021_Forecast_Highlights.pdf,

  7. [7]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289 ,

  8. [8]

    Music source separation in the waveform domain

    Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 ,

  9. [9]

    Real time speech enhancement in the waveform domain

    Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 ,

  10. [10]

    Differentiable model compression via pseudo quantiza- tion noise

    Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve. Differentiable model compression via pseudo quantiza- tion noise. arXiv preprint arXiv:2104.09987 ,

  11. [11]

    Vladimir Gligorijevi´c, P

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music.arXiv preprint arXiv:2005.00341 ,

  12. [12]

    Icassp 2022 deep noise suppression challenge

    Harishchandra Dubey, Vishak Gopal, Ross Cutler, Sergiy Matusevych, Sebastian Braun, Emre Sefik Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, and Robert Aichner. Icassp 2022 deep noise suppression challenge. InICASSP,

  13. [13]

    Low bit-rate speech coding with vq-vae and a wavenet decoder

    Cristina Gârbacea, Aäron van den Oord, Yazhe Li, Felicia SC Lim, Alejandro Luebs, Oriol Vinyals, and Thomas C Walters. Low bit-rate speech coding with vq-vae and a wavenet decoder. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 735–739. IEEE,

  14. [14]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pp. 776–780. IEEE,

  15. [15]

    It’s raw! audio generation with state-space models

    Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It’s raw! audio generation with state-space models. arXiv preprint arXiv:2202.09729 ,

  16. [16]

    Visqol: The virtual speech quality objective listener

    Andrew Hines, Jan Skoglund, Anil Kokaram, and Naomi Harte. Visqol: The virtual speech quality objective listener. In IW AENC 2012; International Workshop on Acoustic Signal Enhancement , pp. 1–4. VDE,

  17. [17]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Technical Report 1502.03167, arXiv,

  18. [18]

    Architecture for variable bitrate neural speech codec with configurable computation complexity

    Tejas Jayashankar, Thilo Koehler, Kaustubh Kalgaonkar, Zhiping Xiu, Jilong Wu, Ju Lin, Prabhav Agrawal, and Qing He. Architecture for variable bitrate neural speech codec with configurable computation complexity. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 861–865. IEEE,

  19. [19]

    End-to-end neural speech coding for real-time communications

    Xue Jiang, Xiulian Peng, Chengyu Zheng, Huaying Xue, Yuan Zhang, and Yan Lu. End-to-end neural speech coding for real-time communications. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 866–870. IEEE,

  20. [20]

    Text-free prosody-aware generative spoken language modeling

    12 Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux, et al. Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264 ,

  21. [21]

    Generative speech coding with predictive variance regularization

    W Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia SC Lim, Alejandro Luebs, Jan Skoglund, and Hengchin Yeh. Generative speech coding with predictive variance regularization. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6478–6482. IEEE,

  22. [22]

    Textless speech emotion conversion using decomposed and discrete representations.arXiv preprint arXiv:2111.07402 ,

    Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, and Yossi Adi. Textless speech emotion conversion using decomposed and discrete representations.arXiv preprint arXiv:2111.07402 ,

  23. [23]

    Direct speech-to-speech translation with discrete units

    Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino, et al. Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604, 2021a. Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen Chen, Changhan Wang, Sravya Popuri, Juan Pino, Jiatao Gu, and Wei-Ni...

  24. [24]

    Robust low rate speech coding based on cloned networks and wavenet

    Felicia SC Lim, W Bastiaan Kleijn, Michael Chinen, and Jan Skoglund. Robust low rate speech coding based on cloned networks and wavenet. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6769–6773. IEEE,

  25. [25]

    Speech enhancement for low bit rate speech codec

    Ju Lin, Kaustubh Kalgaonkar, Qing He, and Xin Lei. Speech enhancement for low bit rate speech codec. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 7777–7781. IEEE,

  26. [26]

    Generative spoken dialogue language modeling

    Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling. arXiv preprint arXiv:2203.16502 ,

  27. [27]

    Disentangling speech from surroundings in a neural audio codec

    Ahmed Omran, Neil Zeghidour, Zalán Borsos, Félix de Chaumont Quitry, Malcolm Slaney, and Marco Tagliasacchi. Disentangling speech from surroundings in a neural audio codec. arXiv preprint arXiv:2203.15578,

  28. [28]

    WaveNet: A Generative Model for Raw Audio

    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.arXiv preprint arXiv:1609.03499,

  29. [29]

    Speech resynthesis from discrete disentangled self-supervised representations

    Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrah- man Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355 ,

  30. [30]

    Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation

    Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann Lee. Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967 ,

  31. [31]

    Improving opus low bit rate quality with neural speech synthesis.arXiv preprint arXiv:1905.04628,

    Jan Skoglund and Jean-Marc Valin. Improving opus low bit rate quality with neural speech synthesis.arXiv preprint arXiv:1905.04628,

  32. [32]

    Seanet: A multi-modal speech enhancement network

    Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek. Seanet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095 ,

  33. [33]

    Lpcnet: Improving neural speech synthesis through linear prediction

    Jean-Marc Valin and Jan Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 5891–5895. IEEE, 2019a. Jean-Marc Valin and Jan Skoglund. A real-time wideband neural vocoder at 1.6 kb/s using lpcnet.arXiv preprint arXiv:1903.1...

  34. [34]

    Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

    Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6199–6203. IEEE, 2020a. Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Par...

  35. [35]

    License with asterisk annotation * imply that the specific license varies across the dataset and is specific to each sample

    15 Table A.1: Datasets description. License with asterisk annotation * imply that the specific license varies across the dataset and is specific to each sample. Dataset Audio domain Sampling rate Channels Duration License Common Voice 7.0 Speech 48 kHz 1 9,096 h CC-0 DNS Challenge 4 (speech) Speech 48 kHz 1 2,425 h Multiples* AudioSet General audio 48 kHz 2...

  36. [36]

    We used LeakyReLU as non-linear activation function

    was the only one that prevented the discriminator from diverging. We used LeakyReLU as non-linear activation function. Finally, training hyper parameters are not shared either so we use the same parameters as for ourEnCodec model. A.2 Alternative quantizers A.2.1 DiffQ Quantizer Pseudo quantization noise.We perform scalar quantization of the latent represe...

  37. [37]

    We extend the DiffQ approach for latent space quantization, adding support for streamable rescaling, proper sparsity, and improved prior coding

    with a differentiable bandwidth estimate. We extend the DiffQ approach for latent space quantization, adding support for streamable rescaling, proper sparsity, and improved prior coding. Formally, we introduce a learnt parameterB∈ RD (with D the dimension of the latent space) such that B(i) represents the number of bits to use of thei-th dimension. In pract...

  38. [38]

    This gives us a differentiable approximately 1-hot vector over the codebooks, i.e., notingGS the gumbel-softmax, zq,train = NC∑ i=1 GS(log(qi(z)),τ )T Ci

    with a temperatureτ = 0.5. This gives us a differentiable approximately 1-hot vector over the codebooks, i.e., notingGS the gumbel-softmax, zq,train = NC∑ i=1 GS(log(qi(z)),τ )T Ci. (8) At test time, we replace the gumbel-softmax with a sampling from the distributionqi. We define for alli, pi = softmax(li) the prior distribution over the codebooks entries w...