hub Canonical reference

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves · 2016 · cs.SD · arXiv 1609.03499

Canonical reference. 100% of citing Pith papers cite this work as background.

64 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 64 citing papers arXiv PDF

abstract

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12

citation-polarity summary

background 12

representative citing papers

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Efficiently Modeling Long Sequences with Structured State Spaces

cs.LG · 2021-10-31 · unverdicted · novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.

Denoising Diffusion Implicit Models

cs.LG · 2020-10-06 · unverdicted · novelty 8.0

DDIMs construct non-Markovian diffusion processes that share DDPM training objectives but allow much faster reverse sampling, demonstrated empirically at 10-50x wall-clock speedup.

DiffWave: A Versatile Diffusion Model for Audio Synthesis

eess.AS · 2020-09-21 · unverdicted · novelty 8.0

DiffWave is a non-autoregressive diffusion model that generates high-fidelity audio waveforms from noise in constant steps, matching WaveNet vocoder quality while being orders of magnitude faster and outperforming prior models in unconditional generation.

Denoising Diffusion Probabilistic Models

cs.LG · 2020-06-19 · accept · novelty 8.0

Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.

Contrast to Detect: Dynamic Graph Contrastive Regularization for Unsupervised Anomaly Detection in Multivariate Time Series

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

ContrastAD achieves highest mean F1 on all five MTS benchmarks and highest AUC on three by building DTW-based sparse graph snapshots and contrasting divergent pairs with a stable anchor instead of enforcing invariance.

Scale-Equivariant Generative Forecasting: Weight-Tied Dilated Convolutions, Wavelet Scattering Inputs, and Spectral-Consistency Training for Self-Similar Time Series

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

Presents SE-WaveNet with weight-tied dilated convolutions plus wavelet and spectral components that reproduces empirical scaling collapse on financial returns while using L times fewer convolutional parameters.

Neural network modeling of many-body super- and sub-radiant dynamics

quant-ph · 2026-05-06 · unverdicted · novelty 7.0

Neural quantum states simulate dissipative many-body emission dynamics for approximately 40 atoms in dense 1D and 2D arrays, revealing prominent subradiant behavior at late times.

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

cs.SD · 2026-05-02 · unverdicted · novelty 7.0

MelShield adds keyed low-energy spread-spectrum perturbations to Mel-spectrograms inside TTS pipelines before vocoding to enable robust extraction of user-specific attribution signals even after compression or noise.

DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

eess.AS · 2026-04-29 · unverdicted · novelty 7.0

DiffAnon introduces the first diffusion model for voice anonymization that supplies structured, continuous, inference-time control over prosody preservation via classifier-free guidance on RVQ semantic embeddings.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

ReLU Networks for Exact Generation of Similar Graphs

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

Constant-depth ReLU networks of size O(n²d) exist that deterministically generate graphs within edit distance d from any given n-vertex input graph.

Moshi: a speech-text foundation model for real-time dialogue

eess.AS · 2024-09-17 · accept · novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

Deep Time Series Models: A Comprehensive Survey and Benchmark

cs.LG · 2024-07-18 · unverdicted · novelty 7.0

This survey and benchmark of deep time series models using the released TSLib library finds that models with specific structures perform well only on distinct analysis tasks.

Chronos: Learning the Language of Time Series

cs.LG · 2024-03-12 · conditional · novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

A decoder-only foundation model for time-series forecasting

cs.CL · 2023-10-14 · unverdicted · novelty 7.0

A pretrained decoder-only patched transformer achieves near state-of-the-art zero-shot forecasting performance across diverse time series datasets and settings.

High Fidelity Neural Audio Compression

eess.AS · 2022-10-24 · accept · novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

Diffusion Models Beat GANs on Image Synthesis

cs.LG · 2021-05-11 · accept · novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

cs.CL · 2019-07-09 · unverdicted · novelty 7.0

A Tacotron model with phonemic inputs and adversarial disentanglement enables cross-lingual voice cloning without parallel data, producing intelligible speech in native and foreign accents.

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

eess.AS · 2019-07-01 · unverdicted · novelty 7.0

A conditional GAN is used to synthesize speech waveforms from compressed glottal excitation, refined by LPC parameters, yielding higher quality reconstructions than traditional methods on a 30-speaker dataset.

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

eess.AS · 2019-06-26 · unverdicted · novelty 7.0

RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.

Generating Long Sequences with Sparse Transformers

cs.LG · 2019-04-23 · unverdicted · novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

citing papers explorer

Showing 2 of 2 citing papers after filters.

A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 42 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
AIBuildAI: An AI Agent for Automatically Building AI Models cs.AI · 2026-04-15 · unverdicted · none · ref 60 · internal anchor
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.

WaveNet: A Generative Model for Raw Audio

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer