Adversarial Audio Synthesis

Chris Donahue; Julian McAuley; Miller Puckette

arxiv: 1802.04208 · v3 · pith:3QS6H2FUnew · submitted 2018-02-12 · 💻 cs.SD · cs.LG

Adversarial Audio Synthesis

Chris Donahue , Julian McAuley , Miller Puckette This is my paper

classification 💻 cs.SD cs.LG

keywords audiowavegangansgenerationadversarialseensynthesissynthesize

0 comments

read the original abstract

Audio signals are sampled at high temporal resolutions, and learning to synthesize audio requires capturing structure across a range of timescales. Generative adversarial networks (GANs) have seen wide success at generating images that are both locally and globally coherent, but they have seen little application to audio generation. In this paper we introduce WaveGAN, a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio. WaveGAN is capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation. Our experiments demonstrate that, without labels, WaveGAN learns to produce intelligible words when trained on a small-vocabulary speech dataset, and can also synthesize audio from other domains such as drums, bird vocalizations, and piano. We compare WaveGAN to a method which applies GANs designed for image generation on image-like audio feature representations, finding both approaches to be promising.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CIS-BWE: Chaos-Informed Speech Bandwidth Extension
cs.SD 2025-07 unverdicted novelty 6.0

NDSI-BWE deploys seven nonlinear-dynamics discriminators and a dual-stream ConformerNeXt generator to claim new state-of-the-art results in speech bandwidth extension.
Preserving Temporal Dynamics in Time Series Generation
cs.LG 2026-04 unverdicted novelty 5.0

An MCMC framework enforces empirical transition laws on GAN outputs to reduce temporal drift in synthetic multivariate time series.
Autoencoding sensory substitution
q-bio.NC 2019-07 unverdicted novelty 4.0

Deep recurrent autoencoders convert images to shortened audio signals that incorporate hearing models, enabling above-chance hand posture discrimination and object reaching after a few hours of training instead of months.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
eess.AS 2026-05 unverdicted novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.