Jukebox: A Generative Model for Music

Prafulla Dhariwal , Heewoo Jun , Christine Payne , Jong Wook Kim , Alec Radford , Ilya Sutskever

Authors on Pith no claims yet

classification 📡 eess.AS cs.LGcs.SDstat.ML

keywords jukeboxmodelaudiohttpsmusicopenaisingingalong

read the original abstract

We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at https://jukebox.openai.com, along with model weights and code at https://github.com/openai/jukebox

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
cs.HC 2026-05 unverdicted novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR introduces a phasor-based contrastive framework with learned spectral pooling and complex heads that enforces pitch-equivariant and phase-equivariant biases, delivering up to 70% relative accuracy gains in stem...
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR introduces a contrastive audio representation framework with spectral pooling and complex-valued processing that sets new state-of-the-art results in stem retrieval on MoisesDB, Slakh, and ChocoChorales while a...
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics
cs.SD 2026-04 unverdicted novelty 7.0

ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization
cs.CV 2026-04 unverdicted novelty 7.0

A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.
Diffusion Path Alignment for Long-Range Motion Generation and Domain Transitions
cs.CV 2026-03 unverdicted novelty 7.0

An inference-time optimization using a control-energy objective on pretrained diffusion models enables coherent long-range human motion generation with explicit domain transitions.
High Fidelity Neural Audio Compression
eess.AS 2022-10 accept novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
cs.LG 2026-05 unverdicted novelty 6.0

A warm-up phase training VQ-VAEs as autoencoders first avoids dimensional collapse and yields better reconstruction and perceptual quality.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
cs.LG 2026-05 unverdicted novelty 6.0

An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
eess.AS 2026-04 unverdicted novelty 6.0

UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
cs.SD 2026-04 unverdicted novelty 6.0

Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
Make it Simple, Make it Dance: Dance Motion Simplification to Support Novices' Dance Learning
cs.HC 2026-04 unverdicted novelty 6.0

Rule-based and learning-based algorithms simplify dance motions to help novices learn more effectively while maintaining naturalness and style.
Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP
cs.SD 2026-04 unverdicted novelty 6.0

A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
No Language Left Behind: Scaling Human-Centered Machine Translation
cs.CL 2022-07 unverdicted novelty 6.0

A sparsely gated mixture-of-experts model trained on newly mined low-resource data achieves 44% relative BLEU improvement across 200 languages while adding human safety evaluation.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems
cs.IR 2026-04 unverdicted novelty 5.0

Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.