Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
hub Mixed citations
Jukebox: A Generative Model for Music
Mixed citation behavior. Most common role is background (67%).
abstract
We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at https://jukebox.openai.com, along with model weights and code at https://github.com/openai/jukebox
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Proposes an attribution-aware compensation framework for generative music that derives closed-form payments from catalog-level attribution informativeness and quantifies welfare effects under competition.
STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.
Odoriko is the first single multimodal diffusion model for human motion that directly conditions generation on subject morphology across text, music and video inputs and can recover morphology when it is missing.
UniSinger unifies speaker-cloned song generation and accompaniment co-generation SVC in one multimodal diffusion transformer model trained with curriculum learning via task-specific modality masking.
SoulNote enables multi-session GenAI songwriting for DHH users, producing measurable gains in self-insight, emotion regulation, and self-care attitudes.
MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
Finite scalar quantization simplifies VQ-VAE latents by independently rounding a few dimensions to fixed levels, producing an equivalent-sized implicit codebook with competitive performance and no collapse.
HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions without supervision, achieving new state-of-the-art results on HuGaDB, LARa, and BABEL while reducing segment length bias.
An inference-time optimization using a control-energy objective on pretrained diffusion models enables coherent long-range human motion generation with explicit domain transitions.
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
PupuJEPA applies a visual JEPA framework to 2D spectrograms with music-specific adaptations and outperforms 1D SSL models on the MARBLE benchmark for multiple MIR tasks.
Empirical study finds 93% of AI music on Spotify gets negligible plays, distributors have inconsistent unenforced AI policies, and detection methods are unreliable, suggesting slop may become self-sustaining.
sGPO uses an initial-policy success-rate profiling pass to adaptively set rollout group sizes, filter data, and build a curriculum, cutting total RLVR training compute by 3x while matching baseline performance.
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.
The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.
GCDance is a text-and-music-conditioned diffusion framework that generates genre-consistent 3D dance sequences and reports better results than prior methods on FineDance and AIST++.
citing papers explorer
-
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation
UniSinger unifies speaker-cloned song generation and accompaniment co-generation SVC in one multimodal diffusion transformer model trained with curriculum learning via task-specific modality masking.
-
MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline
MIDI-SAG generates consistent long-form singing accompaniments by feeding symbolic MIDI timing, chords, and structure labels into a compositional pipeline built from pre-trained modules.
-
ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics
ArtifactNet extracts codec residuals from spectrograms with a 4M-parameter network to detect AI music at F1=0.9829 and 1.49% FPR on unseen tracks from 22 generators, outperforming larger baselines.
-
Frequency-Aware Self-Supervised Music Representation Learning
PupuJEPA applies a visual JEPA framework to 2D spectrograms with music-specific adaptations and outperforms 1D SSL models on the MARBLE benchmark for multiple MIR tasks.
-
Two-Dimensional Quantization for Geometry-Aware Audio Coding
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
-
SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization
SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.
-
Not that Groove: Zero-Shot Symbolic Music Editing
The work formalizes zero-shot symbolic drum editing as LLM reasoning over a drumroll grid notation, evaluates it on a new benchmark with automated symbolic unit tests, and reports up to 68% success across eight models.
-
PHALAR: Phasors for Learned Musical Audio Representations
PHALAR achieves up to 70% relative accuracy gain in stem retrieval over prior art using under half the parameters and 7x faster training by enforcing musical equivariances via spectral pooling and complex heads.
-
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
-
Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP
A latent diffusion model with consistency distillation generates real-time instrumental accompaniment from live context audio, integrated with MAX/MSP for feasible human-AI co-performance.
-
LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training
LeVo 2 presents a hierarchical LLM-Diffusion model with progressive post-training stages to generate full-length songs that balance semantic planning, track-specific acoustics, and musicality.
-
SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling
SketchSong uses temporal sketch planning with high-level tokens and explicit modeling of four tracks (vocals, bass, drums, other) to generate more coherent songs than baselines.
-
SegTune: Structured and Fine-Grained Control for Song Generation
SegTune is a Diffusion Transformer framework for song generation using segment-aligned prompts and an LLM-based duration predictor to enable fine-grained control over musical structure and dynamics.
-
Real-Time Language Model Jamming: A Case Study for Live Music Accompaniment Generation
StreamMUSE performs frame-synchronous streaming inference for language models by having a client send high-frequency requests and a server return outputs aligned to an external clock, shown on live music accompaniment with open-source code.