Recognition: unknown
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms
read the original abstract
Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality. Our work validates that generic image priors can be effectively leveraged for the training-free transformation of structured Mel-spectrograms. Code and materials are available at https://github.com/Sooyyoungg/Stylus.git.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Latent Fourier Transform
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.