arxiv: 2605.12310 · v1 · submitted 2026-05-12 · 💻 cs.SD

Recognition: no theorem link

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

Chen Geng , Meng Chen , Ruohua Zhou , Ruolan Liu , Weifeng Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:02 UTC · model grok-4.3

classification 💻 cs.SD

keywords singing voice conversionpolyphonyharmonic modelingzero-shotcross-lingualCQT pitch extractionconditional flow matchingdiffusion decoder

0 comments

The pith

Poly-SVC converts singing voices from accompanied recordings by preserving residual harmonies instead of requiring clean isolated vocals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a singing voice conversion system that operates on real-world polyphonic inputs containing leftover harmonies from accompaniment. It replaces the usual reliance on clean-vocal F0 extraction with a Constant-Q Transform pitch extractor that retains both the lead melody and residual harmonic content. A random sampler then reduces interference in the extracted representation, after which a conditional flow matching diffusion decoder fuses pitch, content, and target timbre features to synthesize natural polyphonic output. Experiments indicate gains in naturalness, timbre match, and harmony fidelity over prior methods on both harmony-rich and single-melody material. The approach therefore removes the need for perfect vocal separation as a prerequisite step.

Core claim

Poly-SVC is a zero-shot cross-lingual singing voice conversion system that processes residual harmonies by using a CQT-based pitch extractor to preserve both lead melody and residual harmony information, applying a random sampler to reduce interference, and employing a Conditional Flow Matching diffusion decoder to fuse the resulting pitch, content, and timbre features into natural-sounding polyphonic outputs, outperforming baselines in naturalness, timbre similarity, and harmony reconstruction on both harmony-rich and single-melody recordings.

What carries the argument

The CQT-based pitch extractor paired with a random sampler and a Conditional Flow Matching diffusion decoder that together retain and reconstruct residual harmonies while converting timbre.

If this is right

Singing voice conversion becomes feasible directly on accompanied tracks without a separate vocal isolation stage.
Harmony content in both polyphonic and monophonic inputs is reconstructed more faithfully than with conventional F0 extractors.
Zero-shot cross-lingual conversion extends to inputs containing residual harmonies.
Naturalness and timbre similarity improve when the decoder receives harmonic-rich pitch features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could reduce dependence on upstream source-separation models in broader audio pipelines.
Similar harmonic-retention strategies might transfer to other polyphonic audio tasks such as instrumental timbre transfer.
If the decoder runs efficiently, real-time conversion of live accompanied singing becomes conceivable.

Load-bearing premise

The CQT-based pitch extractor combined with the random sampler can reliably isolate and preserve residual harmonies without introducing artifacts or losing melody information in real accompanied recordings.

What would settle it

Objective or listening-test measurements on accompanied recordings that show no improvement in harmony reconstruction or the presence of new artifacts relative to standard F0-based baselines would falsify the central claim.

read the original abstract

Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Poly-SVC proposes a CQT-plus-CFM setup to keep residual harmonies in SVC instead of stripping them, but the abstract gives no numbers and the pitch step looks like the weakest link.

read the letter

The paper's core move is to stop treating residual harmonies as noise in accompanied singing recordings. Instead of forcing clean vocal separation first, Poly-SVC feeds CQT pitch features through a random sampler and then a conditional flow matching decoder to produce polyphonic output. That combination is not in the SVC papers it cites, so the architecture itself is the new piece. It targets a practical barrier: most F0 extractors break when the input has backing tracks, and this tries to work with the harmonies that remain.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Poly-SVC, a zero-shot cross-lingual singing voice conversion system designed to handle accompanied recordings containing residual harmonies. It consists of a CQT-based pitch extractor to capture both lead melody and residual harmonies, a random sampler to mitigate interference in the CQT representation, and a Conditional Flow Matching (CFM) diffusion decoder that integrates pitch, content, and timbre features. The central claim is that Poly-SVC outperforms baseline models in naturalness, timbre similarity, and harmony reconstruction on both harmony-rich and single-melody recordings.

Significance. If the experimental results hold, this would constitute a meaningful contribution to singing voice conversion by removing the reliance on clean vocal separation and directly modeling polyphonic inputs. The architectural focus on preserving residual harmonies addresses a practical gap in real-world SVC applications. The proposal of the CQT-plus-random-sampler pipeline combined with CFM decoding is a concrete technical direction worth exploring, though its robustness requires verification.

major comments (2)

[Abstract] Abstract: The statement that 'Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings' is presented without any quantitative metrics, baseline descriptions, dataset details, or statistical tests. This absence directly undermines evaluation of the central superiority claim.
[Method] Method (CQT-based pitch extractor and random sampler): The design assumes that the CQT pitch extractor combined with the random sampler can reliably isolate and preserve both lead melody and residual vocal harmonies from accompanied (non-clean) signals without introducing artifacts or temporal inconsistencies. No ablation studies, analysis on mixed audio, or validation against entanglement of vocal/instrumental harmonics are provided, yet this component is load-bearing for attributing any performance gains to the polyphony-aware design rather than the CFM decoder alone.

minor comments (1)

[Method] The description of how the random sampler specifically reduces interference while retaining secondary pitch contours would benefit from additional implementation details or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that 'Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings' is presented without any quantitative metrics, baseline descriptions, dataset details, or statistical tests. This absence directly undermines evaluation of the central superiority claim.

Authors: We agree that the abstract would benefit from greater specificity. In the revision we will add concise quantitative results (e.g., MOS scores for naturalness and timbre similarity together with the main baseline names and dataset sizes) while remaining within the abstract length limit. This will allow readers to evaluate the reported improvements directly from the abstract. revision: yes
Referee: [Method] Method (CQT-based pitch extractor and random sampler): The design assumes that the CQT pitch extractor combined with the random sampler can reliably isolate and preserve both lead melody and residual vocal harmonies from accompanied (non-clean) signals without introducing artifacts or temporal inconsistencies. No ablation studies, analysis on mixed audio, or validation against entanglement of vocal/instrumental harmonics are provided, yet this component is load-bearing for attributing any performance gains to the polyphony-aware design rather than the CFM decoder alone.

Authors: The referee is correct that the current manuscript does not contain dedicated ablations isolating the CQT extractor and random sampler. End-to-end results and qualitative examples are provided, but these do not fully separate the contribution of the polyphony-aware front-end from the CFM decoder. In the revision we will add ablation experiments that replace the CQT-plus-sampler pipeline with a conventional F0 extractor and that remove the random sampler, together with targeted analysis on mixed-audio examples to examine harmonic entanglement. These additions will strengthen attribution of the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: independent architectural proposal with experimental validation

full rationale

The paper describes Poly-SVC as a new zero-shot SVC architecture combining a CQT-based pitch extractor, random sampler, and CFM diffusion decoder. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Performance claims rest on experimental comparisons to baselines rather than any reduction to inputs by construction. The central assumption about the pitch pipeline is presented as a design choice, not a derived result, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard assumptions in neural audio processing plus the novel components described; no machine-checked proofs or external benchmarks are referenced.

free parameters (1)

neural network weights and training hyperparameters
The CQT extractor, sampler, and CFM decoder parameters are learned from data and not derived from first principles.

axioms (2)

domain assumption CQT transform preserves both lead melody and residual harmonic information in accompanied singing audio
Invoked in the design of the pitch extractor component.
domain assumption Conditional flow matching diffusion can fuse pitch, content, and timbre features into natural polyphonic singing
Basis for the decoder design.

invented entities (1)

Poly-SVC architecture no independent evidence
purpose: Zero-shot cross-lingual SVC that handles residual harmonies
New proposed system combining the three components.

pith-pipeline@v0.9.0 · 5473 in / 1350 out tokens · 88149 ms · 2026-05-13T04:02:34.408144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Singing voice conversion (SVC) is an emerging research hotspot that converts one singer’s vocal identity and style to sound like another while keeping the original lyrics, melody, and various vocal tech- niques [1]–[4]. The task addressed in this work presents greater challenges than conventional SVC, as it deals with the mismatch be- tween c...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

1 illustrates an overview of our Poly-SVC framework

METHODS Fig. 1 illustrates an overview of our Poly-SVC framework. Fol- lowing prior SVC methods [11] and [13], we first extract the mel- spectrogram as the acoustic representation and apply a Timbre Shifter based on OpenV oice [17] to align the distributions between training and inference, thereby reducing the timbre leak from the content representation. ...

work page
[3]

Dataset We use a wide variety of datasets covering both speech and singing, encompassing multiple languages, audio durations, and speaker counts

EXPERIMEMTS 3.1. Dataset We use a wide variety of datasets covering both speech and singing, encompassing multiple languages, audio durations, and speaker counts. For speech data, we adopt the Emilia dataset [19], a 101k-hour multilingual speech corpus rich in expressive speak- ing styles, which provides a robust foundation for modeling natural speech. A ...

work page
[4]

CONCLUSION This study highlights the significant challenges inherent in real- world singing voice conversion, particularly due to the challenge of obtaining clean singing vocals. To address these issues, we proposed Poly-SVC, a singing voice conversion framework designed for real- world scenarios where vocal-accompaniment separation often leaves residual ...

work page
[5]

Freesvc: Towards zero-shot multilingual singing voice conversion,

A. I. S. Ferreira, L. R. S. Gris, A. S. da Rosa, F. S. de Oliveira, E. Casanova, R. T. Sousa, A. C. Jr., A. da Silva Soares, and A. R. G. Filho, “Freesvc: Towards zero-shot multilingual singing voice conversion,” in2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025. IEEE, 2025, pp. 1–5

work page 2025
[6]

SPA-SVC: self-supervised pitch augmentation for singing voice conversion,

B. Bai, F. Wang, Y . Gao, and Y . Li, “SPA-SVC: self-supervised pitch augmentation for singing voice conversion,” in25th An- nual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5,

work page 2024
[7]

Zero-shot voice conversion with diffusion transform- ers,

S. Liu, “Zero-shot voice conversion with diffusion transform- ers,”arXiv preprint arXiv:2411.09943, 2024

work page arXiv 2024
[8]

SYKI-SVC: advancing singing voice conversion with post-processing innovations and an open-source professional testset,

Y . Zhou, W. Wang, H. Ding, J. Xu, J. Zhu, X. Gao, and S. Li, “SYKI-SVC: advancing singing voice conversion with post-processing innovations and an open-source professional testset,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, In- dia, April 6-11, 2025. IEEE, 2025, pp. 1–5

work page 2025
[9]

wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,” inAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Pro- cessing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020

work page 2020
[10]

Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 29, pp. 3451–3460, 2021

work page 2021
[11]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[12]

RMVPE: A robust model for vocal pitch estimation in polyphonic music,

H. Wei, X. Cao, T. Dan, and Y . Chen, “RMVPE: A robust model for vocal pitch estimation in polyphonic music,” in24th Annual Conference of the International Speech Communica- tion Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023, 2023, pp. 5421–5425

work page 2023
[13]

Crepe: A con- volutional representation for pitch estimation,

J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A con- volutional representation for pitch estimation,” in2018 IEEE international conference on acoustics, speech and signal pro- cessing (ICASSP). IEEE, 2018, pp. 161–165

work page 2018
[14]

CAM++: A fast and efficient network for speaker verifica- tion using context-aware masking,

H. Wang, S. Zheng, Y . Chen, L. Cheng, and Q. Chen, “CAM++: A fast and efficient network for speaker verifica- tion using context-aware masking,” in24th Annual Conference of the International Speech Communication Association, In- terspeech 2023, Dublin, Ireland, August 20-24, 2023. ISCA, 2023, pp. 5301–5305

work page 2023
[15]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,”arXiv preprint arXiv:2410.06885, 2024

work page arXiv 2024
[16]

Maskgct: Zero-shot text-to-speech with masked generative codec trans- former,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to-speech with masked generative codec trans- former,” inThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025

work page 2025
[17]

Fastsvc: Fast cross-domain singing voice conversion with feature-wise lin- ear modulation,

S. Liu, Y . Cao, N. Hu, D. Su, and H. Meng, “Fastsvc: Fast cross-domain singing voice conversion with feature-wise lin- ear modulation,” in2021 ieee international conference on mul- timedia and expo (ICME). IEEE, 2021, pp. 1–6

work page 2021
[18]

Hybrid transformers for music source separation,

S. Rouard, F. Massa, and A. D ´efossez, “Hybrid transformers for music source separation,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[19]

Singing voice conversion with accompaniment using self- supervised representation-based melody features,

W. Chen, B. Sha, J. Yang, Z. Wang, F. Fan, and Z. Wu, “Singing voice conversion with accompaniment using self- supervised representation-based melody features,” in2025 IEEE International Conference on Acoustics, Speech and Sig- nal Processing, ICASSP 2025, Hyderabad, India, April 6-11,

work page 2025
[20]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inThe Eleventh International Conference on Learning Representa- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open- Review.net, 2023

work page 2023
[21]

Openvoice: Versatile instant voice cloning

Z. Qin, W. Zhao, X. Yu, and X. Sun, “Openvoice: Versatile instant voice cloning,”arXiv preprint arXiv:2312.01479, 2023

work page arXiv 2023
[22]

Fish-speech: Leveraging large language models for advanced multilingual text-to- speech synthesis.arXiv preprint arXiv:2411.01156, 2024

S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-speech: Leveraging large language mod- els for advanced multilingual text-to-speech synthesis,”arXiv preprint arXiv:2411.01156, 2024

work page arXiv 2024
[23]

Emilia: An extensive, multilin- gual, and diverse speech dataset for large-scale speech gener- ation,

H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: An extensive, multilin- gual, and diverse speech dataset for large-scale speech gener- ation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 885–890

work page 2024
[24]

M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,

L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y . Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao, “M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, 2022

work page 2022
[25]

Multi- singer: Fast multi-singer singing voice vocoder with a large- scale corpus,

R. Huang, F. Chen, Y . Ren, J. Liu, C. Cui, and Z. Zhao, “Multi- singer: Fast multi-singer singing voice vocoder with a large- scale corpus,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3945–3954

work page 2021
[26]

Opencpop: A high-quality open source chi- nese popular song corpus for singing voice synthesis,

Y . Wang, X. Wang, P. Zhu, J. Wu, H. Li, H. Xue, Y . Zhang, L. Xie, and M. Bi, “Opencpop: A high-quality open source chi- nese popular song corpus for singing voice synthesis,” in23rd Annual Conference of the International Speech Communica- tion Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022. ISCA, 2022, pp. 4242–4246

work page 2022
[27]

Learning the beauty in songs: Neural singing voice beautifier,

J. Liu, C. Li, Y . Ren, Z. Zhu, and Z. Zhao, “Learning the beauty in songs: Neural singing voice beautifier,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), ACL 2022, Dublin, Ire- land, May 22-27, 2022. Association for Computational Lin- guistics, 2022, pp. 7970–7983

work page 2022
[28]

V ocalset: A singing voice dataset

J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “V ocalset: A singing voice dataset.” inISMIR, 2018, pp. 468–474

work page 2018