Recognition: no theorem link
Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
Pith reviewed 2026-05-13 04:02 UTC · model grok-4.3
The pith
Poly-SVC converts singing voices from accompanied recordings by preserving residual harmonies instead of requiring clean isolated vocals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Poly-SVC is a zero-shot cross-lingual singing voice conversion system that processes residual harmonies by using a CQT-based pitch extractor to preserve both lead melody and residual harmony information, applying a random sampler to reduce interference, and employing a Conditional Flow Matching diffusion decoder to fuse the resulting pitch, content, and timbre features into natural-sounding polyphonic outputs, outperforming baselines in naturalness, timbre similarity, and harmony reconstruction on both harmony-rich and single-melody recordings.
What carries the argument
The CQT-based pitch extractor paired with a random sampler and a Conditional Flow Matching diffusion decoder that together retain and reconstruct residual harmonies while converting timbre.
If this is right
- Singing voice conversion becomes feasible directly on accompanied tracks without a separate vocal isolation stage.
- Harmony content in both polyphonic and monophonic inputs is reconstructed more faithfully than with conventional F0 extractors.
- Zero-shot cross-lingual conversion extends to inputs containing residual harmonies.
- Naturalness and timbre similarity improve when the decoder receives harmonic-rich pitch features.
Where Pith is reading between the lines
- The method could reduce dependence on upstream source-separation models in broader audio pipelines.
- Similar harmonic-retention strategies might transfer to other polyphonic audio tasks such as instrumental timbre transfer.
- If the decoder runs efficiently, real-time conversion of live accompanied singing becomes conceivable.
Load-bearing premise
The CQT-based pitch extractor combined with the random sampler can reliably isolate and preserve residual harmonies without introducing artifacts or losing melody information in real accompanied recordings.
What would settle it
Objective or listening-test measurements on accompanied recordings that show no improvement in harmony reconstruction or the presence of new artifacts relative to standard F0-based baselines would falsify the central claim.
read the original abstract
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Poly-SVC, a zero-shot cross-lingual singing voice conversion system designed to handle accompanied recordings containing residual harmonies. It consists of a CQT-based pitch extractor to capture both lead melody and residual harmonies, a random sampler to mitigate interference in the CQT representation, and a Conditional Flow Matching (CFM) diffusion decoder that integrates pitch, content, and timbre features. The central claim is that Poly-SVC outperforms baseline models in naturalness, timbre similarity, and harmony reconstruction on both harmony-rich and single-melody recordings.
Significance. If the experimental results hold, this would constitute a meaningful contribution to singing voice conversion by removing the reliance on clean vocal separation and directly modeling polyphonic inputs. The architectural focus on preserving residual harmonies addresses a practical gap in real-world SVC applications. The proposal of the CQT-plus-random-sampler pipeline combined with CFM decoding is a concrete technical direction worth exploring, though its robustness requires verification.
major comments (2)
- [Abstract] Abstract: The statement that 'Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings' is presented without any quantitative metrics, baseline descriptions, dataset details, or statistical tests. This absence directly undermines evaluation of the central superiority claim.
- [Method] Method (CQT-based pitch extractor and random sampler): The design assumes that the CQT pitch extractor combined with the random sampler can reliably isolate and preserve both lead melody and residual vocal harmonies from accompanied (non-clean) signals without introducing artifacts or temporal inconsistencies. No ablation studies, analysis on mixed audio, or validation against entanglement of vocal/instrumental harmonics are provided, yet this component is load-bearing for attributing any performance gains to the polyphony-aware design rather than the CFM decoder alone.
minor comments (1)
- [Method] The description of how the random sampler specifically reduces interference while retaining secondary pitch contours would benefit from additional implementation details or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and describe the changes we will make in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The statement that 'Experiments demonstrate that Poly-SVC surpasses the baseline models in naturalness, timbre similarity and harmony reconstruction across both harmony-rich and single-melody recordings' is presented without any quantitative metrics, baseline descriptions, dataset details, or statistical tests. This absence directly undermines evaluation of the central superiority claim.
Authors: We agree that the abstract would benefit from greater specificity. In the revision we will add concise quantitative results (e.g., MOS scores for naturalness and timbre similarity together with the main baseline names and dataset sizes) while remaining within the abstract length limit. This will allow readers to evaluate the reported improvements directly from the abstract. revision: yes
-
Referee: [Method] Method (CQT-based pitch extractor and random sampler): The design assumes that the CQT pitch extractor combined with the random sampler can reliably isolate and preserve both lead melody and residual vocal harmonies from accompanied (non-clean) signals without introducing artifacts or temporal inconsistencies. No ablation studies, analysis on mixed audio, or validation against entanglement of vocal/instrumental harmonics are provided, yet this component is load-bearing for attributing any performance gains to the polyphony-aware design rather than the CFM decoder alone.
Authors: The referee is correct that the current manuscript does not contain dedicated ablations isolating the CQT extractor and random sampler. End-to-end results and qualitative examples are provided, but these do not fully separate the contribution of the polyphony-aware front-end from the CFM decoder. In the revision we will add ablation experiments that replace the CQT-plus-sampler pipeline with a conventional F0 extractor and that remove the random sampler, together with targeted analysis on mixed-audio examples to examine harmonic entanglement. These additions will strengthen attribution of the observed gains. revision: yes
Circularity Check
No circularity: independent architectural proposal with experimental validation
full rationale
The paper describes Poly-SVC as a new zero-shot SVC architecture combining a CQT-based pitch extractor, random sampler, and CFM diffusion decoder. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Performance claims rest on experimental comparisons to baselines rather than any reduction to inputs by construction. The central assumption about the pitch pipeline is presented as a design choice, not a derived result, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights and training hyperparameters
axioms (2)
- domain assumption CQT transform preserves both lead melody and residual harmonic information in accompanied singing audio
- domain assumption Conditional flow matching diffusion can fuse pitch, content, and timbre features into natural polyphonic singing
invented entities (1)
-
Poly-SVC architecture
no independent evidence
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Singing voice conversion (SVC) is an emerging research hotspot that converts one singer’s vocal identity and style to sound like another while keeping the original lyrics, melody, and various vocal tech- niques [1]–[4]. The task addressed in this work presents greater challenges than conventional SVC, as it deals with the mismatch be- tween c...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
1 illustrates an overview of our Poly-SVC framework
METHODS Fig. 1 illustrates an overview of our Poly-SVC framework. Fol- lowing prior SVC methods [11] and [13], we first extract the mel- spectrogram as the acoustic representation and apply a Timbre Shifter based on OpenV oice [17] to align the distributions between training and inference, thereby reducing the timbre leak from the content representation. ...
-
[3]
EXPERIMEMTS 3.1. Dataset We use a wide variety of datasets covering both speech and singing, encompassing multiple languages, audio durations, and speaker counts. For speech data, we adopt the Emilia dataset [19], a 101k-hour multilingual speech corpus rich in expressive speak- ing styles, which provides a robust foundation for modeling natural speech. A ...
-
[4]
CONCLUSION This study highlights the significant challenges inherent in real- world singing voice conversion, particularly due to the challenge of obtaining clean singing vocals. To address these issues, we proposed Poly-SVC, a singing voice conversion framework designed for real- world scenarios where vocal-accompaniment separation often leaves residual ...
-
[5]
Freesvc: Towards zero-shot multilingual singing voice conversion,
A. I. S. Ferreira, L. R. S. Gris, A. S. da Rosa, F. S. de Oliveira, E. Casanova, R. T. Sousa, A. C. Jr., A. da Silva Soares, and A. R. G. Filho, “Freesvc: Towards zero-shot multilingual singing voice conversion,” in2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025. IEEE, 2025, pp. 1–5
work page 2025
-
[6]
SPA-SVC: self-supervised pitch augmentation for singing voice conversion,
B. Bai, F. Wang, Y . Gao, and Y . Li, “SPA-SVC: self-supervised pitch augmentation for singing voice conversion,” in25th An- nual Conference of the International Speech Communication Association, Interspeech 2024, Kos, Greece, September 1-5,
work page 2024
-
[7]
Zero-shot voice conversion with diffusion transform- ers,
S. Liu, “Zero-shot voice conversion with diffusion transform- ers,”arXiv preprint arXiv:2411.09943, 2024
-
[8]
Y . Zhou, W. Wang, H. Ding, J. Xu, J. Zhu, X. Gao, and S. Li, “SYKI-SVC: advancing singing voice conversion with post-processing innovations and an open-source professional testset,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, In- dia, April 6-11, 2025. IEEE, 2025, pp. 1–5
work page 2025
-
[9]
wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech rep- resentations,” inAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Pro- cessing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020
work page 2020
-
[10]
Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[11]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[12]
RMVPE: A robust model for vocal pitch estimation in polyphonic music,
H. Wei, X. Cao, T. Dan, and Y . Chen, “RMVPE: A robust model for vocal pitch estimation in polyphonic music,” in24th Annual Conference of the International Speech Communica- tion Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023, 2023, pp. 5421–5425
work page 2023
-
[13]
Crepe: A con- volutional representation for pitch estimation,
J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A con- volutional representation for pitch estimation,” in2018 IEEE international conference on acoustics, speech and signal pro- cessing (ICASSP). IEEE, 2018, pp. 161–165
work page 2018
-
[14]
CAM++: A fast and efficient network for speaker verifica- tion using context-aware masking,
H. Wang, S. Zheng, Y . Chen, L. Cheng, and Q. Chen, “CAM++: A fast and efficient network for speaker verifica- tion using context-aware masking,” in24th Annual Conference of the International Speech Communication Association, In- terspeech 2023, Dublin, Ireland, August 20-24, 2023. ISCA, 2023, pp. 5301–5305
work page 2023
-
[15]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,”arXiv preprint arXiv:2410.06885, 2024
-
[16]
Maskgct: Zero-shot text-to-speech with masked generative codec trans- former,
Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to-speech with masked generative codec trans- former,” inThe Thirteenth International Conference on Learn- ing Representations, ICLR 2025, Singapore, April 24-28, 2025
work page 2025
-
[17]
Fastsvc: Fast cross-domain singing voice conversion with feature-wise lin- ear modulation,
S. Liu, Y . Cao, N. Hu, D. Su, and H. Meng, “Fastsvc: Fast cross-domain singing voice conversion with feature-wise lin- ear modulation,” in2021 ieee international conference on mul- timedia and expo (ICME). IEEE, 2021, pp. 1–6
work page 2021
-
[18]
Hybrid transformers for music source separation,
S. Rouard, F. Massa, and A. D ´efossez, “Hybrid transformers for music source separation,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[19]
W. Chen, B. Sha, J. Yang, Z. Wang, F. Fan, and Z. Wu, “Singing voice conversion with accompaniment using self- supervised representation-based melody features,” in2025 IEEE International Conference on Acoustics, Speech and Sig- nal Processing, ICASSP 2025, Hyderabad, India, April 6-11,
work page 2025
-
[20]
Flow matching for generative modeling,
Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inThe Eleventh International Conference on Learning Representa- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open- Review.net, 2023
work page 2023
-
[21]
Openvoice: Versatile instant voice cloning
Z. Qin, W. Zhao, X. Yu, and X. Sun, “Openvoice: Versatile instant voice cloning,”arXiv preprint arXiv:2312.01479, 2023
-
[22]
S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-speech: Leveraging large language mod- els for advanced multilingual text-to-speech synthesis,”arXiv preprint arXiv:2411.01156, 2024
-
[23]
H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: An extensive, multilin- gual, and diverse speech dataset for large-scale speech gener- ation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 885–890
work page 2024
-
[24]
M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,
L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y . Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao, “M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, 2022
work page 2022
-
[25]
Multi- singer: Fast multi-singer singing voice vocoder with a large- scale corpus,
R. Huang, F. Chen, Y . Ren, J. Liu, C. Cui, and Z. Zhao, “Multi- singer: Fast multi-singer singing voice vocoder with a large- scale corpus,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3945–3954
work page 2021
-
[26]
Opencpop: A high-quality open source chi- nese popular song corpus for singing voice synthesis,
Y . Wang, X. Wang, P. Zhu, J. Wu, H. Li, H. Xue, Y . Zhang, L. Xie, and M. Bi, “Opencpop: A high-quality open source chi- nese popular song corpus for singing voice synthesis,” in23rd Annual Conference of the International Speech Communica- tion Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022. ISCA, 2022, pp. 4242–4246
work page 2022
-
[27]
Learning the beauty in songs: Neural singing voice beautifier,
J. Liu, C. Li, Y . Ren, Z. Zhu, and Z. Zhao, “Learning the beauty in songs: Neural singing voice beautifier,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), ACL 2022, Dublin, Ire- land, May 22-27, 2022. Association for Computational Lin- guistics, 2022, pp. 7970–7983
work page 2022
-
[28]
V ocalset: A singing voice dataset
J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “V ocalset: A singing voice dataset.” inISMIR, 2018, pp. 468–474
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.