arxiv: 2604.25498 · v1 · submitted 2026-04-28 · 💻 cs.SD · cs.AI

Recognition: unknown

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

Ao Li, Feng Yu, Nan Nan, Xiaobing Li, Xiaohong Guan, Xuzheng He, Yu Pan, Zhilin Wang, Zhuoru Mo, Ziyue Kang

Pith reviewed 2026-05-07 14:14 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords symphonic music generation3D hierarchical modelharmony skeletonorchestral generationsymbolic musicreinforcement learningmusic generationcascading decoder

0 comments

The pith

SymphonyGen generates controllable symphonic music by decomposing generation into a 3D hierarchy of bars, tracks, and events conditioned on a harmony skeleton.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SymphonyGen as a framework that addresses the difficulty of simultaneously handling long-form structure and dense multi-track orchestration in symbolic music. It decomposes the generation process along three axes using a cascading decoder, which the authors claim improves efficiency over standard 1D or 2D autoregressive models. A beat-quantized multi-voice harmony skeleton provides high-level outline control while allowing textural variation within tracks. The model is then further aligned to acoustic preferences through reinforcement learning with an audio-perceptual reward and a specialized sampling step that avoids tonal clashes. Evaluations indicate the resulting outputs are rated higher in musicality and overall preference than prior orchestral generation systems.

Core claim

SymphonyGen is a 3D hierarchical framework for contemporary cinematic orchestration that employs a cascading decoder architecture decomposing the Bar, Track, and Event axes, introduces short-score conditioning via a beat-quantized multi-voice harmony skeleton, and is refined using Group Relative Policy Optimization with a cross-modal audio-perceptual reward along with dissonance-averse sampling to produce outputs that improve harmonic cleanliness while preserving melodic expression.

What carries the argument

The cascading decoder that decomposes generation across Bar, Track, and Event axes together with short-score conditioning on a beat-quantized multi-voice harmony skeleton.

Load-bearing premise

The 3D cascading decoder and cross-modal reward together preserve full musical coherence and expression while delivering the claimed gains in efficiency, control, and harmonic cleanliness.

What would settle it

A head-to-head comparison in which SymphonyGen outputs receive lower scores than strong baselines on objective harmonic-cleanliness metrics or lose in blind human preference tests for orchestral musicality.

read the original abstract

Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: https://symphonygen.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SymphonyGen adds a 3D cascading decoder and harmony-skeleton control to orchestral generation, but the abstract leaves the performance claims without numbers or implementation details.

read the letter

SymphonyGen introduces a 3D hierarchical cascading decoder that splits generation along bar, track, and event axes, plus a beat-quantized multi-voice harmony skeleton for high-level control. It also applies Group Relative Policy Optimization with a cross-modal audio-perceptual reward and a dissonance-averse sampler at inference time. These pieces target the specific problem of scaling controllable symphonic output without losing texture or running into compute walls that hit standard 1D or 2D models.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestral music generation. It employs a cascading decoder architecture decomposing the Bar, Track, and Event axes for improved efficiency over 1D/2D models, short-score conditioning via a beat-quantized multi-voice harmony skeleton for outline control, Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward to align outputs with acoustic expectations, and a dissonance-averse sampling algorithm at inference. The central claims are that RL and the sampling method enhance harmonic cleanliness while preserving melodic expression (objective) and that SymphonyGen outperforms baselines in musicality and preference (subjective).

Significance. If the performance claims are substantiated with detailed metrics and implementation specifics, the hierarchical cascading decoder and controllable harmony skeleton could provide a scalable path for steerable multi-track orchestral generation, addressing complexity-control issues in symbolic music models. The integration of cross-modal RL for perceptual alignment is a potentially valuable direction, though its impact cannot be assessed without evidence.

major comments (2)

[Abstract] Abstract: The claim that 'Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression' is unsupported by any quantitative metrics, baseline descriptions, dataset details, statistical tests, or result tables in the manuscript. This directly undermines the central claim regarding the effectiveness of GRPO and dissonance-averse sampling.
[Abstract] GRPO description (Abstract): The cross-modal audio-perceptual reward is described only at a high level as aligning 'symbolic output with modern acoustic expectations,' with no details on the perceptual model, symbolic-to-audio rendering pipeline, reward computation procedure, or correlation analysis with human judgments. This is load-bearing for the RL refinement stage, as unverified proxy rewards risk optimizing for artifacts rather than genuine harmonic quality.

minor comments (2)

[Abstract] The phrase 'complexity-control imbalance' is used without definition or citation, reducing clarity for readers.
[Abstract] The manuscript references a demo page but would benefit from embedding key quantitative results or example outputs directly in the text for self-containment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address each major comment below and will make the necessary revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression' is unsupported by any quantitative metrics, baseline descriptions, dataset details, statistical tests, or result tables in the manuscript. This directly undermines the central claim regarding the effectiveness of GRPO and dissonance-averse sampling.

Authors: We appreciate the referee highlighting this issue. The manuscript does include objective evaluations in the dedicated Experiments section, with metrics demonstrating improvements in harmonic cleanliness (such as reduced dissonance scores) and preservation of melodic expression under GRPO and the sampling method, along with comparisons to baselines. However, to ensure the abstract is self-contained and properly supported, we will revise it to briefly summarize these key quantitative results, including references to specific tables, dataset details, and any statistical tests performed. This will directly address the concern and substantiate the central claims. revision: yes
Referee: [Abstract] GRPO description (Abstract): The cross-modal audio-perceptual reward is described only at a high level as aligning 'symbolic output with modern acoustic expectations,' with no details on the perceptual model, symbolic-to-audio rendering pipeline, reward computation procedure, or correlation analysis with human judgments. This is load-bearing for the RL refinement stage, as unverified proxy rewards risk optimizing for artifacts rather than genuine harmonic quality.

Authors: We agree that more details are needed for the cross-modal reward to allow proper assessment of its validity. In the revised manuscript, we will expand the description of the GRPO component, providing specifics on the perceptual model used for audio rendering, the symbolic-to-audio pipeline, the exact reward computation procedure, and any analysis correlating the reward with human judgments. This will be added to the methods section and referenced in the abstract where appropriate, ensuring transparency and mitigating concerns about proxy reward artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural and training claims rest on independent design choices and external evaluations

full rationale

The paper introduces a 3D hierarchical cascading decoder, short-score conditioning, GRPO with cross-modal reward, and dissonance-averse sampling. No equations, derivations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Objective and subjective evaluations are described as external validation rather than tautological outputs. No load-bearing uniqueness theorems or ansatzes imported via self-citation appear in the provided text. This is the common case of a self-contained proposal whose central claims do not collapse into their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on domain assumptions about music decomposition and perceptual alignment rather than new free parameters or invented entities; no explicit fitting constants or novel physical entities are introduced in the abstract.

axioms (2)

domain assumption Music generation can be decomposed along independent Bar, Track, and Event axes without loss of coherence.
Invoked to justify the 3D cascading decoder architecture.
domain assumption Cross-modal audio-perceptual rewards can guide symbolic music outputs toward acoustic expectations.
Basis for the GRPO training step.

pith-pipeline@v0.9.0 · 5522 in / 1395 out tokens · 71456 ms · 2026-05-07T14:14:13.504948+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages · 1 internal anchor

[1]

SymphonyGen: 3D Hierarchical Orchestral Generation with Controllable Harmony Skeleton

INTRODUCTION For centuries, symphonic composition has served as the pri- mary vehicle for expansive musical storytelling through the collective power of the orchestra. Replicating this complex- ity, however, remains a formidable challenge for symbolic AI, as it demands the simultaneous management of high- level form and multi-track textures. Specifically,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Early efforts like MuseGAN [18] utilized GANs, but the field has largely shifted toward Transformers to capture long-range dependencies

RELATED WORK 2.1 Symbolic Orchestral Generation Generative models specifically targeting orchestral music are less common than those for general multi-track tasks. Early efforts like MuseGAN [18] utilized GANs, but the field has largely shifted toward Transformers to capture long-range dependencies. Pop Music Transformer [3] and Compound Word Transformer ...
[3]

find itself

METHODOLOGY 3.1 3D Hierarchical Architecture The backbone of our model is a series of cascading Trans- former encoders and decoders to process one structural level at a time. This hierarchical approach of decomposing the sequence into Bar, Track, and Event levels is designed to manage the high density of symphonic scores. (We define events as the sequence...
[4]

The MIDI files are partitioned into a 90/10 train/val split

EXPERIMENTAL SETUP 4.1 Dataset We utilize theSymphonyNet Dataset[5], comprising 728 classical and 45,632 contemporary MIDI files. The MIDI files are partitioned into a 90/10 train/val split. 4.2 Training and Inference Details SymphonyGen is implemented with 33 layers (512 hidden size, 124M parameters), consisting of 8 harmony event decoder layers, 9 music...
[5]

sweet spot

RESULTS AND DISCUSSION 5.1 Objective Evaluation As shown in Table 2, the reinforcement learning stage sig- nificantly enhances the average CLaMP score and lowers dissonance scores, while preserving melodic movement and ornament. This suggests that the model is more aligned with modern acoustic expectations. Dissonance-averse sampling further suppresses th...
[6]

By decomposing the Bar, Track, and Event axes, we achieve improved computa- tional efficiency and scalability while maintaining structural coherence

CONCLUSION This paper introduces SymphonyGen, a 3D hierarchical framework designed to master the dual complexities of or- chestral composition and arrangement. By decomposing the Bar, Track, and Event axes, we achieve improved computa- tional efficiency and scalability while maintaining structural coherence. Our multi-voice harmony skeleton provides a mus...
[7]

dissonance-averse

ETHICS STATEMENT Our training data consists of publicly available symbolic music datasets and audio references. While SymphonyGen automates aspects of orchestral composition, it is designed as a collaborative tool to assist composers rather than com- pete with them. We anticipate that human composers will benefit more from our controllable framework than ...
[8]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2017

2017
[9]

Music Trans- former: Generating music with long-term structure,

C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. M. Shazeer, I. Simon, C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music Trans- former: Generating music with long-term structure,” in International Conference on Learning Representations (ICLR), 2018

2018
[10]

Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions,

Y .-S. Huang and Y .-H. Yang, “Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions,” inProceedings of the 28th ACM International Conference on Multimedia, 2020

2020
[11]

Compound Word Transformer: Learning to compose full-song music over dynamic directed hypergraphs,

W.-Y . Hsiao, J.-Y . Liu, Y .-C. Yeh, and Y .-H. Yang, “Compound Word Transformer: Learning to compose full-song music over dynamic directed hypergraphs,” in Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), 2021

2021
[12]

Symphony generation with permutation invariant language model,

J. Liu, Y . Dong, Z. Cheng, X. Zhang, X. Li, F. Yu, and M. Sun, “Symphony generation with permutation invariant language model,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022

2022
[13]

NotaGen: Advanc- ing musicality in symbolic music generation with large language model training paradigms,

Y . Wang, S. Wu, J. Hu, X. Du, Y . Peng, Y . Huang, S. Fan, X. Li, F. Yu, and M. Sun, “NotaGen: Advanc- ing musicality in symbolic music generation with large language model training paradigms,” inProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), 2025

2025
[14]

AccoMontage: Accompaniment arrangement via phrase selection and style transfer,

J. Zhao and G. Xia, “AccoMontage: Accompaniment arrangement via phrase selection and style transfer,” in Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021

2021
[15]

MuseMorphose: Full-song and fine-grained piano music style transfer with one Transformer V AE,

S.-L. Wu and Y .-H. Yang, “MuseMorphose: Full-song and fine-grained piano music style transfer with one Transformer V AE,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1953– 1967, 2023

1953
[16]

METEOR: Melody-aware texture-controllable symbolic music re-orchestration via Transformer V AE,

D.-V .-T. Le and Y .-H. Yang, “METEOR: Melody-aware texture-controllable symbolic music re-orchestration via Transformer V AE,” inProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), 2025

2025
[17]

Structured multi-track accompaniment arrangement via style prior modelling,

J. Zhao, G. Xia, Z. Wang, and Y . Wang, “Structured multi-track accompaniment arrangement via style prior modelling,” inAdvances in Neural Information Process- ing Systems (NeurIPS), 2024

2024
[18]

Unifying symbolic music arrangement: Track-aware reconstruction and structured tokenization,

L. Ou, J. Zhao, Z. Wang, G. Xia, Q. Liang, T. Hopkins, and Y . Wang, “Unifying symbolic music arrangement: Track-aware reconstruction and structured tokenization,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[19]

Q&A: Query-based representation learning for multi-track symbolic music re-arrangement,

J. Zhao, G. Xia, and Y . Wang, “Q&A: Query-based representation learning for multi-track symbolic music re-arrangement,” inProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), 2023

2023
[20]

FIGARO: Controllable music generation using expert and learned features,

D. von Rütte, L. Biggio, Y . Kilcher, and T. Hofmann, “FIGARO: Controllable music generation using expert and learned features,” inInternational Conference on Learning Representations (ICLR), 2023

2023
[21]

Theme Transformer: Symbolic music generation with theme-conditioned Transformer,

Y .-J. Shih, S.-L. Wu, F. Zalkow, M. Müller, and Y .-H. Yang, “Theme Transformer: Symbolic music generation with theme-conditioned Transformer,”IEEE Transac- tions on Multimedia, vol. 25, pp. 3495–3508, 2023

2023
[22]

Museformer: Transformer with fine- and coarse-grained attention for music generation,

B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang, T. Qin, and T.-Y . Liu, “Museformer: Transformer with fine- and coarse-grained attention for music generation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[23]

Whole-song hierarchical generation of symbolic music using cascaded diffusion models,

Z. Wang, L. Min, and G. Xia, “Whole-song hierarchical generation of symbolic music using cascaded diffusion models,” inInternational Conference on Learning Rep- resentations (ICLR), 2024

2024
[24]

Text2midi: Generating symbolic music from captions,

K. Bhandari, A. Roy, K. Wang, G. Puri, S. Colton, and D. Herremans, “Text2midi: Generating symbolic music from captions,” inProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI), 2025

2025
[25]

MuseGAN: Multi-track sequential generative adversar- ial networks for symbolic music generation and accom- paniment,

H.-W. Dong, W.-Y . Hsiao, L.-C. Yang, and Y .-H. Yang, “MuseGAN: Multi-track sequential generative adversar- ial networks for symbolic music generation and accom- paniment,” inProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018

2018
[26]

AccoMontage2: A complete harmonization and accompaniment arrange- ment system,

L. Yi, H. Hu, J. Zhao, and G. Xia, “AccoMontage2: A complete harmonization and accompaniment arrange- ment system,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (IS- MIR), 2022

2022
[27]

A corpus and a modular infrastructure for the empirical study of (an)notated music,

J. Hentschel, Y . Rammos, M. Neuwirth, and M. Rohrmeier, “A corpus and a modular infrastructure for the empirical study of (an)notated music,”Scientific Data, vol. 12, no. 1, p. 685, 2025

2025
[28]

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” 2024

2024
[29]

CLaMP 3: Universal music information retrieval across unaligned modalities and unseen languages,

S. Wu, G. Zhancheng, R. Yuan, J. Jiang, S. Doh, G. Xia, J. Nam, X. Li, F. Yu, and M. Sun, “CLaMP 3: Universal music information retrieval across unaligned modalities and unseen languages,” inFindings of the Association for Computational Linguistics, 2025

2025
[30]

Tonal consonance and critical bandwidth,

R. Plomp and W. J. M. Levelt, “Tonal consonance and critical bandwidth,”The Journal of the Acoustical Soci- ety of America, vol. 38, no. 4, pp. 548–560, 1965

1965
[31]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learn- ing Representations (ICLR), 2019

2019