pith. sign in

arxiv: 2606.13626 · v1 · pith:XDBICEG5new · submitted 2026-06-11 · 💻 cs.SD · cs.LG

Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches

Pith reviewed 2026-06-27 05:27 UTC · model grok-4.3

classification 💻 cs.SD cs.LG
keywords symbolic music generationBach styleautoregressive LSTMvector quantized VAEmusic GANpolyphonic MIDIcomparative evaluationposterior collapse
0
0 comments X

The pith

Autoregressive LSTMs with attention generate the most musically coherent Bach-style piano music among the three model families tested on a shared MIDI corpus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares autoregressive LSTMs with attention, recurrent and vector-quantized VAEs, and GANs for generating polyphonic symbolic music in Bach's style from the same MIDI data. It evaluates how well each family models note sequences, learns latent structure, and produces stylistically fitting outputs. The results position the attention-equipped autoregressive model as strongest on coherence, show vector quantization reducing collapse and improving structure in VAEs, and note that GANs capture local pitch patterns yet train unstably and generalize less well. A sympathetic reader would care because the direct head-to-head design isolates architectural trade-offs that matter for building practical music generators.

Core claim

Experiments on the shared MIDI corpus demonstrate that the autoregressive LSTM with attention produces the most musically coherent samples. Vector quantization mitigates posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach's style.

What carries the argument

Three model families—autoregressive LSTMs with attention, latent-variable models (recurrent VAEs and VQ-VAEs), and GANs—applied to polyphonic note sequences from one MIDI corpus for direct comparison of coherence, structure, and stylistic fit.

If this is right

  • Attention-based autoregressive models are preferable when coherence is the primary goal for symbolic music generation.
  • Vector quantization offers a concrete way to improve structure and reduce collapse in recurrent latent-variable models for music.
  • Adversarial methods need additional stabilization techniques to achieve reliable stylistic generalization on polyphonic sequences.
  • Comparative evaluation on identical data reveals distinct failure modes: coherence gaps in latent models, training instability in GANs.
  • The relative performance ordering can guide selection of base architectures before adding domain-specific refinements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether combining attention mechanisms directly with vector quantization produces hybrids that exceed the best single-family results.
  • The corpus-specific findings may not transfer to other composers or genres without repeating the comparison.
  • Objective sequence-level metrics that align with human coherence judgments would strengthen future model rankings.
  • The training difficulties observed with GANs suggest exploring conditional or progressive variants for sequential music tasks.

Load-bearing premise

That judgments of musical coherence and stylistic generalization on the shared MIDI corpus provide a sufficient and unbiased basis for ranking the three model families.

What would settle it

A blinded listening study with multiple expert musicians rating large sets of generated samples for coherence and Bach-style fidelity, or an automatic metric that correlates with such ratings and produces a different model ranking.

Figures

Figures reproduced from arXiv: 2606.13626 by Dezhi YU, Kyuil Lee, Yongkang Huang.

Figure 1
Figure 1. Figure 1: LSTM architecture We have a 2-layer LSTM with a hidden state size of 512 and a dropout rate of 0.5. The LSTM generates a sequence of length 32, which is then passed through a single FC layer to generate the sequence of logits. The logits are passed through a softmax function to calculate the probability distribution of each note in the sequence. Using the sequence of categorical distributions, we calculate… view at source ↗
Figure 2
Figure 2. Figure 2: Baseline VAE Architecture The encoder is made recurrent with a 1-layer bidirectional LSTM with a hidden state size of 512. Making the LSTM bidirectional allows it to learn patterns in both directions of the sequence, leading to greater flexibility. The final hidden state of the LSTM is sent through two separate neural networks, which calculate µ and σ. µ and σ are then used to sample the latent variable z,… view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical Decoder in VAE 3 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VQVAE Architecture [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: LSTM with Attention Architecture The model’s architecture can be summarized as follows: • Input Layer: Represents multi-dimensional musical data. • LSTM Layers: A sequence of 2 LSTM layers. • Attention Layer: Processes LSTM outputs to create a context vector, focusing on crucial sequence parts. • Multiple parallel output heads, each predicting different voices of the musical piece. Each head comprises line… view at source ↗
Figure 7
Figure 7. Figure 7: LSTM with Attention Architecture The input is the vectorized MIDI file that has the shape of [batch, sequence_length, 88], similar to before. We treat the batch size and sequence length as hyperparameters, as both can significantly impact GAN training. A larger batch size is preferable to reduce noise during training updates, enhancing the effectiveness of the GAN training process. Our generator is based o… view at source ↗
Figure 8
Figure 8. Figure 8: Reconstruction Errors [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: MIDI generated by hierarchical VAE For the VQVAE, as explained in an earlier section, the input was organized into the shape [batch, 64, 4, 88]. We experimented by varying the size of the embedding space (32, 64, 128), and measured the reconstruction error [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: MIDI generated by VQVAE As you can see (if you look closely), the reconstruction loss drops further when you increase the size of the embedding space, which makes sense since this allows the model to learn more diverse 4-note patterns, allowing it to output a more flexible note sequence. The note patterns exhibit clearer patterns, such as going up and down, and having scale-like sequences. The output audi… view at source ↗
Figure 13
Figure 13. Figure 13: LSTM with Attention, Results The baseline model, utilizing two LSTM layers with minimal hyperparameters (window size, hidden size, batch size, L2 = 16, 16, 16, 0.001), starts with a training loss of approximately 12.4. Over 100 epochs, it converges to a training loss of around 6.3, indicating a decent learning curve but suggesting there is room for improvement. Drawing on insights from the article lon, it… view at source ↗
Figure 14
Figure 14. Figure 14: LSTM with Attention, Results The above image represents a score of the generated Bach-styled music. As you can see, the chords and melodic progressions are quite sensical and reminiscent of a Baroque style. From listening to the music, we found that the generated music has a strong semblance to Bach’s music, and this matched our intuition that solving the problem autoregressively was a much easier problem… view at source ↗
Figure 15
Figure 15. Figure 15: GAN loss during training We can observe with the intervention of parameter clipping, the loss of both discriminator and generator dropping nearly linearly. The GAN captures 2 soundtracks analogous to the left-hand side of the piano (bass) and the right-hand side (treble). However, the music style is more like modern jazz piano with more complex harmonies and idiocracies of improvisation [PITH_FULL_IMAGE:… view at source ↗
Figure 16
Figure 16. Figure 16: Music generated by GAN The outcomes of the Generative Adversarial Network (GAN) analysis indicate that the model effectively captures the pitch sequence information. However, there is significant potential for enhancement in the area of style generalization. This improvement could be achieved by incorporating additional information into the model. During training, it has been observed that the GAN does no… view at source ↗
read the original abstract

We study generative modeling of Bach-style symbolic piano music using a shared MIDI corpus and three model families: autoregressive LSTMs with attention, latent-variable models including recurrent VAEs and vector-quantized VAEs, and generative adversarial networks. We compare their ability to model polyphonic note sequences, learn useful latent representations, and generate stylistically coherent compositions. Our experiments show that the autoregressive LSTM with attention produces the most musically coherent samples, while vector quantization helps mitigate posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach's style. These results highlight the relative strengths and failure modes of autoregressive, latent-variable, and adversarial approaches for symbolic music generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript compares three families of generative models—autoregressive LSTMs with attention, latent-variable models (recurrent VAEs and vector-quantized VAEs), and GANs—for producing Bach-style symbolic piano music on a shared MIDI corpus. It reports that the autoregressive LSTM with attention yields the most musically coherent samples, that vector quantization mitigates posterior collapse and produces more structured outputs than standard recurrent VAEs, and that the adversarial approach captures local pitch patterns but is difficult to train and generalizes less reliably.

Significance. A rigorously quantified comparison of these modeling paradigms on polyphonic symbolic music could clarify their relative strengths and typical failure modes, providing guidance for future work in music generation. The explicit discussion of posterior collapse and training stability issues is a positive feature if backed by reproducible evidence.

major comments (1)
  1. [Abstract] Abstract: the central comparative claims—that the autoregressive LSTM with attention 'produces the most musically coherent samples,' that VQ-VAEs 'yield more structured outputs,' and that GANs 'generalize less reliably'—are asserted without reference to any quantitative metrics (e.g., sequence perplexity, pitch-class histogram distances, or note-onset statistics), controlled listening-test protocols, inter-rater reliability statistics, or dataset statistics, rendering the model rankings unverifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central comparative claims—that the autoregressive LSTM with attention 'produces the most musically coherent samples,' that VQ-VAEs 'yield more structured outputs,' and that GANs 'generalize less reliably'—are asserted without reference to any quantitative metrics (e.g., sequence perplexity, pitch-class histogram distances, or note-onset statistics), controlled listening-test protocols, inter-rater reliability statistics, or dataset statistics, rendering the model rankings unverifiable.

    Authors: We agree that the abstract would be strengthened by explicit references to the quantitative metrics and evaluation details that support the claims. The body of the manuscript reports sequence perplexity, pitch-class histogram distances, note-onset statistics, and dataset statistics, along with the evaluation protocol. We will revise the abstract to include brief citations to these supporting elements so that the model rankings are more directly verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on direct experimental outputs

full rationale

The paper is an empirical comparative study of three model families on a shared MIDI corpus. Its central claims concern relative performance in musical coherence, structure, and generalization, derived from training and sampling experiments rather than any mathematical derivation chain. No equations, uniqueness theorems, ansatzes, or self-citations are invoked to force conclusions; the reported rankings follow from the experimental protocol itself. This is the standard case of a self-contained empirical paper whose results can be externally reproduced or falsified on the same corpus, yielding no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the work relies on standard neural network training practices whose details are not stated.

pith-pipeline@v0.9.1-grok · 5671 in / 1004 out tokens · 31854 ms · 2026-06-27T05:27:00.149207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    2021 , eprint=

    Bach Style Music Authoring System based on Deep Learning , author=. 2021 , eprint=

  2. [2]

    Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach

    Nikhil Kotecha , title =. CoRR , volume =. 2018 , url =. 1812.01060 , timestamp =

  3. [3]

    Self-Supervised

    Ondrej C. Self-Supervised. CoRR , volume =. 2021 , url =. 2102.05749 , timestamp =

  4. [4]

    2023 , eprint=

    Simple and Controllable Music Generation , author=. 2023 , eprint=

  5. [5]

    2012 , eprint=

    Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , author=. 2012 , eprint=

  6. [6]

    Complete Bach Midi Index , howpublished =

  7. [7]

    Bach Chorale Harmony Data , howpublished =

  8. [8]

    CoRR , volume =

    Omar Peracha , title =. CoRR , volume =. 2021 , url =. 2107.10388 , timestamp =

  9. [9]

    2019 , eprint=

    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , author=. 2019 , eprint=

  10. [10]

    2022 , eprint=

    Generating music with sentiment using Transformer-GANs , author=. 2022 , eprint=

  11. [11]

    2016 , eprint=

    WaveNet: A Generative Model for Raw Audio , author=. 2016 , eprint=

  12. [12]

    Engel and Colin Raffel and Curtis Hawthorne and Douglas Eck , title =

    Adam Roberts and Jesse H. Engel and Colin Raffel and Curtis Hawthorne and Douglas Eck , title =. CoRR , volume =. 2018 , url =. 1803.05428 , timestamp =

  13. [13]

    Colin Raffel and Daniel P. W. Ellis. Intuitive Analysis, Creation and Manipulation of MIDI Data with pretty\_midi. International Society for Music Information Retrieval Conference. 2014

  14. [14]

    2018 , eprint=

    MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer , author=. 2018 , eprint=

  15. [15]

    Music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data

    Cuthbert, Michael Scott and Ariza, Christopher , biburl =. Music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data. , url =. ISMIR , crossref =

  16. [16]

    Advances in Neural Information Processing Systems , year=

    Neural Discrete Representation Learning , author=. Advances in Neural Information Processing Systems , year=

  17. [17]

    GitHub repository , howpublished =

    Melucci, Pierfrancesco , title =. GitHub repository , howpublished =. 2022 , publisher =

  18. [18]

    arXiv , year=

    Conditional LSTM-GAN for Melody Generation from Lyrics , author=. arXiv , year=

  19. [19]

    2019 , eprint=

    A Style-Based Generator Architecture for Generative Adversarial Networks , author=. 2019 , eprint=

  20. [20]

    A Style-Based Generator Architecture for Generative Adversarial Networks

    Tero Karras and Samuli Laine and Timo Aila , title =. CoRR , volume =. 2018 , url =. 1812.04948 , timestamp =

  21. [21]

    2017 , eprint=

    Wasserstein GAN , author=. 2017 , eprint=

  22. [22]

    2017 , eprint=

    Improved Training of Wasserstein GANs , author=. 2017 , eprint=

  23. [23]

    Generating Original Classical Music with an LSTM Neural Network and Attention , howpublished =

  24. [24]

    2016 , eprint=

    Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

  25. [25]

    Generating Long-Term Structure in Songs and Stories , howpublished =