pith. sign in

arxiv: 1906.09155 · v1 · pith:EEC5QJXOnew · submitted 2019-06-21 · 💻 cs.SD · cs.LG· eess.AS· stat.ML

Query-based Deep Improvisation

Pith reviewed 2026-05-25 18:17 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.ASstat.ML
keywords music generationvariational autoencoderquery-based generationrate-distortion theorylatent spacestyle blendingmusical improvisationnoisy channel
0
0 comments X

The pith

Querying a VAE trained on one musical style with input from another produces blended output that carries longer-term structure from the query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A variational autoencoder is trained on a corpus of music in one style. Rather than sampling its latent states at random, the network receives a query consisting of music written in a different style. The decoder then generates new output that incorporates longer-term structure from the query while adopting timbral and local stylistic traits from the training corpus. A noisy channel whose noise level is set by a bit-allocation rule taken from rate-distortion theory is placed between encoder and decoder; changing the allocated bits varies how much of the query structure survives in the final music. Experiments with this setup are used to examine how structural information is represented in the latent states.

Core claim

Instead of free improvisation obtained by random sampling of latent states, new music is generated by feeding the encoder a query signal whose style differs from the training corpus; a controllable noisy channel based on rate-distortion bit allocation then determines how much of the query's longer-term structure is preserved in the output while the decoder supplies the learned style.

What carries the argument

Noisy channel placed between the VAE encoder and decoder whose noise variance is set by a bit-allocation algorithm drawn from rate-distortion theory, thereby regulating how much structural information from the query reaches the decoder.

If this is right

  • The generated pieces exhibit longer-term coherence traceable to the query input rather than arising only from the training corpus.
  • The amount of query influence can be varied continuously by changing the noise level set by the rate-distortion allocator.
  • Latent states are shown to carry both style-specific representational information and structural information supplied by the query.
  • The same mechanism supplies a practical handle for using the network in deliberate composition rather than pure improvisation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rate-distortion control could be replaced by other information-bottleneck techniques to achieve similar blending in non-VAE generators.
  • Live performance input could be used as the query to create real-time systems that let a human performer steer the AI output at the structural level.
  • The same encoder-decoder-plus-channel architecture might be applied to other sequence domains where one wishes to import external structure into a learned style.

Load-bearing premise

That a noisy channel whose noise level is chosen by rate-distortion bit allocation will produce a controllable blend of query structure into the trained style without destroying the longer-term coherence that the query is supposed to supply.

What would settle it

If human listeners cannot reliably detect longer-term structural differences between pieces generated with the query-plus-channel method and pieces generated by ordinary random latent sampling, or if varying the bit allocation produces no audible change in the degree of blending, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.09155 by Shlomo Dubnov.

Figure 1
Figure 1. Figure 1: Noisy channel between encoder and decoder [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Output of VAE that was trained on Pop Mu [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generation by VAE with bit-rate controlled query: [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Information rate as function of similarity thresh [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Information rate of the latent states at full rate (top) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

In this paper we explore techniques for generating new music using a Variational Autoencoder (VAE) neural network that was trained on a corpus of specific style. Instead of randomly sampling the latent states of the network to produce free improvisation, we generate new music by querying the network with musical input in a style different from the training corpus. This allows us to produce new musical output with longer-term structure that blends aspects of the query to the style of the network. In order to control the level of this blending we add a noisy channel between the VAE encoder and decoder using bit-allocation algorithm from communication rate-distortion theory. Our experiments provide new insight into relations between the representational and structural information of latent states and the query signal, suggesting their possible use for composition purposes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a query-based music generation technique using a VAE trained on one musical style corpus. Rather than random latent sampling, the method feeds the encoder with query input from a different style and inserts a noisy channel (noise level set via rate-distortion bit allocation) between encoder and decoder; the resulting output is claimed to exhibit longer-term structure that blends query aspects into the trained style, with the noise level controlling the blend. Experiments are asserted to yield insight into relations between latent representational/structural information and the query signal.

Significance. If the central mechanism is shown to work, the approach would supply a concrete, information-theoretically grounded way to inject controllable long-range structure into style-specific generative models without retraining, which could be useful for interactive composition systems.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that the method produces controllable longer-term structure is unsupported; the text supplies no quantitative metrics, error bars, listening-test protocol, ablation studies, or statistical analysis of generated outputs.
  2. [§3.2] §3.2 (Noisy channel construction): the bit-allocation rule is introduced to control blending, yet no derivation or empirical check demonstrates that the resulting noisy latent trajectory still carries the query's temporal dependencies when the encoder receives out-of-distribution input; without this, the blending claim cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's constructive comments. We address each major comment below, indicating planned revisions where appropriate. The feedback helps clarify the presentation of our experimental results and the justification for the noisy channel mechanism.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that the method produces controllable longer-term structure is unsupported; the text supplies no quantitative metrics, error bars, listening-test protocol, ablation studies, or statistical analysis of generated outputs.

    Authors: We agree that the experimental section relies primarily on qualitative demonstrations of generated outputs and latent analysis rather than formal quantitative evaluation. The manuscript's focus is on exploratory insight into query-based blending via examples. In revision we will expand §4 to include a description of the listening evaluation protocol, basic quantitative comparisons (e.g., pitch-class histogram divergence and note-density statistics across noise levels), and ablation results for different bit-allocation settings. Where repeated generations are feasible, error bars will be reported. These additions will be incorporated in the revised manuscript. revision: yes

  2. Referee: [§3.2] §3.2 (Noisy channel construction): the bit-allocation rule is introduced to control blending, yet no derivation or empirical check demonstrates that the resulting noisy latent trajectory still carries the query's temporal dependencies when the encoder receives out-of-distribution input; without this, the blending claim cannot be evaluated.

    Authors: The bit-allocation procedure is taken directly from rate-distortion theory to modulate information passed from encoder to decoder. We acknowledge that the manuscript does not supply an explicit derivation or empirical verification that temporal structure from out-of-distribution queries is retained after noise injection. In the revision we will add a short theoretical paragraph in §3.2 explaining preservation of temporal dependencies under the VAE's Gaussian latent assumption, together with an empirical check that compares autocorrelation of latent trajectories before and after noise for query inputs from different styles. If the check proves inconclusive we will qualify the blending claim accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: method composes standard VAE + external rate-distortion channel without self-referential reduction

full rationale

The paper's core procedure—training a VAE on one corpus, encoding a stylistically foreign query, injecting noise whose level is set by a bit-allocation rule taken from rate-distortion theory, and decoding—relies on externally established components (VAE training, rate-distortion bit allocation) rather than any quantity fitted inside the paper and then re-labeled as a prediction. No equations are presented that define the output structure in terms of the query input or that reduce the blending claim to a self-citation chain. The abstract and described approach therefore remain self-contained against external benchmarks; the reader's supplied circularity score of 2.0 is consistent with this assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method relies on standard VAE training assumptions and off-the-shelf rate-distortion bit allocation whose details are not stated.

pith-pipeline@v0.9.0 · 5655 in / 1154 out tokens · 22152 ms · 2026-05-25T18:17:29.538243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

  2. [2]

    Abdallah, S., and Plumbley, M. 2009. Information dynamics: Patterns of expectation and surprise in the perception of music. Connect. Sci 21(2-3):89--117

  3. [3]

    Fixing a Broken ELBO

    Alemi, A. A.; Poole, B.; Fischer, I.; Dillon, J. V.; Saurous, R. A.; and Murphy, K. 2017. An information-theoretic analysis of deep latent-variable models. CoRR abs/1711.00464

  4. [4]

    Berger, T. 1971. Rate distortion theory; a mathematical basis for data compression . Prentice-Hall Englewood Cliffs, N.J

  5. [5]

    Harte, C.; Sandler, M.; and Gasser, M. 2006. Detecting harmonic change in musical audio. Proceedings of Audio and Music Computing for Multimedia Workshop

  6. [6]

    Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. - VAE : Learning basic visual concepts with a constrained variational framework. ICLR

  7. [7]

    Wang, C., and Dubnov, S. 2015a. Pattern discovery from audio recordings by variable markov oracle: A music information dynamics approach. Proceedings of 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  8. [8]

    Wang, C., and Dubnov, S. 2015b. The variable markov oracle: Algorithms for human gesture applications. IEEE MultiMedia 22(04):52--67

  9. [9]

    Wang, C.; Hsu, J.; and Dubnov, S. 2016. Machine improvisation with variable markov oracle: Toward guided and structured improvisation. Computers in Entertainment (CIE) 14(03)