Query-based Deep Improvisation
Pith reviewed 2026-05-25 18:17 UTC · model grok-4.3
The pith
Querying a VAE trained on one musical style with input from another produces blended output that carries longer-term structure from the query.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Instead of free improvisation obtained by random sampling of latent states, new music is generated by feeding the encoder a query signal whose style differs from the training corpus; a controllable noisy channel based on rate-distortion bit allocation then determines how much of the query's longer-term structure is preserved in the output while the decoder supplies the learned style.
What carries the argument
Noisy channel placed between the VAE encoder and decoder whose noise variance is set by a bit-allocation algorithm drawn from rate-distortion theory, thereby regulating how much structural information from the query reaches the decoder.
If this is right
- The generated pieces exhibit longer-term coherence traceable to the query input rather than arising only from the training corpus.
- The amount of query influence can be varied continuously by changing the noise level set by the rate-distortion allocator.
- Latent states are shown to carry both style-specific representational information and structural information supplied by the query.
- The same mechanism supplies a practical handle for using the network in deliberate composition rather than pure improvisation.
Where Pith is reading between the lines
- The rate-distortion control could be replaced by other information-bottleneck techniques to achieve similar blending in non-VAE generators.
- Live performance input could be used as the query to create real-time systems that let a human performer steer the AI output at the structural level.
- The same encoder-decoder-plus-channel architecture might be applied to other sequence domains where one wishes to import external structure into a learned style.
Load-bearing premise
That a noisy channel whose noise level is chosen by rate-distortion bit allocation will produce a controllable blend of query structure into the trained style without destroying the longer-term coherence that the query is supposed to supply.
What would settle it
If human listeners cannot reliably detect longer-term structural differences between pieces generated with the query-plus-channel method and pieces generated by ordinary random latent sampling, or if varying the bit allocation produces no audible change in the degree of blending, the central claim would be falsified.
Figures
read the original abstract
In this paper we explore techniques for generating new music using a Variational Autoencoder (VAE) neural network that was trained on a corpus of specific style. Instead of randomly sampling the latent states of the network to produce free improvisation, we generate new music by querying the network with musical input in a style different from the training corpus. This allows us to produce new musical output with longer-term structure that blends aspects of the query to the style of the network. In order to control the level of this blending we add a noisy channel between the VAE encoder and decoder using bit-allocation algorithm from communication rate-distortion theory. Our experiments provide new insight into relations between the representational and structural information of latent states and the query signal, suggesting their possible use for composition purposes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a query-based music generation technique using a VAE trained on one musical style corpus. Rather than random latent sampling, the method feeds the encoder with query input from a different style and inserts a noisy channel (noise level set via rate-distortion bit allocation) between encoder and decoder; the resulting output is claimed to exhibit longer-term structure that blends query aspects into the trained style, with the noise level controlling the blend. Experiments are asserted to yield insight into relations between latent representational/structural information and the query signal.
Significance. If the central mechanism is shown to work, the approach would supply a concrete, information-theoretically grounded way to inject controllable long-range structure into style-specific generative models without retraining, which could be useful for interactive composition systems.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the central claim that the method produces controllable longer-term structure is unsupported; the text supplies no quantitative metrics, error bars, listening-test protocol, ablation studies, or statistical analysis of generated outputs.
- [§3.2] §3.2 (Noisy channel construction): the bit-allocation rule is introduced to control blending, yet no derivation or empirical check demonstrates that the resulting noisy latent trajectory still carries the query's temporal dependencies when the encoder receives out-of-distribution input; without this, the blending claim cannot be evaluated.
Simulated Author's Rebuttal
Thank you for the referee's constructive comments. We address each major comment below, indicating planned revisions where appropriate. The feedback helps clarify the presentation of our experimental results and the justification for the noisy channel mechanism.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that the method produces controllable longer-term structure is unsupported; the text supplies no quantitative metrics, error bars, listening-test protocol, ablation studies, or statistical analysis of generated outputs.
Authors: We agree that the experimental section relies primarily on qualitative demonstrations of generated outputs and latent analysis rather than formal quantitative evaluation. The manuscript's focus is on exploratory insight into query-based blending via examples. In revision we will expand §4 to include a description of the listening evaluation protocol, basic quantitative comparisons (e.g., pitch-class histogram divergence and note-density statistics across noise levels), and ablation results for different bit-allocation settings. Where repeated generations are feasible, error bars will be reported. These additions will be incorporated in the revised manuscript. revision: yes
-
Referee: [§3.2] §3.2 (Noisy channel construction): the bit-allocation rule is introduced to control blending, yet no derivation or empirical check demonstrates that the resulting noisy latent trajectory still carries the query's temporal dependencies when the encoder receives out-of-distribution input; without this, the blending claim cannot be evaluated.
Authors: The bit-allocation procedure is taken directly from rate-distortion theory to modulate information passed from encoder to decoder. We acknowledge that the manuscript does not supply an explicit derivation or empirical verification that temporal structure from out-of-distribution queries is retained after noise injection. In the revision we will add a short theoretical paragraph in §3.2 explaining preservation of temporal dependencies under the VAE's Gaussian latent assumption, together with an empirical check that compares autocorrelation of latent trajectories before and after noise for query inputs from different styles. If the check proves inconclusive we will qualify the blending claim accordingly. revision: partial
Circularity Check
No circularity: method composes standard VAE + external rate-distortion channel without self-referential reduction
full rationale
The paper's core procedure—training a VAE on one corpus, encoding a stylistically foreign query, injecting noise whose level is set by a bit-allocation rule taken from rate-distortion theory, and decoding—relies on externally established components (VAE training, rate-distortion bit allocation) rather than any quantity fitted inside the paper and then re-labeled as a prediction. No equations are presented that define the output structure in terms of the query input or that reduce the blending claim to a self-citation chain. The abstract and described approach therefore remain self-contained against external benchmarks; the reader's supplied circularity score of 2.0 is consistent with this assessment.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...
-
[2]
Abdallah, S., and Plumbley, M. 2009. Information dynamics: Patterns of expectation and surprise in the perception of music. Connect. Sci 21(2-3):89--117
work page 2009
-
[3]
Alemi, A. A.; Poole, B.; Fischer, I.; Dillon, J. V.; Saurous, R. A.; and Murphy, K. 2017. An information-theoretic analysis of deep latent-variable models. CoRR abs/1711.00464
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Berger, T. 1971. Rate distortion theory; a mathematical basis for data compression . Prentice-Hall Englewood Cliffs, N.J
work page 1971
-
[5]
Harte, C.; Sandler, M.; and Gasser, M. 2006. Detecting harmonic change in musical audio. Proceedings of Audio and Music Computing for Multimedia Workshop
work page 2006
-
[6]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. - VAE : Learning basic visual concepts with a constrained variational framework. ICLR
work page 2017
-
[7]
Wang, C., and Dubnov, S. 2015a. Pattern discovery from audio recordings by variable markov oracle: A music information dynamics approach. Proceedings of 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
-
[8]
Wang, C., and Dubnov, S. 2015b. The variable markov oracle: Algorithms for human gesture applications. IEEE MultiMedia 22(04):52--67
-
[9]
Wang, C.; Hsu, J.; and Dubnov, S. 2016. Machine improvisation with variable markov oracle: Toward guided and structured improvisation. Computers in Entertainment (CIE) 14(03)
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.