pith. machine review for the scientific record. sign in

arxiv: 2604.22925 · v1 · submitted 2026-04-24 · 📊 stat.AP · cs.SD

Recognition: unknown

Come Together: Analyzing Popular Songs Through Statistical Embeddings

Jason Brown, Mark Glickman, Matthew Esmaili Mallory

Pith reviewed 2026-05-08 09:08 UTC · model grok-4.3

classification 📊 stat.AP cs.SD
keywords logistic PCAsong embeddingsBeatles analysispopular music statisticsmultivariate analysischords and melodiesstylistic evolutionsong structure
0
0 comments X

The pith

Logistic principal component analysis turns song features into embeddings that support statistical study of stylistic changes in early Beatles music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how logistic principal component analysis converts global features of songs such as chords, melodic notes, transitions, and contours into real-valued embeddings. These embeddings then permit ordinary multivariate statistical methods to examine patterns that conventional tools cannot handle directly. A sympathetic reader would care because the approach supplies a data-driven way to track how songwriting developed across specific years and between two key composers. The method is demonstrated on Lennon and McCartney songs from 1962 to 1966 to inspect album-based groupings, temporal evolution, and signs of convergence or divergence.

Core claim

The central claim is that embeddings obtained through logistic principal component analysis on global song features including chords, melodic notes, chord and pitch transitions, and melodic contours enable standard multivariate analysis of a corpus of Lennon and McCartney songs from 1962-1966. This framework is applied to explore how the embeddings cluster by Beatles album, how songwriting styles changed over time, and whether the two songwriters' compositions converged or diverged.

What carries the argument

logistic principal component analysis applied to global song features to produce vector embeddings suitable for multivariate statistical analysis

Load-bearing premise

The selected musical features and the logistic PCA embeddings derived from them meaningfully encode the stylistic and structural differences the authors wish to study.

What would settle it

Recomputing the embeddings and finding no album-aligned clusters or no consistent temporal trends across the 1962-1966 songs would show that the features and embeddings do not capture the intended differences.

Figures

Figures reproduced from arXiv: 2604.22925 by Jason Brown, Mark Glickman, Matthew Esmaili Mallory.

Figure 1
Figure 1. Figure 1: A small sample of the Beatles dataset, showing each song’s authorship, album, and a few features. “Dominant view at source ↗
Figure 2
Figure 2. Figure 2: Plot of the first two principal components, colored by album. Circles represent McCartney’s songs, triangles view at source ↗
Figure 3
Figure 3. Figure 3: Plot of the album centroids for the first two principal components, again colored by album. Circles represent view at source ↗
Figure 4
Figure 4. Figure 4: Plot of the first two principal components for songs by Lennon and McCartney. view at source ↗
Figure 5
Figure 5. Figure 5: Average Euclidean distance between Lennon’s and McCartney’s embeddings across all albums from 1962 to view at source ↗
Figure 6
Figure 6. Figure 6: Plot of the square root of total variance, which is computed as the average squared Euclidean distances view at source ↗
Figure 7
Figure 7. Figure 7: Plot showing the Euclidean distance between each of George’s songs and the corresponding album-specific view at source ↗
Figure 8
Figure 8. Figure 8: Two-means clustering of Lennon and McCartney songs using the 35 principal components, projected onto the view at source ↗
Figure 9
Figure 9. Figure 9: The five features that contributed most to songs being outliers. The frequency refers to how often the feature view at source ↗
Figure 10
Figure 10. Figure 10: Predicted authorship for a selection of disputed Beatles songs. In over half the cases, all four models produce view at source ↗
read the original abstract

Statistical modeling of popular music presents a unique challenge due to the complexity of song structures, which cannot be easily analyzed using conventional statistical tools. However, recent advances in data science have shown that converting non-standard data objects into real vector-valued embeddings enables meaningful statistical analysis. In this work, we demonstrate an approach based on logistic principal component analysis to construct embeddings from global song features, allowing for standard multivariate analysis. We apply this method to a corpus of Lennon and McCartney songs from 1962-1966, using embeddings derived from chords, melodic notes, chord and pitch transitions, and melodic contours. Our analysis explores how these song embeddings cluster by Beatles album, how songwriting styles evolved over time, and whether Lennon and McCartney's compositions exhibited convergence or divergence. This embedding-based approach offers a powerful framework for statistically examining musical structure and stylistic development in popular music.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes logistic principal component analysis to construct low-dimensional embeddings from global song features (chords, melodic notes, chord/pitch transitions, and melodic contours) extracted from a corpus of Lennon-McCartney Beatles songs (1962-1966). These embeddings are then used for standard multivariate statistical analyses to examine album-based clustering, temporal evolution of songwriting styles, and convergence or divergence between Lennon and McCartney.

Significance. If the embeddings faithfully represent the input musical features, the work supplies a practical framework for applying conventional multivariate tools to complex, non-vector musical data. The Beatles application serves as a coherent proof-of-concept, with internally consistent feature encoding, logistic PCA implementation, and exploratory visualizations that illustrate album clustering and stylistic trends without requiring additional distributional assumptions.

minor comments (3)
  1. The abstract describes the method and goals but supplies no equations, validation steps, error analysis, or results; adding a single sentence summarizing the key empirical observations (e.g., observed clustering patterns) would improve reader orientation without altering the manuscript's scope.
  2. The methods section would benefit from an explicit statement of the number of songs analyzed and the precise encoding scheme used for each feature type (chords, notes, transitions, contours) to allow direct replication.
  3. Figure captions should more clearly indicate which panels correspond to which downstream analysis (album clustering, temporal trends, Lennon-McCartney comparison) and note any preprocessing steps applied to the embeddings before visualization.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The review accurately captures the core contribution of using logistic PCA embeddings to enable standard multivariate analyses on musical features.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies logistic principal component analysis—an established external technique—to song feature vectors (chords, notes, transitions, contours) to produce embeddings, then performs standard multivariate analysis on those embeddings. No derivation step reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. The workflow is a direct, non-circular dimensionality reduction followed by exploratory interpretation on a new corpus, remaining fully self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; no explicit free parameters, axioms, or invented entities are stated.

axioms (1)
  • domain assumption Logistic principal component analysis is appropriate for converting categorical musical features into continuous embeddings suitable for multivariate analysis
    Invoked by the choice of method in the abstract

pith-pipeline@v0.9.0 · 5445 in / 1136 out tokens · 42589 ms · 2026-05-08T09:08:12.637042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Leo Breiman

    doi: 10.1201/9780367816377. Leo Breiman. Random forests.Machine Learning, 45(1):5–32, Oct

  2. [2]

    doi: 10.1023/A: 1021805924152

    ISSN 1573-0565. doi: 10.1023/A: 1010933404324. URLhttps://doi.org/10.1023/A:1010933404324. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Zi...

  3. [3]

    Language Models are Few-Shot Learners

    URL https://arxiv.org/abs/2005.14165. John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga. Compositional data analysis of harmonic structures in popular music. InMathematics and Computation in Music, pages 52–63, Berlin, Heidelberg,

  4. [4]

    Jan de Leeuw

    URL https://proceedings.neurips.cc/paper_files/ paper/2001/file/f410588e48dc83f2822a880a68f78923-Paper.pdf. Jan de Leeuw. Principal component analysis of binary data by iterated singular value decomposition.Computational Statistics & Data Analysis, 50(1):21–39,

  5. [5]

    doi: https://doi.org/10.1016/j.csda.2004.07.010

    ISSN 0167-9473. doi: https://doi.org/10.1016/j.csda.2004.07.010. URL https://www.sciencedirect.com/science/article/pii/S0167947304002300. 2nd Special issue on Matrix Computations and Statistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding,

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    URLhttps://arxiv.org/abs/1810.04805. Mark Glickman, Jason Brown, and Ryan Song. (A) Data in the Life: Authorship Attribution in Lennon-McCartney Songs.Harvard Data Science Review, 1(1), jul 2

  7. [7]

    Herzog, T

    ISSN 0047-259X. doi: https://doi.org/10.1016/j. jmva.2020.104668. Seokho Lee, Jianhua Z. Huang, and Jianhua Hu. Sparse logistic principal components analysis for binary data. The Annals of Applied Statistics, 4(3), September

  8. [8]

    doi: 10.1214/10-aoas327

    ISSN 1932-6157. doi: 10.1214/10-aoas327. URL http://dx.doi.org/10.1214/10-AOAS327. J. Lennon, D. Sheff, and Y . Ono.All We Are Saying: The Last Major Interview with John Lennon and Yoko Ono. Pan Macmillan,

  9. [9]

    Efficient Estimation of Word Representations in Vector Space

    URLhttps://arxiv.org/abs/1301.3781. Barry Miles.Paul McCartney: Many Years From Now. Henry Holt & Co,

  10. [10]

    2002.1035728

    doi: 10.1109/ICME. 2002.1035728. David J. Pannell. Quantitative analysis of the evolution of the beatles’ releases for emi, 1962–1970.Jour- nal of Beatles Studies, 2023(Spring/Autumn):65–90,

  11. [11]

    URL https://www

    doi: 10.3828/jbs.2023.5. URL https://www. liverpooluniversitypress.co.uk/doi/abs/10.3828/jbs.2023.5. 12 APREPRINT- APRIL28, 2026 Andrew I. Schein, Lawrence K. Saul, and Lyle H. Ungar. A generalized linear model for principal component analysis of binary data. In Christopher M. Bishop and Brendan J. Frey, editors,Proceedings of the Ninth International Work...

  12. [12]

    URL https://arxiv.org/abs/2312. 11805. arXiv:2312.11805. John Thickstun, Zaid Harchaoui, and Sham Kakade. Learning features of music from scratch. InInternational Conference on Learning Representations,