arxiv: 2604.22925 · v1 · submitted 2026-04-24 · 📊 stat.AP · cs.SD

Recognition: unknown

Come Together: Analyzing Popular Songs Through Statistical Embeddings

Jason Brown, Mark Glickman, Matthew Esmaili Mallory

Pith reviewed 2026-05-08 09:08 UTC · model grok-4.3

classification 📊 stat.AP cs.SD

keywords logistic PCAsong embeddingsBeatles analysispopular music statisticsmultivariate analysischords and melodiesstylistic evolutionsong structure

0 comments

The pith

Logistic principal component analysis turns song features into embeddings that support statistical study of stylistic changes in early Beatles music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how logistic principal component analysis converts global features of songs such as chords, melodic notes, transitions, and contours into real-valued embeddings. These embeddings then permit ordinary multivariate statistical methods to examine patterns that conventional tools cannot handle directly. A sympathetic reader would care because the approach supplies a data-driven way to track how songwriting developed across specific years and between two key composers. The method is demonstrated on Lennon and McCartney songs from 1962 to 1966 to inspect album-based groupings, temporal evolution, and signs of convergence or divergence.

Core claim

The central claim is that embeddings obtained through logistic principal component analysis on global song features including chords, melodic notes, chord and pitch transitions, and melodic contours enable standard multivariate analysis of a corpus of Lennon and McCartney songs from 1962-1966. This framework is applied to explore how the embeddings cluster by Beatles album, how songwriting styles changed over time, and whether the two songwriters' compositions converged or diverged.

What carries the argument

logistic principal component analysis applied to global song features to produce vector embeddings suitable for multivariate statistical analysis

Load-bearing premise

The selected musical features and the logistic PCA embeddings derived from them meaningfully encode the stylistic and structural differences the authors wish to study.

What would settle it

Recomputing the embeddings and finding no album-aligned clusters or no consistent temporal trends across the 1962-1966 songs would show that the features and embeddings do not capture the intended differences.

Figures

Figures reproduced from arXiv: 2604.22925 by Jason Brown, Mark Glickman, Matthew Esmaili Mallory.

**Figure 1.** Figure 1: A small sample of the Beatles dataset, showing each song’s authorship, album, and a few features. “Dominant view at source ↗

**Figure 2.** Figure 2: Plot of the first two principal components, colored by album. Circles represent McCartney’s songs, triangles view at source ↗

**Figure 3.** Figure 3: Plot of the album centroids for the first two principal components, again colored by album. Circles represent view at source ↗

**Figure 4.** Figure 4: Plot of the first two principal components for songs by Lennon and McCartney. view at source ↗

**Figure 5.** Figure 5: Average Euclidean distance between Lennon’s and McCartney’s embeddings across all albums from 1962 to view at source ↗

**Figure 6.** Figure 6: Plot of the square root of total variance, which is computed as the average squared Euclidean distances view at source ↗

**Figure 7.** Figure 7: Plot showing the Euclidean distance between each of George’s songs and the corresponding album-specific view at source ↗

**Figure 8.** Figure 8: Two-means clustering of Lennon and McCartney songs using the 35 principal components, projected onto the view at source ↗

**Figure 9.** Figure 9: The five features that contributed most to songs being outliers. The frequency refers to how often the feature view at source ↗

**Figure 10.** Figure 10: Predicted authorship for a selection of disputed Beatles songs. In over half the cases, all four models produce view at source ↗

read the original abstract

Statistical modeling of popular music presents a unique challenge due to the complexity of song structures, which cannot be easily analyzed using conventional statistical tools. However, recent advances in data science have shown that converting non-standard data objects into real vector-valued embeddings enables meaningful statistical analysis. In this work, we demonstrate an approach based on logistic principal component analysis to construct embeddings from global song features, allowing for standard multivariate analysis. We apply this method to a corpus of Lennon and McCartney songs from 1962-1966, using embeddings derived from chords, melodic notes, chord and pitch transitions, and melodic contours. Our analysis explores how these song embeddings cluster by Beatles album, how songwriting styles evolved over time, and whether Lennon and McCartney's compositions exhibited convergence or divergence. This embedding-based approach offers a powerful framework for statistically examining musical structure and stylistic development in popular music.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a solid proof-of-concept application of logistic PCA to early Beatles song features that yields interpretable clusters and trends without claiming new methodology.

read the letter

The paper takes global song features—chords, notes, transitions, and contours—from Lennon-McCartney tracks 1962-1966, embeds them via logistic PCA, and then runs ordinary multivariate checks on album grouping, temporal shifts, and writer convergence. That workflow is the core contribution, and it is executed cleanly enough to produce readable visualizations and straightforward interpretations. The authors stay within the limits of an exploratory study and do not overclaim causal or predictive power. The technical steps line up internally: feature encoding feeds the embedding, the embedding supports the downstream plots, and the plots match the stated questions about style evolution. No circularity or hidden fitting appears in the reported pipeline. The main limitation is scope. Everything is confined to one band and a narrow window, so the embeddings capture Beatles-specific patterns rather than general popular-music structure. There is also no external validation set, no head-to-head comparison against other embeddings, and no quantitative measure of how much information the low-dimensional representation retains. Those gaps are expected in a first application paper but keep the work from being more than a demonstration. Readers working in digital humanities or music information retrieval will find the concrete example useful for seeing how standard statistical tools can be applied once the data are turned into vectors. Statisticians interested in non-standard objects will see a transparent case study rather than a methodological advance. The paper is coherent on its own terms and deserves a serious referee who can check the feature definitions and embedding stability on the actual corpus.

Referee Report

0 major / 3 minor

Summary. The paper proposes logistic principal component analysis to construct low-dimensional embeddings from global song features (chords, melodic notes, chord/pitch transitions, and melodic contours) extracted from a corpus of Lennon-McCartney Beatles songs (1962-1966). These embeddings are then used for standard multivariate statistical analyses to examine album-based clustering, temporal evolution of songwriting styles, and convergence or divergence between Lennon and McCartney.

Significance. If the embeddings faithfully represent the input musical features, the work supplies a practical framework for applying conventional multivariate tools to complex, non-vector musical data. The Beatles application serves as a coherent proof-of-concept, with internally consistent feature encoding, logistic PCA implementation, and exploratory visualizations that illustrate album clustering and stylistic trends without requiring additional distributional assumptions.

minor comments (3)

The abstract describes the method and goals but supplies no equations, validation steps, error analysis, or results; adding a single sentence summarizing the key empirical observations (e.g., observed clustering patterns) would improve reader orientation without altering the manuscript's scope.
The methods section would benefit from an explicit statement of the number of songs analyzed and the precise encoding scheme used for each feature type (chords, notes, transitions, contours) to allow direct replication.
Figure captions should more clearly indicate which panels correspond to which downstream analysis (album clustering, temporal trends, Lennon-McCartney comparison) and note any preprocessing steps applied to the embeddings before visualization.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The review accurately captures the core contribution of using logistic PCA embeddings to enable standard multivariate analyses on musical features.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper applies logistic principal component analysis—an established external technique—to song feature vectors (chords, notes, transitions, contours) to produce embeddings, then performs standard multivariate analysis on those embeddings. No derivation step reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain. The workflow is a direct, non-circular dimensionality reduction followed by exploratory interpretation on a new corpus, remaining fully self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; no explicit free parameters, axioms, or invented entities are stated.

axioms (1)

domain assumption Logistic principal component analysis is appropriate for converting categorical musical features into continuous embeddings suitable for multivariate analysis
Invoked by the choice of method in the abstract

pith-pipeline@v0.9.0 · 5445 in / 1136 out tokens · 42589 ms · 2026-05-08T09:08:12.637042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 4 internal anchors

[1]

Leo Breiman

doi: 10.1201/9780367816377. Leo Breiman. Random forests.Machine Learning, 45(1):5–32, Oct

work page doi:10.1201/9780367816377
[2]

doi: 10.1023/A: 1021805924152

ISSN 1573-0565. doi: 10.1023/A: 1010933404324. URLhttps://doi.org/10.1023/A:1010933404324. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Zi...

work page doi:10.1023/a:
[3]

Language Models are Few-Shot Learners

URL https://arxiv.org/abs/2005.14165. John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga. Compositional data analysis of harmonic structures in popular music. InMathematics and Computation in Music, pages 52–63, Berlin, Heidelberg,

work page internal anchor Pith review arXiv 2005
[4]

Jan de Leeuw

URL https://proceedings.neurips.cc/paper_files/ paper/2001/file/f410588e48dc83f2822a880a68f78923-Paper.pdf. Jan de Leeuw. Principal component analysis of binary data by iterated singular value decomposition.Computational Statistics & Data Analysis, 50(1):21–39,

2001
[5]

doi: https://doi.org/10.1016/j.csda.2004.07.010

ISSN 0167-9473. doi: https://doi.org/10.1016/j.csda.2004.07.010. URL https://www.sciencedirect.com/science/article/pii/S0167947304002300. 2nd Special issue on Matrix Computations and Statistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding,

work page doi:10.1016/j.csda.2004.07.010 2004
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

URLhttps://arxiv.org/abs/1810.04805. Mark Glickman, Jason Brown, and Ryan Song. (A) Data in the Life: Authorship Attribution in Lennon-McCartney Songs.Harvard Data Science Review, 1(1), jul 2

work page internal anchor Pith review arXiv
[7]

Herzog, T

ISSN 0047-259X. doi: https://doi.org/10.1016/j. jmva.2020.104668. Seokho Lee, Jianhua Z. Huang, and Jianhua Hu. Sparse logistic principal components analysis for binary data. The Annals of Applied Statistics, 4(3), September

work page doi:10.1016/j 2020
[8]

doi: 10.1214/10-aoas327

ISSN 1932-6157. doi: 10.1214/10-aoas327. URL http://dx.doi.org/10.1214/10-AOAS327. J. Lennon, D. Sheff, and Y . Ono.All We Are Saying: The Last Major Interview with John Lennon and Yoko Ono. Pan Macmillan,

work page doi:10.1214/10-aoas327 1932
[9]

Efficient Estimation of Word Representations in Vector Space

URLhttps://arxiv.org/abs/1301.3781. Barry Miles.Paul McCartney: Many Years From Now. Henry Holt & Co,

work page internal anchor Pith review arXiv
[10]

2002.1035728

doi: 10.1109/ICME. 2002.1035728. David J. Pannell. Quantitative analysis of the evolution of the beatles’ releases for emi, 1962–1970.Jour- nal of Beatles Studies, 2023(Spring/Autumn):65–90,

work page doi:10.1109/icme 2002
[11]

URL https://www

doi: 10.3828/jbs.2023.5. URL https://www. liverpooluniversitypress.co.uk/doi/abs/10.3828/jbs.2023.5. 12 APREPRINT- APRIL28, 2026 Andrew I. Schein, Lawrence K. Saul, and Lyle H. Ungar. A generalized linear model for principal component analysis of binary data. In Christopher M. Bishop and Brendan J. Frey, editors,Proceedings of the Ninth International Work...

work page doi:10.3828/jbs.2023.5 2023
[12]

URL https://arxiv.org/abs/2312. 11805. arXiv:2312.11805. John Thickstun, Zaid Harchaoui, and Sham Kakade. Learning features of music from scratch. InInternational Conference on Learning Representations,

work page internal anchor Pith review arXiv