On the Spectral Structure and Objective Equivalence of Orthogonal Multilabel Fisher Discriminants

Brian Keith-Norambuena; Juan Bekios-Calfa

arxiv: 2605.03283 · v2 · pith:Z76NZ7S3new · submitted 2026-05-05 · 📊 stat.ML · cs.AI· cs.LG

On the Spectral Structure and Objective Equivalence of Orthogonal Multilabel Fisher Discriminants

Brian Keith-Norambuena , Juan Bekios-Calfa This is my paper

Pith reviewed 2026-07-01 00:39 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords multilabel discriminant analysisFisher objectivessubspace estimationminimax boundsStiefel constraintspectral structurelabel distance preservation

0 comments

The pith

Multilabel Fisher discriminants attain near-minimax-optimal subspace rates and become equivalent under a total-scatter normalization constraint.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies a unified algebraic and statistical treatment of linear discriminant analysis when labels are vectors rather than single categories and projections are required to be orthogonal. Algebraically it shows that the multilabel between-class scatter matrix can have rank strictly larger than the classical C-1 limit, that variance admits a multilabel partition, and that four common Fisher objective functions coincide exactly once the projection matrix satisfies W transpose S_t^ML W equals the identity. Statistically it derives a finite-sample upper bound on subspace recovery error of order k_max times square-root of d log d over n divided by gap_r together with a matching lower bound of order sigma squared d over n gap_r, establishing near-optimality up to logarithmic and k_max factors under a linear label-effect model with sub-Gaussian noise.

Core claim

All four Fisher objectives are equivalent when the projection satisfies the constraint W^T S_t^{ML} W = I_r; the multilabel between-class scatter matrix admits rank larger than C-1; and the estimation error of the discriminant subspace obeys an O(k_max sqrt(d log d / n) / gap_r) finite-sample bound that matches an Omega(sigma^2 d / (n gap_r)) minimax lower bound up to logarithmic and k_max factors.

What carries the argument

The multilabel total scatter matrix S_t^{ML} together with the normalization constraint W^T S_t^{ML} W = I_r that forces equivalence of the four objectives, and the gap_r quantity that governs the statistical convergence rate.

If this is right

The effective number of usable discriminant directions can exceed the classical single-label limit of C-1.
Projected Euclidean distances satisfy two-sided bounds relating them to Hamming distances in the original label space.
High-probability concentration of label distances and robustness to label interactions hold under the same modeling assumptions.
A simple regularization scheme preserves the spectral structure of the scatter matrices when dimension greatly exceeds sample size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could safely substitute any of the four equivalent objectives once the total-scatter normalization is enforced, potentially simplifying software implementations.
The near-optimality result suggests that multilabel LDA may serve as a computationally cheap baseline for subspace methods in high-dimensional label spaces, provided the linear label model is a reasonable approximation.
The rank characterization could be tested directly on label matrices from real multilabel corpora to see whether the effective discriminant dimension indeed exceeds C-1 in practice.

Load-bearing premise

Data are generated exactly from a linear label-effect model with sub-Gaussian noise.

What would settle it

Observe whether the subspace estimation error on samples drawn from a non-linear or non-sub-Gaussian label model exceeds the stated O(k_max sqrt(d log d / n) / gap_r) rate by more than logarithmic factors.

read the original abstract

We provide a unified theoretical analysis of Linear Discriminant Analysis with simultaneous multilabel scatter matrix formulations and Stiefel orthogonality constraints. Our contributions span both algebraic structure and statistical guarantees. On the algebraic side, we characterize the rank of the multilabel between-class scatter matrix, showing that the effective discriminant dimensionality can strictly exceed the classical single-label bound of $C-1$; we establish a multilabel partition of variance and prove that all four Fisher objectives are equivalent under the $W^\top S_t^{ML} W = I_r$ constraint while characterizing their divergence under the Stiefel constraint; and we prove a two-sided label-distance preservation bound relating projected distances to Hamming distances in label space. On the statistical side, we establish a finite-sample $O(k_{\max}\sqrt{d\log d/n}/gap_r)$ bound on the subspace estimation error under sub-Gaussian noise with a matching $\Omega(\sigma^2 d/(n\,gap_r))$ minimax lower bound, establishing a near-minimax-optimal rate (matching up to logarithmic and $k_{\max}$ factors) for multilabel discriminant subspace estimation. We further provide high-probability distance concentration, robustness guarantees under label interactions, and a regularization analysis preserving the spectral structure when $d \gg n$. All results are verified numerically on synthetic data generated from the linear label-effect model, covering both the algebraic identities and the multilabel-specific quantities ($k_{\max}$, $\kappa(S_t^{ML})$, $\|\Gamma/n\|_2$, $\Delta_r$) that govern the statistical bounds. The numerical experiments are designed as a sanity check for the theorems rather than as an empirical benchmark; evaluation on real multilabel datasets is left to future work targeting application-oriented venues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The algebraic results on multilabel scatter ranks and objective equivalence look usable, but the near-minimax statistical claim collapses because the upper and lower bounds have incompatible n-rates.

read the letter

The main point for you is that the paper's algebraic contributions on multilabel Fisher discriminants hold up better than its statistical guarantees.

It shows the multilabel between-class scatter can have rank strictly larger than C-1, which is a clean departure from single-label LDA. It also proves the four standard Fisher objectives become equivalent under the specific constraint W^T S_t^{ML} W = I_r, while they diverge under plain Stiefel orthogonality, and it gives a two-sided bound connecting projected distances to Hamming distances in label space. These pieces are new relative to the usual LDA literature and are checked on synthetic data drawn from the linear label-effect model. That part of the work is straightforward and internally consistent.

The statistical section is where the problem sits. The finite-sample upper bound is O(k_max sqrt(d log d / n) / gap_r), but the claimed matching minimax lower bound is Ω(σ² d / (n gap_r)). One scales as n^{-1/2} and the other as n^{-1}; the powers do not match even after the extra k_max and log factors. The lower bound therefore cannot establish tightness, and the "near-minimax-optimal" phrasing does not follow. The algebraic equivalence results are independent of this issue and do not repair it. Everything stays inside the sub-Gaussian linear label-effect model, so the bounds apply only where that model is realistic.

The experiments are limited to synthetic sanity checks and do not include real multilabel data. The citation pattern looks standard for the claims.

This is for readers working on theoretical extensions of discriminant analysis or multilabel subspace methods. A serious referee could usefully check the algebraic derivations and ask for a corrected statistical section. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper provides a unified theoretical analysis of Linear Discriminant Analysis using simultaneous multilabel scatter matrix formulations and Stiefel orthogonality constraints. It characterizes the rank of the multilabel between-class scatter matrix (showing effective dimensionality can exceed C-1), establishes a multilabel partition of variance, proves equivalence of four Fisher objectives under the W^T S_t^{ML} W = I_r constraint (with divergence under Stiefel), and gives a two-sided label-distance preservation bound. Statistically, it claims a finite-sample O(k_max √(d log d/n)/gap_r) upper bound on subspace estimation error under sub-Gaussian noise, paired with a matching Ω(σ² d/(n gap_r)) minimax lower bound for near-minimax optimality (up to logs and k_max), plus high-probability concentration, robustness, and regularization results when d ≫ n. All claims are verified on synthetic data from the linear label-effect model.

Significance. If the algebraic results hold, the objective equivalence and variance partition provide a clean structural understanding of multilabel LDA variants that could inform algorithm design. The statistical bounds, if the rates were consistent, would give the first near-minimax guarantees for multilabel discriminant subspace estimation under the stated model. The numerical verification on synthetic data is a positive sanity check for the derived quantities (k_max, κ(S_t^{ML}), etc.). The rate mismatch identified below prevents the near-minimax claim from holding as stated.

major comments (2)

[Abstract] Abstract and statistical results section: the claimed near-minimax optimality does not follow from the stated bounds. The upper bound scales as O(k_max √(d log d/n)/gap_r) (i.e., Θ(n^{-1/2})), while the minimax lower bound is Ω(σ² d/(n gap_r)) (i.e., Θ(n^{-1})). These n-dependencies are incompatible even after the extra k_max and log d factors; the lower bound cannot certify tightness of the upper bound. This is load-bearing for the central statistical contribution.
[Abstract] Abstract and § on statistical guarantees: the lower-bound construction is described as 'matching' but the explicit rate Ω(σ² d/(n gap_r)) is not shown to be tight against the upper bound's dependence on √(d log d/n). Without the full derivation of the lower bound (including how the Ω term arises and whether a √n factor is omitted), it is impossible to assess whether the minimax claim can be salvaged by re-deriving one of the bounds.

minor comments (2)

The numerical experiments are described as 'sanity checks' rather than benchmarks; this is appropriate but should be stated more explicitly in the experimental section to avoid reader misinterpretation.
Notation for multilabel quantities (k_max, gap_r, Γ/n) is introduced in the abstract and used throughout; a consolidated notation table or early definition section would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying the rate mismatch between the upper and lower bounds. This is a substantive issue for the statistical contribution, and we agree that the near-minimax claim cannot be supported by the stated rates. We will revise the manuscript to correct the presentation of these results.

read point-by-point responses

Referee: [Abstract] Abstract and statistical results section: the claimed near-minimax optimality does not follow from the stated bounds. The upper bound scales as O(k_max √(d log d/n)/gap_r) (i.e., Θ(n^{-1/2})), while the minimax lower bound is Ω(σ² d/(n gap_r)) (i.e., Θ(n^{-1})). These n-dependencies are incompatible even after the extra k_max and log d factors; the lower bound cannot certify tightness of the upper bound. This is load-bearing for the central statistical contribution.

Authors: We agree with the referee. The upper bound is of order n^{-1/2} (up to the k_max and log d factors) while the lower bound is of order n^{-1}. These rates are incompatible, so the lower bound does not establish tightness of the upper bound and the near-minimax optimality claim does not hold. We will revise the abstract and the section on statistical guarantees to remove all references to near-minimax optimality or matching rates. The bounds will be presented separately, with an explicit statement that the lower bound does not certify optimality of the upper bound. revision: yes
Referee: [Abstract] Abstract and § on statistical guarantees: the lower-bound construction is described as 'matching' but the explicit rate Ω(σ² d/(n gap_r)) is not shown to be tight against the upper bound's dependence on √(d log d/n). Without the full derivation of the lower bound (including how the Ω term arises and whether a √n factor is omitted), it is impossible to assess whether the minimax claim can be salvaged by re-deriving one of the bounds.

Authors: The lower-bound derivation appears in the supplementary material. However, the rate mismatch is fundamental and cannot be resolved by re-derivation; the Ω(n^{-1}) lower bound is simply weaker than the O(n^{-1/2}) upper bound. We will revise the main text to include a brief summary of the lower-bound construction (including the origin of the Ω term) and to remove the 'matching' language. No salvage of the minimax claim is possible with the current bounds. revision: yes

Circularity Check

0 steps flagged

No circularity; derivations are self-contained proofs under stated model

full rationale

The paper derives algebraic equivalences (objectives under W^T S_t^{ML} W = I_r), rank characterizations, and statistical bounds (upper O(k_max sqrt(d log d/n)/gap_r) and lower Omega(sigma^2 d/(n gap_r))) directly from the linear label-effect model with sub-Gaussian noise and the given constraints. No step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction; the numerical verification is explicitly a sanity check on synthetic data from the same model rather than a source of the theorems. The rate mismatch noted in external commentary affects claim strength but does not indicate circularity in the derivation chain itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions from statistical learning theory for extending LDA to multilabel cases; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Noise is sub-Gaussian
Invoked for the finite-sample O(k_max sqrt(d log d/n)/gap_r) subspace estimation error bound and the matching minimax lower bound.
domain assumption Data generated from linear label-effect model
Used to generate synthetic data for numerical verification of algebraic identities and multilabel-specific quantities.

pith-pipeline@v0.9.1-grok · 5859 in / 1522 out tokens · 56257 ms · 2026-07-01T00:39:49.688266+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

3 extracted references

[1]

Absil, R

P.-A. Absil, R. Mahony, and R. Sepulchre.Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2008. 47 Keith-Norambuena and Bekios-Calfa Juan Bekios-Calfa and Brian Keith. A multilabel extension of LDA based on the Gram- Schmidt orthogonalization procedure. InProgress in Pattern Recognition, Image Anal- ysis, Computer Vi...

2008
[2]

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch¨ olkopf

Diego, CA, 2nd edition, 1990. Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch¨ olkopf. Measuring statis- tical dependence with Hilbert-Schmidt norms. InProceedings of the 16th International Conference on Algorithmic Learning Theory (ALT), volume 3734 ofLecture Notes in Computer Science, pages 63–77. Springer, 2005. D. L. Hanson and F. T. Wr...

1990
[3]

Sanli Hou and Patrick Riley

Cambridge, UK, 2nd edition, 2013. Sanli Hou and Patrick Riley. Is uncorrelated linear discriminant analysis really a new method?Chemometrics and Intelligent Laboratory Systems, 143:139–142, 2015. 48 Orthogonal Multilabel Fisher Discriminants Shuiwang Ji and Jieping Ye. Linear dimensionality reduction for multi-label classification. InProceedings of the In...

2013

[1] [1]

Absil, R

P.-A. Absil, R. Mahony, and R. Sepulchre.Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2008. 47 Keith-Norambuena and Bekios-Calfa Juan Bekios-Calfa and Brian Keith. A multilabel extension of LDA based on the Gram- Schmidt orthogonalization procedure. InProgress in Pattern Recognition, Image Anal- ysis, Computer Vi...

2008

[2] [2]

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch¨ olkopf

Diego, CA, 2nd edition, 1990. Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch¨ olkopf. Measuring statis- tical dependence with Hilbert-Schmidt norms. InProceedings of the 16th International Conference on Algorithmic Learning Theory (ALT), volume 3734 ofLecture Notes in Computer Science, pages 63–77. Springer, 2005. D. L. Hanson and F. T. Wr...

1990

[3] [3]

Sanli Hou and Patrick Riley

Cambridge, UK, 2nd edition, 2013. Sanli Hou and Patrick Riley. Is uncorrelated linear discriminant analysis really a new method?Chemometrics and Intelligent Laboratory Systems, 143:139–142, 2015. 48 Orthogonal Multilabel Fisher Discriminants Shuiwang Ji and Jieping Ye. Linear dimensionality reduction for multi-label classification. InProceedings of the In...

2013