arxiv: 2605.07786 · v2 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

Caterina Gallegati , Monica Bianchini , Franco Scarselli , Vittorio Murino , Barbara Toniella Corradini

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image quality assessmentsliced wasserstein distanceembedding metricsgenerative image evaluationrobustness to degradationscross-dataset stabilityassumption-free similarity

0 comments

The pith

APEX applies sliced Wasserstein distance to embeddings to create an assumption-free metric for image quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the closed-vocabulary limits of older feature sets and the distributional assumptions built into standard metrics such as FID. It does so by defining APEX around a projection-based use of sliced Wasserstein distance that requires no parametric form and works with any embedding source. A reader would care because reliable quality scores matter when generative models produce images that must be judged fairly across many visual conditions and data sources. The approach is presented as scalable in high dimensions and stable both inside and across datasets.

Core claim

APEX is a novel evaluation framework that leverages the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure between embeddings, inherits effective scalability to high-dimensional spaces, and uses open-vocabulary foundation models as feature extractors to achieve superior robustness to visual degradations along with high intra- and cross-dataset stability.

What carries the argument

Sliced Wasserstein Distance applied to projections of embeddings, functioning as an assumption-free similarity measure that replaces rigid parametric formulations.

If this is right

APEX scales to high-dimensional spaces with supporting theoretical and empirical evidence.
Benchmark comparisons show greater robustness to visual degradations than established baselines.
The resulting scores remain stable within single datasets and across different datasets, including out-of-domain cases.
Because the method is embedding-agnostic, the same distance computation can be paired with other feature extractors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the stability result holds, developers of generative models could rely on a single metric for consistent quality checks even when training data changes.
The projection step could be reused to compare distributions in other perceptual tasks such as video or 3-D asset evaluation.
A direct test would measure whether APEX rankings match human preference studies on newly generated image sets.

Load-bearing premise

The sliced Wasserstein distance remains free of hidden distributional assumptions once it is computed on the chosen embeddings.

What would settle it

If APEX scores fail to track human perceptual judgments more closely than FID on images that have undergone controlled degradations, or if they vary sharply on out-of-domain test sets, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2605.07786 by Barbara Toniella Corradini, Caterina Gallegati, Franco Scarselli, Monica Bianchini, Vittorio Murino.

**Figure 1.** Figure 1: Sample images from the five evaluation datasets. We span different domains: natural scenes (COCO-30k), dermoscopy (HAM10000), faces (CelebA-HQ), medical radiographs (NIH Chest X-Ray), and satellite imagery (NWPU-RESISC45). 4.2 Image Quality Assessment Metrics To comprehensively assess evaluation robustness under distribution shifts, we benchmark a diverse suite of similarity metrics [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 2.** Figure 2: Sample complexity and runtime. (Top) APEX-DINO and APEX-CLIP stabilise reliably by N≈500, avoiding FID’s low-data overestimation and CMMD’s cross-domain instability. (Bottom) Extraction (solid) vs. computation (dashed) times. APEX computation scales as O(N log N), bypassing the prohibitive O(N2 ) overhead of MMD-based baselines. Zoom in for better visualisation. of the baselines. Furthermore, our per-laye… view at source ↗

**Figure 3.** Figure 3: Number of SWD projections vs. execution time for APEX across image corruptions and datasets. We report the APEX metrics score (max-normalised) on the left axis, while dashed curves report execution time on the right axis. Both trends are shown as the number of projection L used in the SWD computation increases. F.1 Stimuli and Design. • Datasets: To rigorously test the domain invariance of the metrics, we … view at source ↗

**Figure 4.** Figure 4: Visualization of image degradations on COCO-30k. Qualitative examples of the six degradation categories across five levels of increasing severity. F.2 Protocol and Deployment • Procedure: To calibrate annotators to the 1–5 scale, each session begins with 3 warm-up trials. After each response, the annotator is shown the population-average rating as feedback. The main evaluation phase consists of 36 trials p… view at source ↗

**Figure 5.** Figure 5: Metric sensitivity across degradations and domains. We compare the response of all metrics to progressively stronger perturbations across six degradation types and five evaluation datasets. Rows correspond to degradation families, while columns correspond to datasets. Each curve is independently min–max normalized to [0, 1], and the x-axis reports the degradation parameter used for the corresponding transf… view at source ↗

**Figure 6.** Figure 6: Layer-wise sensitivity of APEX-DINO. We report normalized SWD responses computed separately on DINOv2 layers L6, L12, and L23 across datasets and degradations. Shallow features L6 are generally most sensitive to low-level corruptions, such as colour shifts, noise, JPEG artifacts, and resolution loss, while deep features L23 are more invariant to these perturbations. The intermediate layer L12 captures stru… view at source ↗

**Figure 7.** Figure 7: Generative models perform refinement in the last timesteps. Top: Evolution of the cosine distance between CLIP embeddings (semantic score) and LPIPS (perceptual score) across generation steps for selected samples. Bottom: Qualitative results across generative timesteps of Stable Diffusion. The semantic CLIP score shows saturation for increasing timesteps, while the perceptual LPIPS score decreases. This em… view at source ↗

**Figure 8.** Figure 8: Consistency across coarse-to-fine generation. We evaluate metric responses along the Stable Diffusion denoising process. The metrics correctly exhibit a decreasing trend, reflecting the progressive improvement in sample quality. • Kendall τ : Provides a secondary concordance measure that is highly robust to tied ranks [Kendall, 1938]. • Per-dataset ρ Variance: The Spearman correlation is calculated within … view at source ↗

read the original abstract

As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APEX swaps FID-style metrics for SWD on CLIP/DINOv2 embeddings and claims assumption-free stability, but the core positioning rests on shaky ground.

read the letter

The main point is that APEX applies sliced Wasserstein distance to features from CLIP and DINOv2 for image quality assessment, with reported gains in robustness to degradations and stability across datasets. The authors position this as fixing both the closed-vocabulary limits of older features and the parametric assumptions in FID-like scores. They include a theoretical argument plus empirical checks for scalability in high dimensions, and they test the metric on multiple backbones while showing intra- and cross-dataset consistency on out-of-domain data. Those stability results are the most concrete contribution here and could be useful for anyone running repeated evaluations on generative outputs. The experiments appear to benchmark against standard baselines and report better behavior under visual changes, which is a reasonable way to demonstrate practical value. The work is clearly aimed at the generative-image evaluation community rather than a broad vision audience. A reader already working on distribution distances or IQA benchmarks would find the framework and the stability numbers worth examining, even if the novelty is mostly in the specific combination rather than a new distance or backbone. The soft spot is the repeated claim that SWD supplies a mathematically grounded, assumption-free measure once modern embeddings are used. Sliced Wasserstein still requires choices about the number and distribution of random projections, and those choices can introduce variance or bias in finite samples. The embeddings themselves carry training-induced preferences that affect what counts as similar or high-quality, so the metric is not neutral. The abstract does not detail how projection sampling was controlled or whether ablation on slice count was performed, which leaves the superiority claims harder to evaluate. If the full experiments do not address those points directly, the contrast with parametric methods weakens. I would bring this to a reading group focused on evaluation metrics. I would not cite it in my own work unless the stability results hold up under closer inspection. It deserves peer review because the topic is timely and the authors engage with the literature on distribution metrics, but referees will need to press on the assumption-free framing and the exact experimental protocols.

Referee Report

3 major / 3 minor

Summary. The paper introduces APEX, a novel image quality assessment metric that applies the Sliced Wasserstein Distance (SWD) to high-dimensional embeddings extracted from open-vocabulary foundation models (CLIP and DINOv2). It claims this framework is assumption-free and embedding-agnostic, provides theoretical and empirical proofs of scalability to high-dimensional spaces, and demonstrates superior robustness to visual degradations along with strong intra- and cross-dataset stability compared to parametric baselines such as FID.

Significance. If the central claims on assumption-freeness, scalability, and robustness hold under scrutiny, APEX could meaningfully improve evaluation practices for generative vision models by replacing rigid parametric assumptions and closed-vocabulary bottlenecks with a more general, stable distance measure. The explicit use of modern embeddings and the focus on out-of-domain stability are particularly relevant strengths.

major comments (3)

[§4] §4 (Theoretical Analysis), the claimed proof of scalability and assumption-freeness for SWD: the derivation must explicitly bound the Monte Carlo projection error for finite samples in the high-dimensional regime of CLIP/DINOv2 embeddings (typically 512–1024 dims) and clarify whether any implicit regularity assumptions on the embedding distribution are required; without this, the contrast to 'parametric limitations' of FID-style metrics is not fully load-bearing.
[§5.3] §5.3 (Robustness Experiments), the benchmarking tables: the reported superiority in robustness to degradations lacks statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals across multiple runs) and does not include controls that isolate the contribution of SWD versus the specific inductive biases of the chosen CLIP and DINOv2 extractors; this leaves open the possibility that performance gains trace to feature-extractor properties rather than the projection-based distance.
[§5.4] §5.4 (Stability Analysis), the intra- and cross-dataset stability results: the evaluation should include an ablation replacing CLIP/DINOv2 with at least one additional embedding backbone (e.g., a supervised ResNet or another self-supervised model) to test the embedding-agnostic claim; otherwise the stability may not generalize beyond the training-induced biases of the two selected models.

minor comments (3)

[§3] Notation for the sliced projections and the final APEX score should be unified across equations and text to avoid ambiguity in the definition of the expectation over random directions.
[Abstract / §1] The abstract and introduction cite 'provably hindered' properties of FID without referencing the specific theorems (e.g., on Gaussian assumptions or sample complexity); adding these citations would strengthen the motivation.
[Figures in §5] Figure captions for the robustness and stability plots should explicitly state the number of random seeds, the exact degradation parameters, and the dataset splits used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas to strengthen our manuscript. We address each major comment point-by-point below, agreeing to incorporate revisions where appropriate to enhance the clarity and rigor of our claims regarding APEX.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis), the claimed proof of scalability and assumption-freeness for SWD: the derivation must explicitly bound the Monte Carlo projection error for finite samples in the high-dimensional regime of CLIP/DINOv2 embeddings (typically 512–1024 dims) and clarify whether any implicit regularity assumptions on the embedding distribution are required; without this, the contrast to 'parametric limitations' of FID-style metrics is not fully load-bearing.

Authors: We appreciate this suggestion to make our theoretical analysis more complete. In the original manuscript, §4 provides theoretical and empirical evidence for the scalability of SWD to high dimensions, building on established results for the sliced Wasserstein distance. However, we agree that an explicit bound on the Monte Carlo projection error for finite numbers of projections in high-dimensional settings (512-1024 dims) would strengthen the section. In the revised version, we will derive such a bound using standard concentration inequalities (e.g., Hoeffding's inequality applied to the projections), under the mild assumption of bounded second moments of the embedding distributions, which holds for normalized CLIP and DINOv2 features. We will also explicitly state that no stronger parametric assumptions (such as Gaussianity) are required, in contrast to FID. This addition will be included in the updated §4, along with numerical verification of the bound's tightness. revision: yes
Referee: [§5.3] §5.3 (Robustness Experiments), the benchmarking tables: the reported superiority in robustness to degradations lacks statistical significance testing (e.g., paired t-tests or bootstrap confidence intervals across multiple runs) and does not include controls that isolate the contribution of SWD versus the specific inductive biases of the chosen CLIP and DINOv2 extractors; this leaves open the possibility that performance gains trace to feature-extractor properties rather than the projection-based distance.

Authors: We agree that adding statistical significance testing will improve the credibility of our empirical results. In the revision, we will augment the tables in §5.3 with bootstrap confidence intervals or paired t-tests computed over multiple independent runs of the experiments. To address the isolation of SWD's contribution, we will add control experiments where we replace the Sliced Wasserstein Distance with alternative metrics (such as mean Euclidean distance or cosine similarity) applied to the same CLIP and DINOv2 embeddings. This will allow us to demonstrate that the robustness advantages stem from the projection-based, assumption-free nature of SWD rather than solely from the choice of embeddings. These controls will be presented in the revised §5.3. revision: yes
Referee: [§5.4] §5.4 (Stability Analysis), the intra- and cross-dataset stability results: the evaluation should include an ablation replacing CLIP/DINOv2 with at least one additional embedding backbone (e.g., a supervised ResNet or another self-supervised model) to test the embedding-agnostic claim; otherwise the stability may not generalize beyond the training-induced biases of the two selected models.

Authors: We recognize the value of this ablation to more convincingly support our embedding-agnostic claim. Although APEX is formulated to work with any embedding extractor, our primary experiments focused on CLIP and DINOv2 due to their open-vocabulary and strong performance. In the revised manuscript, we will include an additional ablation in §5.4 using a supervised ResNet-50 backbone (pretrained on ImageNet) and report the intra- and cross-dataset stability metrics for comparison. This will provide direct evidence that the stability properties hold across different embedding types, including those with supervised training biases, thereby reinforcing the generality of the framework. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent SWD properties and external embeddings

full rationale

The paper defines APEX as a direct application of the established Sliced Wasserstein Distance to features from independent foundation models (CLIP, DINOv2). Scalability is claimed via separate theoretical and empirical arguments rather than by construction from the metric itself. Benchmarking and stability results are presented as external validations, not as inputs that are renamed or fitted into the core definition. No self-citations, ansatzes, or uniqueness theorems from the authors' prior work are invoked as load-bearing premises. The central claim therefore remains self-contained against external mathematical and empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on SWD being assumption-free and the chosen embeddings overcoming prior bottlenecks. No explicit free parameters or new physical entities are mentioned.

axioms (1)

domain assumption The Sliced Wasserstein Distance serves as a mathematically grounded, assumption-free similarity measure for high-dimensional embeddings.
Directly invoked in the abstract as the core justification for the framework.

pith-pipeline@v0.9.0 · 5493 in / 1311 out tokens · 77597 ms · 2026-05-12T03:13:53.733691+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure... Monte Carlo approximation... L ≥ 2D⁴/τ² [2k log(8CD²/τ) − log(δ/2)]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

APEX is embedding-agnostic... CLIP and DINOv2... superior robustness... intra- and cross-dataset stability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Tero Karras, Samuli Laine, and Timo Aila

URLhttps://openreview.net/forum?id=Hk99zCeAb. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. M. G. Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 06 1938. ISSN 0006-

work page 2019
[2]

Hierarchical Text-Conditional Image Generation with CLIP Latents

doi: 10.1093/biomet/30.1-2.81. URL https://doi.org/10.1093/biomet/30.1-2.81. Hyeok Kyu Kwon, Jaeseung Yang, and Minwoo Chae. Evaluating image generation models via sliced wasserstein distance.Journal of the Korean Statistical Society, pages 1–21, 2026. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/biomet/30.1-2.81 2026