pith. machine review for the scientific record. sign in

arxiv: 2603.10225 · v3 · submitted 2026-03-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Rethinking the Harmonic Loss via Non-Euclidean Distance Layers

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords harmonic lossdistance metricsnon-Euclidean geometryneural network traininginterpretabilitycarbon emissionsvision modelslanguage models
0
0 comments X

The pith

Swapping the distance metric inside harmonic loss yields better accuracy, stability, and lower emissions than the original Euclidean version or cross-entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the harmonic loss, previously limited to Euclidean distance, improves when other distance functions are substituted in its place. On image classification tasks cosine distance raises accuracy while cutting carbon output; Bray-Curtis and Mahalanobis distances increase how readable the learned representations are, though at different computational costs. On language models the cosine version of the same loss stabilizes gradients, tightens representation geometry, and again lowers emissions relative to both cross-entropy and the Euclidean harmonic baseline. These results matter because cross-entropy is known to produce unbounded weights and delayed generalization, while the harmonic alternative was previously explored only in one narrow geometric setting.

Core claim

Replacing the Euclidean distance inside the harmonic loss with a range of other metrics produces distance-tailored harmonic losses that, on vision backbones, give higher accuracy and lower carbon emissions with cosine distance and greater interpretability with Bray-Curtis or Mahalanobis distance; on large language models the cosine variant improves gradient stability, representation structure, and emission reductions compared with both cross-entropy and the Euclidean harmonic head.

What carries the argument

Distance-tailored harmonic loss layers that substitute a chosen non-Euclidean metric for the original Euclidean distance when computing the loss between model outputs and targets.

If this is right

  • Cosine harmonic loss becomes a drop-in candidate for training both vision and language models when emission reduction is a design goal.
  • Bray-Curtis or Mahalanobis harmonic losses can be selected when post-hoc interpretability of class boundaries is prioritized over raw speed.
  • Training dynamics on language models become more stable when the loss uses cosine rather than Euclidean or cross-entropy geometry.
  • Overall carbon cost of large-scale training can be lowered by metric choice inside an existing loss template without altering model size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same substitution principle could be tested on other distance-based losses or on reinforcement-learning objectives.
  • If cosine distance proves robust across more domains, it may become the default geometry for any loss that compares embeddings to targets.
  • The interpretability gains from Bray-Curtis and Mahalanobis suggest that metric choice can be treated as a tunable hyper-parameter for downstream explanation methods.

Load-bearing premise

Changing only the distance function inside the loss is sufficient to obtain the reported gains in accuracy, stability, and emissions without further changes to architecture, optimizer, or training schedule.

What would settle it

A controlled re-training experiment on the same vision or language model and dataset that swaps only the distance metric in the harmonic loss and measures whether accuracy, gradient norms, or measured carbon output change by more than a few percent.

Figures

Figures reproduced from arXiv: 2603.10225 by Collin Coil, Kamil Faber, Marcin Pietron, Maxwell Miller-Golub, Panpan Zheng, Pasquale Minervini, Roberto Corizzo.

Figure 1
Figure 1. Figure 1: Vision: Radar plots: 1) Model Performance (F1, Accuracy); 2) Interpretability (PC2 EV, PCA 90%), and 3) Sustainability (Duration/Epoch/GFLOPs, Emissions). Plots feature Baseline (Cross-Entropy), Euclidean harmonic, and the four top-performing non-Euclidean harmonic losses. 5 Related Work Loss functions for classification. The majority of classification models are trained with cross￾entropy loss due to its … view at source ↗
Figure 2
Figure 2. Figure 2: Language: Radar plots: 1) [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Vision: Accuracy curves with Confidence Intervals. Shaded regions show 95% confidence [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Vision: Radar plots: 1) Model Performance (F1, Accuracy); 2) Interpretability (PC2 EV, PCA 90%), and 3) Sustainability (Duration/Epoch/GFLOPs, Emissions). Plots feature Baseline (Cross-Entropy), Euclidean harmonic, and the four top-performing non-Euclidean harmonic losses. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Vision: Radar plots – MNIST, CIFAR10, CIFAR100: 1) [PITH_FULL_IMAGE:figures/full_fig_p040_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Vision: Radar plots – Marathi Sign Language, TinyImageNet: 1) [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Vision: Emissions Averaged Across Seeds and Aggregated Over all 12 Model Backbones. [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Loss convergence behavior with PVT and ResNet50: Training and Validation loss across [PITH_FULL_IMAGE:figures/full_fig_p046_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Loss convergence behavior with language models (BERT-0.1B, GPT-0.1B, QWEN2-0.5B, [PITH_FULL_IMAGE:figures/full_fig_p047_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Geometric effect of distance–based harmonic losses on ResNet50 embeddings (MNIST). [PITH_FULL_IMAGE:figures/full_fig_p053_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Geometric effect of distance–based harmonic losses on ResNet50 embeddings (CIFAR10). [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Carbon emission differences for MNIST across four model backbones (MLP, CNN, [PITH_FULL_IMAGE:figures/full_fig_p059_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Carbon emission differences for CIFAR10 across four model backbones (MLP, CNN, [PITH_FULL_IMAGE:figures/full_fig_p060_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Carbon emission differences for CIFAR100 across four model backbones (MLP, CNN, [PITH_FULL_IMAGE:figures/full_fig_p061_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Carbon emission differences for LLM pretraining on OpenWebText (BERT, GPT2, [PITH_FULL_IMAGE:figures/full_fig_p063_16.png] view at source ↗
read the original abstract

Cross-entropy loss has long been the standard choice for training deep neural networks, yet it suffers from interpretability limitations, unbounded weight growth, and inefficiencies that can contribute to costly training dynamics. The harmonic loss is a distance-based alternative grounded in Euclidean geometry that improves interpretability and mitigates phenomena such as grokking, or delayed generalization on the test set. However, the study of harmonic loss remains narrow: only Euclidean distance is explored, and no systematic evaluation of computational efficiency or sustainability was conducted. We extend harmonic loss by systematically investigating a broad spectrum of distance metrics as replacements for the Euclidean distance. We comprehensively evaluate distance-tailored harmonic losses on both vision backbones and large language models. Our analysis is framed around a three-way evaluation of model performance, interpretability, and sustainability. On vision tasks, cosine distances provide the most favorable trade-off, consistently improving accuracy while lowering carbon emissions, whereas Bray-Curtis and Mahalanobis further enhance interpretability at varying efficiency costs. On language models, cosine-based harmonic losses improve gradient and learning stability, strengthen representation structure, and reduce emissions relative to cross-entropy and Euclidean heads. Our code is available at: https://anonymous.4open.science/r/rethinking-harmonic-loss-5BAB/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper extends the harmonic loss framework—originally based on Euclidean distance—by systematically replacing it with alternative distance metrics including cosine, Bray-Curtis, and Mahalanobis. It evaluates the resulting distance-tailored losses on vision backbones and large language models using a three-way lens of performance, interpretability, and sustainability (carbon emissions). The central claims are that cosine distance yields the best accuracy-emissions trade-off on vision tasks, that Bray-Curtis and Mahalanobis improve interpretability at varying efficiency costs, and that cosine-based harmonic losses enhance gradient stability, representation structure, and emission reduction on language models relative to both cross-entropy and the original Euclidean harmonic loss.

Significance. If the empirical isolation of the distance metric holds and the reported gains are robust, the work would offer a practical, low-overhead route to more interpretable and potentially greener training objectives. The emphasis on sustainability metrics alongside accuracy and interpretability is timely, and the public code release supports reproducibility. However, the absence of tabulated quantitative results, error bars, or explicit hyperparameter controls in the provided abstract limits immediate assessment of whether the claimed advantages survive controlled re-implementation.

major comments (2)
  1. [Experiments and Results] The central empirical claims rest on the assumption that only the distance function inside the harmonic loss is changed while architecture, optimizer, learning-rate schedule, and regularization remain identical across variants. Different metrics possess distinct ranges and gradient magnitudes; without explicit verification that a single hyperparameter set was used (or that per-metric retuning was avoided), observed gains in accuracy and stability could arise from better effective optimization rather than the metric itself. This assumption is load-bearing for the three-way evaluation narrative.
  2. [Abstract] The abstract asserts consistent accuracy improvements and emission reductions for cosine-based losses, yet supplies no quantitative tables, error bars, statistical tests, or baseline implementation details. This makes it impossible to verify whether the claimed improvements are robust or affected by post-hoc metric selection, directly undermining the soundness of the performance and sustainability conclusions.
minor comments (1)
  1. [Abstract] The code link is given as an anonymous repository; the manuscript should clarify whether the released code includes the exact training scripts, hyperparameter files, and carbon-emission measurement routines used to generate the reported figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. Below we respond point-by-point to the major concerns raised, clarifying our methodology and outlining planned revisions.

read point-by-point responses
  1. Referee: [Experiments and Results] The central empirical claims rest on the assumption that only the distance function inside the harmonic loss is changed while architecture, optimizer, learning-rate schedule, and regularization remain identical across variants. Different metrics possess distinct ranges and gradient magnitudes; without explicit verification that a single hyperparameter set was used (or that per-metric retuning was avoided), observed gains in accuracy and stability could arise from better effective optimization rather than the metric itself. This assumption is load-bearing for the three-way evaluation narrative.

    Authors: The manuscript details in Section 4 that a single set of hyperparameters, including learning rate schedule and regularization, was used for all distance variants to isolate the effect of the metric. We avoided per-metric retuning. Distance normalization is applied within the loss to handle varying ranges and gradients. We will include a dedicated paragraph in the revision explicitly verifying this setup and discussing its implications for the evaluation. revision: yes

  2. Referee: [Abstract] The abstract asserts consistent accuracy improvements and emission reductions for cosine-based losses, yet supplies no quantitative tables, error bars, statistical tests, or baseline implementation details. This makes it impossible to verify whether the claimed improvements are robust or affected by post-hoc metric selection, directly undermining the soundness of the performance and sustainability conclusions.

    Authors: While the abstract prioritizes brevity, the full manuscript contains tables with quantitative results, error bars from repeated experiments, and baseline details in Sections 5 and 6, along with the public code. We will revise the abstract to incorporate key quantitative findings and references to these sections to better support the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivation or self-referential reduction

full rationale

The paper offers no derivation chain or equations that reduce a claimed prediction to fitted inputs defined inside the work. All results rest on experimental evaluations of distance metrics (cosine, Bray-Curtis, Mahalanobis) substituted into an existing harmonic loss on vision backbones and language models, reporting accuracy, stability, interpretability, and emissions. No self-citation is load-bearing for a uniqueness theorem or ansatz; the harmonic loss itself is referenced as prior work without the present claims depending on an unverified self-referential step. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the work is presented as an empirical replacement of one distance function by others inside an existing loss.

pith-pipeline@v0.9.0 · 5540 in / 1139 out tokens · 42348 ms · 2026-05-15T12:55:08.338003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 3 internal anchors

  1. [1]

    Aggarwal, Alexander Hinneburg, and Daniel A

    Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. 2001. On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. InICDT (Lecture Notes in Computer Science, Vol. 1973). Springer, 420–434

  2. [2]

    Nazhir Amaya-Tejera, Margarita Gamarra, Jorge I Vélez, and Eduardo Zurek. 2024. A distance- based kernel for classification via Support Vector Machines.Frontiers in Artificial Intelligence 7 (2024), 1287875

  3. [3]

    Baek, Ziming Liu, Riya Tyagi, and Max Tegmark

    David D. Baek, Ziming Liu, Riya Tyagi, and Max Tegmark. 2025. Harmonic Loss Trains Interpretable AI Models.arXiv preprint arXiv:2502.01628(2025). doi: 10.48550/arXiv.2502. 01628

  4. [4]

    Adrien Bardes, Jean Ponce, and Yann LeCun. 2022. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https: //openreview.net/forum?id=xm6YD62D1Ub

  5. [5]

    Magnus Bengtsson. 2025. Compressing Large Language Models with PCA Without Performance Loss. arXiv:2508.04307https://arxiv.org/abs/2508.04307

  6. [6]

    Leonard Bereska and Stratis Gavves. 2024. Mechanistic Interpretability for AI Safety - A Review.Trans. Mach. Learn. Res.2024 (2024)

  7. [7]

    Malik Boudiaf, Jérôme Rony, Imtiaz Masud Ziko, Eric Granger, Marco Pedersoli, Pablo Piantanida, and Ismail Ben Ayed. 2020. A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. InEuropean conference on computer vision. Springer, 548–564

  8. [8]

    Anne Chao, Robin L Chazdon, Robert K Colwell, and Tsung-Jen Shen. 2010. An additive decomposition formula for the Bray–Curtis dissimilarity and their ecological meaning.Ecological Modelling221, 9 (2010), 1275–1283

  9. [9]

    Hanting Chen, Yunhe Wang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. 2020. AdderNet: Do We Really Need Multiplications in Deep Learning?. InCVPR. Computer Vision Foundation / IEEE, 1465–1474

  10. [10]

    Kwantae Cho, Jong-hyuk Roh, Youngsam Kim, and Sangrae Cho. 2019. A performance com- parison of loss functions. In2019 International Conference on Information and Communication Technology Convergence (ICTC). IEEE, 1146–1151

  11. [11]

    Hongjun Choi, Anirudh Som, and Pavan K. Turaga. 2020. AMC-Loss: Angular Margin Contrastive Loss for Improved Explainability in Image Classification. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020. Computer Vision Foundation / IEEE, 3659–3666. doi:10.1109/ CVPRW50498.2020.00427

  12. [12]

    Collin Coil, Kamil Faber, Bartlomiej Sniezynski, and Roberto Corizzo. 2025. Distance-based change point detection for novelty detection in concept-agnostic continual anomaly detection. Journal of Intelligent Information Systems(2025), 1–39. 14

  13. [13]

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699

  14. [14]

    Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. 2017. Improving interpretability of deep neural networks with semantic information. InProceedings of the IEEE conference on computer vision and pattern recognition. 4306–4314

  15. [15]

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy Models of Superposition.Transformer Circuits Thread(2022). https: //transformer-circuits.pub/...

  16. [16]

    Fleet, and Jimmy Ba

    Fartash Faghri, David Duvenaud, David J. Fleet, and Jimmy Ba. 2020. A Study of Gradient Variance in Deep Learning.CoRRabs/2007.04532 (2020). arXiv:2007.04532 https://arxiv. org/abs/2007.04532

  17. [17]

    FAR AI. 2023. Uncovering Latent Human Wellbeing in LLM Embeddings. https://far. ai/news/uncovering-latent-human-wellbeing-in-llm-embeddings . Shows first principal component of GPT-3 embeddings correlates with ethics/well-being labels

  18. [18]

    Alessandro Fuschi, Alessandra Merlotti, and Daniel Remondini. 2025. Microbiome data: tell me which metrics and I will tell you which communities.ISME communications5, 1 (2025), ycaf125

  19. [19]

    Quentin Garrido, Randall Balestriero, Laurent Najman, and Yann LeCun. 2023. RankMe: Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank. InICML (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10929–10974

  20. [20]

    Arie Giloni and Manfred Padberg. 2003. The finite sample breakdown point ofℓ1-regression. In SIAM Journal on Optimization, Vol. 14. SIAM, 608–620

  21. [21]

    María José Gómez-Silva, Arturo de la Escalera, and José María Armingol. 2021. Back- propagation of the Mahalanobis distance through a deep triplet learning model for person re-identification.Integrated Computer-Aided Engineering28, 3 (2021), 277–288

  22. [22]

    Santiago Gonzalez and Risto Miikkulainen. 2020. Improved training speed, accuracy, and data utilization through loss function optimization. In2020 IEEE congress on evolutionary computation (CEC). IEEE, 1–8

  23. [23]

    Misgina Tsighe Hagos, Niamh Belton, Kathleen M Curran, and Brian Mac Namee. 2023. Distance-aware explanation based learning. In2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 279–286

  24. [24]

    Yihan He, Yuan Cao, Hong-Yu Chen, Dennis Wu, Jianqing Fan, and Han Liu. 2024. Can Transformers Perform PCA? https://openreview.net/forum?id=mjDNVksC5G ICLR 2025 Conference Withdrawn Submission

  25. [25]

    Li-Yu Hu, Min-Wei Huang, Shih-Wen Ke, and Chih-Fong Tsai. 2016. The distance function effect on k-nearest neighbor classification for medical datasets.SpringerPlus5, 1 (2016), 1304. 15

  26. [26]

    Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2024. Sparse Autoencoders Find Highly Interpretable Features in Language Models. InInternational Conference on Learning Representations (ICLR), Poster.https://openreview.net/forum? id=F76bwRSLeK

  27. [27]

    Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel- Softmax. InICLR (Poster). OpenReview.net

  28. [28]

    Katarzyna Janocha and Wojciech Marian Czarnecki. 2017. On loss functions for deep neural networks in classification.arXiv preprint arXiv:1702.05659(2017)

  29. [29]

    Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du, and Yongfeng Zhang. 2025. Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?. InProceedings of the 31st International Conference on Computational Linguistic...

  30. [30]

    Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. 2022. Understanding Dimensional Collapse in Contrastive Self-supervised Learning. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=YevsQ05DEN7

  31. [31]

    2023.Understanding and Controlling the Activations of Language Models

    Ole Jorgensen. 2023.Understanding and Controlling the Activations of Language Models. Ph.D. Dissertation. Imperial College London. https://ojorgensen.github.io/assets/ pdfs/Imperial_Dissertation.pdf

  32. [32]

    Vandana Kalra, Indu Kashyap, and Harmeet Kaur. 2022. Effect of distance measures on K-nearest neighbour classifier. In2022 Second International Conference on Computer Science, Engineering and Applications (ICCSEA). IEEE, 1–7

  33. [33]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 6769–6781

  34. [34]

    S. L. Keeling and K. Kunisch. 2016. Robustℓ1 approaches to computing the geometric median and principal and independent components.Journal of Mathematical Imaging and Vision56, 2 (2016), 286–300

  35. [35]

    Lance and William T

    Godfrey N. Lance and William T. Williams. 1967. A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems.Comput. J.9, 4 (1967), 373–380

  36. [36]

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in Neural Information Processing Systems31 (2018)

  37. [37]

    Yunwen Lei and Yiming Ying. 2020. Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 5809–5819.http://proceedings.mlr.press/v119/lei20c.html 16

  38. [38]

    Junhong Liu, Yijie Lin, Liang Jiang, Jia Liu, Zujie Wen, and Xi Peng. 2022. Improve Inter- pretability of Neural Networks via Sparse Contrastive Coding. InFindings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Compu...

  39. [39]

    Chunjie Luo, Jianfeng Zhan, Xiaohe Xue, Lei Wang, Rui Ren, and Qiang Yang. 2018. Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks. InICANN (1) (Lecture Notes in Computer Science, Vol. 11139). Springer, 382–391

  40. [40]

    Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables.arXiv preprint arXiv:1611.00712(2016)

  41. [41]

    Dimity Miller, Niko Sunderhauf, Michael Milford, and Feras Dayoub. 2021. Class anchor clustering: A loss for distance-based open set recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3570–3578

  42. [42]

    Abd El-Samie, and Enmin Song

    Ibrahim Omara, Ahmed Hagag, Guangzhi Ma, Fathi E. Abd El-Samie, and Enmin Song. 2021. A novel approach for ear recognition: learning Mahalanobis distance features from deep CNNs. Mach. Vis. Appl.32, 1 (2021), 38

  43. [43]

    Tianyu Pang, Chao Du, Yinpeng Dong, and Jun Zhu. 2018. Max-Mahalanobis linear discriminant analysis networks. InInternational Conference on Machine Learning. PMLR, 4016–4025

  44. [44]

    Eileen Paula, Jayesh Soni, Himanshu Upadhyay, and Leonel Lagos. 2025. Comparative analysis of model compression techniques for achieving carbon efficient AI.Scientific Reports15, 1 (2025), 23461

  45. [45]

    Sun Pei-Xia, Lin Hui-Ting, and Luo Tao. 2016. Learning discriminative CNN features and simi- larity metrics for image retrieval. In2016 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). IEEE, 1–5

  46. [46]

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. 2022. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.CoRRabs/2201.02177 (2022)

  47. [47]

    Kazi Rafat, Sadia Islam, Abdullah Al Mahfug, Md Ismail Hossain, Fuad Rahman, Sifat Momen, Shafin Rahman, and Nabeel Mohammed. 2023. Mitigating carbon footprint for knowledge distillation based deep learning model compression.Plos one18, 5 (2023), e0285668

  48. [48]

    Khan, and Fahad Shah- baz Khan

    Kanchana Ranasinghe, Muzammal Naseer, Munawar Hayat, Salman H. Khan, and Fahad Shah- baz Khan. 2021. Orthogonal Projection Loss. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 12313–12323. doi:10.1109/ICCV48922.2021.01211

  49. [49]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992

  50. [50]

    Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In2007 15th European signal processing conference. IEEE, 606–610. 17

  51. [51]

    Cynthia Rudin. 2019. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.Nature Machine Intelligence1, 5 (2019), 206–215

  52. [52]

    Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2020. Green ai.Commun. ACM 63, 12 (2020), 54–63

  53. [53]

    Kuncheng Song, Fred A Wright, and Yi-Hui Zhou. 2020. Systematic comparisons for composition profiles, taxonomic levels, and machine learning methods for microbiome-based disease prediction. Frontiers in Molecular Biosciences7 (2020), 610845

  54. [54]

    Arthur Templeton et al. 2023. Sparse Autoencoders Find Highly Interpretable Direc- tions in Language Models. https://www.alignmentforum.org/posts/Qryk6FqjtZk9FHHJR/ sparse-autoencoders-find-highly-interpretable-directions-in

  55. [55]

    Alex Turntrout. 2023. Steering GPT-2-XL by Adding an Activation Vector.https://turntrout. com/gpt2-steering-vectors

  56. [56]

    Anil Verma, Sumit Kumar Singh, Rupesh Kumar Sah, Rajiv Misra, and TN Singh. 2024. Perfor- mance Comparison of Deep Learning Models for CO2 Prediction: Analyzing Carbon Footprint with Advanced Trackers. In2024 IEEE International Conference on Big Data (BigData). IEEE, 4429–4437

  57. [57]

    Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. 2018. CosFace: Large margin cosine loss for deep face recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5265–5274

  58. [58]

    Irene Wang, Newsha Ardalani, Mostafa Elhoushi, Daniel Jiang, Samuel Hsia, Ekin Sumbul, Divya Mahajan, Carole-Jean Wu, and Bilge Acun. 2025. CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization.arXiv preprint arXiv:2505.01386 (2025). doi:10.48550/arXiv.2505.01386Journal reference: NeurIPS 2025

  59. [59]

    Xiaorong Wang, Clara Na, Emma Strubell, Sorelle Friedler, and Sasha Luccioni. 2023. Energy and carbon considerations of fine-tuning BERT.arXiv preprint arXiv:2311.10267(2023)

  60. [60]

    Yunshi Wen, Tengfei Ma, Ronny Luss, Debarun Bhattacharjya, Achille Fokoue, and Anak Agung Julius. 2025. Shedding Light on Time Series Classification using Interpretability Gated Networks. InICLR. OpenReview.net

  61. [61]

    Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A Discriminative Feature Learning Approach for Deep Face Recognition. InProceedings of the European Conference on Computer Vision (ECCV). Springer, 499–515. doi:10.1007/978-3-319-46478-7_31

  62. [62]

    Jinfeng Ye, Tao Li, Tao Xiong, and Ravi Janardan. 2012. A pureL1-norm principal component analysis.Computational Statistics & Data Analysis56, 12 (2012), 4474–4486

  63. [63]

    Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. 2018. Interpretable convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition. 8827–8836

  64. [64]

    weight explosion

    Yu Zhang, Peter Tiňo, Aleš Leonardis, and Ke Tang. 2021. A survey on neural network interpretability.IEEE transactions on emerging topics in computational intelligence5, 5 (2021), 726–742. 18 Appendix A Theoretical Properties of Distance-based Probabilistic Layers Setup.Let {(xi, yi)}n i=1 be the training set withyi ∈ {1, ..., K}. Each class has a prototy...

  65. [65]

    Stop the tracker and record interval-level metrics: emissions (kg CO2), duration, estimated CPU/GPU/RAM power and energy

  66. [66]

    Log cumulative emissions and training metrics (loss, lr) to W&B (if enabled)

  67. [67]

    At the end of training, we stop the tracker one final time and persist all accumulated records to a CSV (emissions_*.csv) alongside model checkpoints

    Restart the tracker for the next interval to avoid long-running file locks and to attribute emissions to training phases cleanly. At the end of training, we stop the tracker one final time and persist all accumulated records to a CSV (emissions_*.csv) alongside model checkpoints. Key Configurations (Reproducibility)The following knobs are saved in run con...

  68. [68]

    Training and test accuracy rise together, indicating that the model discovers the algorithmic rule rather than memorizing individual cases

    Reduced grokking or complete elimination of delayed generalization. Training and test accuracy rise together, indicating that the model discovers the algorithmic rule rather than memorizing individual cases. 48

  69. [69]

    The emergence of a low– dimensional circular manifold with EV close to 1.0 serves as a quantitative and visual certificate of representation clarity

    Improved interpretability via stable geometric structure. The emergence of a low– dimensional circular manifold with EV close to 1.0 serves as a quantitative and visual certificate of representation clarity. These results reinforce the core claims of the paper: harmonic losses promote structured, prototype–aligned representations and smoother, more reliab...

  70. [70]

    using 2D PCA, with class prototypes overlaid as markers. For the Euclidean harmonic head, the class clusters are roughly spherical and separated by (approximately) straight boundaries in the projection: decision regions are controlled mainly by radial distance to each prototype, yielding isotropic attraction basins around each center. Under Cosine harmoni...

  71. [71]

    Concept probing and visualization.Projections onto top PCs often align with semantically meaningful contrasts; e.g., the first PC of GPT-style embeddings correlated with human well- being judgments in zero-shot tests [17], and per-layer PCA can reconstruct or predict response modes in GPT-2 [31]

  72. [72]

    Trackingsubspace distanceacross checkpoints detects representational drift during fine-tuning or domain shift

    Diagnosing and localizing phenomena.Layer-wise or head-wise PCA reveals where variance concentrates, helping localize depth at which concepts emerge or consolidate (com- plementary to linear probing) [29]. Trackingsubspace distanceacross checkpoints detects representational drift during fine-tuning or domain shift

  73. [73]

    PCA is most compelling under: a) approximately linear feature superposition and b) high signal- to-noise in dominant directions

    Sanity checks and baselines.With growing interest in sparse autoencoders (SAEs) for monosemantic features [26], PCA serves as a transparent baseline decomposition: if SAEs 63 meaningfully improve sparsity/faithfulness over PCA while matching reconstruction, that strengthens the interpretability claim [54]. PCA is most compelling under: a) approximately li...