pith. machine review for the scientific record. sign in

arxiv: 2605.11608 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM post-trainingdrift decompositionquantizationLoRArisk boundgeometric analysiscatastrophic forgettingrepresentation similarity
0
0 comments X

The pith

PRISM derives a closed-form upper bound on LLM risk gaps by decomposing post-training drift into scale mismatch, shape mismatch, and head divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to diagnose the specific ways post-training changes alter LLM behavior rather than simply detecting that performance has dropped. It does so by exploiting the linear output head together with the near-isometric character of the backbone to produce an explicit upper bound on the increase in cross-entropy risk. The bound factors the total drift into three separately measurable terms, each tied to a concrete failure mode such as shape distortion from aggressive quantization or head divergence from particular quantization formats. A reader would care because the dominant term then points toward a targeted remedy and because the shape term is differentiable enough to serve as a regularizer that reduces forgetting.

Core claim

PRISM exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound decomposes drift into scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. The same geometry yields variant rankings with mean Spearman correlations of 0.820 for quantization and 0.831 for LoRA forgetting and supplies a differentiable shape regularizer that mitigates忘

What carries the argument

PRISM, the geometric upper bound obtained by mapping representation drift through the linear head to isolate scale, shape, and head components.

If this is right

  • Variants can be ranked by their total drift score with mean Spearman correlations above 0.82 to measured risk on quantization and LoRA tasks.
  • The axis with the largest contribution indicates a concrete remediation direction such as adjusting bit width for shape issues.
  • The differentiable shape term can be inserted directly into training as a regularizer that outperforms experience replay at preserving downstream performance.
  • Each of the three components can be measured independently on new variants without retraining the full model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the near-isometric property extends to additional model families, the same decomposition could be applied to distillation or full fine-tuning pipelines.
  • The three-axis view could help practitioners choose among competing post-training methods by inspecting only the dominant mismatch rather than running complete evaluations.
  • Analogous bounds might be derived for losses other than cross-entropy by substituting the appropriate head mapping.

Load-bearing premise

The backbones of LLMs possess a near-isometric structure that allows the linear head to produce a closed-form risk bound.

What would settle it

A set of post-trained models on which the PRISM-computed bound fails to upper-bound the measured cross-entropy risk gap or on which the Spearman correlation with actual risk falls below 0.6.

Figures

Figures reproduced from arXiv: 2605.11608 by Chieh-Yen Lin, Shao-Hua Sun.

Figure 1
Figure 1. Figure 1: PRISM (Proxy Risk Inference via Structural Mapping) decomposition of the risk gap. For any orthogonal alignment W ∈ O(d), the cross-entropy risk gap |RT −RP | is bounded (Thm. 1) by a feature alignment error δ—decomposed exactly into scale mismatch (∆ρ) 2 and shape mismatch 2ρT ρP (1−ΩW ) (Prop. 1)—plus a head discrepancy γ=Kpred∥Σ 1/2 P ∆H∥F where ∆H=W HT −HP . The main text uses the identity alignment W=… view at source ↗
Figure 2
Figure 2. Figure 2: The PRISM bound B tracks the empirical risk gap across two model families and five benchmarks. Each subplot scatters the PRISM bound B (x-axis, log) against the empirical cross-entropy risk gap |∆R| (y-axis, log). Each point is one quantization variant; colors denote PTQ family (GGUF / GPTQ / BitsAndBytes). Rows: Llama-3.1-8B, Qwen3-8B. Columns: ARC, MMLU, SQuAD, TriviaQA, GSM8K. Per-subplot Spearman rs is… view at source ↗
Figure 3
Figure 3. Figure 3: Llama-3.1-8B: the PRISM bound tracks catastrophic forgetting across LoRA fine￾tuning steps. Each subplot scatters the bound B (x-axis, log) against the empirical forgetting |∆R| (y-axis, log) on a downstream benchmark, with one point per LoRA checkpoint colored by training step. Rows: fine-tuning task (TruthfulQA, BBQ). Columns: downstream benchmark (ARC, MMLU, SQuAD, TriviaQA, GSM8K). Under LoRA’s frozen … view at source ↗
Figure 4
Figure 4. Figure 4: Shape regularization vs. replay-CE on Llama-3.1-8B. LoRA fine-tuning on TruthfulQA (top) and BBQ (bottom) under three configurations: no reg (anchor), the replay baseline, and our trace (Eq. 8); the latter two share a 32-sample reference set and are each method’s sweep |∆R|-best. Our trace cuts downstream mean |∆R| further than the replay baseline; per-benchmark Ω and |∆R| in [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 5
Figure 5. Figure 5: replicates the main quantization scatter ( [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature alignment error δ alone is already highly predictive of the risk gap. Identical layout to [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qwen3-8B: the PRISM bound tracks catastrophic forgetting across LoRA fine-tuning steps. Each subplot scatters the PRISM bound B (x-axis, log) against the empirical forgetting |∆R| (y-axis, log) on a held-out benchmark, with one point per LoRA checkpoint colored by training step. Rows: fine-tuning task (TruthfulQA, BBQ). Columns: held-out evaluation benchmark (ARC, MMLU, SQuAD, TriviaQA, GSM8K). Because LoR… view at source ↗
Figure 8
Figure 8. Figure 8: Shape regularization vs. replay-CE on Qwen3-8B (replication of [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
read the original abstract

Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PRISM, which exploits the linear output head of LLMs and their empirically near-isometric backbone representations to derive a closed-form upper bound on the cross-entropy risk gap between a target model and post-training variants (e.g., quantized or LoRA-adapted). The bound decomposes drift into scale mismatch, shape mismatch, and head divergence terms, is calibrated for ranking, and yields mean Spearman correlations of 0.820 for quantization and 0.831 for LoRA forgetting across two model families and five benchmarks. The differentiable shape term is additionally used as a training regularizer that outperforms experience replay at mitigating forgetting.

Significance. If the derivation holds and the near-isometry assumption is quantitatively verified, PRISM would offer a useful advance by linking geometric drift directly to risk in a decomposable, actionable way that goes beyond similarity metrics like CKA or SVCCA. The reported ranking correlations and the regularizer result are concrete strengths that demonstrate potential practical value for diagnosing and mitigating specific post-training failure modes.

major comments (2)
  1. [§3] §3 (derivation of the PRISM bound): The closed-form claim depends on the backbone being near-isometric so that inner-product or distance preservation allows elimination of explicit integration over the representation distribution. The manuscript states this only as an empirical observation without reporting a quantitative tolerance (e.g., max or average deviation from isometry) or measured values on the evaluated models. This is load-bearing; material deviation would invalidate the algebraic simplification and make the bound non-closed-form, undermining the interpretation of the reported Spearman correlations as evidence for the three-axis decomposition.
  2. [§5] §5 (experimental results and tables): The mean Spearman correlations of 0.820 and 0.831 are presented without error bars, per-axis ablations, or comparisons to baselines such as direct risk estimation or other geometric measures. It is also unclear whether the 'calibration for variant ranking' introduces any data-dependent fitting that would contradict the parameter-free character implied by the closed-form derivation.
minor comments (2)
  1. [Abstract] The abstract refers to 'five benchmarks' without naming them; this should be stated explicitly for reproducibility.
  2. [§2] Notation for the three axes (scale, shape, head) should be introduced with explicit equations early in the paper to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important points for strengthening the manuscript's rigor. We address each major comment below and will incorporate revisions to provide quantitative verification of the isometry assumption and enhanced experimental details.

read point-by-point responses
  1. Referee: [§3] §3 (derivation of the PRISM bound): The closed-form claim depends on the backbone being near-isometric so that inner-product or distance preservation allows elimination of explicit integration over the representation distribution. The manuscript states this only as an empirical observation without reporting a quantitative tolerance (e.g., max or average deviation from isometry) or measured values on the evaluated models. This is load-bearing; material deviation would invalidate the algebraic simplification and make the bound non-closed-form, undermining the interpretation of the reported Spearman correlations as evidence for the three-axis decomposition.

    Authors: We agree that the near-isometry assumption is central to the closed-form derivation and that quantitative evidence is required. The current manuscript presents this as an empirical observation supported by prior work on LLM representations, but does not include explicit metrics. In the revision, we will add a dedicated subsection with measured deviations from isometry (e.g., average and maximum relative error in inner-product preservation and Euclidean distance preservation) computed on the evaluated models and layers. These will be reported for both model families across the benchmarks. If deviations remain small, this will corroborate the approximation; we will also discuss sensitivity of the bound to larger deviations. revision: yes

  2. Referee: [§5] §5 (experimental results and tables): The mean Spearman correlations of 0.820 and 0.831 are presented without error bars, per-axis ablations, or comparisons to baselines such as direct risk estimation or other geometric measures. It is also unclear whether the 'calibration for variant ranking' introduces any data-dependent fitting that would contradict the parameter-free character implied by the closed-form derivation.

    Authors: We acknowledge these gaps in statistical reporting and comparative analysis. In the revised manuscript, we will add error bars to the Spearman correlations (via bootstrap resampling over variants or seeds where applicable) and include per-axis ablations showing the ranking contribution of scale, shape, and head terms individually. We will also add comparisons to baselines including CKA, SVCCA, and direct risk estimation. Regarding calibration: it consists of a fixed, monotonic post-hoc scaling (derived once from a small held-out set of variants) solely to improve interpretability for ranking; the core bound remains parameter-free and closed-form. We will clarify the exact procedure and confirm it does not involve per-experiment fitting on evaluation data. revision: yes

Circularity Check

0 steps flagged

PRISM derivation is self-contained from linear-head and isometry assumptions

full rationale

The paper states that it exploits the linear output head of LLMs together with the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap, which is then calibrated for ranking and validated on post-training quantization and LoRA forgetting tasks. No equation or step is shown to reduce by construction to a fitted parameter, a self-citation chain, or a renaming of the input data; the bound is presented as an algebraic consequence of the stated geometric assumptions, with the reported Spearman correlations serving as external empirical checks rather than tautological outputs of the derivation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full derivation and any fitted quantities or additional assumptions are unavailable.

axioms (2)
  • domain assumption LLM output head is linear
    Exploited to obtain closed-form bound
  • domain assumption Backbone representations are empirically near-isometric
    Stated as empirical observation enabling the structural mapping

pith-pipeline@v0.9.0 · 5557 in / 1379 out tokens · 46256 ms · 2026-05-13T01:31:27.788849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

  1. [1]

    GPTQ: Accurate post-training quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations (ICLR), 2023

  2. [2]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 30318–30332. Curran Associates, Inc., 2022

  3. [3]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022

  4. [4]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  5. [5]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  6. [6]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  7. [7]

    SVCCA: Singu- lar vector canonical correlation analysis for deep learning dynamics and interpretability

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: Singu- lar vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  8. [8]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InInternational Conference on Machine Learning (ICML), 2019

  9. [9]

    Grounding representation similarity with statistical testing

    Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. Grounding representation similarity with statistical testing. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  10. [10]

    Reliability of CKA as a similarity measure in deep learning

    MohammadReza Davari, Stefan Horoi, Amine Natik, Guillaume Lajoie, Guy Wolf, and Eugene Belilovsky. Reliability of CKA as a similarity measure in deep learning. InInternational Conference on Learning Representations (ICLR), 2023

  11. [11]

    Similarity of neural network models: A survey of functional and representational measures.ACM Computing Surveys, 57(9):1–52, 2025

    Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. Similarity of neural network models: A survey of functional and representational measures.ACM Computing Surveys, 57(9):1–52, 2025. doi: 10.1145/3728458

  12. [12]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning (ICML), 2024

  13. [13]

    Relative representations enable zero-shot latent space communication

    Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. In International Conference on Learning Representations (ICLR), 2023

  14. [14]

    Generalized shape metrics on neural representations

    Alex H Williams, Erin Kunz, Simon Kornblith, and Scott W Linderman. Generalized shape metrics on neural representations. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 10

  15. [15]

    What representational similarity measures imply about decodable information

    Sarah E Harvey, David Lipshutz, and Alex H Williams. What representational similarity measures imply about decodable information. InProceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models, volume 285 ofProceedings of Machine Learning Research, pages 140–151. PMLR, 2024

  16. [16]

    Position: The platonic representation hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 20617–20642. PMLR, 2024

  17. [17]

    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InInternational Conference on Machine Learning (ICML), pages 38087–38099. PMLR, 2023

  18. [18]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences (PNAS), 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences (PNAS), 114(13):3521–3526, 2017

  19. [19]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  20. [20]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  21. [21]

    tinyBenchmarks: Evaluating LLMs with fewer examples

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinyBenchmarks: Evaluating LLMs with fewer examples. InInternational Confer- ence on Machine Learning (ICML), 2024

  22. [22]

    Efficient benchmarking (of language models)

    Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, and Leshem Choshen. Efficient benchmarking (of language models). InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), page...

  23. [23]

    Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschen- brenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In International Conference on Machine Learning (ICML), 2024

  24. [24]

    Ministral 3

    Alexander H Liu et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  25. [25]

    Available: http://dx.doi.org/10.1038/s41586-025-09422-z

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z

  26. [26]

    Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap, volume

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229

  27. [27]

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thomp- son, Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for ques- tion answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.1865...

  28. [28]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021. 11

  29. [29]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  30. [30]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 2017. Association for Computational Linguistics. do...

  31. [31]

    SQuAD : 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264

  32. [32]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  33. [33]

    A generalized solution of the orthogonal Procrustes problem.Psychome- trika, 31(1):1–10, 1966

    Peter H Schönemann. A generalized solution of the orthogonal Procrustes problem.Psychome- trika, 31(1):1–10, 1966

  34. [34]

    Oxford University Press, 2004

    John C Gower and Garmt B Dijksterhuis.Procrustes Problems. Oxford University Press, 2004

  35. [35]

    effective heads

    Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. 12 Appendix Table of Contents A Proof of the unified risk bound (Theorem 1) 13 A.1 Relation to classical Procrustes sha...

  36. [36]

    Larger1− ¯Ω= larger headroom under condition (i). • Relative trace effect ∆|∆R|/|∆R|0 := (|∆R|trace − |∆R|λ=0)/|∆R|λ=0: across-benchmark mean of the trace-vs-no-reg relative change in empirical forgetting at λ=1.0 (negative = trace helps; positive=trace hurts). Three of four settings match the gating prediction directly: trace produces the largest benefit...