pith. machine review for the scientific record. sign in

arxiv: 2605.08048 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: no theorem link

Accurate and Efficient Statistical Testing for Word Semantic Breadth

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords word embeddingssemantic breadthpermutation testHouseholder reflectiondispersioncontextual diversityhypothesis testing
0
0 comments X

The pith

Aligning mean directions via Householder reflection lets permutation tests isolate true differences in word semantic breadth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When comparing the contextual spread of two words through their token embeddings, differences in average direction can falsely appear as differences in dispersion and produce misleading significance results. The paper introduces a Householder-aligned permutation test that first applies one reflection to match the mean directions of the two clouds and then permutes labels on the aligned data to obtain calibrated p-values for dispersion. This correction lowers Type-I error while keeping the test sensitive to real breadth variation, and a batched GPU version makes the procedure fast enough for practical vocabulary-scale use.

Core claim

Applying a single Householder reflection to align the mean directions of two word-type token clouds, followed by a permutation test on the aligned vectors, produces non-parametric p-values that correctly reflect dispersion differences rather than directional mismatches.

What carries the argument

Householder-aligned permutation test: one Householder reflection aligns the two mean vectors so that subsequent label permutations test only for dispersion equality.

If this is right

  • Breadth comparisons between words become less likely to report significance when only their average contexts differ in direction.
  • The test remains sensitive to genuine increases in contextual diversity after alignment.
  • A GPU-batched implementation reduces runtime by a factor of 23 compared with a CPU baseline, enabling larger-scale applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same single-reflection alignment could be applied to other vector-cloud comparisons where direction must be factored out of a magnitude test.
  • The method offers a route to more reliable automatic sense distinction in dictionaries by grounding decisions on calibrated breadth statistics.
  • Further work could examine whether the alignment step extends without modification to multi-class or time-varying embedding clouds.

Load-bearing premise

Aligning the means with a Householder reflection preserves the dispersion geometry and does not alter the null distribution of the permutation statistic.

What would settle it

Generate two synthetic clouds with identical dispersion but offset means, run the aligned permutation test repeatedly under the null, and check whether the resulting p-values are uniformly distributed.

Figures

Figures reproduced from arXiv: 2605.08048 by Yo Ehara.

Figure 1
Figure 1. Figure 1: Illustration of Type-I error inflation in naive [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of (a) Type-I Error rate and (b) Precision across dispersion ranking gaps from 1 to 10. The [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Baseline, i.e., t-SNE visualization before ap [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proposed, i.e., t-SNE visualization after ap [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Householder-aligned permutation test for comparing semantic breadth (dispersion) between two word types in contextualized embedding spaces. A single Householder reflection is computed from the observed mean vectors to align directional differences, after which a permutation test is run on the fixed aligned token clouds to yield non-parametric p-values for dispersion differences. A GPU-batched implementation is introduced for efficiency. Empirical claims include a 32.5% reduction in Type-I error relative to naive tests while retaining power, plus a 23x speedup over CPU baselines.

Significance. If the alignment procedure preserves exact calibration, the method would address a practical problem in NLP by enabling reliable statistical comparisons of contextual diversity for tasks such as sense inventory construction and domain dictionary building. The work supplies a novel combination of Householder reflections with permutation testing plus a practical GPU implementation that could scale to large embedding corpora.

major comments (2)
  1. [Proposed method (Householder alignment and permutation procedure)] The method description states that a single Householder reflection is computed once from the two observed mean vectors and then applied to produce fixed aligned clouds on which the permutation test is performed. Under the null of equal dispersion (after directional alignment), this fixed transformation violates exchangeability: each permuted pair of clouds has new means, so the original reflection no longer aligns them, and the test statistic is evaluated under a transformation that does not match the permuted data. This directly undermines the claim of 'calibrated, non-parametric p-values'.
  2. [Experiments and empirical validation] The reported 32.5% Type-I error reduction and preservation of sensitivity are presented without simulation details confirming that the alignment step was either (a) recomputed inside the permutation loop or (b) shown to leave the null distribution of the dispersion statistic unchanged. Without such verification, the empirical calibration claim cannot be assessed.
minor comments (2)
  1. The citation 'Nagata and Tanaka-Ishii, ACL2025' appears in the abstract but should be expanded to a full reference entry with title and venue details for completeness.
  2. Consider adding a short pseudocode listing that explicitly shows whether the Householder reflection is recomputed per permutation or held fixed; this would clarify the exact algorithm for readers implementing the test.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the manuscript. We address each major comment below and describe the revisions we will implement.

read point-by-point responses
  1. Referee: The method description states that a single Householder reflection is computed once from the two observed mean vectors and then applied to produce fixed aligned clouds on which the permutation test is performed. Under the null of equal dispersion (after directional alignment), this fixed transformation violates exchangeability: each permuted pair of clouds has new means, so the original reflection no longer aligns them, and the test statistic is evaluated under a transformation that does not match the permuted data. This directly undermines the claim of 'calibrated, non-parametric p-values'.

    Authors: We appreciate the referee highlighting this critical aspect of permutation test validity. The concern about exchangeability is well-founded: a fixed Householder reflection derived solely from the observed means does not guarantee that the null distribution remains correctly calibrated when applied to permuted samples whose means differ. To resolve this, we will revise the method to recompute the Householder reflection for every permutation using the means of the current permuted clouds. This ensures the alignment procedure is identically applied to both observed and permuted data, restoring exact non-parametric calibration. We will update the method description, Algorithm 1, and the GPU batching implementation accordingly while retaining the overall efficiency gains. revision: yes

  2. Referee: The reported 32.5% Type-I error reduction and preservation of sensitivity are presented without simulation details confirming that the alignment step was either (a) recomputed inside the permutation loop or (b) shown to leave the null distribution of the dispersion statistic unchanged. Without such verification, the empirical calibration claim cannot be assessed.

    Authors: We agree that the current manuscript lacks sufficient detail on the simulation protocol and whether alignment was handled inside the permutation loop. In the revised version we will add a dedicated subsection (and supplementary code) that fully specifies the null simulation design, confirms that the Householder reflection is recomputed per permutation under the updated procedure, and reports the resulting empirical Type-I error rates together with power curves. This will allow direct assessment of calibration and sensitivity. revision: yes

Circularity Check

0 steps flagged

No circularity: method is an explicit algorithmic construction on standard primitives

full rationale

The paper defines a Householder-aligned permutation test by first computing a single reflection from the two observed mean vectors to align directions, then running a standard permutation test on the transformed clouds. This is presented as a direct procedural description without any equation or result that reduces to a fitted parameter, self-referential definition, or load-bearing self-citation. No ansatz is smuggled, no uniqueness theorem is invoked from prior author work, and no renaming of known results occurs. The empirical claims (Type-I error reduction, speedup) are separate experimental outcomes, not tautological consequences of the method definition itself. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Relies on standard assumptions in embedding spaces and non-parametric statistics; no new entities introduced.

axioms (2)
  • domain assumption Dispersion of contextual token embeddings serves as a proxy for semantic breadth
    Basis for the measurements as stated in the abstract.
  • domain assumption Permutation tests on aligned clouds provide calibrated p-values for dispersion differences
    Core assumption of the statistical method.

pith-pipeline@v0.9.0 · 5516 in / 1296 out tokens · 54861 ms · 2026-05-11T02:29:31.174894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [12]

    A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space

    Rajaee, Sara and Pilehvar, Mohammad Taher. A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2021. doi:10.18653/v1/2021.acl-short.73

  9. [14]

    Statistical Significance Tests for Machine Translation Evaluation

    Koehn, Philipp. Statistical Significance Tests for Machine Translation Evaluation. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004

  10. [16]

    On Some Pitfalls in Automatic Evaluation and Significance Testing for MT

    Riezler, Stefan and Maxwell, John T. On Some Pitfalls in Automatic Evaluation and Significance Testing for MT. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

  11. [17]

    Breaking Through the 80

    Bevilacqua, Michele and Navigli, Roberto , booktitle =. Breaking Through the 80. 2020 , address =. doi:10.18653/v1/2020.acl-main.255 , pages =

  12. [18]

    2007 , publisher =

    The. 2007 , publisher =

  13. [19]

    Language Resources and Evaluation , year =

    Maekawa, Kikuo and Yamazaki, Makoto and Ogiso, Toshinobu and Maruyama, Takehiko and Ogura, Hideki and Kashino, Wakako and Koiso, Hanae and Yamaguchi, Masaya and Tanaka, Makiro and Den, Yasuharu , title =. Language Resources and Evaluation , year =

  14. [20]

    An Empirical Investigation of Statistical Significance in NLP

    Berg-Kirkpatrick, Taylor and Burkett, David and Klein, Dan. An Empirical Investigation of Statistical Significance in NLP. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2012

  15. [21]

    , title =

    Miller, George A. , title =. Communications of the ACM , volume =

  16. [22]

    Educational Cone Model in Embedding Vector Spaces

    Yo Ehara. Educational Cone Model in Embedding Vector Spaces. Proceedings of ICCE 2025: The 33rd International Conference on Computers in Education (short paper). 2025

  17. [23]

    Mining Words in the Minds of Second Language Learners: Learner-Specific Word Difficulty

    Ehara, Yo and Sato, Issei and Oiwa, Hidekazu and Nakagawa, Hiroshi. Mining Words in the Minds of Second Language Learners: Learner-Specific Word Difficulty. Proceedings of COLING 2012. 2012

  18. [25]

    A Generalized Solution of the Orthogonal Procrustes Problem , journal =

    Sch. A Generalized Solution of the Orthogonal Procrustes Problem , journal =. 1966 , doi =

  19. [26]

    Journal of Machine Learning Research , volume =

    van der Maaten, Laurens and Hinton, Geoffrey , title =. Journal of Machine Learning Research , volume =

  20. [27]

    Taylor Berg-Kirkpatrick, David Burkett, and Dan Klein. 2012. https://aclanthology.org/D12-1091/ An empirical investigation of statistical significance in NLP . In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 995--1005, Jeju Island, Korea. Association for Com...

  21. [28]

    BNC Consortium . 2007. http://hdl.handle.net/20.500.14106/2554 The British National Corpus , XML edition . License: http://www.natcorp.ox.ac.uk/docs/licence.html

  22. [29]

    Francis Bond, Arkadiusz Janz, Marek Maziarz, and Ewa Rudnicka. 2019. https://doi.org/10.18653/v1/2019.gwc-1.44 Testing Z ipf ' s meaning-frequency law with wordnets as sense inventories . In Proceedings of the 10th Global Wordnet Conference, pages 342--352, Wroclaw, Poland. Global Wordnet Association

  23. [30]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

  24. [31]

    Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. https://doi.org/10.18653/v1/P18-1128 The hitchhiker ' s guide to testing statistical significance in natural language processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383--1392, Melbourne, Australia. Associ...

  25. [32]

    Yo Ehara. 2022. https://doi.org/10.1007/978-3-031-11644-5\_37 An intelligent interactive support system for word usage learning in second languages . In Artificial Intelligence in Education - 23rd International Conference, AIED 2022, Durham, UK, July 27-31, 2022, Proceedings, Part I , Lecture Notes in Computer Science, pages 453--464. Springer

  26. [33]

    Yo Ehara. 2025. https://library.apsce.net/index.php/ICCE/article/view/5944 Educational cone model in embedding vector spaces . In Proceedings of ICCE 2025: The 33rd International Conference on Computers in Education (short paper)

  27. [34]

    Yo Ehara, Issei Sato, Hidekazu Oiwa, and Hiroshi Nakagawa. 2012. https://aclanthology.org/C12-1049/ Mining words in the minds of second language learners: Learner-specific word difficulty . In Proceedings of COLING 2012 , pages 799--814, Mumbai, India. The COLING 2012 Organizing Committee

  28. [35]

    Kawin Ethayarajh. 2019. https://doi.org/10.18653/v1/D19-1006 How contextual are contextualized word representations? C omparing the geometry of BERT , ELM o, and GPT -2 embeddings . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

  29. [37]

    Yvette Graham, Nitika Mathur, and Timothy Baldwin. 2014. https://doi.org/10.3115/v1/W14-3333 Randomized significance tests in machine translation . In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 266--274, Baltimore, Maryland, USA. Association for Computational Linguistics

  30. [38]

    Philipp Koehn. 2004. https://aclanthology.org/W04-3250/ Statistical significance tests for machine translation evaluation . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388--395, Barcelona, Spain. Association for Computational Linguistics

  31. [40]

    Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. Balanced corpus of contemporary written Japanese . Language Resources and Evaluation, 48:345--371

  32. [41]

    George A. Miller. 1995. WordNet : A lexical database for English . Communications of the ACM, 38(11):39--41

  33. [44]

    Stefan Riezler and John T. Maxwell. 2005. https://aclanthology.org/W05-0908/ On some pitfalls in automatic evaluation and significance testing for MT . In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pages 57--64, Ann Arbor, Michigan. Association for Computational Linguistics

  34. [45]

    Schönemann

    Peter H. Sch \"o nemann. 1966. https://doi.org/10.1007/BF02289451 A generalized solution of the orthogonal procrustes problem . Psychometrika, 31(1):1--10

  35. [47]

    Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579--2605

  36. [50]

    Christos Xypolopoulos, Antoine Tixier, and Michalis Vazirgiannis. 2021. https://doi.org/10.18653/v1/2021.eacl-main.297 Unsupervised word polysemy quantification with multiresolution grids of contextual embeddings . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3391--3401,...

  37. [51]

    Hiroaki Yamagiwa and Hidetoshi Shimodaira. 2025. 2025.coling-main.521/ Norm of mean contextualized embeddings determines their variance . In Proceedings of the 31st International Conference on Computational Linguistics, pages 7778--7808, Abu Dhabi, UAE

  38. [53]

    A New Formulation of Z ipf ' s Meaning-Frequency Law through Contextual Diversity

    Nagata, Ryo and Tanaka-Ishii, Kumiko. A New Formulation of Z ipf ' s Meaning-Frequency Law through Contextual Diversity. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.744

  39. [54]

    A Systematic Comparison of Contextualized Word Embeddings for Lexical Semantic Change

    Periti, Francesco and Tahmasebi, Nina. A Systematic Comparison of Contextualized Word Embeddings for Lexical Semantic Change. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.240

  40. [55]

    Analysing Lexical Semantic Change with Contextualised Word Representations

    Giulianelli, Mario and Del Tredici, Marco and Fern \'a ndez, Raquel. Analysing Lexical Semantic Change with Contextualised Word Representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.365

  41. [56]

    Exact Paired-Permutation Testing for Structured Test Statistics

    Zmigrod, Ran and Vieira, Tim and Cotterell, Ryan. Exact Paired-Permutation Testing for Structured Test Statistics. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.360

  42. [57]

    H yper L ex: A Large-Scale Evaluation of Graded Lexical Entailment

    Vuli \'c , Ivan and Gerz, Daniela and Kiela, Douwe and Hill, Felix and Korhonen, Anna. H yper L ex: A Large-Scale Evaluation of Graded Lexical Entailment. Computational Linguistics. 2017. doi:10.1162/COLI_a_00301

  43. [58]

    Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

    Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127

  44. [59]

    Norm of Mean Contextualized Embeddings Determines their Variance

    Yamagiwa, Hiroaki and Shimodaira, Hidetoshi. Norm of Mean Contextualized Embeddings Determines their Variance. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  45. [60]

    Statistical Uncertainty in Word Embeddings: G lo V e- V

    Vallebueno, Andrea and Handan-Nader, Cassandra and Manning, Christopher D and Ho, Daniel E. Statistical Uncertainty in Word Embeddings: G lo V e- V. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.510

  46. [61]

    U i O - U v A at S em E val-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection

    Kutuzov, Andrey and Giulianelli, Mario. U i O - U v A at S em E val-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection. Proceedings of the Fourteenth Workshop on Semantic Evaluation. 2020. doi:10.18653/v1/2020.semeval-1.14