The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence
Pith reviewed 2026-06-26 14:55 UTC · model grok-4.3
The pith
One singular value decomposition of peer ratings in an LLM word game extracts competence for both generating and judging true statements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that in the metanym game, where LLMs create falsifiable analogy-based sentences and rate each other as peers, one singular value decomposition of the ratings matrix yields the competence of each participant as both generator and judge of true statements at once. The factual rating obtained this way correlates with GPQA Diamond at Pearson r = 0.92. When scored separately, making and judging dissociate, with the strongest generators being only middling judges and the sharpest judge ranking mid-pack as a generator. The benchmark scales by having the strongest players form a contestable council for official evaluations.
What carries the argument
Singular value decomposition applied to the matrix of peer ratings from the metanym game, which extracts consistent competence measures for both creation and evaluation of statements.
If this is right
- The benchmark is entirely self-contained and self-consistent with no fixed test set.
- Stronger models can contest and earn seats on the council that performs official benchmarking.
- Generation and judgment emerge as distinct skills, with judging the rarer one.
- The system provides a stable gauge over time without external oracles or golden keys.
Where Pith is reading between the lines
- The method could allow benchmarks to update dynamically as stronger models enter the council.
- The dissociation of skills suggests training focused on judgment might improve community-wide evaluation accuracy.
- Similar spectral approaches might apply to other peer-evaluation settings where objective ground truth is hard to obtain.
- If the ratings consistently track objective accuracy, the approach could reduce dependence on fixed human-annotated test sets.
Load-bearing premise
That peer ratings produced inside the metanym game accurately reflect objective factual accuracy and structural intelligence, allowing SVD to extract meaningful competence scores rather than merely re-expressing subjective ratings.
What would settle it
If the SVD-derived competence scores show no correlation with performance on an independent factual benchmark or if the resulting council ratings prove unstable across repeated rounds, the central claim would be falsified.
Figures
read the original abstract
The metanym game is a competitive word game for LLMs that measures structural intelligence against established cognitive-science constructs. No content is given in advance; the contestants create all of it -- a new kind of analogy test, analogical production falsifiable sentence by sentence, with no fixed test set to leak into training (contamination-resistant by construction). In the council-of-peers benchmark, the contestants also rate each other's creations. We introduce the first spectral solution, to our knowledge, to the wicked problem of benchmarking LLMs' factual accuracy without golden keys or oracle models: one singular value decomposition of the evaluators' ratings matrix yields their competence as both generators and judges of true statements at once. Competence on the subjective criteria comes from each judge's rating consistency as the yardstick shifts. The factual rating correlates with GPQA Diamond at Pearson r = 0.92. Scored separately, making and judging dissociate -- judging is the scarcer skill: the strongest generators are middling judges, the sharpest judge a mid-pack generator. To scale, the strongest players form a council that does the official benchmarking; its seats are contestable -- a stronger model earns one on the benchmark's own rating. The benchmark is entirely self-contained and self-consistent, a stable gauge over time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Metanym Game, a self-contained benchmark in which LLMs generate novel analogical content sentence-by-sentence without any pre-supplied test items and then rate one another's outputs in a peer council. A single singular value decomposition applied to the resulting ratings matrix is claimed to extract separate competence scores for generation and judgment of true statements; the factual component of these scores is reported to correlate with GPQA Diamond at Pearson r = 0.92. The design is presented as contamination-resistant by construction, with judging shown to be the scarcer skill, and the benchmark is governed by a contestable council of the strongest models.
Significance. If the SVD extraction can be shown to recover objective competence rather than inter-model agreement patterns, the approach would constitute a notable methodological contribution to LLM evaluation by removing dependence on fixed test sets or external oracles while remaining self-consistent and scalable. The explicit dissociation between generation and judgment, the contestable-council governance mechanism, and the reported GPQA correlation are all potentially valuable if substantiated. The absence of implementation details currently prevents assessment of whether these strengths are realized.
major comments (2)
- [Abstract] Abstract: The central claim that SVD of the ratings matrix 'yields their competence as both generators and judges of true statements at once' is load-bearing for the entire contribution, yet the manuscript supplies no equation, matrix construction details, or validation that the leading singular vectors isolate objective factual accuracy rather than shared stylistic or leniency biases among the evaluated models.
- [Abstract] Abstract: The reported Pearson r = 0.92 with GPQA Diamond is presented without the number of models, rating scale, number of ratings per item, statistical significance, or any ablation that would distinguish truth-tracking from inter-rater agreement; this information is required to evaluate whether the correlation supports the objective-competence interpretation.
minor comments (1)
- [Abstract] The abstract contains several run-on sentences and undefined terms (e.g., 'metanym,' 'structural intelligence') that should be clarified on first use for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback. We address each major comment below and agree that the manuscript requires additional detail and clarification on the points raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that SVD of the ratings matrix 'yields their competence as both generators and judges of true statements at once' is load-bearing for the entire contribution, yet the manuscript supplies no equation, matrix construction details, or validation that the leading singular vectors isolate objective factual accuracy rather than shared stylistic or leniency biases among the evaluated models.
Authors: We agree that the current manuscript does not supply the requested equation, matrix construction details, or explicit validation against stylistic or leniency biases. We will revise the manuscript to include the SVD equation, a precise description of how the ratings matrix is constructed from the peer ratings, and additional analysis or discussion addressing whether the leading singular vectors capture objective factual accuracy rather than agreement patterns or biases. revision: yes
-
Referee: [Abstract] Abstract: The reported Pearson r = 0.92 with GPQA Diamond is presented without the number of models, rating scale, number of ratings per item, statistical significance, or any ablation that would distinguish truth-tracking from inter-rater agreement; this information is required to evaluate whether the correlation supports the objective-competence interpretation.
Authors: We agree that these details are necessary and were omitted. We will revise the manuscript to report the number of models evaluated, the rating scale used, the number of ratings per item, the statistical significance of the correlation, and an ablation study comparing the observed correlation against randomized or permuted ratings to help distinguish truth-tracking from inter-rater agreement. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes SVD on the peer ratings matrix as a spectral method to extract generator and judge competence scores, explicitly validated by an external Pearson r=0.92 correlation with GPQA Diamond. No equations or self-citations are shown that reduce the competence claim to the input ratings by construction; the method is presented as a new approach to the benchmarking problem, with the external benchmark providing independent grounding. The self-contained design of the game itself does not create circularity in the reported derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Peer ratings inside the metanym game reflect true factual accuracy and structural intelligence
- ad hoc to paper Singular value decomposition of the ratings matrix separates generator and judge competence
invented entities (1)
-
Metanym game
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hesse, M. (1963). Models and Analogies in Science. London: Sheed & Ward
1963
-
[2]
Minsky , M. (1975). A framework for representing knowledge. In P . H. Winston (Ed.), The Psychology of Computer Vision (pp. 211–277). McGraw-Hill
1975
-
[3]
Fillmore, C. J. (1982). Frame semantics. In Linguistic Society of Korea (Ed.), Linguistics in the Morning Calm (pp. 111–137). Hanshin
1982
-
[4]
metaphor
Boyd, R. (1979). Metaphor and theory change: What is “metaphor” a metaphor for? In A. Ortony (Ed.), Metaphor and Thought (pp. 356–408). Cambridge Uni- versity Press
1979
-
[5]
Gentner , D. (1983). Structure-mapping: A theoretical framework for analogy . Cognitive Science, 7 (2), 155–170
1983
-
[6]
L., & Holyoak, K
Gick, M. L., & Holyoak, K. J. (1983). Schema induction and analogical transfer . Cognitive Psychology , 15(1), 1–38
1983
-
[7]
Gentner , D. (1989). The mechanisms of analogical learning. In S. Vosniadou & A. Ortony (Eds.), Similarity and Analogical Reasoning. Cambridge University Press
1989
-
[8]
D., & Gentner , D
Falkenhainer , B., Forbus, K. D., & Gentner , D. (1989). The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41 (1), 1–63
1989
-
[9]
Lakoff, G., & Johnson, M. (1980). Metaphors We Live By . University of Chicago Press
1980
-
[10]
J., & Thagard, P
Holyoak, K. J., & Thagard, P . (1995). Mental Leaps: Analogy in Creative Thought. MIT Press
1995
-
[11]
Goldberg, A. E. (1995). Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press
1995
-
[12]
C., Holyoak, K
Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31 (2), 109–130
2008
-
[13]
Hofstadter , D., & Sander , E. (2013). Surfaces and Essences. Basic Books
2013
-
[14]
von Bertalanffy , L. (1968). General System Theory . George Braziller
1968
-
[15]
Salthe, S. N. (1985). Evolving Hierarchical Systems: Their Structure and Rep- resentation. Columbia University Press. 34 Archetypes and pattern-instantiation
1985
-
[16]
Pauli, W . (1955). The influence of archetypal ideas on the scientific theories of Kepler (P . Silz, Trans.). In C. G. Jung & W . Pauli, The Interpretation of Na- ture and the Psyche (pp. 147–240). Pantheon Books. (Original work published
1955
-
[17]
(Cited for the Jung–Pauli proposal that archetypes act as ordering princi- ples across psyche and physical world; we adopt the structural framing, not the wider metaphysics.) Psychometric intelligence taxonomies
-
[18]
Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology , 54 (1), 1–22
1963
-
[19]
L., & Cattell, R
Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and crystallized general intelligences. Journal of Educational Psychology , 57 (5), 253–270
1966
-
[20]
Guilford, J. P . (1967). The Nature of Human Intelligence. McGraw-Hill
1967
-
[21]
Carroll, J. B. (1993). Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press
1993
-
[22]
McGrew , K. S. (2009). CHC theory and the human cognitive abilities project. Intelligence, 37 (1), 1–10. LLM-as-judge methodology
2009
-
[23]
Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT -Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS), 36. arXiv:2306.05685
Pith/arXiv arXiv 2023
-
[24]
Liu, Y ., et al. (2023). G-Eval: NLG evaluation using GPT -4 with better human alignment. Proceedings of EMNLP 2023. arXiv:2303.16634
Pith/arXiv arXiv 2023
-
[25]
Verga, P ., et al. (2024). Replacing judges with juries: Evaluating LLM genera- tions with a panel of diverse models. arXiv:2404.18796
Pith/arXiv arXiv 2024
-
[26]
Bai, Y ., et al. (2023). Benchmarking foundation models with Language-Model-as- an-Examiner . NeurIPS 36. arXiv:2306.04181
arXiv 2023
-
[27]
Ning, K.-P ., Yang, S., Liu, Y .- Y ., Yao, J.- Y ., Liu, Z.-H., Wang, Y ., Pang, M., & Yuan, L. (2025). PiCO: Peer review in LLMs based on consistency optimization. Pro- ceedings of ICLR 2025. arXiv:2402.01830
arXiv 2025
-
[28]
Zhang, Q., Ning, M., Liu, Z., Huang, Y ., Yang, S., Wang, Y ., Ye, J., Chen, X., Song, Y ., & Yuan, L. (2025). UPME: An unsupervised peer review framework for multimodal large language model evaluation. Proceedings of CVPR 2025. arXiv:2503.14941
arXiv 2025
-
[29]
Don- Yehiya, S., Yehudai, A., Choshen, L., & Abend, O. (2026). Mediocrity is the key for LLM as a judge anchor selection. arXiv:2603.16848
arXiv 2026
-
[30]
Weng, S., Feng, Y ., & Xie, X. (2026). Beyond accuracy: Policy invariance as a reliability test for LLM safety judges. arXiv:2605.06161
Pith/arXiv arXiv 2026
-
[31]
Bellibatlu, R. R., Raff, E., & Zhang, W . (2026). JudgeSense: A benchmark for prompt sensitivity in LLM-as-a-judge systems. arXiv:2604.23478. 35 Analogical reasoning in LLMs
Pith/arXiv arXiv 2026
-
[32]
J., & Lu, H
Webb, T ., Holyoak, K. J., & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour , 7(9), 1526–1541
2023
-
[33]
Lewis, M., & Mitchell, M. (2024). Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv:2402.08955. Related benchmarks
arXiv 2024
-
[34]
Chollet, F . (2019). On the measure of intelligence. arXiv:1911.01547
Pith/arXiv arXiv 2019
-
[35]
Mitchell, M. (2021). Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences, 1505 (1), 79–101
2021
-
[36]
M., Ullman, T
Lake, B. M., Ullman, T . D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40 , e253
2017
-
[37]
Srivastava, A., et al. (2022). Beyond the imitation game: Quantifying and extrap- olating the capabilities of language models. arXiv:2206.04615
Pith/arXiv arXiv 2022
-
[38]
Cobbe, K., et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168
Pith/arXiv arXiv 2021
-
[39]
Rein, D., Hou, B. L., Stickland, A. C., Petty , J., Pang, R. Y ., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv:2311.12022. Statistical methods
Pith/arXiv arXiv 2023
-
[40]
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall
1993
-
[41]
Parisi, F ., Strino, F ., Nadler , B., & Kluger , Y . (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences, 111 (4), 1253-1258
2014
-
[42]
P ., & Skene, A
Dawid, A. P ., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Se- ries C (Applied Statistics), 28 (1), 20-28. Generation vs evaluation
1979
-
[43]
What it can create, it may not understand
West, P ., Lu, X., Dziri, N., Brahman, F ., Li, L., Hwang, J. D., Jiang, L., Fisher , J., Ravichander , A., Chandu, K., Newman, B., Koh, P . W ., Ettinger , A., & Choi, Y . (2024). The Generative AI Paradox: “What it can create, it may not understand.” Proceedings of ICLR 2024. arXiv:2311.00059
arXiv 2024
-
[44]
Oh, J., Kim, E., Cha, I., & Oh, A. (2024). The Generative AI Paradox on evaluation: What it can solve, it may not evaluate. EACL 2024 Student Research Workshop. arXiv:2402.06204
arXiv 2024
-
[45]
L., Shrivastava, V ., Li, S., Hashimoto, T ., & Liang, P
Li, X. L., Shrivastava, V ., Li, S., Hashimoto, T ., & Liang, P . (2024). Benchmark- ing and improving generator-validator consistency . Proceedings of ICLR 2024. arXiv:2310.01846. 36 Appendices Appendix A. Rating estimators The exact estimators for every benchmark rating, all computed from one anchored, anchor-swept evaluation matrix with no external key...
arXiv 2024
-
[46]
↪ ↪ ↪ ↪ ↪ ↪ ↪ **Form (b)** — idiomatic rewrite (same propositions, written as a domain expert would):
= 6 anchor pairs for which both score vec- tors are non-constant (a constant vector makes the correlation undefined, so that pair is dropped). This per-axis breakdown is the diagnostic per-criterion com- petence 𝐸𝐶 𝑎 = 𝜌 𝑠,𝑥 (𝑎 ranging over the non-factual axes — beauty , intelligence, instantiation-distinctness, impressive-length, structural-diversity; §...
-
[47]
Slots use one canonical noun (e.g
**Context-template** — a worded paragraph with `[SLOT]` placeholders. Slots use one canonical noun (e.g. `[ELEMENT]`, never `[ELEMENTS]`).↪
-
[48]
**Metanym table** — rows = slots, columns = 5 domains, each cell a metanym in **base form** (singular noun, infinitive verb, etc.).↪
-
[49]
Inflect metanyms as English requires; Form (a) must be grammatically correct.↪ - **Form (b)** — idiomatic rewrite of Form (a)
**Five parallel contexts**, one per domain: - **Form (a)** — the context-template with that domain's metanym set substituted in. Inflect metanyms as English requires; Form (a) must be grammatically correct.↪ - **Form (b)** — idiomatic rewrite of Form (a). Same propositions, written as a domain expert would naturally write them.↪ - **Optional 1-sentence j...
-
[50]
**(Each parallel context)** Each sentence is factually correct
-
[51]
**(Each archetypal context)** Beauty
-
[52]
**(Each archetypal context)** Intelligence
-
[53]
Metanyms are far from synonymous↪
**(Each archetypal context)** The parallel contexts from the template span very different domains. Metanyms are far from synonymous↪
-
[54]
**(Each archetypal context)** The archetypal template has impressive length
-
[55]
Target Submission
**(Each submitted set of archetypal contexts)** The archetypal contexts have very different system structures↪ B.2 — Evaluation prompt (calibrated/anchored) ### Score this submission against a calibration reference. You are evaluating one contest submission ("Target Submission") against a fixed reference ("Reference Submission") that has been pre-scored a...
-
[56]
Score this submission against a calibration reference
Title line. “Score this submission against a calibration reference.” becomes “Score these. You are evaluating contest submissions.”
-
[57]
You are evaluating one contest submission (
The entire calibration preamble is removed — i.e. everything from “You are evaluating one contest submission (”Target Submission”) against a fixed refer- ence…” down to and including “…Do not score the Reference Submission itself — its scores are fixed at {ANCHOR_SCORE}.” (the opening paragraph, the three “Equal / Clearly better / Clearly worse” bullets, ...
-
[58]
Score each submission on six criteria, each rated 1–10… justifying the rating
The scoring-instruction sentence drops its reference clause. “Score the Target Submission on six criteria, each rated 1–10 relative to the Reference (which is fixed at {ANCHOR_SCORE} on every criterion)… justifying the rat- ing relative to the Reference” becomes “Score each submission on six criteria, each rated 1–10… justifying the rating” (all “relative...
-
[59]
## The submissions → ### Reference Submission (fixed at {ANCHOR_SCORE}/10…) {REFER- ENCE_SUBMISSION} → ### Target Submission … {TARGET_SUBMISSION}
The Reference Submission block is removed. The “## The submissions → ### Reference Submission (fixed at {ANCHOR_SCORE}/10…) {REFER- ENCE_SUBMISSION} → ### Target Submission … {TARGET_SUBMISSION}” section is replaced by a single batch: “## The proposals to evaluate” followed by {SUBMISSIONS}
-
[60]
## Target Submission
The output is per-submission, not per-target. “## Target Submission” be- comes “## Submission ” repeated for each submission; all “<…relative to Refer- ence>” annotations in the output template are dropped; and the JSON top-level key changes from the single "Target" to one entry per "<submission_id>"
-
[61]
All ratings are integers 1–10 in- clusive. Equal to the Reference = {ANCHOR_SCORE}
The closing line drops its anchor clause. “ All ratings are integers 1–10 in- clusive. Equal to the Reference = {ANCHOR_SCORE}.” becomes “ All ratings are integers 1–10 inclusive.” Everything else — the six criteria and their scope tags, the terminology block, the re- cursion note, and the per-archetype/per-PC/per-portfolio output structure — is iden- tic...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.