The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence

David Nordfors

arxiv: 2606.21008 · v1 · pith:GNZ535F6new · submitted 2026-06-19 · 💻 cs.CL · cs.AI· cs.LG

The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence

David Nordfors This is my paper

Pith reviewed 2026-06-26 14:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords metanym gamestructural intelligencepeer ratingssingular value decompositionLLM benchmarkinganalogy generationself-consistent evaluation

0 comments

The pith

One singular value decomposition of peer ratings in an LLM word game extracts competence for both generating and judging true statements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The metanym game is a competitive word game in which LLMs create their own analogy-based content and rate each other's outputs with no pre-given material. The paper shows that a single singular value decomposition on the matrix of these peer ratings derives each model's competence as both a generator and a judge of true statements at once. This produces a self-contained benchmark for factual accuracy and structural intelligence that resists training-data contamination by construction. The factual component of the scores correlates with GPQA Diamond at Pearson r = 0.92. When measured separately, generation and judgment dissociate, with judging the scarcer skill.

Core claim

The paper establishes that in the metanym game, where LLMs create falsifiable analogy-based sentences and rate each other as peers, one singular value decomposition of the ratings matrix yields the competence of each participant as both generator and judge of true statements at once. The factual rating obtained this way correlates with GPQA Diamond at Pearson r = 0.92. When scored separately, making and judging dissociate, with the strongest generators being only middling judges and the sharpest judge ranking mid-pack as a generator. The benchmark scales by having the strongest players form a contestable council for official evaluations.

What carries the argument

Singular value decomposition applied to the matrix of peer ratings from the metanym game, which extracts consistent competence measures for both creation and evaluation of statements.

If this is right

The benchmark is entirely self-contained and self-consistent with no fixed test set.
Stronger models can contest and earn seats on the council that performs official benchmarking.
Generation and judgment emerge as distinct skills, with judging the rarer one.
The system provides a stable gauge over time without external oracles or golden keys.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could allow benchmarks to update dynamically as stronger models enter the council.
The dissociation of skills suggests training focused on judgment might improve community-wide evaluation accuracy.
Similar spectral approaches might apply to other peer-evaluation settings where objective ground truth is hard to obtain.
If the ratings consistently track objective accuracy, the approach could reduce dependence on fixed human-annotated test sets.

Load-bearing premise

That peer ratings produced inside the metanym game accurately reflect objective factual accuracy and structural intelligence, allowing SVD to extract meaningful competence scores rather than merely re-expressing subjective ratings.

What would settle it

If the SVD-derived competence scores show no correlation with performance on an independent factual benchmark or if the resulting council ratings prove unstable across repeated rounds, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.21008 by David Nordfors.

**Figure 1.** Figure 1: Combined factual rating 1 2 (𝐸𝐹 + 𝐺𝐹 ) (key-free, anchored 1–10) against self-administered GPQA Diamond accuracy, across the twelve models. Pearson 𝑟 = 0.92 (95% CI [0.85, 0.97]), Spearman 𝜌 = 0.90, 𝑛 = 12. Filled markers are council seats, open markers non-council; horizontal bars are the combined 95% CI (the mean of the 𝐸𝐹 and 𝐺𝐹 intervals), vertical bars the GPQA binomial 95% CI. The blue star is the an… view at source ↗

read the original abstract

The metanym game is a competitive word game for LLMs that measures structural intelligence against established cognitive-science constructs. No content is given in advance; the contestants create all of it -- a new kind of analogy test, analogical production falsifiable sentence by sentence, with no fixed test set to leak into training (contamination-resistant by construction). In the council-of-peers benchmark, the contestants also rate each other's creations. We introduce the first spectral solution, to our knowledge, to the wicked problem of benchmarking LLMs' factual accuracy without golden keys or oracle models: one singular value decomposition of the evaluators' ratings matrix yields their competence as both generators and judges of true statements at once. Competence on the subjective criteria comes from each judge's rating consistency as the yardstick shifts. The factual rating correlates with GPQA Diamond at Pearson r = 0.92. Scored separately, making and judging dissociate -- judging is the scarcer skill: the strongest generators are middling judges, the sharpest judge a mid-pack generator. To scale, the strongest players form a council that does the official benchmarking; its seats are contestable -- a stronger model earns one on the benchmark's own rating. The benchmark is entirely self-contained and self-consistent, a stable gauge over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is a generative analogy game where LLMs rate each other and SVD on that matrix is said to extract both generator and judge competence, with a 0.92 correlation to GPQA.

read the letter

The punchline is that this work tries to solve LLM benchmarking without fixed tests or oracles by having models generate analogies in the metanym game, rate one another's output, and then apply SVD to the rating matrix to pull out competence scores for both making and judging statements. The reported dissociation—that strong generators are only average judges—comes directly from that decomposition.

What is actually new is the specific combination of a fully generative, contamination-resistant game with the spectral approach to scoring. The self-contained council idea, where top models earn seats by the same ratings, is a clean way to make the benchmark update without external intervention. The paper does well at laying out why judging appears scarcer than generating and at keeping everything internal to the peer ratings.

The soft spots are in the grounding. The central claim that the SVD isolates objective factual accuracy rather than patterns of model agreement or stylistic similarity rests on the ratings themselves, and the 0.92 GPQA correlation is presented without details on model count, rating scale, statistical tests, or controls for shared training artifacts. If the ratings largely track inter-model leniency or common biases, the decomposition would recover those patterns instead of truth-tracking, and nothing in the abstract rules that out. The circularity burden is therefore high.

This is for people working on alternative evaluation methods who want ideas for self-updating benchmarks. A reader already thinking about peer-based or spectral scoring might pick up a useful angle, but the current evidence is too thin for strong conclusions.

I would send it to peer review so the authors can supply the missing experimental details and address the rating-validity concern directly.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Metanym Game, a self-contained benchmark in which LLMs generate novel analogical content sentence-by-sentence without any pre-supplied test items and then rate one another's outputs in a peer council. A single singular value decomposition applied to the resulting ratings matrix is claimed to extract separate competence scores for generation and judgment of true statements; the factual component of these scores is reported to correlate with GPQA Diamond at Pearson r = 0.92. The design is presented as contamination-resistant by construction, with judging shown to be the scarcer skill, and the benchmark is governed by a contestable council of the strongest models.

Significance. If the SVD extraction can be shown to recover objective competence rather than inter-model agreement patterns, the approach would constitute a notable methodological contribution to LLM evaluation by removing dependence on fixed test sets or external oracles while remaining self-consistent and scalable. The explicit dissociation between generation and judgment, the contestable-council governance mechanism, and the reported GPQA correlation are all potentially valuable if substantiated. The absence of implementation details currently prevents assessment of whether these strengths are realized.

major comments (2)

[Abstract] Abstract: The central claim that SVD of the ratings matrix 'yields their competence as both generators and judges of true statements at once' is load-bearing for the entire contribution, yet the manuscript supplies no equation, matrix construction details, or validation that the leading singular vectors isolate objective factual accuracy rather than shared stylistic or leniency biases among the evaluated models.
[Abstract] Abstract: The reported Pearson r = 0.92 with GPQA Diamond is presented without the number of models, rating scale, number of ratings per item, statistical significance, or any ablation that would distinguish truth-tracking from inter-rater agreement; this information is required to evaluate whether the correlation supports the objective-competence interpretation.

minor comments (1)

[Abstract] The abstract contains several run-on sentences and undefined terms (e.g., 'metanym,' 'structural intelligence') that should be clarified on first use for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address each major comment below and agree that the manuscript requires additional detail and clarification on the points raised.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that SVD of the ratings matrix 'yields their competence as both generators and judges of true statements at once' is load-bearing for the entire contribution, yet the manuscript supplies no equation, matrix construction details, or validation that the leading singular vectors isolate objective factual accuracy rather than shared stylistic or leniency biases among the evaluated models.

Authors: We agree that the current manuscript does not supply the requested equation, matrix construction details, or explicit validation against stylistic or leniency biases. We will revise the manuscript to include the SVD equation, a precise description of how the ratings matrix is constructed from the peer ratings, and additional analysis or discussion addressing whether the leading singular vectors capture objective factual accuracy rather than agreement patterns or biases. revision: yes
Referee: [Abstract] Abstract: The reported Pearson r = 0.92 with GPQA Diamond is presented without the number of models, rating scale, number of ratings per item, statistical significance, or any ablation that would distinguish truth-tracking from inter-rater agreement; this information is required to evaluate whether the correlation supports the objective-competence interpretation.

Authors: We agree that these details are necessary and were omitted. We will revise the manuscript to report the number of models evaluated, the rating scale used, the number of ratings per item, the statistical significance of the correlation, and an ablation study comparing the observed correlation against randomized or permuted ratings to help distinguish truth-tracking from inter-rater agreement. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes SVD on the peer ratings matrix as a spectral method to extract generator and judge competence scores, explicitly validated by an external Pearson r=0.92 correlation with GPQA Diamond. No equations or self-citations are shown that reduce the competence claim to the input ratings by construction; the method is presented as a new approach to the benchmarking problem, with the external benchmark providing independent grounding. The self-contained design of the game itself does not create circularity in the reported derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review is limited to the abstract; no explicit free parameters, axioms, or invented entities beyond the game itself are detailed.

axioms (2)

domain assumption Peer ratings inside the metanym game reflect true factual accuracy and structural intelligence
Required for the SVD to yield meaningful competence scores rather than circular re-expression of ratings.
ad hoc to paper Singular value decomposition of the ratings matrix separates generator and judge competence
This is the central technical claim of the spectral solution.

invented entities (1)

Metanym game no independent evidence
purpose: Provide a generative, contamination-resistant benchmark for structural intelligence
New game introduced by the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5761 in / 1354 out tokens · 44837 ms · 2026-06-26T14:55:17.039596+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 9 linked inside Pith

[1]

Hesse, M. (1963). Models and Analogies in Science. London: Sheed & Ward

1963
[2]

Minsky , M. (1975). A framework for representing knowledge. In P . H. Winston (Ed.), The Psychology of Computer Vision (pp. 211–277). McGraw-Hill

1975
[3]

Fillmore, C. J. (1982). Frame semantics. In Linguistic Society of Korea (Ed.), Linguistics in the Morning Calm (pp. 111–137). Hanshin

1982
[4]

metaphor

Boyd, R. (1979). Metaphor and theory change: What is “metaphor” a metaphor for? In A. Ortony (Ed.), Metaphor and Thought (pp. 356–408). Cambridge Uni- versity Press

1979
[5]

Gentner , D. (1983). Structure-mapping: A theoretical framework for analogy . Cognitive Science, 7 (2), 155–170

1983
[6]

L., & Holyoak, K

Gick, M. L., & Holyoak, K. J. (1983). Schema induction and analogical transfer . Cognitive Psychology , 15(1), 1–38

1983
[7]

Gentner , D. (1989). The mechanisms of analogical learning. In S. Vosniadou & A. Ortony (Eds.), Similarity and Analogical Reasoning. Cambridge University Press

1989
[8]

D., & Gentner , D

Falkenhainer , B., Forbus, K. D., & Gentner , D. (1989). The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41 (1), 1–63

1989
[9]

Lakoff, G., & Johnson, M. (1980). Metaphors We Live By . University of Chicago Press

1980
[10]

J., & Thagard, P

Holyoak, K. J., & Thagard, P . (1995). Mental Leaps: Analogy in Creative Thought. MIT Press

1995
[11]

Goldberg, A. E. (1995). Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press

1995
[12]

C., Holyoak, K

Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31 (2), 109–130

2008
[13]

Hofstadter , D., & Sander , E. (2013). Surfaces and Essences. Basic Books

2013
[14]

von Bertalanffy , L. (1968). General System Theory . George Braziller

1968
[15]

Salthe, S. N. (1985). Evolving Hierarchical Systems: Their Structure and Rep- resentation. Columbia University Press. 34 Archetypes and pattern-instantiation

1985
[16]

Pauli, W . (1955). The influence of archetypal ideas on the scientific theories of Kepler (P . Silz, Trans.). In C. G. Jung & W . Pauli, The Interpretation of Na- ture and the Psyche (pp. 147–240). Pantheon Books. (Original work published

1955
[17]

(Cited for the Jung–Pauli proposal that archetypes act as ordering princi- ples across psyche and physical world; we adopt the structural framing, not the wider metaphysics.) Psychometric intelligence taxonomies
[18]

Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology , 54 (1), 1–22

1963
[19]

L., & Cattell, R

Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and crystallized general intelligences. Journal of Educational Psychology , 57 (5), 253–270

1966
[20]

Guilford, J. P . (1967). The Nature of Human Intelligence. McGraw-Hill

1967
[21]

Carroll, J. B. (1993). Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press

1993
[22]

McGrew , K. S. (2009). CHC theory and the human cognitive abilities project. Intelligence, 37 (1), 1–10. LLM-as-judge methodology

2009
[23]

Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT -Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS), 36. arXiv:2306.05685

Pith/arXiv arXiv 2023
[24]

Liu, Y ., et al. (2023). G-Eval: NLG evaluation using GPT -4 with better human alignment. Proceedings of EMNLP 2023. arXiv:2303.16634

Pith/arXiv arXiv 2023
[25]

Verga, P ., et al. (2024). Replacing judges with juries: Evaluating LLM genera- tions with a panel of diverse models. arXiv:2404.18796

Pith/arXiv arXiv 2024
[26]

Bai, Y ., et al. (2023). Benchmarking foundation models with Language-Model-as- an-Examiner . NeurIPS 36. arXiv:2306.04181

arXiv 2023
[27]

Ning, K.-P ., Yang, S., Liu, Y .- Y ., Yao, J.- Y ., Liu, Z.-H., Wang, Y ., Pang, M., & Yuan, L. (2025). PiCO: Peer review in LLMs based on consistency optimization. Pro- ceedings of ICLR 2025. arXiv:2402.01830

arXiv 2025
[28]

Zhang, Q., Ning, M., Liu, Z., Huang, Y ., Yang, S., Wang, Y ., Ye, J., Chen, X., Song, Y ., & Yuan, L. (2025). UPME: An unsupervised peer review framework for multimodal large language model evaluation. Proceedings of CVPR 2025. arXiv:2503.14941

arXiv 2025
[29]

Don- Yehiya, S., Yehudai, A., Choshen, L., & Abend, O. (2026). Mediocrity is the key for LLM as a judge anchor selection. arXiv:2603.16848

arXiv 2026
[30]

Weng, S., Feng, Y ., & Xie, X. (2026). Beyond accuracy: Policy invariance as a reliability test for LLM safety judges. arXiv:2605.06161

Pith/arXiv arXiv 2026
[31]

R., Raff, E., & Zhang, W

Bellibatlu, R. R., Raff, E., & Zhang, W . (2026). JudgeSense: A benchmark for prompt sensitivity in LLM-as-a-judge systems. arXiv:2604.23478. 35 Analogical reasoning in LLMs

Pith/arXiv arXiv 2026
[32]

J., & Lu, H

Webb, T ., Holyoak, K. J., & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour , 7(9), 1526–1541

2023
[33]

Lewis, M., & Mitchell, M. (2024). Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv:2402.08955. Related benchmarks

arXiv 2024
[34]

Chollet, F . (2019). On the measure of intelligence. arXiv:1911.01547

Pith/arXiv arXiv 2019
[35]

Mitchell, M. (2021). Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences, 1505 (1), 79–101

2021
[36]

M., Ullman, T

Lake, B. M., Ullman, T . D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40 , e253

2017
[37]

Srivastava, A., et al. (2022). Beyond the imitation game: Quantifying and extrap- olating the capabilities of language models. arXiv:2206.04615

Pith/arXiv arXiv 2022
[38]

Cobbe, K., et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168

Pith/arXiv arXiv 2021
[39]

L., Stickland, A

Rein, D., Hou, B. L., Stickland, A. C., Petty , J., Pang, R. Y ., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv:2311.12022. Statistical methods

Pith/arXiv arXiv 2023
[40]

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall

1993
[41]

Parisi, F ., Strino, F ., Nadler , B., & Kluger , Y . (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences, 111 (4), 1253-1258

2014
[42]

P ., & Skene, A

Dawid, A. P ., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Se- ries C (Applied Statistics), 28 (1), 20-28. Generation vs evaluation

1979
[43]

What it can create, it may not understand

West, P ., Lu, X., Dziri, N., Brahman, F ., Li, L., Hwang, J. D., Jiang, L., Fisher , J., Ravichander , A., Chandu, K., Newman, B., Koh, P . W ., Ettinger , A., & Choi, Y . (2024). The Generative AI Paradox: “What it can create, it may not understand.” Proceedings of ICLR 2024. arXiv:2311.00059

arXiv 2024
[44]

Oh, J., Kim, E., Cha, I., & Oh, A. (2024). The Generative AI Paradox on evaluation: What it can solve, it may not evaluate. EACL 2024 Student Research Workshop. arXiv:2402.06204

arXiv 2024
[45]

L., Shrivastava, V ., Li, S., Hashimoto, T ., & Liang, P

Li, X. L., Shrivastava, V ., Li, S., Hashimoto, T ., & Liang, P . (2024). Benchmark- ing and improving generator-validator consistency . Proceedings of ICLR 2024. arXiv:2310.01846. 36 Appendices Appendix A. Rating estimators The exact estimators for every benchmark rating, all computed from one anchored, anchor-swept evaluation matrix with no external key...

arXiv 2024
[46]

↪ ↪ ↪ ↪ ↪ ↪ ↪ **Form (b)** — idiomatic rewrite (same propositions, written as a domain expert would):

= 6 anchor pairs for which both score vec- tors are non-constant (a constant vector makes the correlation undefined, so that pair is dropped). This per-axis breakdown is the diagnostic per-criterion com- petence 𝐸𝐶 𝑎 = 𝜌 𝑠,𝑥 (𝑎 ranging over the non-factual axes — beauty , intelligence, instantiation-distinctness, impressive-length, structural-diversity; §...
[47]

Slots use one canonical noun (e.g

**Context-template** — a worded paragraph with `[SLOT]` placeholders. Slots use one canonical noun (e.g. `[ELEMENT]`, never `[ELEMENTS]`).↪
[48]

**Metanym table** — rows = slots, columns = 5 domains, each cell a metanym in **base form** (singular noun, infinitive verb, etc.).↪
[49]

Inflect metanyms as English requires; Form (a) must be grammatically correct.↪ - **Form (b)** — idiomatic rewrite of Form (a)

**Five parallel contexts**, one per domain: - **Form (a)** — the context-template with that domain's metanym set substituted in. Inflect metanyms as English requires; Form (a) must be grammatically correct.↪ - **Form (b)** — idiomatic rewrite of Form (a). Same propositions, written as a domain expert would naturally write them.↪ - **Optional 1-sentence j...
[50]

**(Each parallel context)** Each sentence is factually correct
[51]

**(Each archetypal context)** Beauty
[52]

**(Each archetypal context)** Intelligence
[53]

Metanyms are far from synonymous↪

**(Each archetypal context)** The parallel contexts from the template span very different domains. Metanyms are far from synonymous↪
[54]

**(Each archetypal context)** The archetypal template has impressive length
[55]

Target Submission

**(Each submitted set of archetypal contexts)** The archetypal contexts have very different system structures↪ B.2 — Evaluation prompt (calibrated/anchored) ### Score this submission against a calibration reference. You are evaluating one contest submission ("Target Submission") against a fixed reference ("Reference Submission") that has been pre-scored a...
[56]

Score this submission against a calibration reference

Title line. “Score this submission against a calibration reference.” becomes “Score these. You are evaluating contest submissions.”
[57]

You are evaluating one contest submission (

The entire calibration preamble is removed — i.e. everything from “You are evaluating one contest submission (”Target Submission”) against a fixed refer- ence…” down to and including “…Do not score the Reference Submission itself — its scores are fixed at {ANCHOR_SCORE}.” (the opening paragraph, the three “Equal / Clearly better / Clearly worse” bullets, ...
[58]

Score each submission on six criteria, each rated 1–10… justifying the rating

The scoring-instruction sentence drops its reference clause. “Score the Target Submission on six criteria, each rated 1–10 relative to the Reference (which is fixed at {ANCHOR_SCORE} on every criterion)… justifying the rat- ing relative to the Reference” becomes “Score each submission on six criteria, each rated 1–10… justifying the rating” (all “relative...
[59]

## The submissions → ### Reference Submission (fixed at {ANCHOR_SCORE}/10…) {REFER- ENCE_SUBMISSION} → ### Target Submission … {TARGET_SUBMISSION}

The Reference Submission block is removed. The “## The submissions → ### Reference Submission (fixed at {ANCHOR_SCORE}/10…) {REFER- ENCE_SUBMISSION} → ### Target Submission … {TARGET_SUBMISSION}” section is replaced by a single batch: “## The proposals to evaluate” followed by {SUBMISSIONS}
[60]

## Target Submission

The output is per-submission, not per-target. “## Target Submission” be- comes “## Submission ” repeated for each submission; all “<…relative to Refer- ence>” annotations in the output template are dropped; and the JSON top-level key changes from the single "Target" to one entry per "<submission_id>"
[61]

All ratings are integers 1–10 in- clusive. Equal to the Reference = {ANCHOR_SCORE}

The closing line drops its anchor clause. “ All ratings are integers 1–10 in- clusive. Equal to the Reference = {ANCHOR_SCORE}.” becomes “ All ratings are integers 1–10 inclusive.” Everything else — the six criteria and their scope tags, the terminology block, the re- cursion note, and the per-archetype/per-PC/per-portfolio output structure — is iden- tic...

[1] [1]

Hesse, M. (1963). Models and Analogies in Science. London: Sheed & Ward

1963

[2] [2]

Minsky , M. (1975). A framework for representing knowledge. In P . H. Winston (Ed.), The Psychology of Computer Vision (pp. 211–277). McGraw-Hill

1975

[3] [3]

Fillmore, C. J. (1982). Frame semantics. In Linguistic Society of Korea (Ed.), Linguistics in the Morning Calm (pp. 111–137). Hanshin

1982

[4] [4]

metaphor

Boyd, R. (1979). Metaphor and theory change: What is “metaphor” a metaphor for? In A. Ortony (Ed.), Metaphor and Thought (pp. 356–408). Cambridge Uni- versity Press

1979

[5] [5]

Gentner , D. (1983). Structure-mapping: A theoretical framework for analogy . Cognitive Science, 7 (2), 155–170

1983

[6] [6]

L., & Holyoak, K

Gick, M. L., & Holyoak, K. J. (1983). Schema induction and analogical transfer . Cognitive Psychology , 15(1), 1–38

1983

[7] [7]

Gentner , D. (1989). The mechanisms of analogical learning. In S. Vosniadou & A. Ortony (Eds.), Similarity and Analogical Reasoning. Cambridge University Press

1989

[8] [8]

D., & Gentner , D

Falkenhainer , B., Forbus, K. D., & Gentner , D. (1989). The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41 (1), 1–63

1989

[9] [9]

Lakoff, G., & Johnson, M. (1980). Metaphors We Live By . University of Chicago Press

1980

[10] [10]

J., & Thagard, P

Holyoak, K. J., & Thagard, P . (1995). Mental Leaps: Analogy in Creative Thought. MIT Press

1995

[11] [11]

Goldberg, A. E. (1995). Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press

1995

[12] [12]

C., Holyoak, K

Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31 (2), 109–130

2008

[13] [13]

Hofstadter , D., & Sander , E. (2013). Surfaces and Essences. Basic Books

2013

[14] [14]

von Bertalanffy , L. (1968). General System Theory . George Braziller

1968

[15] [15]

Salthe, S. N. (1985). Evolving Hierarchical Systems: Their Structure and Rep- resentation. Columbia University Press. 34 Archetypes and pattern-instantiation

1985

[16] [16]

Pauli, W . (1955). The influence of archetypal ideas on the scientific theories of Kepler (P . Silz, Trans.). In C. G. Jung & W . Pauli, The Interpretation of Na- ture and the Psyche (pp. 147–240). Pantheon Books. (Original work published

1955

[17] [17]

(Cited for the Jung–Pauli proposal that archetypes act as ordering princi- ples across psyche and physical world; we adopt the structural framing, not the wider metaphysics.) Psychometric intelligence taxonomies

[18] [18]

Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology , 54 (1), 1–22

1963

[19] [19]

L., & Cattell, R

Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and crystallized general intelligences. Journal of Educational Psychology , 57 (5), 253–270

1966

[20] [20]

Guilford, J. P . (1967). The Nature of Human Intelligence. McGraw-Hill

1967

[21] [21]

Carroll, J. B. (1993). Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press

1993

[22] [22]

McGrew , K. S. (2009). CHC theory and the human cognitive abilities project. Intelligence, 37 (1), 1–10. LLM-as-judge methodology

2009

[23] [23]

Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT -Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS), 36. arXiv:2306.05685

Pith/arXiv arXiv 2023

[24] [24]

Liu, Y ., et al. (2023). G-Eval: NLG evaluation using GPT -4 with better human alignment. Proceedings of EMNLP 2023. arXiv:2303.16634

Pith/arXiv arXiv 2023

[25] [25]

Verga, P ., et al. (2024). Replacing judges with juries: Evaluating LLM genera- tions with a panel of diverse models. arXiv:2404.18796

Pith/arXiv arXiv 2024

[26] [26]

Bai, Y ., et al. (2023). Benchmarking foundation models with Language-Model-as- an-Examiner . NeurIPS 36. arXiv:2306.04181

arXiv 2023

[27] [27]

Ning, K.-P ., Yang, S., Liu, Y .- Y ., Yao, J.- Y ., Liu, Z.-H., Wang, Y ., Pang, M., & Yuan, L. (2025). PiCO: Peer review in LLMs based on consistency optimization. Pro- ceedings of ICLR 2025. arXiv:2402.01830

arXiv 2025

[28] [28]

Zhang, Q., Ning, M., Liu, Z., Huang, Y ., Yang, S., Wang, Y ., Ye, J., Chen, X., Song, Y ., & Yuan, L. (2025). UPME: An unsupervised peer review framework for multimodal large language model evaluation. Proceedings of CVPR 2025. arXiv:2503.14941

arXiv 2025

[29] [29]

Don- Yehiya, S., Yehudai, A., Choshen, L., & Abend, O. (2026). Mediocrity is the key for LLM as a judge anchor selection. arXiv:2603.16848

arXiv 2026

[30] [30]

Weng, S., Feng, Y ., & Xie, X. (2026). Beyond accuracy: Policy invariance as a reliability test for LLM safety judges. arXiv:2605.06161

Pith/arXiv arXiv 2026

[31] [31]

R., Raff, E., & Zhang, W

Bellibatlu, R. R., Raff, E., & Zhang, W . (2026). JudgeSense: A benchmark for prompt sensitivity in LLM-as-a-judge systems. arXiv:2604.23478. 35 Analogical reasoning in LLMs

Pith/arXiv arXiv 2026

[32] [32]

J., & Lu, H

Webb, T ., Holyoak, K. J., & Lu, H. (2023). Emergent analogical reasoning in large language models. Nature Human Behaviour , 7(9), 1526–1541

2023

[33] [33]

Lewis, M., & Mitchell, M. (2024). Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv:2402.08955. Related benchmarks

arXiv 2024

[34] [34]

Chollet, F . (2019). On the measure of intelligence. arXiv:1911.01547

Pith/arXiv arXiv 2019

[35] [35]

Mitchell, M. (2021). Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences, 1505 (1), 79–101

2021

[36] [36]

M., Ullman, T

Lake, B. M., Ullman, T . D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40 , e253

2017

[37] [37]

Srivastava, A., et al. (2022). Beyond the imitation game: Quantifying and extrap- olating the capabilities of language models. arXiv:2206.04615

Pith/arXiv arXiv 2022

[38] [38]

Cobbe, K., et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168

Pith/arXiv arXiv 2021

[39] [39]

L., Stickland, A

Rein, D., Hou, B. L., Stickland, A. C., Petty , J., Pang, R. Y ., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv:2311.12022. Statistical methods

Pith/arXiv arXiv 2023

[40] [40]

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall

1993

[41] [41]

Parisi, F ., Strino, F ., Nadler , B., & Kluger , Y . (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences, 111 (4), 1253-1258

2014

[42] [42]

P ., & Skene, A

Dawid, A. P ., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Se- ries C (Applied Statistics), 28 (1), 20-28. Generation vs evaluation

1979

[43] [43]

What it can create, it may not understand

West, P ., Lu, X., Dziri, N., Brahman, F ., Li, L., Hwang, J. D., Jiang, L., Fisher , J., Ravichander , A., Chandu, K., Newman, B., Koh, P . W ., Ettinger , A., & Choi, Y . (2024). The Generative AI Paradox: “What it can create, it may not understand.” Proceedings of ICLR 2024. arXiv:2311.00059

arXiv 2024

[44] [44]

Oh, J., Kim, E., Cha, I., & Oh, A. (2024). The Generative AI Paradox on evaluation: What it can solve, it may not evaluate. EACL 2024 Student Research Workshop. arXiv:2402.06204

arXiv 2024

[45] [45]

L., Shrivastava, V ., Li, S., Hashimoto, T ., & Liang, P

Li, X. L., Shrivastava, V ., Li, S., Hashimoto, T ., & Liang, P . (2024). Benchmark- ing and improving generator-validator consistency . Proceedings of ICLR 2024. arXiv:2310.01846. 36 Appendices Appendix A. Rating estimators The exact estimators for every benchmark rating, all computed from one anchored, anchor-swept evaluation matrix with no external key...

arXiv 2024

[46] [46]

↪ ↪ ↪ ↪ ↪ ↪ ↪ **Form (b)** — idiomatic rewrite (same propositions, written as a domain expert would):

= 6 anchor pairs for which both score vec- tors are non-constant (a constant vector makes the correlation undefined, so that pair is dropped). This per-axis breakdown is the diagnostic per-criterion com- petence 𝐸𝐶 𝑎 = 𝜌 𝑠,𝑥 (𝑎 ranging over the non-factual axes — beauty , intelligence, instantiation-distinctness, impressive-length, structural-diversity; §...

[47] [47]

Slots use one canonical noun (e.g

**Context-template** — a worded paragraph with `[SLOT]` placeholders. Slots use one canonical noun (e.g. `[ELEMENT]`, never `[ELEMENTS]`).↪

[48] [48]

**Metanym table** — rows = slots, columns = 5 domains, each cell a metanym in **base form** (singular noun, infinitive verb, etc.).↪

[49] [49]

Inflect metanyms as English requires; Form (a) must be grammatically correct.↪ - **Form (b)** — idiomatic rewrite of Form (a)

**Five parallel contexts**, one per domain: - **Form (a)** — the context-template with that domain's metanym set substituted in. Inflect metanyms as English requires; Form (a) must be grammatically correct.↪ - **Form (b)** — idiomatic rewrite of Form (a). Same propositions, written as a domain expert would naturally write them.↪ - **Optional 1-sentence j...

[50] [50]

**(Each parallel context)** Each sentence is factually correct

[51] [51]

**(Each archetypal context)** Beauty

[52] [52]

**(Each archetypal context)** Intelligence

[53] [53]

Metanyms are far from synonymous↪

**(Each archetypal context)** The parallel contexts from the template span very different domains. Metanyms are far from synonymous↪

[54] [54]

**(Each archetypal context)** The archetypal template has impressive length

[55] [55]

Target Submission

**(Each submitted set of archetypal contexts)** The archetypal contexts have very different system structures↪ B.2 — Evaluation prompt (calibrated/anchored) ### Score this submission against a calibration reference. You are evaluating one contest submission ("Target Submission") against a fixed reference ("Reference Submission") that has been pre-scored a...

[56] [56]

Score this submission against a calibration reference

Title line. “Score this submission against a calibration reference.” becomes “Score these. You are evaluating contest submissions.”

[57] [57]

You are evaluating one contest submission (

The entire calibration preamble is removed — i.e. everything from “You are evaluating one contest submission (”Target Submission”) against a fixed refer- ence…” down to and including “…Do not score the Reference Submission itself — its scores are fixed at {ANCHOR_SCORE}.” (the opening paragraph, the three “Equal / Clearly better / Clearly worse” bullets, ...

[58] [58]

Score each submission on six criteria, each rated 1–10… justifying the rating

The scoring-instruction sentence drops its reference clause. “Score the Target Submission on six criteria, each rated 1–10 relative to the Reference (which is fixed at {ANCHOR_SCORE} on every criterion)… justifying the rat- ing relative to the Reference” becomes “Score each submission on six criteria, each rated 1–10… justifying the rating” (all “relative...

[59] [59]

## The submissions → ### Reference Submission (fixed at {ANCHOR_SCORE}/10…) {REFER- ENCE_SUBMISSION} → ### Target Submission … {TARGET_SUBMISSION}

The Reference Submission block is removed. The “## The submissions → ### Reference Submission (fixed at {ANCHOR_SCORE}/10…) {REFER- ENCE_SUBMISSION} → ### Target Submission … {TARGET_SUBMISSION}” section is replaced by a single batch: “## The proposals to evaluate” followed by {SUBMISSIONS}

[60] [60]

## Target Submission

The output is per-submission, not per-target. “## Target Submission” be- comes “## Submission ” repeated for each submission; all “<…relative to Refer- ence>” annotations in the output template are dropped; and the JSON top-level key changes from the single "Target" to one entry per "<submission_id>"

[61] [61]

All ratings are integers 1–10 in- clusive. Equal to the Reference = {ANCHOR_SCORE}

The closing line drops its anchor clause. “ All ratings are integers 1–10 in- clusive. Equal to the Reference = {ANCHOR_SCORE}.” becomes “ All ratings are integers 1–10 inclusive.” Everything else — the six criteria and their scope tags, the terminology block, the re- cursion note, and the per-archetype/per-PC/per-portfolio output structure — is iden- tic...