arxiv: 2604.24827 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI

Recognition: unknown

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

Bojie Li

Pith reviewed 2026-05-08 04:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM parameter estimationknowledge probesfactual capacityscaling lawsblack-box modelsmixture-of-expertsmodel size inferenceincompressible facts

0 comments

The pith

Factual knowledge in large language models scales log-linearly with total parameter count, enabling estimates of closed-source model sizes from probe accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Incompressible Knowledge Probes as a benchmark of 1,400 factual questions across obscurity levels to measure knowledge that models must store directly rather than derive or compress. Calibration on 89 open-weight models from 19 vendors produces a log-linear mapping from probe accuracy to parameter count with R squared of 0.917. This mapping generalizes well enough to estimate sizes for proprietary models and shows that factual capacity tracks total parameters far better than active parameters in mixture-of-experts designs. Across dated open models the mapping remains stable over time, indicating no measurable compression of factual storage.

Core claim

Incompressible Knowledge Probes isolate facts that cannot be obtained by reasoning from other knowledge and therefore require dedicated parameter storage. Accuracy on these probes maps log-linearly to total parameter count across open models, with the fit holding across vendors and generations. For mixture-of-experts models total parameters predict knowledge capacity much more strongly than active parameters. The same mapping applied to closed models yields size estimates, which are lower bounds when safety tuning causes refusals on known facts.

What carries the argument

Incompressible Knowledge Probes, a fixed set of 1,400 factual questions spanning seven obscurity tiers that measure the minimum weights required to store non-derivable facts.

Load-bearing premise

The chosen questions measure knowledge that cannot be derived through reasoning and that architectural changes cannot pack into fewer parameters than the observed relationship assumes.

What would settle it

Measure IKP accuracy on a newly disclosed model whose parameter count was previously unknown and check whether the observed accuracy falls inside the calibrated log-linear prediction band.

Figures

Figures reproduced from arXiv: 2604.24827 by Bojie Li.

**Figure 1.** Figure 1: IKP calibration curve. Each point is an open-weight model with known parameter count. view at source ↗

**Figure 2.** Figure 2: MoE models: accuracy vs total parameters (left, view at source ↗

**Figure 3.** Figure 3: Leave-one-out cross-validation: predicted vs actual parameter count for 89 open models. view at source ↗

**Figure 4.** Figure 4: Per-tier accuracy for the top 25 models. Each row is a model (sorted by overall accuracy); view at source ↗

**Figure 5.** Figure 5: Thinking mode effect across 27 base/think model pairs. Light bars: base model accuracy; view at source ↗

**Figure 6.** Figure 6: Densing Law falsification on IKP across 96 open-weight models (2023-09 to 2026-04). view at source ↗

**Figure 7.** Figure 7: Vendor-reported benchmark scores vs log10(total params, B). Top-left: IKP across all 89 calibration models (R2 = 0.917). Other panels: each standard benchmark (red) plotted against the same models with IKP (blue ×, dashed line) overlaid for direct comparison. On every matched subset, IKP yields a tighter fit; the gap is largest for GPQA Diamond (reasoning-heavy) and smallest for SimpleQA (factual). Three f… view at source ↗

**Figure 8.** Figure 8: Jaccard similarity on T5–T6 correct-answer sets for 15 frontier models, clustered by view at source ↗

**Figure 9.** Figure 9: (a) Hallucination-similarity (HSS) for consecutive-generation pairs within each family. Annotation is same_wrong/both_wrong. Dashed line: shared-base threshold (0.30); dotted line: lineage threshold (0.10). GPT-5 through 5.4 transitions all fall in the retrained regime. (b) All pairs with ≥ 5 joint-wrong probes in the (J, HSS) plane. Same-vendor (blue) and cross-vendor (red) pairs largely overlap on J but … view at source ↗

**Figure 10.** Figure 10: Researcher recognition rate vs log(citations), colored by tier. The moderate correlation view at source ↗

**Figure 11.** Figure 11: Accuracy distribution by tier across 188 models. T1–T2 are compressed at ceiling view at source ↗

**Figure 12.** Figure 12: T5–T7 hallucination rate by vendor. Individual model points are overlaid on vendor view at source ↗

**Figure 13.** Figure 13: Knowledge evolution across model generations. Red boxes highlight regressions: GPT view at source ↗

**Figure 14.** Figure 14: GPT-5 family size stratification revealed by IKP. T5 accuracy (annotated) shows clear view at source ↗

read the original abstract

Closed-source frontier labs do not disclose parameter counts, and the standard alternative -- inference economics -- carries $2\times$+ uncertainty from hardware, batching, and serving-stack assumptions external to the model. We exploit a tighter intrinsic bound: storing $F$ facts requires at least $F/$(bits per parameter) weights, so measuring how much a model \emph{knows} lower-bounds how many parameters it \emph{has}. We introduce \textbf{Incompressible Knowledge Probes (IKPs)}, a benchmark of 1{,}400 factual questions spanning 7 tiers of obscurity, designed to isolate knowledge that cannot be derived by reasoning or compressed by architectural improvements. We calibrate a log-linear mapping from IKP accuracy to parameter count on 89 open-weight models (135M--1,600B) spanning 19 vendors, achieving $R^2 = 0.917$; leave-one-out cross-validation confirms generalization (median fold error $1.59\times$, $68.5\%$ within $2\times$ and $87.6\%$ within $3\times$). For Mixture-of-Experts models, total parameters predict knowledge ($R^2 = 0.79$) far better than active parameters ($R^2 = 0.51$). We evaluate 188 models from 27 vendors and estimate effective knowledge capacity for all major proprietary frontier models; for heavily safety-tuned models the estimates are lower bounds, since refusal policy can hide tens of percentage points of "refused but known" capacity. The widely-reported saturation of reasoning benchmarks does not imply the end of scaling. Procedural capability compresses under the "Densing Law," but across 96 dated open-weight models the IKP time coefficient is $-0.0010$/month (95\% CI $[-0.0031, +0.0008]$) -- indistinguishable from zero, and rejecting the Densing prediction of $+0.0117$/month at $p < 10^{-15}$. Factual capacity continues to scale log-linearly with parameters across generations and across vendors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable estimator for black-box LLM sizes from factual probes calibrated on open models, but the closed-model estimates rest on an untested assumption of similar knowledge efficiency.

read the letter

The main thing to know is that this paper introduces Incompressible Knowledge Probes, a 1400-question factual benchmark, and shows it produces a tight log-linear fit to parameter count on 89 open-weight models from 19 vendors. They report R² of 0.917 with leave-one-out median error around 1.59x, plus checks on MoE models and a flat time trend in factual capacity that pushes back on saturation claims for knowledge specifically. That is the concrete contribution. The calibration data and the MoE total-versus-active comparison are useful pieces of evidence that stand on their own. The time-trend test is also straightforward and directly addresses one competing narrative about scaling. The work is transparent about safety tuning producing lower-bound estimates, which is the right framing. The soft spot is the step from open to closed models. The mapping is fitted where ground truth exists and then applied where it does not, so any difference in pre-training data, post-training, or storage efficiency between the two regimes would shift the estimates without being detectable. The paper already notes one such effect; the same logic applies more broadly. Probe construction details matter here too, since the claim of incompressibility depends on how well the questions avoid reasoning shortcuts or common knowledge that models might handle differently. This is aimed at people who track scaling trends or need rough size estimates for closed systems without relying on inference-cost guesses. A reader who works on empirical capability measurement or scaling-law tests would get direct value from the benchmark and the reported validation numbers. It deserves a serious referee because the open-model results are replicable and the method is new enough to discuss, even with the generalization caveat. I would send it out for review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces Incompressible Knowledge Probes (IKPs), a benchmark of 1,400 factual questions across 7 tiers of obscurity, to lower-bound the parameter counts of black-box LLMs via their factual knowledge capacity. A log-linear mapping from IKP accuracy to parameter count is calibrated on 89 open-weight models (135M–1.6T parameters, 19 vendors) yielding R²=0.917; leave-one-out cross-validation reports median error 1.59×. The mapping is applied to estimate effective capacity for 188 models from 27 vendors (including proprietary frontier models, with safety-tuned estimates noted as lower bounds). For MoE models, total parameters predict IKP accuracy better than active parameters. Across 96 dated open models, the IKP time trend after controlling for size is statistically indistinguishable from zero, rejecting a positive “Densing Law” prediction at p<10^{-15}.

Significance. If the central mapping holds, the work supplies a reproducible, inference-cost-independent method for bounding black-box model sizes and demonstrates that factual capacity continues to scale log-linearly with parameters even as reasoning benchmarks saturate. The high R², LOOCV results, MoE total-vs-active comparison, and falsification of the Densing prediction on open models are concrete strengths. The approach could inform scaling-law research and transparency discussions, though its utility for proprietary models depends on the untested transportability assumption.

major comments (3)

[Abstract / calibration paragraph] Abstract and calibration section: The log-linear mapping is fitted exclusively on open-weight models and then applied to closed proprietary models; no sensitivity analysis or bounds are provided on how much the fitted coefficients could shift due to differences in pre-training data scale/quality, post-training (RLHF/safety), or architectural tweaks. This assumption is load-bearing for all proprietary estimates.
[IKP benchmark description] IKP construction and validation: The claim that the 1,400 questions isolate knowledge that cannot be derived by reasoning or compressed by architecture is central to the “incompressible” premise, yet the manuscript provides insufficient detail on question sourcing, exclusion rules, tier construction, and error analysis (e.g., breakdown of accuracy by obscurity tier or model size). These gaps directly affect the soundness of the R²=0.917 fit and LOOCV results.
[Scaling and time-trend results] Time-trend analysis: The reported IKP time coefficient of −0.0010/month (95% CI [−0.0031, +0.0008]) on 96 dated open-weight models rejects the Densing prediction, but the selection criteria for these 96 models, the exact parameterization of the size control, and any vendor or architecture clustering are not specified, making the p<10^{-15} claim difficult to evaluate.

minor comments (2)

[Abstract] The abstract states “Factual capacity continues to scale log-linearly with parameters across generations and across vendors” but does not clarify whether this is a within-model-size or across-size statement; a brief clarification would improve readability.
[Results tables/figures] Table or figure reporting the 89-model calibration should include per-vendor or per-size-bin residuals to allow readers to assess whether the fit is uniform or driven by the largest models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each of the major points raised below and indicate the revisions we will make to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract / calibration paragraph] Abstract and calibration section: The log-linear mapping is fitted exclusively on open-weight models and then applied to closed proprietary models; no sensitivity analysis or bounds are provided on how much the fitted coefficients could shift due to differences in pre-training data scale/quality, post-training (RLHF/safety), or architectural tweaks. This assumption is load-bearing for all proprietary estimates.

Authors: We agree that the assumption of transportability is central and load-bearing. The manuscript already qualifies estimates for safety-tuned models as lower bounds. In revision, we will add a new subsection in the calibration section providing sensitivity checks using subsets of the open models (e.g., comparing base vs. fine-tuned variants and different vendors) to bound potential shifts. We will also discuss qualitative factors that could affect the mapping for proprietary models. Full quantitative bounds require internal training data unavailable to us. revision: partial
Referee: [IKP benchmark description] IKP construction and validation: The claim that the 1,400 questions isolate knowledge that cannot be derived by reasoning or compressed by architecture is central to the “incompressible” premise, yet the manuscript provides insufficient detail on question sourcing, exclusion rules, tier construction, and error analysis (e.g., breakdown of accuracy by obscurity tier or model size). These gaps directly affect the soundness of the R²=0.917 fit and LOOCV results.

Authors: We appreciate this feedback on the need for greater detail. The full paper describes sourcing from public knowledge bases and web archives, with manual review to ensure questions require specific factual recall rather than reasoning. Exclusion rules remove questions solvable via common knowledge or logic. Tiers are constructed by obscurity measured via search hit counts and expert ratings. We will expand the IKP section with these details, include per-tier accuracy tables stratified by model size, and report error analysis to better support the reported fit statistics. revision: yes
Referee: [Scaling and time-trend results] Time-trend analysis: The reported IKP time coefficient of −0.0010/month (95% CI [−0.0031, +0.0008]) on 96 dated open-weight models rejects the Densing prediction, but the selection criteria for these 96 models, the exact parameterization of the size control, and any vendor or architecture clustering are not specified, making the p<10^{-15} claim difficult to evaluate.

Authors: The 96 models comprise all open-weight models in our calibration set with publicly documented release dates. The regression controls for log(parameter count) as the size variable and uses clustered standard errors by vendor to account for potential non-independence. We will add an appendix detailing the exact model list, the regression equation (IKP accuracy ~ log(params) + time), and robustness checks including architecture fixed effects. This will allow direct evaluation of the statistical claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical calibration and external validation are self-contained

full rationale

The paper calibrates a log-linear regression from measured IKP accuracy to known parameter counts on 89 open-weight models (reporting R²=0.917 and LOOCV error), then applies the fitted mapping to estimate closed models under an explicit transportability assumption. This is standard supervised regression with held-out validation rather than any self-definitional loop, fitted-input-renamed-as-prediction, or self-citation that bears the central load. The time-trend test fits a coefficient on dated open models and rejects an external Densing prediction; no equations reduce to their own inputs by construction, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled via citation. The derivation rests on observable data and falsifiable assumptions external to the paper itself.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on an empirical fit plus two key assumptions about knowledge storage and incompressibility; the benchmark itself is a new constructed entity.

free parameters (1)

log-linear mapping coefficients
Slope and intercept fitted to 89 open models to achieve R²=0.917.

axioms (2)

domain assumption Storing F facts requires at least F/(bits per parameter) weights
Core premise linking measured knowledge to minimum parameter count.
ad hoc to paper IKP questions isolate knowledge that cannot be derived by reasoning or compressed by architecture
Design choice stated in the abstract to justify the lower-bound interpretation.

invented entities (1)

Incompressible Knowledge Probes no independent evidence
purpose: Benchmark of 1400 factual questions to measure stored knowledge capacity
Newly introduced test set spanning 7 obscurity tiers.

pith-pipeline@v0.9.0 · 5684 in / 1427 out tokens · 86013 ms · 2026-05-08T04:05:18.825963+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
cs.LG 2026-05 unverdicted novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
cs.CL 2026-05 unverdicted novelty 6.0

LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE...
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
cs.AI 2026-04 unverdicted novelty 6.0

BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.

Reference graph

Works this paper leans on

28 extracted references · 14 canonical work pages · cited by 3 Pith papers · 9 internal anchors

[1]

Long-tail knowledge in large language models: Taxonomy, mechanisms, interventions and implications.arXiv preprint arXiv:2602.16201,

Sanket Badhe, Deep Shah, and Nehal Kathrotia. Long-tail knowledge in large language models: Taxonomy, mechanisms, interventions and implications.arXiv preprint arXiv:2602.16201,

work page arXiv
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review arXiv
[3]

Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901,

1901
[4]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review arXiv
[6]

Estimating training compute of frontier ai models

Epoch AI. Estimating training compute of frontier ai models. 2024.https://epoch.ai/. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39,

2024
[7]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review arXiv
[8]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

2021
[9]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review arXiv
[10]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review arXiv 2001
[11]

Leave it to the experts: Detecting knowledge distillation via MoE expert signatures.arXiv preprint arXiv:2510.16968,

Pingzhi Li, Morris Yu-Chao Huang, Zhen Tan, Qingquan Song, Jie Peng, Kai Zou, Yu Cheng, Kaidi Xu, and Tianlong Chen. Leave it to the experts: Detecting knowledge distillation via MoE expert signatures.arXiv preprint arXiv:2510.16968,

work page arXiv
[12]

Scaling laws for fact memorization of large language models

Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuanjing Huang, and Xipeng Qiu. Scaling laws for fact memorization of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024,

2024
[13]

arXiv preprint arXiv:2505.24832 , year =

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. How much do language models memorize? arXiv preprint arXiv:2505.24832,

work page arXiv
[14]

Scalable fingerprinting of large language models.arXiv preprint arXiv:2502.07760,

36 Anshul Nasery, Jonathan Hayase, Creston Brooks, Peiyao Sheng, Himanshu Tyagi, Pramod Viswanath, and Sewoong Oh. Scalable fingerprinting of large language models.arXiv preprint arXiv:2502.07760,

work page arXiv
[15]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review arXiv
[16]

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller. Language models as knowledge bases? InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 2463–2473,

2019
[17]

How much knowledge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5418–5426,

Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5418–5426,

2020
[18]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. 37 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- ...

work page internal anchor Pith review arXiv
[19]

Qwen2 Technical Report

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. OpenAI, 2024.https://openai.com/index/introducing-simpleqa/. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei ...

work page internal anchor Pith review arXiv 2024
[20]

arXiv preprint arXiv:2509.23678 , year=

Guoliang Zhao, Yuhan Fu, Shuaipeng Li, Xingwu Sun, Ruobing Xie, An Wang, Weidong Han, Zhen Yang, Weixuan Sun, Yudong Zhang, Cheng-zhong Xu, Di Wang, and Jie Jiang. Towards a comprehensive scaling law of mixture-of-experts.arXiv preprint arXiv:2509.23678,

work page arXiv
[21]

What is the capital of Norway?

38 A Probe Generation Methodology This appendix expands Section 4.1 with the full reproducibility detail of the IKP probe pipeline: tier definitions, per-corpus sampling procedures, quality filters with their drop rates, the final per-tier composition, verification procedure, and sample probes. A.1 Tier Definitions Each of the seven IKP tiers is defined b...

2014
[22]

In computer science

Probe template.The released format is two-part: In computer science, what is the research subfield ofName, and name one paper, system, institution, or co-author associated with their work? If you don’t know who this person is, say so. The CS scoping (“In computer science”) reduces cross-field name collisions for short / common names. The artifact requirem...

2015
[23]

Na Klang named after Na Klang

loses∼0.013inR 2 and widens the 90% PI factor from2.94×to3.21×, indicating that some hallucination penalty is justified. (iii) Beyondλ=−1.0, fits gradually degrade as the regression slope is increasingly driven by aggressive bluffers’ negative scores at hard tiers. (iv) The strict ordering STRONG>WEAK>REFUSAL>WRONG required for the 4-way researcher judge ...

1938
[24]

Vendor Model Acc. Raw T1 T2 T3 T4 T5 T6 T7 Google gemini-3.1-pro 0.811 0.822 0.00 0.01 0.01 0.02 0.03 0.03 0.23 gemini-3-flash-think 0.756 0.809 0.00 0.01 0.01 0.06 0.04 0.20 0.73 gemini-3-flash 0.745 0.804 0.00 0.00 0.02 0.05 0.06 0.21 0.58 gemini-3.1-flash-lite 0.610 0.706 0.00 0.01 0.03 0.09 0.17 0.46 0.67 gemini-2.5-pro 0.584 0.713 0.01 0.00 0.03 0.09...

2023
[25]

possible lineage

leaves ˆβ2 =−0.00051/month; (ii) excluding the five landmark models (n= 76)leaves ˆβ2 =−0.00018/month. InbothcasestheDensingpredictionisrejectedatp <10 −15. Check (iii) useslog 10(active_B)in place of total parameters and yields ˆβ2 = +0.0063/month— nominally significant but still only half the Densing target, and also rejecting the Densing prediction (p≈...

2026
[26]

I don’t have reli- able detailed memory and don’t want to fabricate specifics

is the first model to list real per-year challenges, with∼19 verified 2023 titles. Whatever change to the training pipeline absorbed Hackergame writeups happened in this∼9-month window, not gradually across the field; later models (Claude Opus 4.6/4.7, GPT-5.5, Kimi K2.6) inherited and extended this knowledge. 2.Knowing the meta-fact does not imply knowin...

2023
[27]

Geoffrey Ye Li

Claude Opus 4.6 correct— Heike Kamerlingh Onnes first produced liquid helium on July 10, 1908, at his laboratory in Leiden, the Netherlands. This achievement earned him widespread recogni- tion and contributed to his receiving the Nobel Prize in P... H.4 Tier Discrimination: Frontier-Only Knowledge (T6) T6 probes are only answered correctly by the largest...

1908
[28]

I Ethical Considerations This work estimates proprietary model sizes from public API access, which raises considerations for competitive intelligence and intellectual property. We note that: (1) parameter count is only one dimension of model capability, and our estimates are effective capacity (not literal architecture); (2) the method provides∼3.0×precis...

2024