Recognition: no theorem link
The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research
Pith reviewed 2026-05-13 18:58 UTC · model grok-4.3
The pith
Idea quality explains 71 percent of the performance gap between AI-generated and human economics papers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Autonomous AI systems generate complete economics papers that underperform human-authored publications in head-to-head comparisons. The quality gap decomposes into a large idea-quality difference (Cohen's d = 2.23) where humans reach 47.1 percent mean ensemble exceptional probability versus 16.5 percent for AI, and a smaller execution-quality difference (d = 0.90) where humans score 4.38 out of 5 versus 3.84. Idea quality accounts for approximately 71 percent of the overall difference, with execution contributing 29 percent. The largest execution weakness appears in mechanism analysis depth, while robustness shows no significant difference. Only 0.8 percent of AI papers surpass the median on
What carries the argument
A two-model ensemble trained on publication decisions to score idea quality, paired with a six-dimension rubric evaluated by Gemini to score execution quality, applied to 912 AI papers and 41 human papers.
If this is right
- Human papers achieve markedly higher idea exceptional probability than AI papers.
- Mechanism analysis depth shows the largest execution gap among the six dimensions.
- Robustness checks exhibit no significant quality difference between AI and human papers.
- 74 percent of AI papers rely on difference-in-differences designs.
- Only 0.8 percent of AI papers exceed the median human paper on both idea and execution quality at once.
Where Pith is reading between the lines
- Future AI training focused on idea generation could close most of the observed gap if the current decomposition holds.
- The findings suggest testing whether scaling ideation-specific capabilities in models reduces the 71 percent idea contribution over time.
- Neighboring fields such as political science or sociology may show similar ideation bottlenecks if the same decomposition method is applied.
- Improving execution alone would address only about 29 percent of the gap, implying limited returns from refinements to analysis pipelines without better ideas.
Load-bearing premise
The two-model ensemble and Gemini rubric accurately and unbiasedly measure true idea and execution quality.
What would settle it
A new AI system that generates papers scoring at or above the human median on both the ensemble idea-quality probability and the execution rubric simultaneously.
read the original abstract
Autonomous AI systems can now generate complete economics research papers, but they substantially underperform human-authored publications in head-to-head comparisons. This paper decomposes the quality gap into two independent components: research idea quality and execution quality. Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite -- the same model family used as the APE tournament judge, ensuring methodological consistency -- to evaluate execution quality, we analyze 953 economics papers -- 912 AI-generated papers from the APE project and 41 human papers published in the American Economic Review and AEJ: Economic Policy. The idea quality gap is large (Cohen's d = 2.23, p < 0.001), with human papers achieving 47.1% mean ensemble exceptional probability versus 16.5% for AI. The execution quality gap is also significant but smaller (d = 0.90, p < 0.001), with human papers scoring 4.38/5.0 versus 3.84. Idea quality accounts for approximately 71% of the overall quality difference, with execution contributing 29%. The largest execution weakness is mechanism analysis depth (d = 1.43); no significant difference is found on robustness. We document that 74% of AI papers employ difference-in-differences, and only 7 AI papers (0.8%) surpass the median human paper on both idea and execution quality simultaneously. The primary bottleneck to competitive AI-generated economics research remains ideation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the quality gap between AI-generated economics papers (from the APE project) and human publications in top journals is driven primarily by differences in research idea quality (71% of the gap) rather than execution quality (29%). It reaches this conclusion by scoring idea quality via a two-model ensemble fine-tuned on publication decisions and execution quality via a six-dimension Gemini rubric, reporting large effect sizes (Cohen's d=2.23 for ideas, d=0.90 for execution) across 912 AI papers and 41 human papers, with mechanism analysis depth as the largest execution weakness and only 0.8% of AI papers exceeding the median human paper on both dimensions simultaneously.
Significance. If the independence of the idea and execution measures can be established, the work would be significant for identifying ideation as the key barrier to competitive AI economics research and for offering a scalable, statistically grounded LLM-based evaluation framework. The large AI sample, use of effect sizes and p-values, and documentation of specific weaknesses (e.g., mechanism depth) and low overlap rates provide concrete, falsifiable benchmarks that could guide future model development.
major comments (1)
- [Abstract] Abstract: the decomposition attributing 71% of the overall quality gap to idea quality (versus 29% to execution) rests on the assumption that the two-model ensemble isolates idea quality independently of execution quality. Because the ensemble is trained on publication decisions (which jointly reflect idea novelty, execution rigor, and other factors), the reported idea scores may embed execution signals. No cross-validation against human idea-only ratings or ablation demonstrating that ensemble scores remain predictive after controlling for execution metrics is described; this independence is load-bearing for the 71%/29% split and the primary-bottleneck conclusion.
minor comments (2)
- [Abstract] Abstract: the human baseline rests on only 41 papers from AER and AEJ: Economic Policy; the manuscript should discuss statistical power, robustness checks, or sensitivity to this limited sample size relative to the 912 AI papers.
- [Abstract] Abstract: execution quality is scored by Gemini 3.1 Flash Lite—the same model family used as the APE tournament judge. Clarify whether this choice introduces any circularity or is solely for methodological consistency.
Simulated Author's Rebuttal
We thank the referee for highlighting the critical assumption underlying our decomposition. We agree that independence between the idea and execution measures requires explicit validation and address this below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the decomposition attributing 71% of the overall quality gap to idea quality (versus 29% to execution) rests on the assumption that the two-model ensemble isolates idea quality independently of execution quality. Because the ensemble is trained on publication decisions (which jointly reflect idea novelty, execution rigor, and other factors), the reported idea scores may embed execution signals. No cross-validation against human idea-only ratings or ablation demonstrating that ensemble scores remain predictive after controlling for execution metrics is described; this independence is load-bearing for the 71%/29% split and the primary-bottleneck conclusion.
Authors: We acknowledge that the manuscript does not report explicit tests of independence and that publication decisions incorporate both idea and execution elements. The ensemble follows the Gong, Li, and Zhou (2026) protocol, which trains on initial research proposals where execution details are minimal; we interpret this as primarily isolating idea quality. Nevertheless, the referee is correct that this requires direct validation. In the revision we will add: (i) an ablation regressing the AI-human distinction on idea scores while controlling for the six execution dimensions, and (ii) a small-scale expert rating exercise in which economists score a random subset of ideas on novelty and feasibility alone. These additions will either corroborate the 71/29 split or qualify it; we will report the results transparently. revision: yes
Circularity Check
Self-citation load-bearing ensemble for idea quality trained on joint publication decisions
specific steps
-
self citation load bearing
[Abstract]
"Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite -- the same model family used as the APE tournament judge, ensuring methodological consistency -- to evaluate execution quality"
The idea-quality scores derive from a model trained on publication decisions that jointly reflect idea and execution factors in the training papers. The 71% ideation attribution therefore depends on the ensemble isolating idea quality, but the training objective does not separate the dimensions; the resulting gap decomposition embeds execution signals into the 'idea' scores by construction of the training data.
full rationale
The paper's central decomposition (idea quality accounts for ~71% of the gap) rests on a two-model ensemble that produces independent idea-quality scores. This ensemble is fine-tuned on publication decisions from Gong, Li, and Zhou (2026), which by definition embed both idea novelty and execution quality jointly. The execution scores are obtained from the same model family used as the APE tournament judge. Because the training data and judge family conflate the two dimensions, the 71%/29% attribution and the claim that ideation is the primary bottleneck reduce to the composite signals already present in the scoring inputs rather than an externally validated separation. No cross-validation against human idea-only ratings is shown in the provided text.
Axiom & Free-Parameter Ledger
free parameters (1)
- ensemble thresholds for exceptional probability
axioms (2)
- domain assumption LLM ensemble trained on publication decisions measures idea quality without systematic bias
- domain assumption Gemini 3.1 Flash Lite rubric scores execution quality consistently across AI and human papers
Reference graph
Works this paper leans on
-
[1]
LLMs learn scientific taste from institutional traces across the social sciences
“Machines Acquire Scientific Taste from Institutional Traces.” arXiv:2603.16659. Goodman-Bacon, Andrew
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Si, C., Hashimoto, T., and Yang, D
“Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers.” arXiv:2409.04109. Social Catalyst Lab
-
[3]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
“Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks Track. Appendix Appendix A: Idea Quality Evaluation Prompt The following system prompt was used by the fine-tuned GPT-4.1-nano-econ model (Gong, Li, and Zhou,
work page 2023
-
[4]
to evaluate research idea quality. The model receives this prompt as a system message, followed by the standardized idea description as the user message. It outputs a single token with log-probability information enabled, and the probability distribution over the four tier tokens is extracted via softmax normalization (see Section 2.2 for details). ROLE: ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.