arxiv: 2604.03338 · v1 · submitted 2026-04-03 · 💰 econ.GN · cs.AI· cs.CY· q-fin.EC

Recognition: no theorem link

The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research

Ning Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:58 UTC · model grok-4.3

classification 💰 econ.GN cs.AIcs.CYq-fin.EC

keywords AI-generated researcheconomics papersidea qualityexecution qualityquality gapideation bottleneckdifference-in-differencespublication decisions

0 comments

The pith

Idea quality explains 71 percent of the performance gap between AI-generated and human economics papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper decomposes the quality difference between AI-generated economics papers and human publications into separate measures of research idea quality and execution quality. Human papers show a much larger advantage in idea quality, measured by an ensemble model's probability of exceptional publication potential, while execution quality differences are smaller across six dimensions such as mechanism analysis depth. Idea quality accounts for roughly 71 percent of the total gap and is identified as the main bottleneck preventing AI systems from matching human output. This decomposition uses consistent evaluation methods across 953 papers to isolate where AI falls short.

Core claim

Autonomous AI systems generate complete economics papers that underperform human-authored publications in head-to-head comparisons. The quality gap decomposes into a large idea-quality difference (Cohen's d = 2.23) where humans reach 47.1 percent mean ensemble exceptional probability versus 16.5 percent for AI, and a smaller execution-quality difference (d = 0.90) where humans score 4.38 out of 5 versus 3.84. Idea quality accounts for approximately 71 percent of the overall difference, with execution contributing 29 percent. The largest execution weakness appears in mechanism analysis depth, while robustness shows no significant difference. Only 0.8 percent of AI papers surpass the median on

What carries the argument

A two-model ensemble trained on publication decisions to score idea quality, paired with a six-dimension rubric evaluated by Gemini to score execution quality, applied to 912 AI papers and 41 human papers.

If this is right

Human papers achieve markedly higher idea exceptional probability than AI papers.
Mechanism analysis depth shows the largest execution gap among the six dimensions.
Robustness checks exhibit no significant quality difference between AI and human papers.
74 percent of AI papers rely on difference-in-differences designs.
Only 0.8 percent of AI papers exceed the median human paper on both idea and execution quality at once.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future AI training focused on idea generation could close most of the observed gap if the current decomposition holds.
The findings suggest testing whether scaling ideation-specific capabilities in models reduces the 71 percent idea contribution over time.
Neighboring fields such as political science or sociology may show similar ideation bottlenecks if the same decomposition method is applied.
Improving execution alone would address only about 29 percent of the gap, implying limited returns from refinements to analysis pipelines without better ideas.

Load-bearing premise

The two-model ensemble and Gemini rubric accurately and unbiasedly measure true idea and execution quality.

What would settle it

A new AI system that generates papers scoring at or above the human median on both the ensemble idea-quality probability and the execution rubric simultaneously.

read the original abstract

Autonomous AI systems can now generate complete economics research papers, but they substantially underperform human-authored publications in head-to-head comparisons. This paper decomposes the quality gap into two independent components: research idea quality and execution quality. Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite -- the same model family used as the APE tournament judge, ensuring methodological consistency -- to evaluate execution quality, we analyze 953 economics papers -- 912 AI-generated papers from the APE project and 41 human papers published in the American Economic Review and AEJ: Economic Policy. The idea quality gap is large (Cohen's d = 2.23, p < 0.001), with human papers achieving 47.1% mean ensemble exceptional probability versus 16.5% for AI. The execution quality gap is also significant but smaller (d = 0.90, p < 0.001), with human papers scoring 4.38/5.0 versus 3.84. Idea quality accounts for approximately 71% of the overall quality difference, with execution contributing 29%. The largest execution weakness is mechanism analysis depth (d = 1.43); no significant difference is found on robustness. We document that 74% of AI papers employ difference-in-differences, and only 7 AI papers (0.8%) surpass the median human paper on both idea and execution quality simultaneously. The primary bottleneck to competitive AI-generated economics research remains ideation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Idea quality looks like the bigger drag on AI econ papers, but the 71/29 split rests on measures that probably aren't independent.

read the letter

The paper's core claim is that idea quality accounts for roughly 71% of the performance gap between AI-generated economics papers and human ones, with execution quality explaining the remaining 29%. They reach this by scoring 912 AI papers from the APE project against 41 human papers from AER and AEJ: Economic Policy, using an ensemble trained on publication decisions for ideas and a six-dimension Gemini rubric for execution. The numbers show large gaps on both sides, with mechanism analysis depth as the clearest execution weakness and only seven AI papers clearing the human median on both dimensions at once. That scale and the breakdown into specific execution areas are the parts that actually add something new to the existing LLM evaluation literature. The statistical reporting with effect sizes and p-values is straightforward and easy to follow. The soft spot is the independence of the two scores. The ensemble is fine-tuned on full publication decisions, which already combine idea novelty with execution rigor and other factors, so the idea scores likely carry some execution signal. The abstract gives no cross-check against human raters scoring ideas in isolation or any ablation that holds execution metrics fixed. Without that, the clean 71% attribution does not follow directly. Using the same model family for the execution rubric as the original APE judge adds another possible source of overlap. The human comparison sample is also small, which limits precision on the human side. This work is aimed at people building or testing AI systems for social-science research. The question is timely and the data volume is substantial, so it deserves a serious referee to examine the measurement separation and ask for human validation checks. I would send it to review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that the quality gap between AI-generated economics papers (from the APE project) and human publications in top journals is driven primarily by differences in research idea quality (71% of the gap) rather than execution quality (29%). It reaches this conclusion by scoring idea quality via a two-model ensemble fine-tuned on publication decisions and execution quality via a six-dimension Gemini rubric, reporting large effect sizes (Cohen's d=2.23 for ideas, d=0.90 for execution) across 912 AI papers and 41 human papers, with mechanism analysis depth as the largest execution weakness and only 0.8% of AI papers exceeding the median human paper on both dimensions simultaneously.

Significance. If the independence of the idea and execution measures can be established, the work would be significant for identifying ideation as the key barrier to competitive AI economics research and for offering a scalable, statistically grounded LLM-based evaluation framework. The large AI sample, use of effect sizes and p-values, and documentation of specific weaknesses (e.g., mechanism depth) and low overlap rates provide concrete, falsifiable benchmarks that could guide future model development.

major comments (1)

[Abstract] Abstract: the decomposition attributing 71% of the overall quality gap to idea quality (versus 29% to execution) rests on the assumption that the two-model ensemble isolates idea quality independently of execution quality. Because the ensemble is trained on publication decisions (which jointly reflect idea novelty, execution rigor, and other factors), the reported idea scores may embed execution signals. No cross-validation against human idea-only ratings or ablation demonstrating that ensemble scores remain predictive after controlling for execution metrics is described; this independence is load-bearing for the 71%/29% split and the primary-bottleneck conclusion.

minor comments (2)

[Abstract] Abstract: the human baseline rests on only 41 papers from AER and AEJ: Economic Policy; the manuscript should discuss statistical power, robustness checks, or sensitivity to this limited sample size relative to the 912 AI papers.
[Abstract] Abstract: execution quality is scored by Gemini 3.1 Flash Lite—the same model family used as the APE tournament judge. Clarify whether this choice introduces any circularity or is solely for methodological consistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the critical assumption underlying our decomposition. We agree that independence between the idea and execution measures requires explicit validation and address this below.

read point-by-point responses

Referee: [Abstract] Abstract: the decomposition attributing 71% of the overall quality gap to idea quality (versus 29% to execution) rests on the assumption that the two-model ensemble isolates idea quality independently of execution quality. Because the ensemble is trained on publication decisions (which jointly reflect idea novelty, execution rigor, and other factors), the reported idea scores may embed execution signals. No cross-validation against human idea-only ratings or ablation demonstrating that ensemble scores remain predictive after controlling for execution metrics is described; this independence is load-bearing for the 71%/29% split and the primary-bottleneck conclusion.

Authors: We acknowledge that the manuscript does not report explicit tests of independence and that publication decisions incorporate both idea and execution elements. The ensemble follows the Gong, Li, and Zhou (2026) protocol, which trains on initial research proposals where execution details are minimal; we interpret this as primarily isolating idea quality. Nevertheless, the referee is correct that this requires direct validation. In the revision we will add: (i) an ablation regressing the AI-human distinction on idea scores while controlling for the six execution dimensions, and (ii) a small-scale expert rating exercise in which economists score a random subset of ideas on novelty and feasibility alone. These additions will either corroborate the 71/29 split or qualify it; we will report the results transparently. revision: yes

Circularity Check

1 steps flagged

Self-citation load-bearing ensemble for idea quality trained on joint publication decisions

specific steps

self citation load bearing [Abstract]
"Using a two-model ensemble of fine-tuned language models trained on publication decisions (Gong, Li, and Zhou, 2026) to evaluate idea quality and a comprehensive six-dimension rubric assessed by Gemini 3.1 Flash Lite -- the same model family used as the APE tournament judge, ensuring methodological consistency -- to evaluate execution quality"

The idea-quality scores derive from a model trained on publication decisions that jointly reflect idea and execution factors in the training papers. The 71% ideation attribution therefore depends on the ensemble isolating idea quality, but the training objective does not separate the dimensions; the resulting gap decomposition embeds execution signals into the 'idea' scores by construction of the training data.

full rationale

The paper's central decomposition (idea quality accounts for ~71% of the gap) rests on a two-model ensemble that produces independent idea-quality scores. This ensemble is fine-tuned on publication decisions from Gong, Li, and Zhou (2026), which by definition embed both idea novelty and execution quality jointly. The execution scores are obtained from the same model family used as the APE tournament judge. Because the training data and judge family conflate the two dimensions, the 71%/29% attribution and the claim that ideation is the primary bottleneck reduce to the composite signals already present in the scoring inputs rather than an externally validated separation. No cross-validation against human idea-only ratings is shown in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that LLM-based scoring faithfully captures research quality and that the 41 human papers are representative.

free parameters (1)

ensemble thresholds for exceptional probability
The two-model ensemble uses cutoffs to compute 47.1% and 16.5% probabilities that are not derived from first principles.

axioms (2)

domain assumption LLM ensemble trained on publication decisions measures idea quality without systematic bias
Invoked to justify the large idea-quality gap measurement.
domain assumption Gemini 3.1 Flash Lite rubric scores execution quality consistently across AI and human papers
Used for the six-dimension execution evaluation.

pith-pipeline@v0.9.0 · 5608 in / 1268 out tokens · 45761 ms · 2026-05-13T18:58:35.662793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

LLMs learn scientific taste from institutional traces across the social sciences

“Machines Acquire Scientific Taste from Institutional Traces.” arXiv:2603.16659. Goodman-Bacon, Andrew

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Si, C., Hashimoto, T., and Yang, D

“Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers.” arXiv:2409.04109. Social Catalyst Lab

work page arXiv
[3]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

“Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks Track. Appendix Appendix A: Idea Quality Evaluation Prompt The following system prompt was used by the fine-tuned GPT-4.1-nano-econ model (Gong, Li, and Zhou,

work page 2023
[4]

The model receives this prompt as a system message, followed by the standardized idea description as the user message

to evaluate research idea quality. The model receives this prompt as a system message, followed by the standardized idea description as the user message. It outputs a single token with log-probability information enabled, and the probability distribution over the four tier tokens is extracted via softmax normalization (see Section 2.2 for details). ROLE: ...

work page 2026