Theorist Toolbox: Tools for Agent Based LLM-assisted economic theory Research

Moran Koren

arxiv: 2606.22337 · v2 · pith:S3QU27JWnew · submitted 2026-06-21 · 💰 econ.TH · cs.GT· econ.GN· q-fin.EC

Theorist Toolbox: Tools for Agent Based LLM-assisted economic theory Research

Moran Koren This is my paper

Pith reviewed 2026-06-26 09:58 UTC · model grok-4.3

classification 💰 econ.TH cs.GTecon.GNq-fin.EC

keywords LLM-assisted theoryverification protocolseconomic mechanism designadversarial verificationmulti-agent systemsgrade inflationGroves mechanismsPigouvian taxes

0 comments

The pith

External verification, not model capability, determines the reliability of LLM-assisted economic theory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models can generate content for economic theory but require robust verification to avoid false claims. It introduces three protocols that vary in their approach to verification: a single pass, an adversarial pair of models proposing and refuting, and a multi-agent system with a reviewer gate. These are demonstrated on the task of designing a Groves or Pigouvian mechanism for the Gans-Kominers model of grade inflation. Results indicate that verification methods catch errors, lead to convergent findings across runs, and that the most polished output was not the most rigorously verified. This makes verification the central design choice for such research.

Core claim

The central claim is that the bottleneck in LLM-assisted economic theory is trust rather than production, and that protocols differing in verification approach can address this. In the worked example, none produced a strict direct-revelation VCG mechanism, but convergent discovery occurred on an effective-resistance externality kernel, adversarial verification caught three false claims, and the gate rejected a sub-goal, while polish did not ensure rigor.

What carries the argument

The three verification protocols consisting of a single disciplined pass, an adversarial prover-verifier pair, and a structured multi-agent project with a reviewer gate.

If this is right

Adversarial verification caught three of its own false claims in the demonstration.
The multi-agent reviewer gate rejected a flawed sub-goal.
Convergent discovery of the same effective-resistance externality kernel occurred in two runs.
Polish in the output did not correspond to higher rigor or verification.
The specific mechanism requested was not produced, possibly due to non-existence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying these protocols to other economic theory problems could test their generality beyond the grade inflation model.
Different model combinations in the adversarial pair might yield varying error-catching rates.
Integrating these verification steps could shorten the time from idea to initial draft in theoretical economics.

Load-bearing premise

The assumption that the specific example of designing a Groves/Pigouvian mechanism for the Gans-Kominers eigengrade model serves as a representative test case for evaluating the three verification protocols in nontrivial economic theory.

What would settle it

A test on a different nontrivial economic theory problem where the ground truth is independently known, checking if the adversarial pair catches a similar number of false claims or if the single pass performs equivalently.

read the original abstract

Empirical economists often start their projects with a toolbox. Shared packages, replication archives, and circulated guides shorten the time between and idea and a rough initial draft. Theorists, on the other-hand, largely start from a blank page. By 2026, large language models can a produce and check nontrivial mathematics. The can also hallucinate and write wrong claims very convincingly. The current bottleneck on machine-assisted theory is no longer production but trust: a model will claim to prove a false theorem as readily as a true one. Building on recent attempts in mathematics, I present 3 methods for doing economic theory with a language model. These methods differ on how the work is verified: a single disciplined pass, an adversarial prover-verifier pair (Claude Opus~4.8 proposing, OpenAI Codex refuting), and a structured multi-agent project with a reviewer gate (inspired by the Google co-mathematician architecture). I demonstrate these protocols on one open worked example: designing a Groves/Pigouvian incentive mechanism for the Gans--Kominers eigengrade model of grade inflation. None of the three runs produced a strict direct-revelation VCG/Clarke mechanism (as requested, perhaps due to the non-existence of such mechanism). Three phenomena recur. First, convergent discovery: two runs derive the same effective-resistance externality kernel on opposite margins. Second, adversarial verification is load-bearing: the pair caught three of its own false claims and the gate rejected a sub-goal. Third, polish is not rigor: the most finished-looking output was the least verified. The methodological takeaway is that external verification, not model capability, is the design variable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One worked example of three LLM verification protocols on a mechanism design task, with the takeaway that verification protocols matter more than raw model output.

read the letter

The paper runs three verification setups—single pass, adversarial prover-verifier pair, and gated multi-agent—on the task of finding a Groves/Pigouvian mechanism for the Gans-Kominers eigengrade model. None produced the requested strict VCG mechanism, which the author notes may reflect non-existence. Across runs it records convergent discovery of an effective-resistance externality kernel, the adversarial pair catching false claims, and the gate rejecting a sub-goal. The main observation is that external verification, not model capability, controls reliability.

The concrete demonstration is the useful part. It shows actual protocol outputs, flags that polish does not equal rigor, and gives a clear description of how the three setups differ in structure. That supplies a starting point for anyone who wants to try LLM assistance in theory work rather than just read about it.

The limitation is the single open problem. Because the target mechanism may not exist, the runs mainly show avoidance of overclaiming rather than recovery of a known correct answer. No additional cases, no ablation on model size or prompting, and no non-LLM baseline appear, so the claim that verification is the design variable rests on one instance. The phenomena observed could be specific to this task.

This is for theorists who are already experimenting with LLMs and want structured protocols instead of ad-hoc prompting. It is a methods note rather than a new theorem. The protocols are described plainly enough that a referee could evaluate them and ask for the extra cases that would make the general claim testable.

I would send it to peer review as a short methods piece, with the expectation that reviewers will request at least one more worked example before acceptance.

Referee Report

2 major / 1 minor

Summary. The paper presents three verification protocols for LLM-assisted economic theory (single disciplined pass, adversarial prover-verifier pair, and multi-agent project with reviewer gate) and demonstrates them on one open problem: designing a Groves/Pigouvian mechanism for the Gans-Kominers eigengrade model. It reports three recurring phenomena (convergent discovery of an effective-resistance externality kernel, adversarial catching of false claims, and polish not equaling rigor) and concludes that external verification, not model capability, is the controlling design variable.

Significance. If the observations hold and generalize, the work could usefully redirect attention in LLM-assisted theory toward verification architectures rather than prompt engineering or model scale. The concrete protocols (including the adversarial pair and gate) and the choice of an open mechanism-design task are strengths that could be built upon by other researchers.

major comments (2)

[Abstract / demonstration] Abstract and demonstration section: the central claim that 'external verification, not model capability, is the design variable' rests on three protocol runs on a single open mechanism-design task where the requested strict VCG/Clarke mechanism may not exist; without additional cases, model ablations, or non-LLM baselines, the reported phenomena (convergent discovery, error catching) cannot be shown to be general rather than idiosyncratic to this instance.
[Abstract] Abstract: the paper states that the adversarial pair 'caught three of its own false claims' and the gate 'rejected a sub-goal,' yet supplies no details on the mathematical steps, the specific erroneous claims, or the reasoning establishing non-existence of the requested mechanism; this prevents assessment of whether the verification protocols performed as described.

minor comments (1)

[Abstract] Abstract contains typographical errors: 'can a produce' should read 'can produce' and 'The can also hallucinate' should read 'They can also hallucinate'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report. The comments correctly identify limitations in scope and detail; we address them point-by-point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract / demonstration] Abstract and demonstration section: the central claim that 'external verification, not model capability, is the design variable' rests on three protocol runs on a single open mechanism-design task where the requested strict VCG/Clarke mechanism may not exist; without additional cases, model ablations, or non-LLM baselines, the reported phenomena (convergent discovery, error catching) cannot be shown to be general rather than idiosyncratic to this instance.

Authors: We agree the paper is a methods demonstration on one open problem rather than a multi-case empirical study. The central claim will be revised in the abstract and conclusion to present the phenomena as observations from this specific demonstration, with explicit language noting that generality requires further work. The non-existence of the requested VCG mechanism is retained as part of the test case, as it illustrates protocol behavior on an unsolved task. revision: yes
Referee: [Abstract] Abstract: the paper states that the adversarial pair 'caught three of its own false claims' and the gate 'rejected a sub-goal,' yet supplies no details on the mathematical steps, the specific erroneous claims, or the reasoning establishing non-existence of the requested mechanism; this prevents assessment of whether the verification protocols performed as described.

Authors: The demonstration section contains the full traces of the three false claims, the adversarial refutations, the rejected sub-goal, and the reasoning on mechanism non-existence. We will revise the abstract to include one concrete example of a caught claim and will add a concise summary table of verification events to the demonstration section for easier assessment. revision: yes

Circularity Check

0 steps flagged

No circularity; methodological observations drawn directly from described runs

full rationale

The paper describes three verification protocols for LLM-assisted theory work and applies them to one explicit worked example (Groves/Pigouvian mechanism for the Gans-Kominers model). The takeaway that external verification is the controlling design variable is presented as a direct observation from the three protocol runs (convergent discovery, adversarial catching of errors, gate rejection). No equations, fitted parameters, or quantitative predictions exist that could reduce to the paper's own inputs by construction. No self-citations are invoked as load-bearing support for the central claim. The analysis is therefore self-contained as a report on the stated experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters or invented entities; the visible premise is that LLMs can generate nontrivial mathematics that requires external verification.

axioms (1)

domain assumption Large language models can produce and check nontrivial mathematics but will also produce false claims convincingly.
Stated directly in the abstract as the premise motivating the need for verification protocols.

pith-pipeline@v0.9.1-grok · 5838 in / 1293 out tokens · 19685 ms · 2026-06-26T09:58:42.054037+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 1 canonical work pages

[1]

Clarke, E. H. (1971). Multipart pricing of public goods.Public Choice11, 17–33. 37 DRAFT

1971
[2]

(2021).Causal Inference: The Mixtape

Cunningham, S. (2021).Causal Inference: The Mixtape. Yale University Press. https://mixtape. scunning.com. Google DeepMind (2024). AI achieves silver-medal standard solving Interna- tional Mathematical Olympiad problems. https://deepmind.google/blog/ ai-solves-imo-problems-at-silver-medal-level/

2021
[3]

Horváth, Goran Žuži´c, Eric Wieser et al

Hubert, T., Mehta, H., et al. (AlphaProof team) (2025). Olympiad-level formal mathematical reasoning with reinforcement learning.Nature. DOI 10.1038/s41586-025-09833-y. Epoch AI (2026). FrontierMath benchmark program (Tier 4; v2 error-corrected release). https: //epoch.ai/frontiermath

work page doi:10.1038/s41586-025-09833-y 2025
[4]

Gans, J. S. and Kominers, S. D. (2026). What does a grade mean? Informativeness and strategic manipulation of grading systems. NBER Working Paper No. 35183. https://www.nber.org/ papers/w35183

2026
[5]

Glazer, E., et al. (2024). FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI. arXiv:2411.04872.https://arxiv.org/abs/2411.04872

Pith/arXiv arXiv 2024
[6]

Goldsmith-Pinkham, P., Sorkin, I., and Swift, H. (2020). Bartik instruments: What, when, why, and how.American Economic Review110(8), 2586–2624.https://paulgp.com

2020
[7]

and Laffont, J.-J

Green, J. and Laffont, J.-J. (1979).Incentives in Public Decision-Making. North-Holland

1979
[8]

Groves, T. (1973). Incentives in teams.Econometrica41(4), 617–631

1973
[9]

Miller, N., Resnick, P., and Zeckhauser, R. (2005). Eliciting informative feedback: The peer- prediction method.Management Science51(9), 1359–1373

2005
[10]

ProofBridge: Auto-formalization of natural language proofs in Lean via joint embeddings

Jana, Kale, Tanriverdi, Song, Vishwanath, and Ganesh (2025). ProofBridge: Auto-formalization of natural language proofs in Lean via joint embeddings. arXiv:2510.15681. https://arxiv. org/abs/2510.15681. QEDBench: Quantifying the alignment gap in automated evaluation of university-level mathe- matical proofs (2026). arXiv:2602.20629.https://arxiv.org/abs/2...

arXiv 2025
[11]

Weng, Du, Li, et al. (2025). Autoformalization in the era of large language models: A survey. arXiv:2505.23486.https://arxiv.org/abs/2505.23486

arXiv 2025
[12]

Petrov, I., Dekoninck, J., and Vechev, M. (2025). BrokenMath: A benchmark for sycophancy in theorem proving with LLMs. arXiv:2510.04721.https://arxiv.org/abs/2510.04721

arXiv 2025
[13]

Examining false positives under inference scaling for mathematical reasoning

Wang, Yang, Wang, Wei, and Feng (2025). Examining false positives under inference scaling for mathematical reasoning. arXiv:2502.06217.https://arxiv.org/abs/2502.06217

arXiv 2025
[14]

Munkres’ general topology autoformalized in Isabelle/HOL

Bryant, Huerta y Munive, Kaliszyk, and Urban (2026). Munkres’ general topology autoformalized in Isabelle/HOL. arXiv:2604.07455.https://arxiv.org/abs/2604.07455

Pith/arXiv arXiv 2026
[15]

Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders.Journal of Finance16(1), 8–37

1961
[16]

and Parkes, D

Witkowski, J. and Parkes, D. C. (2012). A robust Bayesian truth serum for small populations. InProceedings of the 26th AAAI Conference on Artificial Intelligence, 1492–1498

2012
[17]

Zheng, D., von Glehn, I., Zwols, Y., et al. (2026). AI co-mathematician: Accelerating mathe- maticians with agentic AI. Google DeepMind. arXiv:2605.06651. https://arxiv.org/abs/ 2605.06651. 38

Pith/arXiv arXiv 2026

[1] [1]

Clarke, E. H. (1971). Multipart pricing of public goods.Public Choice11, 17–33. 37 DRAFT

1971

[2] [2]

(2021).Causal Inference: The Mixtape

Cunningham, S. (2021).Causal Inference: The Mixtape. Yale University Press. https://mixtape. scunning.com. Google DeepMind (2024). AI achieves silver-medal standard solving Interna- tional Mathematical Olympiad problems. https://deepmind.google/blog/ ai-solves-imo-problems-at-silver-medal-level/

2021

[3] [3]

Horváth, Goran Žuži´c, Eric Wieser et al

Hubert, T., Mehta, H., et al. (AlphaProof team) (2025). Olympiad-level formal mathematical reasoning with reinforcement learning.Nature. DOI 10.1038/s41586-025-09833-y. Epoch AI (2026). FrontierMath benchmark program (Tier 4; v2 error-corrected release). https: //epoch.ai/frontiermath

work page doi:10.1038/s41586-025-09833-y 2025

[4] [4]

Gans, J. S. and Kominers, S. D. (2026). What does a grade mean? Informativeness and strategic manipulation of grading systems. NBER Working Paper No. 35183. https://www.nber.org/ papers/w35183

2026

[5] [5]

Glazer, E., et al. (2024). FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI. arXiv:2411.04872.https://arxiv.org/abs/2411.04872

Pith/arXiv arXiv 2024

[6] [6]

Goldsmith-Pinkham, P., Sorkin, I., and Swift, H. (2020). Bartik instruments: What, when, why, and how.American Economic Review110(8), 2586–2624.https://paulgp.com

2020

[7] [7]

and Laffont, J.-J

Green, J. and Laffont, J.-J. (1979).Incentives in Public Decision-Making. North-Holland

1979

[8] [8]

Groves, T. (1973). Incentives in teams.Econometrica41(4), 617–631

1973

[9] [9]

Miller, N., Resnick, P., and Zeckhauser, R. (2005). Eliciting informative feedback: The peer- prediction method.Management Science51(9), 1359–1373

2005

[10] [10]

ProofBridge: Auto-formalization of natural language proofs in Lean via joint embeddings

Jana, Kale, Tanriverdi, Song, Vishwanath, and Ganesh (2025). ProofBridge: Auto-formalization of natural language proofs in Lean via joint embeddings. arXiv:2510.15681. https://arxiv. org/abs/2510.15681. QEDBench: Quantifying the alignment gap in automated evaluation of university-level mathe- matical proofs (2026). arXiv:2602.20629.https://arxiv.org/abs/2...

arXiv 2025

[11] [11]

Weng, Du, Li, et al. (2025). Autoformalization in the era of large language models: A survey. arXiv:2505.23486.https://arxiv.org/abs/2505.23486

arXiv 2025

[12] [12]

Petrov, I., Dekoninck, J., and Vechev, M. (2025). BrokenMath: A benchmark for sycophancy in theorem proving with LLMs. arXiv:2510.04721.https://arxiv.org/abs/2510.04721

arXiv 2025

[13] [13]

Examining false positives under inference scaling for mathematical reasoning

Wang, Yang, Wang, Wei, and Feng (2025). Examining false positives under inference scaling for mathematical reasoning. arXiv:2502.06217.https://arxiv.org/abs/2502.06217

arXiv 2025

[14] [14]

Munkres’ general topology autoformalized in Isabelle/HOL

Bryant, Huerta y Munive, Kaliszyk, and Urban (2026). Munkres’ general topology autoformalized in Isabelle/HOL. arXiv:2604.07455.https://arxiv.org/abs/2604.07455

Pith/arXiv arXiv 2026

[15] [15]

Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders.Journal of Finance16(1), 8–37

1961

[16] [16]

and Parkes, D

Witkowski, J. and Parkes, D. C. (2012). A robust Bayesian truth serum for small populations. InProceedings of the 26th AAAI Conference on Artificial Intelligence, 1492–1498

2012

[17] [17]

Zheng, D., von Glehn, I., Zwols, Y., et al. (2026). AI co-mathematician: Accelerating mathe- maticians with agentic AI. Google DeepMind. arXiv:2605.06651. https://arxiv.org/abs/ 2605.06651. 38

Pith/arXiv arXiv 2026