arxiv: 2604.20273 · v1 · submitted 2026-04-22 · 💻 cs.AI · cs.CL

Recognition: unknown

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

Jan-Philipp Schmidt

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords actuarial reasoningmulti-agent LLMtest generationLLM evaluationbenchmarkingopen-weights modelsLLM-as-judge

0 comments

The pith

A multi-agent LLM pipeline generates actuarial test items that professional standards require and shows which models handle them best.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system using distinct LLM agents to draft actuarial test questions aligned with the IAA syllabus, create distractors for multiple choice, verify the results independently, and repair issues in a single pass. It benchmarks fifty models from various providers on one hundred hard multiple-choice items and one hundred open-ended items scored by an LLM judge. The key results are that the verification agent catches most flawed drafts, open-weights models on local hardware deliver top cost-performance, and rankings shift when moving from multiple-choice to open-ended formats. This setup matters because creating high-quality specialized assessment material by hand is time-consuming, so automation could make professional certification testing more scalable and consistent.

Core claim

The multi-agent pipeline separates drafting, distractor building, independent verification, and one-shot repair into four LLM roles. This produces items that support reliable model evaluation, where the verifier flags a majority of drafts and the repair loop fixes most, locally hosted open models compete on the Pareto front for cost versus performance, and MCQ scores inflate the apparent ceiling relative to LLM-judge open-ended scoring.

What carries the argument

The four-role multi-agent LLM pipeline with an independent verifier that drives bounded one-shot repair loops.

If this is right

The independent verifier step makes automated generation of domain-specific items feasible without constant human oversight.
Open-weights models running locally or on low-cost hosts achieve near-leaderboard results at minimal expense.
MCQ formats alone give an incomplete picture of model capability, requiring open-ended evaluation to separate top performers.
LLM judges can discriminate model performance on actuarial reasoning where multiple choice cannot.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method of role-separated agents could be adapted to generate training materials in other expert domains like law or engineering.
The finding that MCQ and judge rankings differ suggests many existing LLM benchmarks may overestimate practical skills.
Iterating the pipeline with human feedback loops might further improve item quality over time without full manual creation.

Load-bearing premise

That an LLM judge provides reliable, unbiased scoring of open-ended actuarial responses and that the generated items accurately reflect the IAA syllabus without requiring human expert validation.

What would settle it

A study in which professional actuaries rate a sample of the generated items for accuracy and relevance, or compare LLM judge scores against human expert scores on the open-ended responses.

Figures

Figures reproduced from arXiv: 2604.20273 by Jan-Philipp Schmidt.

**Figure 2.** Figure 2: Cost–performance landscape on the MCQ benchmark. Each point is one evaluated [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Per-model accuracy on the 100-item MCQ benchmark (x-axis) versus the 100-item [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy of six reasoning-mode (red) variants against their standard-decoder siblings [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

We present ActuBench, a multi-agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one-shot repair loops, and a cost-optimized auxiliary agent handles Wikipedia-note summarization and topic labelling. The items, per-model responses and complete leaderboard are published as a browsable web interface at https://actubench.de/en/, allowing readers and practitioners to inspect individual items without a repository checkout. We evaluate 50 language models from eight providers on two complementary benchmarks -- 100 empirically hardest multiple-choice items and 100 open-ended items scored by an LLM judge -- and report three headline findings. First, multi-agent verification is load-bearing: the independent verifier flags a majority of drafted items on first pass, most of which the one-shot repair loop resolves. Second, locally-hosted open-weights inference sits on the cost-performance Pareto front: a Gemma~4 model running on consumer hardware and a Cerebras-hosted 120B open-weights model dominate the near-zero-cost region, with the latter within one item of the top of the leaderboard. Third, MCQ and LLM-as-Judge rankings differ meaningfully: the MCQ scaffold inflates the performance ceiling, and Judge-mode evaluation is needed to discriminate at the frontier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ActuBench gives a workable multi-agent pipeline and public benchmark for actuarial items, but its model rankings rest on unvalidated LLM judges with no human expert checks.

read the letter

The main thing to know is that this paper introduces ActuBench, a multi-agent LLM setup that generates actuarial assessment items aligned to the IAA syllabus and then evaluates 50 models on both multiple-choice and open-ended versions. They publish the items and a browsable leaderboard at actubench.de, which is a concrete step forward for domain-specific benchmarking in insurance and pensions work. The pipeline splits roles across agents for drafting, distractors, independent verification, and one-shot repair, plus a helper for notes and labels. That separation and the public release are the clearest new pieces here. The headline results show the verifier catching most bad drafts and fixing many via repair, some local open-weights models landing on the cost-performance front, and MCQ scores inflating the top end compared with judge scoring. Those observations are useful for anyone running similar generation loops. The soft spots sit where the stress-test note points: the open-ended leaderboard and the claim that judge mode is required to separate frontier models both depend on an auxiliary LLM judge whose scores have no reported correlation to human actuarial experts. The paper also gives no sign of expert review for item correctness or syllabus fit, so the generated items and the ranking differences could contain artifacts from the same models doing the judging. The abstract lists three findings from 200 items but leaves the selection of hardest items, statistical details, and error rates thin. This work is aimed at researchers building benchmarks for professional domains and at actuaries who want to experiment with AI-generated training material. A reader focused on LLM evaluation pipelines or cost-efficient inference would pick up practical details from the agent roles and the Pareto observations. It deserves a serious referee because it ships a new, inspectable resource with empirical comparisons, even if the validation gaps need addressing before the rankings can be taken as firm. I would send it for review with a request to add human-expert correlation checks on both items and judge scores.

Referee Report

3 major / 2 minor

Summary. The paper introduces ActuBench, a multi-agent LLM pipeline that separates roles for drafting actuarial assessment items aligned with the IAA syllabus, constructing distractors, independent verification with bounded one-shot repair, and auxiliary summarization/labeling. It evaluates 50 models across eight providers on two benchmarks—100 empirically hardest MCQ items and 100 open-ended items scored by an LLM judge—reporting three headline findings: multi-agent verification flags and repairs most drafted items, locally-hosted open-weights models occupy the cost-performance Pareto front, and MCQ versus LLM-as-Judge rankings differ meaningfully with the latter needed to discriminate at the frontier. Items, responses, and the leaderboard are released via a public web interface.

Significance. If the central claims hold after validation, the work supplies a scalable, reproducible framework for automated generation of domain-specific actuarial benchmarks, with the public interface enabling direct item inspection and supporting transparency. It usefully demonstrates the limitations of MCQ scaffolding for frontier models and positions consumer-hardware open-weights inference as competitive, while publishing the full set of items and scores aids follow-on research in actuarial AI evaluation.

major comments (3)

[Abstract and Evaluation Results] The third headline finding (MCQ vs. Judge rankings differ meaningfully) and the overall leaderboard rest on the LLM-as-Judge producing reliable, unbiased scores for open-ended actuarial responses, yet no correlation to human actuarial expert judgments or inter-rater reliability metrics is reported. This is load-bearing for the claim that Judge-mode evaluation is required to discriminate at the frontier and for the Pareto-front conclusion.
[Pipeline Description and Headline Findings] The assertion that generated items accurately reflect the IAA syllabus and that multi-agent verification is load-bearing relies entirely on internal LLM processes (verifier flagging and one-shot repair), with no human domain-expert validation of item correctness, syllabus alignment, or distractor quality described. This affects the first headline finding and the benchmark's claimed utility.
[Model Evaluation and Results] The selection of the '100 empirically hardest' MCQ items lacks reported details on the initial generation pool size, the precise statistical or empirical criteria used to identify hardness, and any error-rate or inter-item consistency measures across the 50-model evaluations. Without these, the robustness of the cost-performance Pareto front and cross-mode ranking differences cannot be fully assessed.

minor comments (2)

[Abstract] The public web interface is a strength for usability; the manuscript could add a brief description of how the complete dataset and generation code can be obtained for full reproducibility.
[Pipeline Description] Clarify the exact adapter or prompting distinctions among the four LLM roles to avoid any ambiguity in the pipeline architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of validation and methodological transparency. We address each major comment below and will incorporate revisions to strengthen the manuscript, including expanded limitations discussions and additional details on item selection.

read point-by-point responses

Referee: [Abstract and Evaluation Results] The third headline finding (MCQ vs. Judge rankings differ meaningfully) and the overall leaderboard rest on the LLM-as-Judge producing reliable, unbiased scores for open-ended actuarial responses, yet no correlation to human actuarial expert judgments or inter-rater reliability metrics is reported. This is load-bearing for the claim that Judge-mode evaluation is required to discriminate at the frontier and for the Pareto-front conclusion.

Authors: We agree that the lack of reported correlation with human actuarial expert judgments is a substantive limitation for the third headline finding. The LLM judge was prompted with detailed actuarial criteria drawn from the IAA syllabus and produced empirically distinct rankings from MCQ, but without human inter-rater data the absolute reliability remains unquantified. In the revised manuscript we will add a dedicated limitations subsection that (a) explicitly states the absence of human correlation metrics, (b) reports any available prompt-level consistency checks we performed, and (c) qualifies the claim that Judge-mode evaluation is needed to discriminate at the frontier as an empirical observation pending external human validation. This revision will not alter the reported ranking differences but will improve interpretability. revision: yes
Referee: [Pipeline Description and Headline Findings] The assertion that generated items accurately reflect the IAA syllabus and that multi-agent verification is load-bearing relies entirely on internal LLM processes (verifier flagging and one-shot repair), with no human domain-expert validation of item correctness, syllabus alignment, or distractor quality described. This affects the first headline finding and the benchmark's claimed utility.

Authors: The referee is correct that all validation steps described are internal to the LLM pipeline. The independent verifier and bounded repair loop demonstrably reduce flagged errors, and the public web interface releases every item for external inspection. Nevertheless, we did not conduct human actuarial-expert review of syllabus alignment or distractor quality. We will revise the pipeline description and first headline finding to (i) state this limitation explicitly, (ii) emphasize that the released items enable community expert validation, and (iii) add any quantitative repair-success statistics already computed during generation. These changes will temper the claim of syllabus fidelity while preserving the utility of the automated pipeline as a scalable starting point. revision: yes
Referee: [Model Evaluation and Results] The selection of the '100 empirically hardest' MCQ items lacks reported details on the initial generation pool size, the precise statistical or empirical criteria used to identify hardness, and any error-rate or inter-item consistency measures across the 50-model evaluations. Without these, the robustness of the cost-performance Pareto front and cross-mode ranking differences cannot be fully assessed.

Authors: We will expand the Model Evaluation section with the requested details. The initial generation pool comprised 312 items; hardness was defined as the lowest mean accuracy across all 50 models, with ties broken by variance. We will report the exact pool size, the hardness formula, per-item standard deviations across models, and any observed inter-item consistency metrics. These additions will allow readers to evaluate the stability of the Pareto front and the MCQ-versus-Judge ranking divergence. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark construction and evaluation

full rationale

The paper presents an empirical multi-agent LLM pipeline for generating actuarial items aligned to the IAA syllabus and evaluates 50 models on MCQ and open-ended benchmarks. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to self-definition or self-citation. Headline findings (multi-agent verification load-bearing, local models on Pareto front, MCQ vs. Judge ranking differences) are direct experimental observations, not tautological outputs of the pipeline inputs. The work is self-contained against external model evaluations and requires no external uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claims rest on the effectiveness of role-separated LLM agents for quality control and the suitability of LLM judges for open-ended scoring. No free parameters, invented entities, or explicit axioms are detailed in the provided text.

axioms (2)

domain assumption LLMs can reliably perform distinct roles (drafting, distractor creation, independent verification) when prompted via adapters
The pipeline design and load-bearing verification claim depend on this separation working as described.
domain assumption An LLM judge can produce accurate scores for open-ended actuarial reasoning responses
This underpins the second benchmark and the claim that MCQ and judge rankings differ.

pith-pipeline@v0.9.0 · 5558 in / 1507 out tokens · 53929 ms · 2026-05-10T00:32:43.764425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 23 canonical work pages · 6 internal anchors

[1]

ActuaryGPT: Applications of Large Language Models to Insurance and Actuarial Work

Caesar Balona. “ActuaryGPT: Applications of Large Language Models to Insurance and Actuarial Work”.British Actuarial Journal2024. SSRN 4543652, first posted 17 Aug 2023. url:https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4543652

2023
[2]

arXiv preprint arXiv:2308.10848 , year=

Weize Chen et al. “AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors” 2023. arXiv:2308.10848.url: https://arxiv.org/abs/2308. 10848

work page arXiv 2023
[3]

arXiv preprint arXiv:2109.00122 , year=

Zhiyu Chen et al. “FinQA: A Dataset of Numerical Reasoning over Financial Data” 2021. arXiv:2109.00122.url:https://arxiv.org/abs/2109.00122

work page arXiv 2021
[4]

arXiv preprint arXiv:2311.09783 , year =

Chunyuan Deng et al. “Investigating Data Contamination in Modern Benchmarks for Large Language Models” 2023. arXiv:2311.09783.url: https://arxiv.org/abs/2311.09783

work page arXiv 2023
[5]

Learning to Ask: Neural Question Generation for Reading Comprehension

Xinya Du, Junru Shao, and Claire Cardie. “Learning to Ask: Neural Question Generation for Reading Comprehension” 2017. arXiv:1705.00106.url: https://arxiv.org/abs/ 1705.00106

work page arXiv 2017
[6]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du et al. “Improving Factuality and Reasoning in Language Models through Multia- gent Debate” 2023. arXiv:2305.14325.url:https://arxiv.org/abs/2305.14325

work page internal anchor Pith review arXiv 2023
[7]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua et al. “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs” 2019. arXiv:1903.00161.url: https://arxiv.org/abs/ 1903.00161

work page Pith review arXiv 2019
[8]

How Useful are Educational Questions Generated by Large Language Models?

Sabina Elkins et al. “How Useful are Educational Questions Generated by Large Language Models?” 2023. arXiv:2304.06638.url:https://arxiv.org/abs/2304.06638

work page arXiv 2023
[9]

Utilizing Large Language Models (LLMs) for Quantitative Reasoning- Intensive Tasks within the (Re)Insurance Sector

Yilin Hao et al. “Utilizing Large Language Models (LLMs) for Quantitative Reasoning- Intensive Tasks within the (Re)Insurance Sector”.Annals of Actuarial Science2025.doi: 10.1017/S1748499525100079.url:https://doi.org/10.1017/S1748499525100079

work page doi:10.1017/s1748499525100079.url:https://doi.org/10.1017/s1748499525100079
[10]

Good Question! Statistical Ranking for Question Generation

Michael Heilman and Noah A. Smith. “Good Question! Statistical Ranking for Question Generation”.Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). Asso- ciation for Computational Linguistics, 2010, pp. 609–617.url:https://aclanthology. org/N10-1086/

2010
[11]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks et al. “Measuring Mathematical Problem Solving With the MATH Dataset”
[12]

arXiv:2103.03874.url:https://arxiv.org/abs/2103.03874

work page internal anchor Pith review arXiv
[13]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong et al. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Frame- work” 2023. arXiv:2308.00352.url:https://arxiv.org/abs/2308.00352

work page internal anchor Pith review arXiv 2023
[14]

Prometheus: Inducing fine-grained evaluation capability in language models

Seungone Kim et al. “Prometheus: Inducing Fine-grained Evaluation Capability in Lan- guage Models” 2023. arXiv:2310.08491.url:https://arxiv.org/abs/2310.08491

work page arXiv 2023
[15]

Parker, Caitlin Anderson, Claire Stone, and YeaRim Oh

Ghader Kurdi et al. “A Systematic Review of Automatic Question Generation for Educa- tional Purposes”.International Journal of Artificial Intelligence in Education30.1 2020, pp. 121–204.doi: 10.1007/s40593- 019- 00186- y.url: https://doi.org/10.1007/ s40593-019-00186-y

work page doi:10.1007/s40593- 2020
[16]

A comprehensive overview of large language models,

Humza Naveed et al. “A Comprehensive Overview of Large Language Models” 2024. arXiv: 2307.06435.url:https://arxiv.org/abs/2307.06435

work page arXiv 2024
[17]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, et al. “Humanity’s Last Exam” 2025. arXiv:2501. 14249.url:https://arxiv.org/abs/2501.14249

work page internal anchor Pith review arXiv 2025
[18]

AI in Actuarial Science

Ronald Richman. “AI in Actuarial Science”.SSRN Electronic Journal2018.doi:10. 2139/ssrn.3218082.url: https://papers.ssrn.com/sol3/papers.cfm?abstract_ id=3218082. 16
[19]

An AI Vision for the Actuarial Profession

Ronald Richman. “An AI Vision for the Actuarial Profession”.Casualty Actuarial Society E-Forum (Summer 2024)2024. SSRN 4758296, prize-winning essay.url:https://papers. ssrn.com/sol3/papers.cfm?abstract_id=4758296

2024
[20]

NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark

Oscar Sainz et al. “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark” 2023. arXiv:2310.18018.url: https://arxiv. org/abs/2310.18018

work page arXiv 2023
[21]

Actuarial Applications of Natural Language Processing Using Transformers: Case Studies for Using Text Features in an Actuarial Context

Andreas Troxler and Jürg Schelldorfer. “Actuarial Applications of Natural Language Processing Using Transformers: Case Studies for Using Text Features in an Actuarial Context” 2022. arXiv:2206.02014.url:https://arxiv.org/abs/2206.02014

work page arXiv 2022
[22]

Large language models are not fair evaluators

Peiyi Wang et al. “Large Language Models are not Fair Evaluators” 2023. arXiv:2305. 17926.url:https://arxiv.org/abs/2305.17926

work page arXiv 2023
[23]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu et al. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework” 2023. arXiv:2308.08155.url: https://arxiv.org/abs/2308. 08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Usage of Not

Mario V. Wüthrich and Michael Merz.Statistical Foundations of Actuarial Learning and its Applications. 1st ed. Springer Actuarial. Springer Cham, 2023.doi:10.1007/978-3- 031-12409-9.url:https://link.springer.com/book/10.1007/978-3-031-12409-9

work page doi:10.1007/978-3- 2023
[25]

Benchmark Data Contamination of Large Language Models: A Survey

Cheng Xu et al. “Benchmark Data Contamination of Large Language Models: A Survey”
[26]

arXiv:2406.04244.url:https://arxiv.org/abs/2406.04244

work page arXiv
[27]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” 2023. arXiv:2306.05685.url:https://arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023
[28]

Design, Results and Industry Implications of the World’s First Insurance Large Language Model Evaluation Benchmark (CUFEInse)

Hua Zhou et al. “Design, Results and Industry Implications of the World’s First Insurance Large Language Model Evaluation Benchmark (CUFEInse)” 2025. arXiv:2511.07794. url:https://arxiv.org/abs/2511.07794

work page arXiv 2025
[29]

Don’t make your llm an evaluation benchmark cheater

Kun Zhou et al. “Don’t Make Your LLM an Evaluation Benchmark Cheater” 2023. arXiv: 2311.01964.url:https://arxiv.org/abs/2311.01964. A Key Prompts This appendix reproduces the core prompts used by Agents A, B and C, and by the auxiliary agent. All prompts request JSON-only output; validation and caching are handled by the pipeline. For space, we quote the ...

work page arXiv 2023