BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

Lequan Ma; Qingtai Wu; Weijia Jia; Wenmian Yang; Yiquan Zhang; Zhensheng Wang

arxiv: 2605.17937 · v2 · pith:RS3RKRTAnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

Zhensheng Wang , Wenmian Yang , Qingtai Wu , Lequan Ma , Yiquan Zhang , Weijia Jia This is my paper

Pith reviewed 2026-05-20 11:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords BacktestBenchLLM benchmarkquantitative backtestingautomated tradingmulti-agent systemsfinancial strategy evaluationcode generationmarket data QA

0 comments

The pith

BacktestBench benchmarks large language models on automating quantitative trading strategy backtesting using 18,246 real-data QA pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BacktestBench as the first large-scale benchmark designed to test how effectively large language models can automate the backtesting of quantitative trading strategies. Backtesting is crucial for validating trading ideas against historical market data but is often too technical and time-consuming for broad use. LLMs have the potential to lower these barriers through code generation and agent-based planning, yet the absence of a dedicated benchmark has slowed targeted improvements. By constructing 18,246 annotated QA pairs from over six million market records in categories like metrics calculation and strategy selection, the work enables systematic comparison of models. Readers should care because success here would make reliable strategy testing more accessible without deep programming skills.

Core claim

BacktestBench is introduced as the first large-scale benchmark for automated quantitative backtesting, constructed from over 6 million real market records and containing 18,246 annotated question-answering pairs divided into four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. The paper also presents AutoBacktest as a multi-agent system baseline that employs a Summarizer to extract semantic factors, a Retriever for SQL generation, and a Coder for Python implementation to convert natural language strategies into reproducible backtests. Evaluations across 23 LLMs with ablations highlight the role of verification and standardized indicator

What carries the argument

BacktestBench, the benchmark of 18,246 QA pairs from real market data, which evaluates LLMs on end-to-end automation of backtesting tasks.

If this is right

Models that score high on the benchmark demonstrate better capability in generating accurate backtest code from natural language inputs.
The AutoBacktest multi-agent approach outperforms single LLM methods by dividing tasks among specialized agents.
Grounded verification steps and standardized indicator formats are critical for reliable end-to-end backtesting automation.
Performance varies significantly across the four task categories, with some proving harder for current models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If LLMs master these tasks, they could enable rapid prototyping of trading strategies for retail investors without requiring coding expertise.
The benchmark opens the door to creating domain-specific training datasets for financial AI applications.
Future work might expand the tasks to include risk management and portfolio-level backtesting to better simulate real trading environments.

Load-bearing premise

The annotated QA pairs from the 6 million market records accurately reflect the full range of technical and semantic difficulties in real quantitative backtesting.

What would settle it

Comparing the backtest outputs produced by high-scoring LLMs against results from professional quantitative analysts on a held-out set of strategies would test whether benchmark success translates to practical accuracy.

Figures

Figures reproduced from arXiv: 2605.17937 by Lequan Ma, Qingtai Wu, Weijia Jia, Wenmian Yang, Yiquan Zhang, Zhensheng Wang.

**Figure 2.** Figure 2: Pipeline of natural language strategy generation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall framework of the AutoBacktest. 3 AutoBacktest To address the challenges of automated backtesting posed by this dataset, we design a multi-agent framework that mimics the workflow of a quantitative researcher. Three specialized agents cooperate in a pipeline: the Summarizer parses natural language strategies into structured indicator representations, the Retriever generates and validates executable… view at source ↗

**Figure 4.** Figure 4: Detailed model performance on the Metrics Cal [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Short Code Ablation. NI+GSOursGI+PSGI+GS 0 10 20 30 40 50 60 Overall Accuracy (%) MiniMax M2.1 44.3 45.9 46.9 48.1 NI+GSOursGI+PSGI+GS 0 10 20 30 40 50 60 GLM 4.7 50.6 56.8 56.8 57.6 NI+GSOursGI+PSGI+GS 0 10 20 30 40 50 60 GPT OSS 120B 46.8 49.6 49.4 51.7 NI+GSOursGI+PSGI+GS 0 10 20 30 40 50 60 GPT OSS 20B 30.7 33.1 31.4 38.0 Vanilla NI+GS Ours GI+PS GI+GS [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ground Truth Ablation. while robust performers like GPT OSS 120B and GLM 4.7 also achieve clear gains. These results confirm that Short Codes act as critical semantic anchors, effectively guiding the models to map natural language intents to precise database schema elements. 4.3.2 Independent Impact of Short Code and SQLQuality. To deeply investigate the independent effects of Short Code and SQL generatio… view at source ↗

**Figure 7.** Figure 7: Atomic Strategy Function Example [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: QA example [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for Converting Python Code to Natural Language. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for Evaluating Natural Language Strategies [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Factor Retrival Prompt [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt For Retriever [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt For Coder [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BacktestBench gives a new large-scale dataset for LLM backtesting with a practical multi-agent baseline, but the annotation quality lacks the validation details needed to trust the results fully.

read the letter

The main thing here is a new benchmark called BacktestBench that turns over 6 million real market records into 18,246 QA pairs across four concrete tasks: metrics calculation, ticker selection, strategy selection, and parameter confirmation. They also give a multi-agent baseline, AutoBacktest, that breaks the workflow into summarizer, retriever, and coder agents, then test it on 23 LLMs with some ablations on what drives performance.

Referee Report

2 major / 2 minor

Summary. The paper introduces BacktestBench, the first large-scale benchmark for automated quantitative backtesting, built from over 6 million real market records and comprising 18,246 annotated QA pairs across four task categories (metrics calculation, ticker selection, strategy selection, and parameter confirmation). It also proposes AutoBacktest, a multi-agent baseline using Summarizer, Retriever, and Coder agents to translate natural language strategies into reproducible backtests, and evaluates performance across 23 LLMs with targeted ablations to identify factors influencing end-to-end results.

Significance. If the benchmark's QA pairs faithfully represent real-world quantitative backtesting challenges and the evaluation is free of annotation artifacts or data leakage, the work would provide a valuable standardized testbed for LLM-based automation in finance. The multi-agent baseline and ablation results on grounded verification could usefully guide future agentic systems, particularly if the dataset construction proves robust.

major comments (2)

[§3] §3 (Dataset Construction): The central claim that the 18,246 QA pairs 'meticulously annotated' from >6M market records accurately encode the technical and semantic difficulties of backtesting workflows is load-bearing for all downstream LLM scores, yet the manuscript provides no inter-annotator agreement statistics, expert review protocol, or error-rate measurements on a held-out sample. Without these, systematic simplifications (e.g., in SQL templates or omission of slippage/edge cases) cannot be ruled out.
[§4.2] §4.2 (Evaluation Setup): The paper does not report controls for data leakage between the market records used to build the benchmark and the training data of the 23 evaluated LLMs, nor does it describe how the four task categories were balanced or validated for coverage of real workflows. This directly affects the reliability of the reported performance rankings and ablation conclusions.

minor comments (2)

[Table 1] Table 1: The caption and column headers for task-category statistics could more explicitly define the 'parameter confirmation' category to avoid ambiguity with 'strategy selection'.
[§5] §5 (Ablations): The description of the 'standardized indicator representations' ablation would benefit from a concrete example of the representation change and its effect size on a specific LLM.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their thorough review and valuable suggestions. We have carefully considered the major comments and provide point-by-point responses below. Where appropriate, we will revise the manuscript to incorporate additional details and clarifications to address the concerns raised.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The central claim that the 18,246 QA pairs 'meticulously annotated' from >6M market records accurately encode the technical and semantic difficulties of backtesting workflows is load-bearing for all downstream LLM scores, yet the manuscript provides no inter-annotator agreement statistics, expert review protocol, or error-rate measurements on a held-out sample. Without these, systematic simplifications (e.g., in SQL templates or omission of slippage/edge cases) cannot be ruled out.

Authors: We thank the referee for this observation. The QA pairs were annotated by experts in quantitative finance following a structured protocol designed to reflect real-world backtesting difficulties, including the use of actual market data for generating questions and answers. However, the original manuscript did not report inter-annotator agreement statistics, a detailed expert review protocol, or error-rate measurements. We agree that these would strengthen the paper. In the revision, we will elaborate on the annotation methodology in §3, describe the protocol used, and include any quality assurance steps performed. We will also add a discussion of potential limitations such as possible simplifications in the templates. revision: yes
Referee: [§4.2] §4.2 (Evaluation Setup): The paper does not report controls for data leakage between the market records used to build the benchmark and the training data of the 23 evaluated LLMs, nor does it describe how the four task categories were balanced or validated for coverage of real workflows. This directly affects the reliability of the reported performance rankings and ablation conclusions.

Authors: We appreciate the referee pointing out these gaps in the evaluation setup. For data leakage, given that training data details for the 23 LLMs (many of which are proprietary) are not publicly accessible, we are unable to perform exhaustive controls. We will add a section in the revised manuscript discussing this challenge and the steps taken to minimize risk, such as using recent market data. Regarding the balancing and validation of task categories, the four categories were chosen to cover essential elements of quantitative strategy development and backtesting as per standard practices in the field. We will revise §4.2 to provide more details on the rationale for category selection, their distribution in the benchmark, and how they map to real workflows based on our design process. revision: partial

standing simulated objections not resolved

Full controls for data leakage with the training data of proprietary LLMs

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an external benchmark (BacktestBench with 18,246 QA pairs from >6M real market records) and a multi-agent baseline (AutoBacktest) without any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. Central claims rest on new artifacts and independent LLM evaluations rather than reducing to inputs by construction. This is the most common honest finding for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The contribution rests on the domain assumption that the chosen four task categories and real-market-derived QA pairs form a representative testbed, plus the introduction of two new entities (BacktestBench and AutoBacktest) without external validation evidence mentioned in the abstract.

axioms (1)

domain assumption The four task categories (metrics calculation, ticker selection, strategy selection, parameter confirmation) comprehensively cover the quantitative backtesting workflow.
The benchmark is explicitly structured around these four categories as stated in the abstract.

invented entities (2)

BacktestBench no independent evidence
purpose: Large-scale benchmark dataset and tasks for LLM automated backtesting
Newly constructed collection of 18,246 QA pairs from 6 million market records.
AutoBacktest no independent evidence
purpose: Multi-agent baseline coordinating summarizer, retriever, and coder for strategy-to-backtest translation
Proposed three-agent architecture described in the abstract.

pith-pipeline@v0.9.0 · 5752 in / 1513 out tokens · 64063 ms · 2026-05-20T11:43:23.332152+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AutoBacktest coordinates three functionally specialized agents: (1) the Summarizer, responsible for semantic-level extraction of financial indicators; (2) the Retriever, which handles data-level precise querying and quality verification; and (3) the Coder, focusing on logic-level code implementation and backtest execution.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.