PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

Ningxin Su; Sijia Chen; Yuxuan Zhao

arxiv: 2605.27887 · v2 · pith:LLGPNLW5new · submitted 2026-05-27 · 💻 cs.AI · q-fin.PM

PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management

Yuxuan Zhao , Sijia Chen , Ningxin Su This is my paper

Pith reviewed 2026-06-29 13:16 UTC · model grok-4.3

classification 💻 cs.AI q-fin.PM

keywords portfolio managementlarge language modelsbenchmarkcorrelation structureallocation pipelinestress testingdiversification metricCEPS

0 comments

The pith

Ninety percent of LLM model-profile combinations fail to outperform a basic equal-weight allocation despite strong static QA performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PortBench introduces a benchmark that adds cross-asset correlation awareness and a complete five-stage allocation pipeline to test LLMs on portfolio management. It pairs a static dataset of correlation questions with a dynamic task spanning six asset classes over ten years. Two new metrics track whether portfolios exploit hedging across classes and how reasoning errors accumulate stage by stage. Evaluation of ten frontier models shows that high scores on isolated questions rarely produce diversified or stable allocations. Most combinations lose to equal-weight strategies, and constraint-compliant outputs still post large drawdowns in historical stress periods.

Core claim

Existing benchmarks ignore correlation structures and stop short of the full decision cycle; PortBench supplies both layers and demonstrates that LLMs answering correlation questions correctly still generate concentrated portfolios whose errors compound, so that 90 percent of model-profile pairs cannot beat equal-weight allocation and even fully compliant runs suffer catastrophic drawdowns under stress.

What carries the argument

The dual-layer correlation score that rewards inter-class hedging and penalizes intra-class concentration, together with the CEPS metric that quantifies compounding of reasoning errors across the five-stage pipeline.

If this is right

Procedural compliance in the allocation pipeline does not guarantee diversification or stress resilience.
Static financial QA performance does not predict success in dynamic, multi-stage portfolio decisions.
Error compounding across stages must be measured separately from single-question accuracy.
Investor risk-profile alignment requires explicit testing under historical stress regimes.
Benchmarks limited to isolated questions will systematically overstate LLM readiness for portfolio management.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LLM agents for finance may need built-in correlation modeling or external hedging modules to close the observed gap.
The same correlation-layer approach could be adapted to benchmark LLMs on other multi-asset tasks such as risk parity or options hedging.
Extending the pipeline to include live market data or forward-looking scenarios would test whether the identified weaknesses persist outside historical regimes.

Load-bearing premise

The dual-layer correlation score and CEPS metric, together with the chosen stress regimes and risk profiles, are faithful proxies for real-world portfolio management outcomes.

What would settle it

An LLM portfolio that receives a low dual-layer correlation score yet delivers higher Sharpe ratios than equal-weight allocation across multiple out-of-sample stress windows would falsify the claim that the metric captures relevant failure modes.

Figures

Figures reproduced from arXiv: 2605.27887 by Ningxin Su, Sijia Chen, Yuxuan Zhao.

**Figure 1.** Figure 1: Overview of PORTBENCH, organized as four modules. (1) Market Base Dataset: representative normalized price indices and interest rate series across six heterogeneous asset classes spanning January 2015 to December 2025. Three historical market stress windows are highlighted and monthly news text coverage is indicated along the bottom. (2) Dual Evaluation Layer: a static QA benchmark of 6,269 correlation-bas… view at source ↗

**Figure 2.** Figure 2: Overview of the PORTBENCH evaluation framework. Top: Static QA evaluation, representative QA pairs from each of the seven task templates. All QA pairs are generated automatically from the market base dataset by applying analytical formulas to historical windows. Bottom: Dynamic five-stage pipeline evaluation. Evaluation is conducted under three investor profiles and three historical stress regimes: across … view at source ↗

**Figure 3.** Figure 3: Risk-adjusted return metrics for all models un [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Maximum drawdown score per model and baseline across the three historical stress regimes. Each cell shows the worst-case drawdown score across all three investor profiles. per-stage scores and CEPS under the balanced profile; [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Normal-period Sharpe ratio against stress [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Normal-period CEPS against stress-period [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Profile Alignment Score (PAS) per model across three investor profiles. Models are sorted left-toright by adaptation standard deviation (σ, descending). Horizontal dashed line marks perfect constraint satisfaction (PAS = 1.0). 0.147), revealing fragility invisible during normal markets. Qwen3.6-Plus shows the opposite: its risk awareness activates under stress despite unremarkable normal-period performa… view at source ↗

**Figure 9.** Figure 9: Number of unique tickers per asset class in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 11.** Figure 11: Mean pairwise correlation between each as [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Normalized price trajectories (base = 100 at first listing date) for representative instruments from each [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Point-in-time slice of the market base dataset at 2024-06-03. At each decision date, the dataset provides: [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: QA sample distribution by template and market regime (sideways, bull, bear). All templates are dominated by sideways-market samples (>65%), consistent with the empirical predominance of rangebound markets. T1–T5 share nearly identical regime proportions because they draw from the same set of randomly sampled dates. T7 exhibits a higher bullmarket share (29%) to ensure adequate regime coverage for its ad… view at source ↗

**Figure 15.** Figure 15: QA sample counts by template and data split [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Text richness by template. Bars (left axis) [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Financial metrics under the conservative in [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Financial metrics under the aggressive in [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Normal-period NAV trajectories under the [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Stress-period NAV trajectories during the [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 23.** Figure 23: Stress-period NAV trajectories during the [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗

**Figure 24.** Figure 24: Stress-period NAV trajectories during the [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗

**Figure 25.** Figure 25: A complete MarketSnapshot for 2024-03-01 (balanced profile). The model receives per-asset price data, macroeconomic indicators, a two-layer correlation interface, and the current portfolio state at each decision step. Each layer is color-coded to emphasize the structured, multi-signal nature of the input. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_25.png] view at source ↗

**Figure 26.** Figure 26: A MarketSnapshot during the 2020 COVID Crash (conservative profile). Compared to the calm 2024-03 snapshot ( [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

**Figure 27.** Figure 27: Representative QA samples from all seven templates (T1–T7). Color indicates difficulty tier: blue = [PITH_FULL_IMAGE:figures/full_fig_p029_27.png] view at source ↗

**Figure 28.** Figure 28: Pipeline trace for Qwen3.6-Plus under normal market conditions. The model produces reasonable [PITH_FULL_IMAGE:figures/full_fig_p030_28.png] view at source ↗

**Figure 29.** Figure 29: Pipeline trace for DS-V4-Flash under the aggressive profile. Relaxed constraints produce near-uniform [PITH_FULL_IMAGE:figures/full_fig_p031_29.png] view at source ↗

**Figure 30.** Figure 30: Pipeline trace for Doubao-Lite during the 2022 Crypto Collapse under conservative constraints. The [PITH_FULL_IMAGE:figures/full_fig_p032_30.png] view at source ↗

read the original abstract

Large language models (LLMs) have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross-asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real-world scenarios. We introduce PortBench, a benchmark spanning six heterogeneous asset classes over ten years. PortBench consists of two complementary layers: a static QA dataset of 6,269 correlation-based questions across seven task templates, and a dynamic five-stage allocation pipeline that mirrors the full PM decision cycle. To evaluate these layers, we introduce two dedicated metrics: a dual-layer correlation score that measures whether proposed portfolios exploit inter-class hedging and avoid intra-class concentration, and CEPS, a metric that quantifies how reasoning errors compound across pipeline stages. We further assess strategy robustness and investor alignment under three historical stress regimes and risk profiles. Evaluating ten frontier LLMs, we find that despite strong performance on static financial QA, 90\% of model-profile combinations fail to outperform a basic equal-weight allocation, and models that satisfy every procedural constraint still suffer catastrophic drawdowns under stress. Our source code is available at \href{https://github.com/AgenticFinLab/portbench}{this https URL}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PortBench adds correlation-aware QA and a full pipeline with CEPS to LLM finance benchmarks, but its performance gaps rest on author-defined metrics without external checks against real outcomes.

read the letter

This paper's main point is a new benchmark for LLM portfolio management that includes correlation structures and tests the entire decision pipeline instead of just static questions. The headline result is that most model setups don't beat a simple equal-weight strategy.

What is new is the combination of a large correlation QA dataset with a dynamic five-stage pipeline and the CEPS metric for tracking how errors add up across stages. The dual-layer correlation score tries to check for proper diversification across asset classes. Testing under stress regimes is also a plus compared to standard benchmarks.

The soft spots are that these new metrics and the chosen stress periods have no external validation shown. It's not clear if high correlation scores or low CEPS actually lead to better real-world portfolio results. The abstract reports 90% failure and catastrophic drawdowns but without error bars or more on the baseline, it's tough to assess the strength of that finding right away.

This is for researchers working on LLMs in finance applications or building better evals for agentic financial tasks. Someone looking for evidence on current LLM limitations in portfolio construction would find it relevant.

I would recommend sending it for peer review. The core idea addresses a genuine limitation in existing work, and the evaluation setup is worth a closer look from referees even with the metric concerns.

Referee Report

3 major / 2 minor

Summary. The paper introduces PortBench, a benchmark for LLM-driven portfolio management spanning six asset classes over ten years. It includes a static QA dataset of 6,269 correlation-based questions and a dynamic five-stage allocation pipeline. New metrics are defined: a dual-layer correlation score measuring inter-class hedging and intra-class concentration, and CEPS quantifying compounding reasoning errors. Evaluation of ten frontier LLMs finds that 90% of model-profile combinations fail to outperform equal-weight allocation, and even procedurally compliant models exhibit catastrophic drawdowns under three historical stress regimes and varying risk profiles. Source code is released.

Significance. If the dual-layer correlation score and CEPS are shown to be faithful proxies for real-world risk-adjusted returns and tail-risk outcomes, the results would demonstrate that strong static financial QA performance does not translate to competent end-to-end portfolio construction. The open-source release supports reproducibility, which strengthens the contribution as an empirical benchmark paper.

major comments (3)

[Abstract] Abstract: the headline claim that 90% of model-profile combinations fail to outperform equal-weight allocation is stated without error bars, statistical significance tests, or an explicit definition of the equal-weight baseline implementation inside the five-stage pipeline; this directly affects the reliability of the central empirical result.
[Metrics definition] The section introducing the dual-layer correlation score and CEPS: these metrics are presented as the primary evaluation criteria without any external validation (e.g., correlation with live-trading Sharpe ratios, maximum drawdown, or investor-reported outcomes) or ablation showing they rank strategies differently from standard financial metrics; because the performance-gap claim rests entirely on these proxies, the absence of such validation is load-bearing.
[Pipeline and evaluation setup] The section describing the dynamic pipeline and stress regimes: no detail is supplied on how the five stages were operationalized, how the three historical stress windows were selected, or how risk profiles were parameterized; without this, it is impossible to assess whether the reported catastrophic drawdowns are artifacts of the chosen regimes rather than general LLM limitations.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a short table summarizing the seven task templates in the static QA layer.
[Metrics definition] Notation for the dual-layer correlation score should be defined with an explicit formula rather than prose description only.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important aspects of clarity and validation in our benchmark paper. We address each major comment point-by-point below, indicating planned revisions where they strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 90% of model-profile combinations fail to outperform equal-weight allocation is stated without error bars, statistical significance tests, or an explicit definition of the equal-weight baseline implementation inside the five-stage pipeline; this directly affects the reliability of the central empirical result.

Authors: The equal-weight baseline is implemented as a static, non-LLM-driven allocation across the six asset classes with monthly rebalancing to equal weights and no dynamic adjustments, serving as the naive benchmark within the pipeline. The 90% figure aggregates results across ten models and three risk profiles from the full evaluation in Section 5. We agree that the abstract would benefit from greater precision and will revise it to include a brief definition of the baseline, report standard errors from multiple prompt seeds, and note that paired t-tests confirm statistical significance (p < 0.01) for the performance gap in the main text. revision: yes
Referee: [Metrics definition] The section introducing the dual-layer correlation score and CEPS: these metrics are presented as the primary evaluation criteria without any external validation (e.g., correlation with live-trading Sharpe ratios, maximum drawdown, or investor-reported outcomes) or ablation showing they rank strategies differently from standard financial metrics; because the performance-gap claim rests entirely on these proxies, the absence of such validation is load-bearing.

Authors: The dual-layer correlation score directly quantifies inter-class hedging and intra-class concentration using the provided correlation matrix, while CEPS measures error propagation across the five stages; both are motivated by portfolio theory and shown in Section 6 to align with observed drawdowns under stress. We will add an ablation comparing strategy rankings under our metrics versus standard Sharpe and maximum drawdown to demonstrate differentiation. External validation against live-trading outcomes lies outside the scope of a static benchmark. revision: partial
Referee: [Pipeline and evaluation setup] The section describing the dynamic pipeline and stress regimes: no detail is supplied on how the five stages were operationalized, how the three historical stress windows were selected, or how risk profiles were parameterized; without this, it is impossible to assess whether the reported catastrophic drawdowns are artifacts of the chosen regimes rather than general LLM limitations.

Authors: Section 4 operationalizes the stages with explicit prompt templates, API calls for data retrieval, and constraint enforcement via post-processing; the stress windows are the 2008 GFC (Sep 2008–Mar 2009), 2015–2016 oil shock, and 2020 COVID crash, chosen to cover distinct volatility sources. Risk profiles are defined by target volatility bands (conservative: <8%, moderate: 8–15%, aggressive: >15%) applied to the allocation stage. We will expand this section with pseudocode, exact date ranges, and parameterization tables in the revision. revision: yes

standing simulated objections not resolved

External validation of the dual-layer correlation score and CEPS against live-trading Sharpe ratios or investor-reported outcomes, which would require proprietary execution data unavailable to a benchmark study.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with explicitly defined metrics

full rationale

The paper introduces PortBench as a new benchmark with two layers (static QA and dynamic pipeline) and two new metrics (dual-layer correlation score and CEPS) defined directly from the data and pipeline stages. The central results (90% of model-profile pairs fail to beat equal-weight; constraint-satisfying models still show drawdowns) are direct empirical measurements against these definitions and three historical regimes. No derivation chain, fitted parameter renamed as prediction, self-citation load-bearing uniqueness theorem, or ansatz smuggling is present; the work is self-contained as an evaluation study whose claims rest on the constructed proxies rather than reducing to them by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical benchmark paper; it does not introduce fitted parameters, new axioms, or postulated entities. The metrics and stress regimes are constructed by the authors but rest on standard financial assumptions rather than new invented objects.

pith-pipeline@v0.9.1-grok · 5778 in / 1260 out tokens · 19039 ms · 2026-06-29T13:16:16.956590+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems
cs.AI 2026-06 unverdicted novelty 3.0

Reproducibility audit of 30 LLM trading papers shows execution assumptions under-reported relative to agent architectures, illustrated by a 10-equity example where frictions compress returns.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

GLM-5: from Vibe Coding to Agentic Engineering

Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953. Ziliang Gan, Dong Zhang, Haohan Li, Yang Wu, Xueyuan Lin, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu, Rongjunchen Zhang, and 1 others. 2025. Mme-finance: A multimodal finance benchmark for expert-level understanding and rea...

work page internal anchor Pith review Pith/arXiv arXiv 1915
[2]

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

Frequant: A reinforcement-learning based adaptive portfolio optimization with multi-frequency decomposition. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1211–1221. John L Kelly. 1956. A new interpretation of information rate.the bell system technical journal, 35(4):917– 926. Kimi Team. 2026. Kimi k2.6: Adv...

work page internal anchor Pith review Pith/arXiv arXiv 1956

[1] [1]

GLM-5: from Vibe Coding to Agentic Engineering

Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy?The review of Financial studies, 22(5):1915–1953. Ziliang Gan, Dong Zhang, Haohan Li, Yang Wu, Xueyuan Lin, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu, Rongjunchen Zhang, and 1 others. 2025. Mme-finance: A multimodal finance benchmark for expert-level understanding and rea...

work page internal anchor Pith review Pith/arXiv arXiv 1915

[2] [2]

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

Frequant: A reinforcement-learning based adaptive portfolio optimization with multi-frequency decomposition. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1211–1221. John L Kelly. 1956. A new interpretation of information rate.the bell system technical journal, 35(4):917– 926. Kimi Team. 2026. Kimi k2.6: Adv...

work page internal anchor Pith review Pith/arXiv arXiv 1956