Position: Standard benchmarks fail -- auditing LLM agents in finance must prioritize risk, 2025

Chen, Z · 2025 · arXiv 2502.15865

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

stat.OT · 2026-06-09 · unverdicted · novelty 7.0

A new code-writing data analysis benchmark shows human experts outperforming a frontier LLM on average with lower performance variance.

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.

OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset

cs.CL · 2026-03-14 · unverdicted · novelty 7.0

OmniCompliance-100K supplies 12,985 distinct rules and 106,009 associated real-world cases from 74 multi-domain regulations to benchmark LLM safety and compliance.

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percentage points.

QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

cs.MA · 2026-04-20 · unverdicted · novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout

cs.CR · 2026-04-10 · unverdicted · novelty 4.0

FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.

Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking

q-fin.ST · 2026-06-21 · unverdicted · novelty 3.0

Leakage-controlled LLM factor ranking yields median Spearman IC of +0.154 that is largely matched by a kNN baseline on the same real-time macro inputs.

citing papers explorer

Showing 7 of 7 citing papers after filters.

Flaws in the LLM Automation Narrative stat.OT · 2026-06-09 · unverdicted · none · ref 39
A new code-writing data analysis benchmark shows human experts outperforming a frontier LLM on average with lower performance variance.
FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios cs.CL · 2026-05-01 · unverdicted · none · ref 1
FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.
OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset cs.CL · 2026-03-14 · unverdicted · none · ref 2
OmniCompliance-100K supplies 12,985 distinct rules and 106,009 associated real-world cases from 74 multi-domain regulations to benchmark LLM safety and compliance.
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems cs.AI · 2026-05-07 · unverdicted · none · ref 4
ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percentage points.
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance cs.MA · 2026-04-20 · unverdicted · none · ref 69
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout cs.CR · 2026-04-10 · unverdicted · none · ref 22
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking q-fin.ST · 2026-06-21 · unverdicted · none · ref 79
Leakage-controlled LLM factor ranking yields median Spearman IC of +0.154 that is largely matched by a kNN baseline on the same real-time macro inputs.

Position: Standard benchmarks fail -- auditing LLM agents in finance must prioritize risk, 2025

fields

years

verdicts

representative citing papers

citing papers explorer