Herculean: An Agentic Benchmark for Financial Intelligence

Xueqing Peng , Zhuohan Xie , Yupeng Cao , Haohang Li , Lingfei Qian , Yan Wang , Vincent Jim Zhang , Huan He

show 56 more authors

Xuguang Ai Linhai Ma Ruoyu Xiang Yueru He Yi Han Shuyao Wang Yuqing Guo Mingyang Jiang Yilun Zhao Youzhong Dong Xiaoyu Wang Yankai Chen Ye Yuan Qiyuan Zhang Fuyuan Lyu Haolun Wu Yonghan Yang Zichen Zhao Yuyang Dai Fan Zhang Rania Elbadry Ayesha Gull Muhammad Usman Safder Nuo Chen Fengbin Zhu Tianshi Cai Zimu Wang Polydoros Giannouris Yuechen Jiang Zhiwei Liu Mohsinul Kabir Yuyan Wang Yixiang Zheng Yangyang Yu Weijin Liu Wenbo Cao Anke Xu Peng Lu Jerry Huang Fengran Mo Mingquan Lin Prayag Tiwari Yijia Zhao Victor Gutierrez Basulto Xiao-Yang Liu Kaleb E Smith Jiahuan Pei Arman Cohan Jimin Huang Yuehua Tang Alejandro Lopez-Lira Xi Chen Xue Liu Junichi Tsujii Jian-Yun Nie Sophia Ananiadou

Authors on Pith no claims yet

classification 💻 cs.AI cs.CL

keywords financialagentstheyagenticauditingbenchmarkhedgingherculean

0 comments

read the original abstract

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

This paper has not been read by Pith yet.

Herculean: An Agentic Benchmark for Financial Intelligence

discussion (0)