pith. machine review for the scientific record. sign in

arxiv: 2604.08948 · v2 · submitted 2026-04-10 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

Gang Hu, Haiyan Ding, Jiajia Huang, Kun Yue, Min Peng, Qianqian Xie, Wang Gao, Yating Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords TaxPraBenLLM benchmarkChinese tax practicestructured evaluationreal-world scenariosperformance disparitiesclosed-source modelsdomain-specific tasks
0
0 comments X

The pith

TaxPraBen benchmark shows closed-source large LLMs outperform others in real Chinese tax tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TaxPraBen as a new benchmark that combines ten standard tasks with three real-world tax scenarios to test LLMs on practical Chinese taxation work. It draws from fourteen datasets with 7.3K instances and applies a structured process of parsing, field alignment, extraction, and numerical-textual matching to score models end to end. Evaluations of nineteen models using Bloom's taxonomy levels reveal clear performance differences. A sympathetic reader would care because tax practice involves regulated knowledge where general models often fail, and such a benchmark can show what capabilities actually transfer to professional use.

Core claim

TaxPraBen introduces a scalable structured evaluation paradigm for end-to-end assessment of LLMs in Chinese tax practice. It covers ten traditional tasks plus three new scenarios in risk prevention, inspection analysis, and strategy planning. Testing nineteen models finds that all closed-source large-parameter LLMs perform strongly, Chinese LLMs such as Qwen2.5 generally surpass multilingual ones, and the YaYi2 model fine-tuned on some tax data shows only limited gains.

What carries the argument

The structured evaluation paradigm of structured parsing followed by field alignment, extraction, and numerical and textual matching that turns raw model outputs into comparable scores across tax scenarios.

If this is right

  • Closed-source large models are currently better positioned for deployment in tax-related applications than smaller or open models.
  • Language-specific training gives Chinese LLMs an edge over multilingual models when handling tax regulations and terminology.
  • Fine-tuning on limited tax data produces only modest gains, suggesting broader or higher-quality data is needed for strong results.
  • The same structured paradigm can be reused to build benchmarks in other regulated professional domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of tax tools may gain more from scaling up general models than from narrow fine-tuning on existing tax datasets.
  • The benchmark could guide creation of new training data that targets the specific gaps observed in inspection analysis and strategy planning.
  • Similar structured matching approaches might help evaluate LLMs in other high-stakes legal or financial fields where outputs must match precise rules.

Load-bearing premise

The chosen real-world scenarios and the parsing-alignment-matching process together measure practical tax capabilities without missing important regulated aspects of the work.

What would settle it

A model that scores low on TaxPraBen yet handles actual tax filings, audits, or planning tasks accurately in professional settings would indicate the benchmark misses key capabilities.

Figures

Figures reproduced from arXiv: 2604.08948 by Gang Hu, Haiyan Ding, Jiajia Huang, Kun Yue, Min Peng, Qianqian Xie, Wang Gao, Yating Chen.

Figure 1
Figure 1. Figure 1: TaxPraBen’s data construction workflow uses 3 methods: (A) Book Data Collection, (B) Official Document [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A unified output format protocol for 3 typical cases of the tax practice scenarios. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structured evaluation pipeline for TaxPlan. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The zero-shot and one-shot overall performance of the 19 popular LLMs evaluated on TaxPraBen. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of "structured parsing-field alignment extraction-numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom's taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen serves as a vital resource for advancing evaluations of LLMs in practical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TaxPraBen, the first benchmark dedicated to Chinese taxation practice. It integrates 10 traditional NLP application tasks with 3 novel real-world scenarios (tax risk prevention, tax inspection analysis, and tax strategy planning) sourced from 14 datasets totaling 7.3K instances. The work proposes a scalable structured evaluation paradigm based on structured parsing, field alignment, extraction, and numerical/textual matching to enable end-to-end assessment of practical tax capabilities. The authors evaluate 19 LLMs using Bloom's taxonomy and report performance disparities: closed-source large-parameter models excel, Chinese LLMs (e.g., Qwen2.5) generally outperform multilingual models, and the tax-fine-tuned YaYi2 shows only limited improvement.

Significance. If the evaluation paradigm proves reliable, TaxPraBen fills an important gap by moving beyond isolated NLP tasks to assess LLMs in a knowledge-intensive, legally regulated domain. The combination of traditional tasks with pioneering real-world scenarios and the extensible structured matching approach are strengths that could support broader domain adaptation. The reported model-type disparities offer useful initial signals for LLM development in practical applications, and the benchmark's scale (7.3K instances) and public intent add value for the community.

major comments (2)
  1. [Abstract] Abstract: the reported 'significant performance disparities' (closed-source vs. open, Chinese vs. multilingual, limited gain from YaYi2 fine-tuning) are presented without any reference to statistical significance testing, error bars, or controls for dataset quality and annotation bias; this is load-bearing for the central empirical claims about LLM capabilities.
  2. [Evaluation Method] Evaluation paradigm description: no details are supplied on validation of the 'structured parsing-field alignment extraction-numerical and textual matching' method against expert human judgments, inter-annotator agreement, or coverage of regulated tax work aspects; without such evidence the claim that the paradigm accurately measures end-to-end practical capabilities cannot be assessed.
minor comments (1)
  1. The manuscript would benefit from explicit dataset citations and provenance details for the 14 source datasets to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will incorporate revisions to strengthen the statistical rigor and validation of our evaluation paradigm.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 'significant performance disparities' (closed-source vs. open, Chinese vs. multilingual, limited gain from YaYi2 fine-tuning) are presented without any reference to statistical significance testing, error bars, or controls for dataset quality and annotation bias; this is load-bearing for the central empirical claims about LLM capabilities.

    Authors: We agree that the abstract and main results would benefit from greater statistical rigor. The current manuscript presents performance tables for 19 LLMs but omits significance testing and error bars. We will revise to include bootstrap confidence intervals or paired statistical tests comparing model categories (e.g., closed- vs. open-source, Chinese vs. multilingual). We will also expand the dataset section to detail curation procedures, expert annotation protocols, and steps taken to address potential quality or bias issues. revision: yes

  2. Referee: [Evaluation Method] Evaluation paradigm description: no details are supplied on validation of the 'structured parsing-field alignment extraction-numerical and textual matching' method against expert human judgments, inter-annotator agreement, or coverage of regulated tax work aspects; without such evidence the claim that the paradigm accurately measures end-to-end practical capabilities cannot be assessed.

    Authors: The structured evaluation paradigm is presented as a scalable, extensible approach in the methods section. We acknowledge the absence of explicit human validation or IAA metrics. In the revision we will add a new subsection reporting a human judgment study, inter-annotator agreement scores, and an analysis of coverage for core regulated tax practices, thereby providing direct evidence for the paradigm's reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark construction

full rationale

The paper constructs TaxPraBen by aggregating 14 external datasets into 10 tasks plus 3 new scenarios and applies a standard structured parsing/matching evaluation paradigm. No mathematical derivations, fitted parameters, or predictions are present; performance disparities are reported directly from model evaluations on held-out instances. Bloom's taxonomy and the matching procedure are imported as external standards rather than defined in terms of the benchmark outputs. No self-citation chains, ansatzes, or renamings reduce the central claims to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Bloom's taxonomy appropriately categorizes LLM capabilities for tax practice and that the structured evaluation paradigm faithfully captures real-world performance without introducing unmeasured biases.

axioms (1)
  • domain assumption Bloom's taxonomy is a suitable framework for evaluating LLM performance in specialized tax practice tasks
    The paper states it evaluates 19 LLMs based on Bloom's taxonomy without further justification in the abstract.

pith-pipeline@v0.9.0 · 5534 in / 1283 out tokens · 39017 ms · 2026-05-10T17:55:38.148843+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 28 canonical work pages · 11 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, and others . 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

  4. [4]

    Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie Peng, Xuanjing Huang, and Zhongyu Wei. 2023. Disc-medllm: Bridging general large language models and real-world medical consultation. arXiv preprint arXiv:2308.14346

  5. [5]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and others . 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  6. [6]

    Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, and Liang He. 2024 a . Medbench: A large-scale chinese benchmark for evaluating medical large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17709--17717

  7. [7]

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, and others . 2024 b . Internlm2 technical report. https://arxiv.org/abs/2403.17297

  8. [8]

    Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. Lexglue: A benchmark dataset for legal language understanding in english. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 4310--4330

  9. [9]

    Yating Chen, Siqi Lv, Peiyuan Xia, Zhenxu Wang, Yiming Qin, Qingqing Wang, and Gang Hu. 2025. Taxben: Benchmarking the chinese tax knowledge of large language models. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 307--321. Springer

  10. [10]

    Yirong Chen, Zhenyu Wang, Xiaofen Xing, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, Xiangmin Xu, and others . 2023 a . Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt. arXiv preprint arXiv:2310.15896

  11. [11]

    Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. 2023 b . Soulchat: Improving llms' empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. arXiv preprint arXiv:2311.00273

  12. [12]

    Eunkyung Choi, Young Jin Suh, Hun Park, and Wonseok Hwang. 2025. Taxation perspectives from large language models: A case study on additional tax penalties. arXiv preprint arXiv:2503.03444

  13. [13]

    Yiming Cui, Ziqing Yang, and Xin Yao. 2023. https://arxiv.org/abs/2304.08177 Efficient and effective text encoding for chinese llama and alpaca . arXiv preprint arXiv:2304.08177

  14. [14]

    Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, and Hao Wang. 2025. Laiw: A chinese legal large language models benchmark. In Proceedings of the 31st International conference on computational linguistics, pages 10738--10766

  15. [15]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and others . 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  16. [16]

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625--630

  17. [17]

    Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, and others . 2024. Lawbench: Benchmarking legal knowledge of large language models. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7933--7962

  18. [18]

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, and others . 2021. A framework for few-shot language model evaluation. Zenodo

  19. [19]

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, and others . 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793

  20. [20]

    Cyril Goutte and Eric Gaussier. 2005. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European conference on information retrieval, pages 345--359. Springer

  21. [21]

    Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, and others . 2024. Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 18099--18107

  22. [22]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and others . 2025 a . Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  23. [23]

    Xin Guo, Haotian Xia, Zhaowei Liu, Hanyang Cao, Zhi Yang, Zhiqiang Liu, Sizhe Wang, Jinyi Niu, Chuqi Wang, Yanhui Wang, and others . 2025 b . Fineval: A chinese financial domain knowledge evaluation benchmark for large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguis...

  24. [24]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2009. Measuring massive multitask language understanding, 2021. URL https://arxiv. org/abs, page 20

  25. [25]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

  26. [26]

    Gang Hu, Ke Qin, Chenhan Yuan, Min Peng, Alejandro Lopez-Lira, Benyou Wang, Sophia Ananiadou, Jimin Huang, and Qianqian Xie. 2024. No language is an island: Unifying chinese and english in financial large language models, instruction data, and benchmarks. arXiv preprint arXiv:2403.06249

  27. [27]

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, and others . 2023. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36:62991--63010

  28. [28]

    Huang Jiajia, Zhu Haoran, Xu Chao, Zhan Tianming, Xie Qianqian, and Huang Jimin. 2024. Auditwen: An open-source large language model for audit. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 1351--1365

  29. [29]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

  30. [30]

    Michael Krumdick, Rik Koncel-Kedziorski, Viet Dac Lai, Varshini Reddy, Charles Lovering, and Chris Tanner. 2024. Bizbench: A quantitative reasoning benchmark for business and finance. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 8309--8332

  31. [31]

    Yanyang Li, Jianqiao Zhao, Duo Zheng, Zi-Yuan Hu, Zhi Chen, Xiaohui Su, Yongfeng Huang, Shijia Huang, Dahua Lin, Michael Lyu, and others . 2023. Cleva: Chinese language models evaluation platform. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 186--217

  32. [32]

    Mianxin Liu, Weiguo Hu, Jinru Ding, Jie Xu, Xiaoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, and others . 2024. Medbench: A comprehensive, standardized, and reliable benchmarking system for evaluating chinese medical large language models. Big Data Mining and Analytics, 7(4):1116--1128

  33. [33]

    Yin Luo, Qingchao Kong, Nan Xu, Jia Cao, Bao Hao, Baoyu Qu, Bo Chen, Chao Zhu, Chenyang Zhao, Donglei Zhang, and others . 2023. Yayi 2: Multilingual open-source large language models. arXiv preprint arXiv:2312.14862

  34. [34]

    Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442--451

  35. [35]

    Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, and Richong Zhang. 2025. Easy dataset: A unified and extensible framework for synthesizing llm fine-tuning data from unstructured documents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 960--968

  36. [36]

    Louis Brulé Naudet. 2023. Livre des procédures fiscales, non-instruct (11-12-2023). https://hf-mirror.com/datasets/louisbrulenaudet/lpf

  37. [37]

    John J Nay, David Karamardian, Sarah B Lawsky, Wenting Tao, Meghana Bhat, Raghav Jain, Aaron Travis Lee, Jonathan H Choi, and Jungo Kasai. 2024. Large language models as tax attorneys: a case study in legal capabilities emergence. Philosophical Transactions of the Royal Society A, 382(2270):20230159

  38. [38]

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248--260. PMLR

  39. [39]

    Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, and Sophia Ananiadou. 2025. Plutus: Benchmarking large language models in low-resource greek finance. arXiv preprint arXiv:2502.18772. https://arxiv.org/abs/2502.18772

  40. [40]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392

  41. [41]

    Thomas Rixen and Brigitte Unger. 2022. Taxation: A regulatory multilevel governance perspective. Regulation & Governance, 16(3):621--633

  42. [42]

    Daniel Steinigen, Marcin Namysl, Markus Hepperle, Jan Krekeler, and Susanne Landgraf. 2023. Semantic extraction of key figures and their properties from tax legal texts using neural models. In ASAIL@ ICAIL, pages 60--71

  43. [43]

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, and others . 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295

  44. [44]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and Others . 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  45. [45]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353--355

  46. [46]

    Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975

  47. [47]

    Xidong Wang, Guiming Chen, Song Dingjie, Zhang Zhiyi, Zhihong Chen, Qingying Xiao, Junying Chen, Feng Jiang, Jianquan Li, Xiang Wan, and others . 2024. Cmb: A comprehensive medical benchmark in chinese. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 6...

  48. [48]

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, and others . 2024. Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems, 37:95716--95743

  49. [49]

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. Pixiu: A comprehensive benchmark, instruction dataset and large language model for finance. Advances in Neural Information Processing Systems, 36:33469--33484

  50. [50]

    Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, and others . 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305

  51. [51]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, and others . 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671

  52. [52]

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, and others . 2024. Yi: Open foundation models by 01. ai. https://arxiv.org/abs/2403.04652

  53. [53]

    Jingsi Yu, Junhui Zhu, Yujie Wang, Yang Liu, Hongxiang Chang, Jinran Nie, Cunliang Kong, R Chong, Xin Liu, Jiyuan An, and others . 2023. Taoli llama

  54. [54]

    Wanlong Yu, Wei Wan, Zhenxu Wang, Feng Li, Kang Wang, and Gang Hu. 2025. Open bilingual benchmark and leaderboard for large language models in cybersecurity. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 456--470. Springer

  55. [55]

    Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. Advances in neural information processing systems, 34:27263--27277

  56. [56]

    Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, , and others . 2023. https://arxiv.org/abs/2309.11325 Disc-lawllm: Fine-tuning large language models for intelligent legal services

  57. [57]

    Hui Zeng. 2023. Measuring massive multitask chinese understanding. arXiv preprint arXiv:2304.12986

  58. [58]

    Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, and Na Zhang. 2024. Withdrawn: Evaluating the generation capabilities of large chinese language models

  59. [59]

    Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, and Yang Feng. 2024. Bayling 2: A multilingual large language model with efficient language alignment. arXiv preprint arXiv:2411.16300

  60. [60]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675

  61. [61]

    Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2023 a . Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474

  62. [62]

    Xuanyu Zhang, Bingbing Li, and Qing Yang. 2023 b . Cgce: A chinese generative chat evaluation benchmark for general and financial domains. arXiv preprint arXiv:2305.14471

  63. [63]

    Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems, 6:196--209

  64. [64]

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2024. Agieval: A human-centric benchmark for evaluating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299--2314

  65. [65]

    Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiao-Wen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li. 2024. Lawgpt: A chinese legal knowledge-enhanced large language model. arXiv preprint arXiv:2406.04614