pith. sign in

arxiv: 2605.28714 · v1 · pith:C6SY63RMnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

Pith reviewed 2026-06-29 12:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords IPO filingsmultimodal datasetdocument segmentationfinancial chartsregulatory documentsmodel alignmentsection-structured analysis
0
0 comments X

The pith

A toolkit segments over 109,000 IPO filings into sections with images, creating a dataset that shows multimodal models diverge from human judgments on chart quality and misleadingness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the IPO-Toolkit, an open-source system that downloads IPO filings, segments them into consistent sections, extracts embedded images, and outputs structured text and visuals. It uses the toolkit to build the IPO-Dataset covering more than 109,000 filings and amendments from 1994 to 2026 along with over 76,000 images. The work defines evaluation tasks for financial charts drawn from these documents and reports that state-of-the-art multimodal models frequently disagree with expert human assessments on chart quality and misleadingness. The resources also support large-scale studies of textual variation across sections and differences in disclosure practices by industry.

Core claim

The paper presents the IPO-Toolkit for parsing long, multimodal IPO documents into standardized section-structured text and extracted images, along with the resulting IPO-Dataset of over 109,000 filings. It establishes tasks for assessing the quality and misleadingness of extracted financial charts and demonstrates that current multimodal models often diverge from expert human judgments on these tasks over real-world regulatory documents.

What carries the argument

The IPO-Toolkit, a framework that segments filings into sections, extracts embedded images, and produces structured outputs enabling reproducible analysis of documents exceeding 500,000 tokens.

If this is right

  • Large-scale analysis of section-level textual variation across filings becomes possible.
  • Systematic study of cross-industry differences in visual and textual disclosure practices is enabled.
  • Benchmarks exist for testing multimodal models on reasoning over long regulatory documents.
  • Reproducible workflows for handling multimodal documents longer than 500,000 tokens are available to researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Structured outputs could let investors compare risk disclosures across companies more directly than reading full filings.
  • Alignment work on multimodal models could prioritize regulatory financial content as a distinct domain.
  • The same segmentation approach might apply to other lengthy legal or financial regulatory filings.

Load-bearing premise

The toolkit's segmentation and image extraction produce accurate, consistent, and reproducible section-structured outputs across the full range of filings from 1994 to 2026.

What would settle it

A manual review of randomly sampled parsed filings that finds inconsistent section boundaries or systematically missing images would show the dataset outputs are not reliable.

Figures

Figures reproduced from arXiv: 2605.28714 by Aman Patel, Arnav Hiray, Liqin Ye, Michael Galarnyk, Prasun Banerjee, Rutwik Routu, Sagnik Nandi, Siddhartha Somani, Siddharth Lohani, Sudheer Chava, Vidhyakshaya Kannan.

Figure 1
Figure 1. Figure 1: Overview of the IPO-Toolkit pipeline for constructing [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Likert validator prompt for IPO section completeness. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Image processing pipeline for IPO filings. Human [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean TTR over time by disclosure section. Lexical [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Longitudinal trends in image-type diversity (3-year [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The web interface supports queries over companies, [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the IPO-Toolkit, an open-source framework for downloading and parsing long IPO filings into standardized section-structured text and extracted images. It constructs the IPO-Dataset covering more than 109,000 filings and amendments (1994–2026) containing over 76,000 images. Structured evaluation tasks are defined over extracted financial charts (chart quality and misleadingness assessment), with experiments showing that state-of-the-art multimodal models diverge from expert human judgments, highlighting alignment challenges in multimodal reasoning over regulatory documents.

Significance. If the toolkit's parsing and extraction steps are shown to be accurate, the work supplies a large-scale, reproducible resource for section-level analysis of multimodal financial disclosures and provides concrete evidence of model-human divergence on chart-based tasks; the public release of code, data, and website under CC-BY-4.0 strengthens its utility for the community.

major comments (2)
  1. [IPO-Toolkit description] The IPO-Toolkit description (abstract and associated methods): no quantitative validation of segmentation accuracy, image extraction fidelity, or error rates on filings from 1994–2026 is reported (e.g., no precision/recall against manual annotations or inter-annotator agreement). This is load-bearing for the central claim, because the reported model divergences on chart quality and misleadingness are defined over outputs produced by these steps; without validation, divergence could be an artifact of pipeline errors rather than a genuine alignment issue.
  2. [Dataset construction] Dataset construction section: the abstract states the dataset enables 'large-scale analysis of section-level textual variation,' yet supplies no statistics on section consistency, token-length distributions per section, or handling of filings exceeding 500,000 tokens; this directly affects reproducibility of the evaluation tasks.
minor comments (2)
  1. [Abstract] Abstract: the exact total number of filings (rather than 'more than 109,000') and a breakdown by year or amendment status would improve precision.
  2. [Methods] The paper should clarify the definition of 'section' used by the toolkit and how embedded images are associated with specific sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of validation and reproducibility. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [IPO-Toolkit description] The IPO-Toolkit description (abstract and associated methods): no quantitative validation of segmentation accuracy, image extraction fidelity, or error rates on filings from 1994–2026 is reported (e.g., no precision/recall against manual annotations or inter-annotator agreement). This is load-bearing for the central claim, because the reported model divergences on chart quality and misleadingness are defined over outputs produced by these steps; without validation, divergence could be an artifact of pipeline errors rather than a genuine alignment issue.

    Authors: We agree that quantitative validation of the toolkit's parsing steps is essential to support the downstream evaluation claims. The current manuscript focuses on the overall framework and dataset release rather than exhaustive error analysis. In the revised version, we will add a dedicated validation subsection reporting precision/recall for section segmentation and image extraction on a manually annotated sample of filings spanning the 1994–2026 period, along with inter-annotator agreement statistics. This will directly address whether observed model-human divergences could stem from pipeline artifacts. revision: yes

  2. Referee: [Dataset construction] Dataset construction section: the abstract states the dataset enables 'large-scale analysis of section-level textual variation,' yet supplies no statistics on section consistency, token-length distributions per section, or handling of filings exceeding 500,000 tokens; this directly affects reproducibility of the evaluation tasks.

    Authors: We concur that explicit statistics on section properties and long-document handling would improve reproducibility. The manuscript currently emphasizes the scale and structure of the IPO-Dataset but omits these details. In revision, we will expand the dataset construction section to include per-section token-length distributions, measures of section consistency across filings, and a description of our approach to documents exceeding 500,000 tokens (including any chunking or truncation methods used in the evaluation tasks). revision: yes

Circularity Check

0 steps flagged

Resource-release paper with no derivation chain or fitted predictions

full rationale

The paper introduces the IPO-Toolkit and IPO-Dataset as infrastructure for analysis, followed by evaluation tasks on extracted charts. No equations, parameters, or predictions are defined or derived within the paper. The central experiments compare multimodal models to human judgments on tasks built from the released dataset; these do not reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The contribution rests on the public release of code and data rather than any internal mathematical reduction. This matches the default expectation for non-circular resource papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering and data-release paper; it introduces no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5855 in / 1141 out tokens · 39694 ms · 2026-06-29T12:26:29.712671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 45 canonical work pages · 5 internal anchors

  1. [1]

    [n. d.]. SEC API. https://sec-api.io/. Accessed: 2025-02-19

  2. [2]

    Mubashara Akhtar, Nikesh Subedi, Vivek Gupta, Sahar Tahmasebi, Oana Co- carascu, and Elena Simperl. 2024. ChartCheck: Explainable Fact-Checking over Real-World Chart Images. InFindings of the Association for Computational Lin- guistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thai...

  3. [3]

    Gary Ang and Ee-Peng Lim. 2022. Guided Attention Multimodal Multitask Finan- cial Forecasting with Inter-Company Relationships and Global and Local News. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computa...

  4. [4]

    Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/claude-

  5. [5]

    Accessed: 2025-12-18

  6. [6]

    Dogu Araci. 2019. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063 [cs.CL] https://arxiv.org/abs/1908.10063

  7. [7]

    Bloomberg News. 2025. Retail Traders Reshape the Market for IPO Debuts. Bloomberg.com(8 September 2025). https://www.bloomberg.com/news/artic les/2025-09-08/retail-traders-reshape-the-market-for-ipo-debuts Accessed: 2026-02-19

  8. [8]

    Brown and Jennifer Wu Tucker

    Stephen V . Brown and Jennifer Wu Tucker. 2011. Large-sample evidence on firms’ year-over-year MD&A modifications.Journal of Accounting Research49, 2 (2011), 309–346

  9. [9]

    Campbell, Hsinchun Chen, Dan S

    John L. Campbell, Hsinchun Chen, Dan S. Dhaliwal, Hai Lu, and Logan B. Steele

  10. [10]

    doi:10.1007/s11142- 013-9258-3

    The Information Content of Mandatory Risk Factor Disclosures in Corporate Filings.Review of Accounting Studies19, 1 (2014), 396–455. doi:10.1007/s11142- 013-9258-3

  11. [11]

    Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. 2022. MapQA: A Dataset for Question Answering on Choropleth Maps. arXiv:2211.08545 [cs.CV] https://arxiv.org/abs/2211.08545

  12. [12]

    Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2024. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. arXiv:2309.12307 [cs.CL] https://arxiv.org/abs/2309.12307

  13. [13]

    Zixin Chen, Sicheng Song, KaShun Shum, Yanna Lin, Rui Sheng, Weiqi Wang, and Huamin Qu. 2025. Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn R...

  14. [14]

    Gurun, and Christopher J

    Lauren Cohen, Umit G. Gurun, and Christopher J. Malloy. 2020. Resident Networks and Firm Disclosure.Journal of Finance75, 2 (2020), 743–785. doi:10.1111/jofi.12878

  15. [15]

    Congress

    U.S. Congress. 2002. Sarbanes–Oxley Act of 2002. https://www.govinfo.gov/co ntent/pkg/PLAW-107publ204/pdf/PLAW-107publ204.pdf. Public Law 107-204. Accessed: 2025-12-12

  16. [16]

    Zihan Dong, Xinyu Fan, and Zhiyuan Peng. 2024. FNSPID: A Comprehensive Financial News Dataset in Time Series. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona, Spain)(KDD ’24). Association for Computing Machinery, New York, NY , USA, 4918–4927. doi:10.1145/3637528.3671629

  17. [17]

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. 2025. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carol...

  18. [18]

    Travis Dyer, Mark Lang, and Lorien Stice-Lawrence. 2017. The evolution of 10-K textual disclosure: Evidence from Latent Dirichlet Allocation.Journal of Accounting and Economics64, 2–3 (2017), 221–245

  19. [19]

    Michael Galarnyk, Veer Kejriwal, Agam Shah, Yash Bhardwaj, Nicholas Watney Meyer, Anand Krishnan, and Sudheer Chava. 2025. VideoConviction: A Multi- modal Benchmark for Human Conviction and Stock Market Recommendations. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2(Toronto ON, Canada)(KDD ’25). Association fo...

  20. [20]

    Michael Galarnyk, Agam Shah, Dipanwita Guhathakurta, Poojitha Nandigam, and Sudheer Chava. 2025. How Inclusively do LMs Perceive Social and Moral Norms?. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 4874–4884. doi...

  21. [21]

    Ziliang Gan, Dong Zhang, Haohan Li, Yang Wu, Xueyuan Lin, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu, Rongjunchen Zhang, and Yong Dai. 2025. MME- Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning. InProceedings of the 33rd ACM International Conference on Multi- media(Dublin, Ireland)(MM ’25). Association for Computing Ma...

  22. [22]

    Simone Giovannini and Simone Marinai. 2025. A Survey on Reading Order, Table of Contents, and Structure Extraction in Document Analysis. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. 7585–7594

  23. [23]

    Ming Gu, David Hirshleifer, Siew Hong Teoh, and Shijia Wu. 2025. GIFfluence: A Visual Approach to Investor Sentiment and the Stock Market. arXiv:2512.20027 [q- fin.PR] https://arxiv.org/abs/2512.20027

  24. [24]

    Ziyuan He, Yuxuan Wang, Jiaqi Li, Kexin Liang, and Muhan Zhang. 2025. LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges? arXiv:2510.22548 [cs.CL] https://arxiv.org/abs/2510.22548

  25. [25]

    Arnav Hiray, Yunsong Liu, Mingxiao Song, Agam Shah, and Sudheer Chava. 2024. CoCoHD: Congress Committee Hearing Dataset. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 15529–15542. doi:10.18653/v1/2024.finding...

  26. [26]

    Madhur Jindal and Saurabh Deshpande. 2025. REVEAL: multi-turn evaluation of image-input harms for vision LLMs. InProceedings of the Thirty-Fourth Interna- tional Joint Conference on Artificial Intelligence(Montreal, Canada)(IJCAI ’25). Article 1081, 9 pages. doi:10.24963/ijcai.2025/1081

  27. [27]

    2023.Ultralytics YOLOv8

    Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023.Ultralytics YOLOv8. https: //github.com/ultralytics/ultralytics

  28. [28]

    Routledge, Jacob S

    Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Jacob S. Sagi, and Noah A. Smith. 2009. Predicting risk from financial reports with regression. InProceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics(Boulder, Col- orado)(NAACL ’09). Association for Computation...

  29. [29]

    Klaus Krippendorff. 2011. Computing Krippendorff’s Alpha-Reliability. https: //api.semanticscholar.org/CorpusID:59901023

  30. [30]

    Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pil- sung Kang, and Najoung Kim. 2025. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, a...

  31. [32]

    Leo Yu-Ho Lo and Huamin Qu. 2025. How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations?IEEE Transactions on Visualization and Computer Graphics31, 1 (Jan. 2025), 1116–1125. doi:10.1109/TVCG.2024.3456333

  32. [33]

    Gunratan Lonare, Bharat Patil, and Nilesh Raut. 2021. edgar: An R Package for the U.S. SEC EDGAR Retrieval and Parsing of Corporate Filings.SoftwareX16 (2021), 100865. doi:10.1016/j.softx.2021.100865

  33. [34]

    Tim Loughran and Bill McDonald. 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks.The Journal of finance66, 1 (2011), 35–65

  34. [35]

    Tim Loughran and Bill McDonald. 2014. Measuring readability in financial disclosures.the Journal of Finance69, 4 (2014), 1643–1671

  35. [36]

    Lefteris Loukas, Fabian Billert, Manos Fergadiotis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2025. EDGAR-CRAWLER: From Raw Web Documents to Structured Financial NLP Datasets. InCompanion Proceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY , USA, 761–764. doi:10.1145/37017...

  36. [37]

    Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, and Prodromos Malaka- siotis. 2021. EDGAR-CORPUS: Billions of Tokens Make The World Go Round. InProceedings of the Third Workshop on Economics and Natural Language Processing, Udo Hahn, Veronique Hoste, and Amanda Stent (Eds.). Associa- tion for Computational Linguistics, Punta Cana, Dominican Repub...

  37. [38]

    Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, and Yike Guo. 2025. FinMME: Benchmark Dataset for Financial Multi- Modal Reasoning Evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  38. [39]

    Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDer- mott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Challenge: Financial Opinion Mining and Question Answering. InCompanion Proceedings of the The Web Conference 2018(Lyon, France)(WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton o...

  39. [40]

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. 2025. ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering. InFindings of the Association fo...

  40. [41]

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, ...

  41. [42]

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V . Jawahar. 2022. InfographicVQA. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2582–2591. doi:10.1109/W ACV51 458.2022.00264

  42. [43]

    Meta AI. 2024. Llama 4: Advancing Multimodal Intelligence. https://ai.meta.co m/blog/llama-4-multimodal-intelligence/. Accessed: 2025-12-11

  43. [44]

    OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/. Accessed: 2025-12-11

  44. [45]

    Huzaifa Pardawala, Siddhant Sukhani, Agam Shah, Veer Kejriwal, Abhishek Pillai, Rohan Bhasin, Andrew DiBiasio, Tarun Mandapati, Dhruv Adha, and Sudheer Chava. 2025. SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts’ QA Through Six-Dimensional Feature Analysis. arXiv:2410.20651 [cs.CL] https://arxiv.org/abs/2410.20651

  45. [46]

    Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. 2025. SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers. arXiv:2407.09413 [cs.CL] https://arxiv.org/abs/2407.09413

  46. [47]

    Chuan Qin, Xin Chen, Chengrui Wang, Pengmin Wu, Xi Chen, Yihang Cheng, Jingyi Zhao, Meng Xiao, Xiangchao Dong, Qingqing Long, Boya Pan, Han Wu, Chengzan Li, Yuanchun Zhou, Hui Xiong, and Hengshu Zhu. 2025. SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models. InProceedings of the 31st ACM SIGKDD Conference on Kno...

  47. [48]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

  48. [49]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV] https://arxiv.or g/abs/2103.00020

  49. [50]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association...

  50. [51]

    Ramit Sawhney, Piyush Khanna, Arshiya Aggarwal, Taru Jain, Puneet Mathur, and Rajiv Ratn Shah. 2020. VolTAGE: V olatility Forecasting via Text Audio Fusion with Graph Convolution Networks for Earnings Calls. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu ...

  51. [52]

    Securities and Exchange Commission

    U.S. Securities and Exchange Commission. 2000. Adoption of Updated EDGAR Filing Requirements. https://www.sec.gov/rules/final/33-7684.txt. SEC Release No. 33-7684. Accessed: 2025-12-12

  52. [53]

    Securities and Exchange Commission

    U.S. Securities and Exchange Commission. 2025. EDGAR Filer Manual. https: //www.sec.gov/edgar/filer-information. Accessed: 2025-12-12

  53. [54]

    Agam Shah, Siddhant Sukhani, Huzaifa Pardawala, Saketh Budideti, Riya Bhadani, Rudra Gopal, Siddhartha Somani, Rutwik Routu, Michael Galarnyk, Soung- min Lee, Arnav Hiray, Akshar Ravichandran, Eric Kim, Pranav Aluru, Joshua Zhang, Sebastian Jaskowski, Veer Guda, Meghaj Tarte, Liqin Ye, Spencer Gos- den, Rachel Yuh, Sloka Chava, Sahasra Chava, Dylan Patric...

  54. [55]

    Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, and Sudheer Chava. 2025. Beyond the Reported Cutoff: Where Large Language Models Fall Short on Finan- cial Knowledge. arXiv:2504.00042 [cs.CL] https://arxiv.org/abs/2504.00042

  55. [56]

    Raj Sanjay Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, and Diyi Yang. 2022. WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain. arXiv:2211.00083 [cs.CL] https://arxiv.org/abs/2211.00083

  56. [57]

    Dong Shu, Haoyang Yuan, Yuchen Wang, Yanguang Liu, Huopu Zhang, Haiyan Zhao, and Mengnan Du. 2025. FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models. arXiv:2507.14823 [cs.CV] https: //arxiv.org/abs/2507.14823

  57. [58]

    Alex Snow. 2019. sec-edgar-downloader: A Python package for downloading SEC filings. https://github.com/jadchaar/sec-edgar-downloader

  58. [59]

    Siddhant Sukhani, Yash Bhardwaj, Riya Bhadani, Veer Kejriwal, Michael Galarnyk, and Sudheer Chava. 2025. FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos. arXiv:2509.25745 [cs.CV] https: //arxiv.org/abs/2509.25745

  59. [60]

    Fatma M Talaat and Hanaa ZainEldin. 2023. An improved fire detection approach based on YOLO-v8 for smart cities.Neural Computing and Applications35, 28 (2023), 20939–20954

  60. [61]

    Camille Thibault, Jacob-Junqi Tian, Gabrielle Péloquin-Skulski, Taylor Lynn Curtis, James Zhou, Florence Laflamme, Luke Yuxiang Guan, Reihaneh Rabbany, Jean-François Godbout, and Kellin Pelrine. 2025. A Guide to Misinformation Detection Data and Evaluation. InProceedings of the 31st ACM SIGKDD Confer- ence on Knowledge Discovery and Data Mining V .2(Toron...

  61. [62]

    Ming-Feng Tsai and Chuan-Ju Wang. 2017. On the risk prediction and analysis of soft information in finance reports.European Journal of Operational Research 257, 1 (2017), 243–250. doi:10.1016/j.ejor.2016.06.069

  62. [63]

    Congress

    U.S. Congress. 2011. 17 U.S.C. §105: Subject matter of copyright: United States Government works. https://www.govinfo.gov/content/pkg/USCODE-2011- title17/pdf/USCODE-2011-title17.pdf. U.S. Government works are not eligible for copyright protection

  63. [64]

    Yu-Hsiang Wang, Wei-Ning Chiu, Yi-Tai Hsiao, Yu-Shiang Huang, Yi-Shyuan Chiang, Shuo-En Wu, and Chuan-Ju Wang. 2025. SURF: A System to Unveil Explainable Risk Relations between Firms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrati...

  64. [65]

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, San- jeev Arora, and Danqi Chen. 2024. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs. arXiv:2406.18521 [cs.CL] https: //arxiv.org/abs/2406.18521

  65. [66]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https: //arxiv.org/abs/2201.11903

  66. [67]

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebas- tian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann

  67. [68]

    Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564(2023)

  68. [69]

    Bingjie Xiao, Minh Nguyen, and Wei Qi Yan. 2024. Fruit ripeness identification using YOLOv8 model.Multimedia Tools and Applications83, 9 (2024), 28039– 28056

  69. [70]

    Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. Fingpt: Open- source financial large language models.arXiv preprint arXiv:2306.06031(2023)

  70. [71]

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. 2024. ∞Bench: Extending Long Context Evaluation Beyond 100K Tokens. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek ...