arxiv: 2605.03147 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

Rasmus T. Aavang , Rasmus Tjalk-B{\o}ggild , Alexandre Iolov , Giovanni Rizzi , Mike Zhang , Johannes Bjerva

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords KPI extractionearnings callslarge language modelsfinancial information extractiondomain shiftbenchmark datasetsemergent metricsopen-ended extraction

0 comments

The pith

LLMs enable open-ended extraction of emergent KPIs from unstructured earnings call transcripts at 79.7 percent human-verified precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Earnings calls deliver timely financial insights through conversational language but lack the labels and structure of SEC filings, making automated extraction difficult. Models trained on filings show poor generalization when applied to calls. The work introduces benchmarks including SECB, ECB, and ECB-A with 2,460 expert annotation groups, then shows that LLMs can perform open-ended KPI extraction. Human evaluation confirms 79.7 percent precision, establishing a usable baseline for tracking performance metrics that emerge in discussion rather than in standard reports. This matters because reliable extraction would let analysts and investors surface non-standard company metrics faster and more consistently.

Core claim

Encoder-based models trained on SEC filings fail to generalize to earnings calls because of the domain shift from templatic to conversational text. The authors therefore build new benchmarks SECB and ECB plus the expert-annotated ECB-A subset. They demonstrate that an LLM system can extract emergent KPIs directly from call transcripts, with human raters confirming 79.7 percent precision, thereby supplying the first baseline for consistent KPI tracking in this domain.

What carries the argument

LLM open-ended extraction pipeline applied to conversational earnings-call transcripts, evaluated against the ECB-A expert-annotated benchmark for emergent KPIs.

If this is right

Encoder-based models trained on SEC filings do not transfer effectively to the conversational domain of earnings calls.
LLMs support extraction of non-standard, company-specific KPIs that appear in calls but not in templated reports.
Human-verified precision of 79.7 percent provides a concrete baseline for the task of tracking emergent performance indicators.
Consistent, automated monitoring of these KPIs across successive earnings calls becomes feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extracted KPIs could be tracked over time for a single company to detect shifts in what management chooses to emphasize in calls.
The method might be extended to compare KPI language across peer firms within the same industry.
Downstream systems could test whether the extracted KPIs predict subsequent earnings surprises or stock reactions.
The annotation scheme itself could be reused or adapted to create training data for finer-grained financial event detection.

Load-bearing premise

The 2,460 expert annotation groups and the definition of emergent KPIs supply a reliable, generalizable ground truth for extraction quality across different companies and sectors.

What would settle it

If a new collection of earnings calls is annotated by independent experts and the LLM-extracted KPIs receive human approval below 60 percent relevance and accuracy, the claim of a reliable baseline would be falsified.

Figures

Figures reproduced from arXiv: 2605.03147 by Alexandre Iolov, Giovanni Rizzi, Johannes Bjerva, Mike Zhang, Rasmus T. Aavang, Rasmus Tjalk-B{\o}ggild.

**Figure 1.** Figure 1: Analysis and Pipeline. We ground our analysis in the established SEC filings domain. To capture the open-ended set of KPIs in earnings calls, we adopt a relation extraction strategy to benchmark encoders and in-context learning against expert annotations. Finally, we aggregate structured outputs to generate consistent, longitudinal KPI tracking suitable for financial analysis. role for industry investors … view at source ↗

**Figure 2.** Figure 2: Lyft’s share price from the release of its earnings report to the end of the earnings call. When the incorrect value is presented in the earnings release the price rises quickly. However, once the error is corrected during the earnings call, the price rapidly drops. # Entries Period Entities FiNER-139 1.1M 2016-2020 387K HiFi-KPI (Lite) 1.9M(8.0K) 2017-06/2024 5,300K SECB 41K 2023-2024 78K ECB (ECB-A) 10.5… view at source ↗

**Figure 3.** Figure 3: Confusion matrices for SEC-BERT-BASE and view at source ↗

**Figure 5.** Figure 5: Tagging Interface During Annotation Example : 2024 Q3 JNJ earnings transcript view at source ↗

**Figure 6.** Figure 6: Relation Extraction Annotation Interface. Annotator 1 Annotator 2 Annotator 3 Annotator 1 Annotator 2 Annotator 3 1.00 0.53 0.27 0.53 1.00 0.36 0.27 0.36 1.00 view at source ↗

**Figure 7.** Figure 7: Inter-Annotator Agreement (Cohen’s Kappa). Annotators 1 and 2 exhibit strong alignment, whereas Annotator 3 demonstrates notably lower agreement with the other evaluators. 15 view at source ↗

read the original abstract

Earnings calls are a key source of financial information about public companies. However, extracting information from these calls is difficult. Unlike the templatic filings required by the U.S. Securities and Exchange Commission (SEC) to report a company's financial situation, earnings conference calls have no built-in labels, are unstructured, and feature conversational language. We explore this challenging domain by assessing the information captured by models trained on SEC filings and in-context learning methods. To establish a baseline, we first evaluate the generalization capabilities of SEC-trained models across established SEC datasets. To support our investigation, we introduce three novel benchmarks: (1) SEC Filings Benchmark (SECB), (2) Earnings Calls Benchmark (ECB), and ECB-A, a subset with 2,460 expert annotation groups to support our qualitative analysis. We find that encoder-based models struggle with the domain shift. Finally, we propose a system utilizing LLMs to perform open-ended extraction from unstructured call transcripts, verified by human evaluation (79.7% precision), providing a baseline for this valuable domain through the consistent tracking of emergent KPIs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds three new benchmarks for KPI extraction from earnings calls and tests an LLM open-ended system with 79.7% human precision, but the annotation reliability details are missing from the abstract.

read the letter

The main things to know are the three new benchmarks and the LLM extraction baseline. The paper creates SECB for SEC filings, ECB for earnings calls, and ECB-A with 2,460 expert annotation groups. It shows encoder models trained on SEC data struggle with the shift to conversational transcripts and proposes LLMs for flexible, open-ended KPI pulling from unstructured calls. That supplies a practical starting point for tracking things that emerge in earnings discussions rather than just standard metrics. The human check at 79.7% precision is presented as verification for the system. This is useful work for setting up resources in a domain that lacks labels by design. The benchmarks let others measure progress on domain shift, which is a real issue here. The focus on emergent KPIs fits the conversational setting better than rigid templates. The soft spot is the evaluation. The abstract reports the precision but gives no inter-annotator agreement, no clear operational definition separating emergent from standard KPIs, and no account of how annotators handled ambiguity or terminology in the calls. That matches the stress-test concern exactly. Without those, the 79.7% is difficult to read as stable evidence instead of alignment with particular annotation choices. If the full paper supplies the missing stats, guidelines, and examples, the result strengthens considerably. This is for researchers in financial NLP or applied text extraction who need baselines or datasets for earnings calls. A reader building analyst tools would get concrete resources and a comparison to prior SEC work. It deserves peer review. The benchmarks are a genuine addition worth referee time, and the authors can fix the evaluation transparency in revisions.

Referee Report

1 major / 1 minor

Summary. The paper addresses the challenge of extracting Key Performance Indicators (KPIs) from unstructured earnings conference call transcripts. It evaluates the cross-domain generalization of models trained on SEC filings, introduces three new benchmarks (SECB, ECB, and ECB-A with 2,460 expert annotation groups), finds that encoder-based models struggle with domain shift from filings to calls, and proposes an LLM-based open-ended extraction system that achieves 79.7% precision under human evaluation, positioning this as a baseline for tracking emergent KPIs.

Significance. If the human evaluation proves reliable, the work fills a notable gap in financial NLP by moving beyond templatic SEC filings to conversational transcripts. The new benchmarks and LLM baseline could enable consistent tracking of emergent KPIs, with credit due for the focus on open-ended extraction and the scale of expert annotations attempted. The result would be a useful starting point for the domain, though its impact depends on verifiable annotation quality.

major comments (1)

[Abstract] Abstract: The headline result of 79.7% precision from human evaluation on the ECB-A subset is central to the claim that the LLM system supplies a usable baseline. However, the abstract supplies no inter-annotator agreement statistic, no operational definition distinguishing emergent from standard KPIs, no annotation guidelines, and no description of how annotators handled conversational ambiguity or domain terminology. Without these, the precision figure cannot be interpreted as evidence of consistent extraction quality rather than idiosyncratic annotator alignment.

minor comments (1)

[Benchmark Introduction] The benchmark description lists three items as (1) SECB, (2) ECB, and ECB-A, yet ECB-A is explicitly a subset of ECB; clarify the exact relationships, sizes, and construction details in the benchmark section to avoid confusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights an important opportunity to strengthen the interpretability of our human evaluation results. We agree that the abstract should be more self-contained regarding the annotation process and will revise it accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result of 79.7% precision from human evaluation on the ECB-A subset is central to the claim that the LLM system supplies a usable baseline. However, the abstract supplies no inter-annotator agreement statistic, no operational definition distinguishing emergent from standard KPIs, no annotation guidelines, and no description of how annotators handled conversational ambiguity or domain terminology. Without these, the precision figure cannot be interpreted as evidence of consistent extraction quality rather than idiosyncratic annotator alignment.

Authors: We acknowledge that the current abstract is too concise and omits key methodological details needed to contextualize the 79.7% precision. The full manuscript details the creation of the ECB-A benchmark (2,460 expert annotation groups), the distinction between emergent and standard KPIs, and the annotation process for handling conversational transcripts and financial terminology. We will revise the abstract to include: a brief operational definition of emergent KPIs, a summary of the annotation guidelines, a description of how annotators managed ambiguity and domain terms, and the inter-annotator agreement statistic (or a note on annotation reliability if not previously computed). This change will make the headline result more robust and interpretable while preserving all original claims and results. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmarks and human-verified extraction form an independent chain.

full rationale

The paper introduces three new benchmarks (SECB, ECB, ECB-A) built from expert annotations and evaluates both SEC-trained models and an LLM open-ended extraction system against them. The 79.7% precision figure is obtained via direct human evaluation on the novel ECB-A annotations rather than by fitting any parameter to a subset of the target data and then claiming a prediction of a related quantity. No self-citations are invoked to justify uniqueness or to smuggle in an ansatz; the derivation from data creation through model assessment is self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard NLP evaluation assumptions and human annotation quality rather than new free parameters or invented entities.

axioms (1)

domain assumption Expert annotations of emergent KPIs in earnings calls constitute reliable ground truth for measuring extraction performance
The 79.7% precision and benchmark creation depend on this without reported validation metrics such as agreement rates.

pith-pipeline@v0.9.0 · 5509 in / 1233 out tokens · 81163 ms · 2026-05-08T17:57:50.733332+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 19 canonical work pages · 6 internal anchors

[1]

What You Say and How You Say It Matters: Predicting Stock Volatility Using Verbal and Vocal Cues

Qin, Yu and Yang, Yi. What You Say and How You Say It Matters: Predicting Stock Volatility Using Verbal and Vocal Cues. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1038

work page doi:10.18653/v1/p19-1038 2019
[2]

Finer: Financial numeric entity recognition for xbrl tagging

Loukas, Lefteris and Fergadiotis, Manos and Chalkidis, Ilias and Spyropoulou, Eirini and Malakasiotis, Prodromos and Androutsopoulos, Ion and Paliouras, Georgios. F i NER : Financial Numeric Entity Recognition for XBRL Tagging. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.186...

work page doi:10.18653/v1/2022.acl-long.303 2022
[3]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=. 2019 , organization=

2019
[4]

Communications of the ACM , volume=

Open information extraction from the web , author=. Communications of the ACM , volume=. 2008 , publisher=

2008
[5]

Advances in Neural Information Processing Systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=
[6]

2025 , howpublished =

LLaMA 3.3 70B Instruct Turbo Free , author =. 2025 , howpublished =

2025
[7]

arXiv preprint arXiv:2502.15411 , year=

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings , author=. arXiv preprint arXiv:2502.15411 , year=

work page arXiv
[8]

International journal of operations & production management , volume=

The changing basis of performance measurement , author=. International journal of operations & production management , volume=. 1996 , publisher=

1996
[9]

2024 , note =

spaCy Documentation: Doc , author =. 2024 , note =

2024
[10]

Journal of Research of the National Institute of Standards and Technology , volume=

Methods and tools for performance assurance of smart manufacturing systems , author=. Journal of Research of the National Institute of Standards and Technology , volume=. 2016 , publisher=

2016
[11]

2005 , publisher=

The balanced scorecard: measures that drive performance , author=. 2005 , publisher=

2005
[12]

National productivity review , volume=

The “SMART” way to define and sustain success , author=. National productivity review , volume=. 1988 , publisher=

1988
[13]

African journal of business management , volume=

Understanding performance measurement through the literature , author=. African journal of business management , volume=
[14]

Journal of finance , volume=

Efficient capital markets , author=. Journal of finance , volume=
[15]

SEC Form 8-K: Definition, What It Tells You, Filing Requirements

Investopedia. SEC Form 8-K: Definition, What It Tells You, Filing Requirements. 2024

2024
[16]

, title =

Hugging Face, Inc. , title =. 2024 , url =

2024
[17]

Python Standard Library:

Python Software Foundation , year =. Python Standard Library:
[18]

2004--2024 , url =

Leonard Richardson , title =. 2004--2024 , url =

2004
[19]

Annual Report on Form 10-K for the fiscal year ended December 31, 2022 , year =

2022
[20]

Q3 2024 Earnings Call Transcript , year =

JPMorgan Chase & Co. Q3 2024 Earnings Call Transcript , year =

2024
[21]

2024 , eprint=

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity , author=. 2024 , eprint=

2024
[22]

Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) , pages=

WarwickNLP at semeval-2024 task 1: Low-rank cross-encoders for efficient semantic textual relatedness , author=. Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) , pages=. 2024 , organization=

2024
[23]

Communication methods and measures , volume=

Answering the call for a standard reliability measure for coding data , author=. Communication methods and measures , volume=. 2007 , publisher=

2007
[24]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

1960
[25]

Self-Preference Bias in LLM-as-a-Judge

Self-preference bias in llm-as-a-judge , author=. arXiv preprint arXiv:2410.21819 , year=

work page internal anchor Pith review arXiv
[26]

arXiv preprint arXiv:2106.07393 , year=

Cross-replication Reliability--An Empirical Approach to Interpreting Inter-rater Reliability , author=. arXiv preprint arXiv:2106.07393 , year=

work page arXiv
[27]

BERTScore: Evaluating Text Generation with BERT

Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

work page internal anchor Pith review arXiv 1904
[28]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

2025
[29]

2025 , institution =

Gemini 3: A New Era of Agentic Intelligence , author =. 2025 , institution =

2025
[30]

2025 , eprint=

ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs , author=. 2025 , eprint=

2025
[31]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[32]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[33]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[34]

Management’s Discussion and Analysis, Selected Financial Data, and Supplementary Financial Information

Securities and Exchange Commission. Management’s Discussion and Analysis, Selected Financial Data, and Supplementary Financial Information. 2020

2020
[35]

2023 Q4 Earnings Call Transcript

Alphabet Inc. 2023 Q4 Earnings Call Transcript. 2024

2023
[36]

2024 Q2 Earnings Call Transcript

Alphabet Inc. 2024 Q2 Earnings Call Transcript. 2024

2024
[37]

2024 Q1 Earnings Call Transcript

Alphabet Inc. 2024 Q1 Earnings Call Transcript. 2024

2024
[38]

Management Discussion and Analysis (MD & A)

Investopedia. Management Discussion and Analysis (MD & A). 2024

2024
[39]

Available at SSRN 4836378 , year=

Giving Retail Investors a Say in Disclosure , author=. Available at SSRN 4836378 , year=
[40]

Earnings Call: Definition, Example, and What to Look For

Corporate Finance Institute. Earnings Call: Definition, Example, and What to Look For
[41]

2023 , url =

Investopedia , title =. 2023 , url =

2023
[42]

2024 , eprint=

Entity Matching using Large Language Models , author=. 2024 , eprint=

2024
[43]

Exchange Act Reporting and Registration , year =
[44]

The Timing of the Earnings Press Release and the Annual Filing , year =
[45]

and Nagar, Venky and Schoenfeld, Jordan , title =

Chen, Jason V. and Nagar, Venky and Schoenfeld, Jordan , title =. Review of Accounting Studies , year =. doi:10.1007/s11142-018-9453-3 , url =

work page doi:10.1007/s11142-018-9453-3
[46]

arXiv preprint arXiv:2009.01317 , year=

Towards earnings call and stock price movement , author=. arXiv preprint arXiv:2009.01317 , year=

work page arXiv 2009
[47]

2024 , school=

The Impact of Earnings Call Sentiment on Stock Market Returns , author=. 2024 , school=

2024
[48]

2021 , url =

Chen, James , title =. 2021 , url =

2021
[49]

2023 , eprint=

FinGPT: Open-Source Financial Large Language Models , author=. 2023 , eprint=

2023
[50]

2023 , eprint=

InvestLM: A Large Language Model for Investment using Financial Domain Instruction Tuning , author=. 2023 , eprint=

2023
[51]

Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching

Wang, Tianshu and Chen, Xiaoyang and Lin, Hongyu and Chen, Xuanang and Han, Xianpei and Sun, Le and Wang, Hao and Zeng, Zhenyu. Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[52]

Prompt engineering overview , year =
[53]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review arXiv
[54]

arXiv:2406.12624 (2025), https://arxiv.org/abs/2406.12624

Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges , author=. arXiv preprint arXiv:2406.12624 , year=

work page arXiv
[55]

2023 , eprint=

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance , author=. 2023 , eprint=

2023
[56]

The Journal of finance , volume=

When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks , author=. The Journal of finance , volume=. 2011 , publisher=

2011
[57]

Q3 2023 Earnings Call , year =

2023
[58]

Form 10-Q Quarterly Report for the Period Ending June 30, 2024 , year =

2024
[59]

Q2 2024 Earnings Transcript , year =

2024
[60]

Securities and Exchange Commission

U.S. Securities and Exchange Commission. Final Rule: International Disclosure Standards. 2000

2000
[61]

Efficient Market Hypothesis (EMH) , year =
[62]

XBRL US Preparers Guide , year =
[63]

What is XBRL?

Intrinio. What is XBRL?
[64]

Understanding SEC XBRL Filings: A Primer on SEC DATA

Chelsea Caltuna. Understanding SEC XBRL Filings: A Primer on SEC DATA
[65]

2024 , howpublished =

Dwight Gunning , title =. 2024 , howpublished =

2024
[66]

Hiroki Nakayama , year=
[67]

Inline XBRL - SEC.gov , author =
[68]

2022 26th International Conference on Pattern Recognition (ICPR) , pages=

Hillebrand, Lars and Deu. 2022 26th International Conference on Pattern Recognition (ICPR) , pages=. 2022 , organization=

2022
[69]

Do Language Embeddings capture Scales?

Zhang, Xikun and Ramachandran, Deepak and Tenney, Ian and Elazar, Yanai and Roth, Dan. Do Language Embeddings capture Scales?. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.439

work page doi:10.18653/v1/2020.findings-emnlp.439 2020
[70]

2019 , eprint=

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , author=. 2019 , eprint=

2019
[71]

2022 , eprint=

WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain , author=. 2022 , eprint=

2022
[72]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
[73]

April , year=

Beautiful soup documentation , author=. April , year=
[74]

Financial Innovation , volume=

Comprehensive review of text-mining applications in finance , author=. Financial Innovation , volume=. 2020 , publisher=

2020
[75]

Accounting and Business Research , volume =

Craig Lewis and Steven Young , title =. Accounting and Business Research , volume =. 2019 , publisher =. doi:10.1080/00014788.2019.1611730 , URL =

work page doi:10.1080/00014788.2019.1611730 2019
[76]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
[77]

doi:10.5281/zenodo.10009823 , url =

Ines Montani and Matthew Honnibal and Matthew Honnibal and Adriane Boyd and Sofie Van Landeghem and Henning Peters , title =. doi:10.5281/zenodo.10009823 , url =

work page doi:10.5281/zenodo.10009823
[78]

2013 IEEE international conference on acoustics, speech and signal processing , pages=

Speech recognition with deep recurrent neural networks , author=. 2013 IEEE international conference on acoustics, speech and signal processing , pages=. 2013 , organization=

2013
[79]

Artificial-Analysis

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , author=. arXiv preprint arXiv:1908.10063 , year=

work page arXiv 1908
[80]

Bloom: A 176b-parameter open-access multilingual language model , author=

Showing first 80 references.