pith. machine review for the scientific record. sign in

arxiv: 2604.13260 · v1 · submitted 2026-04-14 · 💱 q-fin.TR

Recognition: unknown

Which Voices Move Markets? Speaker Identity and the Cross-Section of Post-Earnings Returns

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:26 UTC · model grok-4.3

classification 💱 q-fin.TR
keywords earnings callssentiment analysispost-earnings returnsspeaker identitytextual analysisabnormal returnsmarket efficiencyinformation processing
0
0 comments X

The pith

Not all voices in earnings calls move markets equally, as weighting sentiment by speaker identity substantially improves predictions of post-earnings stock returns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that speaker identity in quarterly earnings conference calls matters for how markets react to the spoken information. Analysts' statements carry the largest influence on subsequent stock performance, followed by CFO remarks, with executive and other comments mattering less. The authors build a combined sentiment score by applying data-derived weights to language-model outputs from different sections of the transcripts. This weighted measure forecasts future returns more accurately than treating all speech the same and delivers trading profits that common risk factors and earnings surprises cannot explain. The results point to gradual market incorporation of the soft information contained in these calls.

Core claim

Post-earnings stock returns are not equally affected by all speakers in a conference call. Applying a domain-specific transformer model to 6.5 million sentences from S&P 500 transcripts, the authors derive speaker weights of 49% for analysts, 30% for CFOs, 16% for other executives, and 5% for others. The resulting section-weighted sentiment achieves an out-of-sample Spearman IC of 0.142, generates monthly long-short alpha of 2.03% unexplained by the Fama-French five-factor model, and remains significant after controlling for standardized unexpected earnings. The weighted approach fully subsumes traditional dictionary-based sentiment in joint tests, while cumulative return patterns indicate a

What carries the argument

Section-weighted sentiment formed by multiplying speaker-specific weights, derived empirically from return data, by sentiment scores extracted from distinct sections of earnings call transcripts.

If this is right

  • Uniform aggregation of sentiment across an entire call understates the predictive content because speakers differ in influence.
  • The weighted sentiment adds explanatory power beyond standardized unexpected earnings alone.
  • Transformer-based sentiment fully accounts for the information in older dictionary methods when both are tested together.
  • Prices incorporate the weighted information gradually, producing the observed pattern of cumulative abnormal returns.
  • The temporal out-of-sample split yields stronger rather than weaker results, consistent with stable relationships.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time monitoring tools for investors could prioritize analyst comments during calls to anticipate return patterns.
  • Similar speaker-weighting methods applied to other corporate events might identify additional channels of information flow.
  • The dominance of analyst speech suggests markets treat external perspectives as more credible signals than internal ones in some contexts.
  • Industry or size differences in the optimal weights could be tested to refine application across firm types.

Load-bearing premise

The empirically derived speaker weights remain stable and continue to predict returns in new time periods and market conditions.

What would settle it

Observing that a long-short portfolio sorted on the section-weighted sentiment produces no statistically significant alpha after risk adjustment when tested on data after 2025.

Figures

Figures reproduced from arXiv: 2604.13260 by Junyi Fan, Karmanpartap Singh Sidhu, Maryam Pishgar.

Figure 1
Figure 1. Figure 1: Monthly Information Coefficients Over Time. Blue bars: in-sample [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Return Surface: Section-Weighted Sentiment [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

We utilize FinBERT, a domain-specific transformer model, to parse 6.5 million sentences from 16,428 S&P 500 quarterly earnings call transcripts (2015-2025) and demonstrate that post-earnings stock returns are not equally affected by all speakers in a conference call. Our section-weighted sentiment, with empirically derived speaker weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%), achieves an out-of-sample Spearman IC of 0.142 versus 0.115 in-sample, generates monthly long-short alpha of 2.03% unexplained by the Fama-French five-factor model (t = 6.49), and remains significant after controlling for standardized unexpected earnings (SUE). FinBERT section-weighted sentiment entirely subsumes the Loughran-McDonald dictionary approach (FinBERT t = 5.90; LM t = 0.86 in the combined specification). Signal decay analysis and cumulative abnormal return charts confirm gradual price adjustment consistent with sluggish assimilation of soft information. All results undergo rigorous out-of-sample validation with an explicit temporal split, yielding improved rather than deteriorated predictive power.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes 6.5 million sentences from 16,428 S&P 500 earnings call transcripts (2015-2025) using FinBERT to extract sentiment. It argues that speaker identity matters for post-earnings returns and proposes a section-weighted sentiment measure with empirically derived weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%). This measure achieves an out-of-sample Spearman IC of 0.142 (vs. 0.115 in-sample), produces a monthly long-short alpha of 2.03% (t=6.49) unexplained by Fama-French five factors, remains significant after controlling for SUE, and subsumes the Loughran-McDonald dictionary approach. The results are supported by signal decay analysis and cumulative abnormal return charts, with all claims validated via temporal out-of-sample split.

Significance. If the speaker weights were derived without look-ahead bias, the paper would make a meaningful contribution by showing that soft information in earnings calls is not uniformly informative across speakers and that weighting by speaker identity enhances predictive power for returns. The large-scale application of a domain-specific LLM, combined with rigorous out-of-sample testing, SUE controls, and evidence of alpha generation, positions this as a useful refinement to sentiment-based trading signals in quantitative finance. The finding that FinBERT subsumes LM is particularly notable for advancing beyond dictionary methods.

major comments (1)
  1. [Abstract] Abstract: The abstract states that the speaker weights are 'empirically derived' but does not indicate whether this derivation was performed exclusively on the in-sample portion of the temporal split. Since the weights directly determine the section-weighted sentiment used for the reported out-of-sample IC of 0.142 and the alpha of 2.03% (t = 6.49), any use of post-split data in weight optimization would invalidate the out-of-sample claims due to look-ahead bias. The manuscript must specify the optimization procedure (e.g., return regression or IC maximization) and confirm the in-sample restriction.
minor comments (2)
  1. [Abstract] Abstract: The observation that out-of-sample IC (0.142) exceeds in-sample IC (0.115) is counter to typical expectations for an empirically fitted model and would benefit from explicit discussion or robustness checks in the results section to rule out data peculiarities or overfitting artifacts.
  2. [Data and Methods] The manuscript should provide the exact temporal split dates (e.g., training end year) and the precise method for assigning sentences to speaker categories to support reproducibility of the 6.5 million sentence corpus.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We address the concern about potential look-ahead bias in the derivation of speaker weights below and commit to revising the manuscript for greater clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states that the speaker weights are 'empirically derived' but does not indicate whether this derivation was performed exclusively on the in-sample portion of the temporal split. Since the weights directly determine the section-weighted sentiment used for the reported out-of-sample IC of 0.142 and the alpha of 2.03% (t = 6.49), any use of post-split data in weight optimization would invalidate the out-of-sample claims due to look-ahead bias. The manuscript must specify the optimization procedure (e.g., return regression or IC maximization) and confirm the in-sample restriction.

    Authors: We agree that the abstract is currently ambiguous on this point and that explicit confirmation is necessary to support the out-of-sample claims. We will revise the abstract to state that the speaker weights are 'empirically derived from in-sample data' and expand the methodology section to describe the optimization procedure (IC maximization on in-sample observations only) while confirming that no post-split data entered the weight derivation. This revision will be incorporated in the next version of the manuscript. revision: yes

Circularity Check

1 steps flagged

Empirically derived speaker weights risk look-ahead bias in OOS IC and alpha claims

specific steps
  1. fitted input called prediction [Abstract]
    "Our section-weighted sentiment, with empirically derived speaker weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%), achieves an out-of-sample Spearman IC of 0.142 versus 0.115 in-sample, generates monthly long-short alpha of 2.03% unexplained by the Fama-French five-factor model (t = 6.49), and remains significant after controlling for standardized unexpected earnings (SUE)."

    Speaker weights are obtained by empirical fitting on the transcript/return data. The paper then reports OOS performance metrics and alpha as validation of the weighted signal. Without an explicit restriction that weight derivation used only the pre-split in-sample period, the OOS IC, alpha, and LM-subsumption statistics are not independent predictions but are statistically influenced by the fitting step itself.

full rationale

The paper's central result is the section-weighted sentiment signal using fixed speaker weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%). These weights are described as 'empirically derived' and then applied to produce the headline OOS Spearman IC of 0.142, monthly alpha of 2.03% (t=6.49), and subsumption of LM. The abstract and claims invoke an explicit temporal split for 'rigorous out-of-sample validation,' but provide no statement that weight optimization (by regression, IC maximization, or similar) was confined to the in-sample window. This matches the fitted-input-called-prediction pattern: the load-bearing parameter is tuned on the data, after which the tuned output is presented as independent predictive power. No other circular steps (self-definitional, self-citation load-bearing, ansatz smuggling, or renaming) appear in the provided text. The FinBERT vs. LM comparison and factor-adjusted alpha are downstream of the weights and inherit the same dependence.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claim depends on the accuracy of the NLP model and the validity of the fitted weights for generalization.

free parameters (1)
  • Speaker weights = Analyst 49%, CFO 30%, Executive 16%, Other 5%
    These percentages are empirically derived to optimize the sentiment measure.
axioms (2)
  • domain assumption FinBERT accurately captures financial sentiment in earnings call transcripts
    Relies on the pre-trained model's performance in this domain.
  • domain assumption Temporal data split prevents information leakage
    Assumes the split is strictly chronological without future data influencing past.

pith-pipeline@v0.9.0 · 5510 in / 1599 out tokens · 70248 ms · 2026-05-10T13:26:03.571577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references

  1. [1]

    FinBERT: A Large Language Model for Extracting Information from Financial Text

    Huang, A. H., Wang, H., and Yang, Y . (2023). “FinBERT: A Large Language Model for Extracting Information from Financial Text.”Con- temporary Accounting Research, 40(2), 806–841

  2. [2]

    When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks

    Loughran, T. and McDonald, B. (2011). “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.”The Journal of Finance, 66(1), 35–65

  3. [3]

    Giving Content to Investor Sentiment: The Role of Media in the Stock Market

    Tetlock, P. C. (2007). “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.”The Journal of Finance, 62(3), 1139– 1168

  4. [4]

    More Than Words: Quantifying Language to Measure Firms’ Fundamentals

    Tetlock, P. C., Saar-Tsechansky, M., and Macskassy, S. (2008). “More Than Words: Quantifying Language to Measure Firms’ Fundamentals.” The Journal of Finance, 63(3), 1437–1467

  5. [5]

    Word Power: A New Approach for Content Analysis

    Jegadeesh, N. and Wu, D. (2013). “Word Power: A New Approach for Content Analysis.”Journal of Financial Economics, 110(3), 712–729

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing.”Proceedings of NAACL-HLT, 4171–4186

  7. [7]

    Do Conference Calls Affect Analysts’ Forecasts?

    Bowen, R. M., Davis, A. K., and Matsumoto, D. A. (2002). “Do Conference Calls Affect Analysts’ Forecasts?”The Accounting Review, 77(2), 285–316

  8. [8]

    What Makes Conference Calls Useful? The Information Content of Managers’ Pre- sentations and Analysts’ Discussion Sessions

    Matsumoto, D., Pronk, M., and Roelofsen, E. (2011). “What Makes Conference Calls Useful? The Information Content of Managers’ Pre- sentations and Analysts’ Discussion Sessions.”The Accounting Review, 86(4), 1383–1414

  9. [9]

    An Empirical Evaluation of Accounting Income Numbers

    Ball, R. and Brown, P. (1968). “An Empirical Evaluation of Accounting Income Numbers.”Journal of Accounting Research, 6(2), 159–178

  10. [10]

    Post-Earnings-Announcement Drift: Delayed Price Response or Risk Premium?

    Bernard, V . L. and Thomas, J. K. (1989). “Post-Earnings-Announcement Drift: Delayed Price Response or Risk Premium?”Journal of Accounting Research, 27, 1–36

  11. [11]

    Manager Sentiment and Stock Returns

    Jiang, F., Lee, J., Martin, X., and Zhou, G. (2019). “Manager Sentiment and Stock Returns.”Journal of Financial Economics, 132(1), 126–149

  12. [12]

    A Five-Factor Asset Pricing Model

    Fama, E. F. and French, K. R. (2015). “A Five-Factor Asset Pricing Model.”Journal of Financial Economics, 116(1), 1–22

  13. [13]

    A Simple, Positive Semi- Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix

    Newey, W. K. and West, K. D. (1987). “A Simple, Positive Semi- Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.”Econometrica, 55(3), 703–708

  14. [14]

    Risk, Return, and Equilibrium: Empirical Tests

    Fama, E. F. and MacBeth, J. D. (1973). “Risk, Return, and Equilibrium: Empirical Tests.”Journal of Political Economy, 81(3), 607–636

  15. [15]

    XGBoost: A Scalable Tree Boosting System

    Chen, T. and Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.”Proceedings of the 22nd ACM SIGKDD, 785–794

  16. [16]

    A Unified Approach to Inter- preting Model Predictions

    Lundberg, S. M. and Lee, S.-I. (2017). “A Unified Approach to Inter- preting Model Predictions.”Advances in Neural Information Processing Systems, 30, 4766–4777