arxiv: 2604.13260 · v1 · submitted 2026-04-14 · 💱 q-fin.TR

Recognition: unknown

Which Voices Move Markets? Speaker Identity and the Cross-Section of Post-Earnings Returns

Karmanpartap Singh Sidhu , Junyi Fan , Maryam Pishgar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:26 UTC · model grok-4.3

classification 💱 q-fin.TR

keywords earnings callssentiment analysispost-earnings returnsspeaker identitytextual analysisabnormal returnsmarket efficiencyinformation processing

0 comments

The pith

Not all voices in earnings calls move markets equally, as weighting sentiment by speaker identity substantially improves predictions of post-earnings stock returns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that speaker identity in quarterly earnings conference calls matters for how markets react to the spoken information. Analysts' statements carry the largest influence on subsequent stock performance, followed by CFO remarks, with executive and other comments mattering less. The authors build a combined sentiment score by applying data-derived weights to language-model outputs from different sections of the transcripts. This weighted measure forecasts future returns more accurately than treating all speech the same and delivers trading profits that common risk factors and earnings surprises cannot explain. The results point to gradual market incorporation of the soft information contained in these calls.

Core claim

Post-earnings stock returns are not equally affected by all speakers in a conference call. Applying a domain-specific transformer model to 6.5 million sentences from S&P 500 transcripts, the authors derive speaker weights of 49% for analysts, 30% for CFOs, 16% for other executives, and 5% for others. The resulting section-weighted sentiment achieves an out-of-sample Spearman IC of 0.142, generates monthly long-short alpha of 2.03% unexplained by the Fama-French five-factor model, and remains significant after controlling for standardized unexpected earnings. The weighted approach fully subsumes traditional dictionary-based sentiment in joint tests, while cumulative return patterns indicate a

What carries the argument

Section-weighted sentiment formed by multiplying speaker-specific weights, derived empirically from return data, by sentiment scores extracted from distinct sections of earnings call transcripts.

If this is right

Uniform aggregation of sentiment across an entire call understates the predictive content because speakers differ in influence.
The weighted sentiment adds explanatory power beyond standardized unexpected earnings alone.
Transformer-based sentiment fully accounts for the information in older dictionary methods when both are tested together.
Prices incorporate the weighted information gradually, producing the observed pattern of cumulative abnormal returns.
The temporal out-of-sample split yields stronger rather than weaker results, consistent with stable relationships.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time monitoring tools for investors could prioritize analyst comments during calls to anticipate return patterns.
Similar speaker-weighting methods applied to other corporate events might identify additional channels of information flow.
The dominance of analyst speech suggests markets treat external perspectives as more credible signals than internal ones in some contexts.
Industry or size differences in the optimal weights could be tested to refine application across firm types.

Load-bearing premise

The empirically derived speaker weights remain stable and continue to predict returns in new time periods and market conditions.

What would settle it

Observing that a long-short portfolio sorted on the section-weighted sentiment produces no statistically significant alpha after risk adjustment when tested on data after 2025.

Figures

Figures reproduced from arXiv: 2604.13260 by Junyi Fan, Karmanpartap Singh Sidhu, Maryam Pishgar.

**Figure 2.** Figure 2: Return Surface: Section-Weighted Sentiment [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

We utilize FinBERT, a domain-specific transformer model, to parse 6.5 million sentences from 16,428 S&P 500 quarterly earnings call transcripts (2015-2025) and demonstrate that post-earnings stock returns are not equally affected by all speakers in a conference call. Our section-weighted sentiment, with empirically derived speaker weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%), achieves an out-of-sample Spearman IC of 0.142 versus 0.115 in-sample, generates monthly long-short alpha of 2.03% unexplained by the Fama-French five-factor model (t = 6.49), and remains significant after controlling for standardized unexpected earnings (SUE). FinBERT section-weighted sentiment entirely subsumes the Loughran-McDonald dictionary approach (FinBERT t = 5.90; LM t = 0.86 in the combined specification). Signal decay analysis and cumulative abnormal return charts confirm gradual price adjustment consistent with sluggish assimilation of soft information. All results undergo rigorous out-of-sample validation with an explicit temporal split, yielding improved rather than deteriorated predictive power.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows speaker identity in earnings calls adds predictive power when FinBERT sentiment is weighted by speaker, but the empirical weights need explicit confirmation they were fit only in-sample.

read the letter

The punchline is that weighting sentiment by who is speaking in earnings calls improves post-earnings return prediction over uniform treatment or the Loughran-McDonald dictionary. The authors extract sentiment from 6.5 million sentences across 16k S&P 500 transcripts, assign higher weight to analysts (49%) than CFOs or executives, and report an out-of-sample Spearman IC of 0.142 that beats the in-sample number. They also show a 2% monthly long-short alpha after Fama-French five factors and after controlling for SUE, plus the FinBERT version subsumes the dictionary approach in a horse race. That is concrete and worth noticing if you work with call transcripts. The temporal split and gradual price adjustment evidence are also straightforward to check. The main soft spot is the speaker weights themselves. They are derived empirically, and the abstract does not state that the optimization used only the pre-split window. If any post-split observations entered the weight calculation, the reported OOS improvement and alpha are no longer clean. Even a minor leakage would make the FinBERT-versus-LM comparison less decisive. The paper claims rigorous out-of-sample validation, but without the exact fitting procedure shown, it is hard to rule out the concern. This work is aimed at quant researchers who already use NLP on transcripts and want to test whether speaker identity is a cheap add-on. It is not a full theory paper, but the numbers are specific enough that a serious referee should see it. I would send it to review rather than desk reject, mainly to get the weight derivation clarified and to check robustness across different splits.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes 6.5 million sentences from 16,428 S&P 500 earnings call transcripts (2015-2025) using FinBERT to extract sentiment. It argues that speaker identity matters for post-earnings returns and proposes a section-weighted sentiment measure with empirically derived weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%). This measure achieves an out-of-sample Spearman IC of 0.142 (vs. 0.115 in-sample), produces a monthly long-short alpha of 2.03% (t=6.49) unexplained by Fama-French five factors, remains significant after controlling for SUE, and subsumes the Loughran-McDonald dictionary approach. The results are supported by signal decay analysis and cumulative abnormal return charts, with all claims validated via temporal out-of-sample split.

Significance. If the speaker weights were derived without look-ahead bias, the paper would make a meaningful contribution by showing that soft information in earnings calls is not uniformly informative across speakers and that weighting by speaker identity enhances predictive power for returns. The large-scale application of a domain-specific LLM, combined with rigorous out-of-sample testing, SUE controls, and evidence of alpha generation, positions this as a useful refinement to sentiment-based trading signals in quantitative finance. The finding that FinBERT subsumes LM is particularly notable for advancing beyond dictionary methods.

major comments (1)

[Abstract] Abstract: The abstract states that the speaker weights are 'empirically derived' but does not indicate whether this derivation was performed exclusively on the in-sample portion of the temporal split. Since the weights directly determine the section-weighted sentiment used for the reported out-of-sample IC of 0.142 and the alpha of 2.03% (t = 6.49), any use of post-split data in weight optimization would invalidate the out-of-sample claims due to look-ahead bias. The manuscript must specify the optimization procedure (e.g., return regression or IC maximization) and confirm the in-sample restriction.

minor comments (2)

[Abstract] Abstract: The observation that out-of-sample IC (0.142) exceeds in-sample IC (0.115) is counter to typical expectations for an empirically fitted model and would benefit from explicit discussion or robustness checks in the results section to rule out data peculiarities or overfitting artifacts.
[Data and Methods] The manuscript should provide the exact temporal split dates (e.g., training end year) and the precise method for assigning sentences to speaker categories to support reproducibility of the 6.5 million sentence corpus.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We address the concern about potential look-ahead bias in the derivation of speaker weights below and commit to revising the manuscript for greater clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that the speaker weights are 'empirically derived' but does not indicate whether this derivation was performed exclusively on the in-sample portion of the temporal split. Since the weights directly determine the section-weighted sentiment used for the reported out-of-sample IC of 0.142 and the alpha of 2.03% (t = 6.49), any use of post-split data in weight optimization would invalidate the out-of-sample claims due to look-ahead bias. The manuscript must specify the optimization procedure (e.g., return regression or IC maximization) and confirm the in-sample restriction.

Authors: We agree that the abstract is currently ambiguous on this point and that explicit confirmation is necessary to support the out-of-sample claims. We will revise the abstract to state that the speaker weights are 'empirically derived from in-sample data' and expand the methodology section to describe the optimization procedure (IC maximization on in-sample observations only) while confirming that no post-split data entered the weight derivation. This revision will be incorporated in the next version of the manuscript. revision: yes

Circularity Check

1 steps flagged

Empirically derived speaker weights risk look-ahead bias in OOS IC and alpha claims

specific steps

fitted input called prediction [Abstract]
"Our section-weighted sentiment, with empirically derived speaker weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%), achieves an out-of-sample Spearman IC of 0.142 versus 0.115 in-sample, generates monthly long-short alpha of 2.03% unexplained by the Fama-French five-factor model (t = 6.49), and remains significant after controlling for standardized unexpected earnings (SUE)."

Speaker weights are obtained by empirical fitting on the transcript/return data. The paper then reports OOS performance metrics and alpha as validation of the weighted signal. Without an explicit restriction that weight derivation used only the pre-split in-sample period, the OOS IC, alpha, and LM-subsumption statistics are not independent predictions but are statistically influenced by the fitting step itself.

full rationale

The paper's central result is the section-weighted sentiment signal using fixed speaker weights (Analyst 49%, CFO 30%, Executive 16%, Other 5%). These weights are described as 'empirically derived' and then applied to produce the headline OOS Spearman IC of 0.142, monthly alpha of 2.03% (t=6.49), and subsumption of LM. The abstract and claims invoke an explicit temporal split for 'rigorous out-of-sample validation,' but provide no statement that weight optimization (by regression, IC maximization, or similar) was confined to the in-sample window. This matches the fitted-input-called-prediction pattern: the load-bearing parameter is tuned on the data, after which the tuned output is presented as independent predictive power. No other circular steps (self-definitional, self-citation load-bearing, ansatz smuggling, or renaming) appear in the provided text. The FinBERT vs. LM comparison and factor-adjusted alpha are downstream of the weights and inherit the same dependence.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claim depends on the accuracy of the NLP model and the validity of the fitted weights for generalization.

free parameters (1)

Speaker weights = Analyst 49%, CFO 30%, Executive 16%, Other 5%
These percentages are empirically derived to optimize the sentiment measure.

axioms (2)

domain assumption FinBERT accurately captures financial sentiment in earnings call transcripts
Relies on the pre-trained model's performance in this domain.
domain assumption Temporal data split prevents information leakage
Assumes the split is strictly chronological without future data influencing past.

pith-pipeline@v0.9.0 · 5510 in / 1599 out tokens · 70248 ms · 2026-05-10T13:26:03.571577+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references

[1]

FinBERT: A Large Language Model for Extracting Information from Financial Text

Huang, A. H., Wang, H., and Yang, Y . (2023). “FinBERT: A Large Language Model for Extracting Information from Financial Text.”Con- temporary Accounting Research, 40(2), 806–841

2023
[2]

When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks

Loughran, T. and McDonald, B. (2011). “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.”The Journal of Finance, 66(1), 35–65

2011
[3]

Giving Content to Investor Sentiment: The Role of Media in the Stock Market

Tetlock, P. C. (2007). “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.”The Journal of Finance, 62(3), 1139– 1168

2007
[4]

More Than Words: Quantifying Language to Measure Firms’ Fundamentals

Tetlock, P. C., Saar-Tsechansky, M., and Macskassy, S. (2008). “More Than Words: Quantifying Language to Measure Firms’ Fundamentals.” The Journal of Finance, 63(3), 1437–1467

2008
[5]

Word Power: A New Approach for Content Analysis

Jegadeesh, N. and Wu, D. (2013). “Word Power: A New Approach for Content Analysis.”Journal of Financial Economics, 110(3), 712–729

2013
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing.”Proceedings of NAACL-HLT, 4171–4186

2019
[7]

Do Conference Calls Affect Analysts’ Forecasts?

Bowen, R. M., Davis, A. K., and Matsumoto, D. A. (2002). “Do Conference Calls Affect Analysts’ Forecasts?”The Accounting Review, 77(2), 285–316

2002
[8]

What Makes Conference Calls Useful? The Information Content of Managers’ Pre- sentations and Analysts’ Discussion Sessions

Matsumoto, D., Pronk, M., and Roelofsen, E. (2011). “What Makes Conference Calls Useful? The Information Content of Managers’ Pre- sentations and Analysts’ Discussion Sessions.”The Accounting Review, 86(4), 1383–1414

2011
[9]

An Empirical Evaluation of Accounting Income Numbers

Ball, R. and Brown, P. (1968). “An Empirical Evaluation of Accounting Income Numbers.”Journal of Accounting Research, 6(2), 159–178

1968
[10]

Post-Earnings-Announcement Drift: Delayed Price Response or Risk Premium?

Bernard, V . L. and Thomas, J. K. (1989). “Post-Earnings-Announcement Drift: Delayed Price Response or Risk Premium?”Journal of Accounting Research, 27, 1–36

1989
[11]

Manager Sentiment and Stock Returns

Jiang, F., Lee, J., Martin, X., and Zhou, G. (2019). “Manager Sentiment and Stock Returns.”Journal of Financial Economics, 132(1), 126–149

2019
[12]

A Five-Factor Asset Pricing Model

Fama, E. F. and French, K. R. (2015). “A Five-Factor Asset Pricing Model.”Journal of Financial Economics, 116(1), 1–22

2015
[13]

A Simple, Positive Semi- Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix

Newey, W. K. and West, K. D. (1987). “A Simple, Positive Semi- Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.”Econometrica, 55(3), 703–708

1987
[14]

Risk, Return, and Equilibrium: Empirical Tests

Fama, E. F. and MacBeth, J. D. (1973). “Risk, Return, and Equilibrium: Empirical Tests.”Journal of Political Economy, 81(3), 607–636

1973
[15]

XGBoost: A Scalable Tree Boosting System

Chen, T. and Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.”Proceedings of the 22nd ACM SIGKDD, 785–794

2016
[16]

A Unified Approach to Inter- preting Model Predictions

Lundberg, S. M. and Lee, S.-I. (2017). “A Unified Approach to Inter- preting Model Predictions.”Advances in Neural Information Processing Systems, 30, 4766–4777

2017