LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text

Andrew Hong; Ito Zapata; Jason Potteiger

arxiv: 2604.14321 · v1 · submitted 2026-04-15 · 💻 cs.CL

LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text

Jason Potteiger , Andrew Hong , Ito Zapata This is my paper

Pith reviewed 2026-05-10 13:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM scoringexperience ratingsunstructured textfan feedbackpredictive validationbaseballsurvey dataGPT-4

0 comments

The pith

GPT-4 can predict baseball fans' 0-10 experience ratings from their written comments alone, with predictions falling within one point of actual ratings two-thirds of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that an unoptimized prompt to GPT-4.1 can extract directional predictions of overall experience ratings from unstructured fan text about MLB games. On a dataset of about 10,000 responses, the model achieves 67% accuracy within one point and shows high consistency across runs. The predictions correlate strongly with the overall rating but sit about one point lower, which the authors attribute to the text reflecting salient memorable moments while the survey rating integrates the entire experience into a verdict. Preserving this gap allows each measure to capture unique information. Readers might care because it opens a way to infer ratings from existing text data without additional surveys.

Core claim

What carries the argument

The systematic gap between LLM-predicted ratings from text (salient moments) and self-reported overall ratings (integrated verdict), treated as complementary constructs rather than error.

If this is right

Text alone can yield reliable directional estimates of experience ratings.
The one-point offset preserves distinct information from memorable events versus holistic judgments.
Simple prompts produce consistent results across repeated applications.
Such scoring applies to any domain with open-ended responses and rating scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Companies could mine customer reviews or social media for predicted ratings to supplement or replace some surveys.
Identifying the salient moments in text might help pinpoint what drives positive or negative experiences specifically.
The distinction suggests keeping both text-derived and direct rating measures in parallel for richer insights.

Load-bearing premise

The systematic one-point gap reflects a real difference between what the text emphasizes (salient moments) and what the rating captures (overall verdict), not an artifact of the prompt or data collection.

What would settle it

If changing the prompt or model version removes the one-point systematic difference while maintaining the correlation, that would show the gap is not a stable construct difference.

read the original abstract

We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan's verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a baseline empirical study in which GPT-4.1 is prompted to predict baseball fans' self-reported overall experience ratings (0-10 scale) from single open-ended text responses. On a sample of approximately 10,000 responses from five MLB teams, the model achieves 67% of predictions within ±1 point of the ground-truth rating (36% exact match), a Pearson correlation of r=0.82 with the overall rating, and high run-to-run stability (87% exact agreement across three independent scorings). Predictions are systematically lower than self-reports by roughly one point; the authors interpret this gap as evidence that LLM outputs capture the impact of salient, memorable moments while self-reports reflect an integrated overall evaluative verdict, and argue that the gap should be preserved rather than treated as error.

Significance. If the directional prediction and reliability results hold under fuller methodological disclosure, the work supplies a reproducible baseline for using off-the-shelf LLMs to extract quantitative ratings from unstructured customer or fan feedback. The suggestion that systematic discrepancies between model and human scores can index distinct constructs (salient moments versus holistic verdict) rather than measurement failure offers a constructive framing for future text-to-rating research in applied NLP and survey methodology.

major comments (2)

[Methods] Methods: The exact prompt text supplied to GPT-4.1 is not reproduced. Because the central claim rests on the performance of a 'simple, unoptimized prompt,' the full prompt (including any system instructions, few-shot examples, or output format constraints) must be provided to support reproducibility and to permit evaluation of whether the observed gap is prompt-dependent.
[Results] Results: The one-point systematic under-prediction is presented as interpretable as a construct difference, yet no statistical test (paired t-test, bootstrap CI, or mixed-effects model accounting for team or response length) is reported to establish that the gap is reliable and not an artifact of the specific text collection or model version. Without these, the claim that the gap 'can be interpreted' as salient moments versus overall verdict remains under-supported.

minor comments (2)

[Abstract] Abstract and Results: Specify the five MLB teams, the collection period, and any inclusion/exclusion criteria for the ~10,000 responses so readers can assess selection bias and generalizability.
[Discussion] Discussion: Clarify whether the salient-moments versus overall-verdict distinction is offered as a post-hoc hypothesis or as a claim directly tested by the aspect-level correlations; the current wording leaves the evidential status ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for minor revision. We address the two major comments point by point below and will revise the manuscript accordingly to improve reproducibility and statistical support.

read point-by-point responses

Referee: [Methods] Methods: The exact prompt text supplied to GPT-4.1 is not reproduced. Because the central claim rests on the performance of a 'simple, unoptimized prompt,' the full prompt (including any system instructions, few-shot examples, or output format constraints) must be provided to support reproducibility and to permit evaluation of whether the observed gap is prompt-dependent.

Authors: We agree that the full prompt must be disclosed for reproducibility. The revised manuscript will include the exact prompt text used with GPT-4.1, encompassing the complete system instructions, output format constraints, and confirmation that no few-shot examples were employed. This will allow readers to evaluate the prompt's simplicity and assess whether the observed gap depends on specific prompt choices. revision: yes
Referee: [Results] Results: The one-point systematic under-prediction is presented as interpretable as a construct difference, yet no statistical test (paired t-test, bootstrap CI, or mixed-effects model accounting for team or response length) is reported to establish that the gap is reliable and not an artifact of the specific text collection or model version. Without these, the claim that the gap 'can be interpreted' as salient moments versus overall verdict remains under-supported.

Authors: The referee is correct that formal statistical tests for the systematic one-point gap were not reported. Although the manuscript describes the gap as consistent and not driven by any single aspect, we will add a paired t-test on the differences, bootstrap confidence intervals for the mean bias, and a mixed-effects model with team and response length as covariates. These analyses will be included in the revision to rigorously demonstrate that the under-prediction is reliable and not an artifact, thereby strengthening the construct-difference interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports an empirical comparison of GPT-4.1 predictions (generated from open-ended fan text via a fixed prompt) against independent survey ground-truth ratings collected from ~10,000 responses. No equations, parameter fitting, or derivations appear in the provided text; the central numbers (67% within ±1, r=0.82, systematic one-point gap) are direct observational outputs rather than quantities that reduce to the inputs by construction. The interpretive claim that the gap reflects a construct difference is presented as one supported reading, not as a theorem or fitted result required for the headline metrics. The work is therefore self-contained against external benchmarks with no load-bearing self-citation chains or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is empirical and relies on standard assumptions about LLM text understanding and survey validity rather than new theoretical constructs or fitted parameters.

axioms (2)

domain assumption GPT-4.1 can extract evaluative information from unstructured fan text sufficient for directional rating prediction
Invoked by using the model for the core prediction task without additional training or fine-tuning
domain assumption Self-reported survey ratings constitute valid ground truth for overall experience
Used as the benchmark for all accuracy and correlation calculations

pith-pipeline@v0.9.0 · 5578 in / 1531 out tokens · 60255 ms · 2026-05-10T13:51:36.569454+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Shleifer, A. (2026). GPT as a measurement tool. NBER Working Paper No. 34834

work page 2026
[2]

F., Bratslavsky, E., Finkenauer, C., and Vohs, K

Baumeister, R. F., Bratslavsky, E., Finkenauer, C., and Vohs, K. D. (2001). Bad is stronger than good. Review of General Psychology, 5(4), 323–370

work page 2001
[3]

L., and Kahneman, D

Fredrickson, B. L., and Kahneman, D. (1993). Duration neglect in retrospective evaluations of affective episodes. Journal of Personality and Social Psychology, 65(1), 45–55

work page 1993
[4]

A., Boydstun, A

Glazier, R. A., Boydstun, A. E., and Feezell, J. T. (2021). Self-coding: A method to assess semantic validity and bias when coding open-ended responses. Research & Politics, 8(3), 1–9

work page 2021
[5]

Halterman, A., and Keith, K. A. (2024). Codebook LLMs: Evaluating LLMs as measurement tools for political science concepts. arXiv preprint, arXiv:2407.10747

work page arXiv 2024
[6]

Kahneman, D., and Riis, J. (2005). Living, and thinking about it: Two perspectives on life. In F. Huppert, N. Baylis, and B. Keverne (Eds.), The science of well-being (pp. 285–304). Oxford University Press

work page 2005
[7]

Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). Sage Publications

work page 2018
[8]

arXiv preprint arXiv:2509.03116 , year =

Licht, H., Sarkar, R., Wu, P. Y., Goel, P., Stoehr, N., Ash, E., and Hoyle, A. M. (2025). Measuring scalar constructs in social science with LLMs. arXiv preprint, arXiv:2509.03116

work page arXiv 2025
[9]

Pangakis, N., Wolken, S., and Fasching, N. (2023). Automated annotation with generative AI requires validation. arXiv preprint, arXiv:2306.00176

work page arXiv 2023
[10]

Ludwig, J., Mullainathan, S., and Rambachan, A. (2025). Large language models: An applied econometric framework. NBER Working Paper No. 33344

work page 2025
[11]

Schwarz, N. (1999). Self-reports: How the questions shape the answers. American Psychologist, 54(2), 93–105

work page 1999
[12]

Schwarz, N., and Clore, G. L. (1983). Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45(3), 513–523

work page 1983
[13]

Schwarz, N., and Strack, F. (1999). Reports of subjective well-being: Judgmental processes and their methodological implications. In D

work page 1999
[14]

name": "extract_session_info

Kahneman, E. Diener, and N. Schwarz (Eds.), Well-being: The foundations of hedonic psychology (pp. 61–84). Russell Sage Foundation. Törnberg, P. (2023). How to use large language models for text analysis. arXiv preprint, arXiv:2307.13106. 26 Appendix A: Prompt function schema The analysis in this study was conducted using the Dimension Labs language data ...

work page arXiv 2023

[1] [1]

Shleifer, A. (2026). GPT as a measurement tool. NBER Working Paper No. 34834

work page 2026

[2] [2]

F., Bratslavsky, E., Finkenauer, C., and Vohs, K

Baumeister, R. F., Bratslavsky, E., Finkenauer, C., and Vohs, K. D. (2001). Bad is stronger than good. Review of General Psychology, 5(4), 323–370

work page 2001

[3] [3]

L., and Kahneman, D

Fredrickson, B. L., and Kahneman, D. (1993). Duration neglect in retrospective evaluations of affective episodes. Journal of Personality and Social Psychology, 65(1), 45–55

work page 1993

[4] [4]

A., Boydstun, A

Glazier, R. A., Boydstun, A. E., and Feezell, J. T. (2021). Self-coding: A method to assess semantic validity and bias when coding open-ended responses. Research & Politics, 8(3), 1–9

work page 2021

[5] [5]

Halterman, A., and Keith, K. A. (2024). Codebook LLMs: Evaluating LLMs as measurement tools for political science concepts. arXiv preprint, arXiv:2407.10747

work page arXiv 2024

[6] [6]

Kahneman, D., and Riis, J. (2005). Living, and thinking about it: Two perspectives on life. In F. Huppert, N. Baylis, and B. Keverne (Eds.), The science of well-being (pp. 285–304). Oxford University Press

work page 2005

[7] [7]

Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). Sage Publications

work page 2018

[8] [8]

arXiv preprint arXiv:2509.03116 , year =

Licht, H., Sarkar, R., Wu, P. Y., Goel, P., Stoehr, N., Ash, E., and Hoyle, A. M. (2025). Measuring scalar constructs in social science with LLMs. arXiv preprint, arXiv:2509.03116

work page arXiv 2025

[9] [9]

Pangakis, N., Wolken, S., and Fasching, N. (2023). Automated annotation with generative AI requires validation. arXiv preprint, arXiv:2306.00176

work page arXiv 2023

[10] [10]

Ludwig, J., Mullainathan, S., and Rambachan, A. (2025). Large language models: An applied econometric framework. NBER Working Paper No. 33344

work page 2025

[11] [11]

Schwarz, N. (1999). Self-reports: How the questions shape the answers. American Psychologist, 54(2), 93–105

work page 1999

[12] [12]

Schwarz, N., and Clore, G. L. (1983). Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45(3), 513–523

work page 1983

[13] [13]

Schwarz, N., and Strack, F. (1999). Reports of subjective well-being: Judgmental processes and their methodological implications. In D

work page 1999

[14] [14]

name": "extract_session_info

Kahneman, E. Diener, and N. Schwarz (Eds.), Well-being: The foundations of hedonic psychology (pp. 61–84). Russell Sage Foundation. Törnberg, P. (2023). How to use large language models for text analysis. arXiv preprint, arXiv:2307.13106. 26 Appendix A: Prompt function schema The analysis in this study was conducted using the Dimension Labs language data ...

work page arXiv 2023