LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text
Pith reviewed 2026-05-10 13:51 UTC · model grok-4.3
The pith
GPT-4 can predict baseball fans' 0-10 experience ratings from their written comments alone, with predictions falling within one point of actual ratings two-thirds of the time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned
What carries the argument
The systematic gap between LLM-predicted ratings from text (salient moments) and self-reported overall ratings (integrated verdict), treated as complementary constructs rather than error.
If this is right
- Text alone can yield reliable directional estimates of experience ratings.
- The one-point offset preserves distinct information from memorable events versus holistic judgments.
- Simple prompts produce consistent results across repeated applications.
- Such scoring applies to any domain with open-ended responses and rating scales.
Where Pith is reading between the lines
- Companies could mine customer reviews or social media for predicted ratings to supplement or replace some surveys.
- Identifying the salient moments in text might help pinpoint what drives positive or negative experiences specifically.
- The distinction suggests keeping both text-derived and direct rating measures in parallel for richer insights.
Load-bearing premise
The systematic one-point gap reflects a real difference between what the text emphasizes (salient moments) and what the rating captures (overall verdict), not an artifact of the prompt or data collection.
What would settle it
If changing the prompt or model version removes the one-point systematic difference while maintaining the correlation, that would show the gap is not a stable construct difference.
read the original abstract
We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan's verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a baseline empirical study in which GPT-4.1 is prompted to predict baseball fans' self-reported overall experience ratings (0-10 scale) from single open-ended text responses. On a sample of approximately 10,000 responses from five MLB teams, the model achieves 67% of predictions within ±1 point of the ground-truth rating (36% exact match), a Pearson correlation of r=0.82 with the overall rating, and high run-to-run stability (87% exact agreement across three independent scorings). Predictions are systematically lower than self-reports by roughly one point; the authors interpret this gap as evidence that LLM outputs capture the impact of salient, memorable moments while self-reports reflect an integrated overall evaluative verdict, and argue that the gap should be preserved rather than treated as error.
Significance. If the directional prediction and reliability results hold under fuller methodological disclosure, the work supplies a reproducible baseline for using off-the-shelf LLMs to extract quantitative ratings from unstructured customer or fan feedback. The suggestion that systematic discrepancies between model and human scores can index distinct constructs (salient moments versus holistic verdict) rather than measurement failure offers a constructive framing for future text-to-rating research in applied NLP and survey methodology.
major comments (2)
- [Methods] Methods: The exact prompt text supplied to GPT-4.1 is not reproduced. Because the central claim rests on the performance of a 'simple, unoptimized prompt,' the full prompt (including any system instructions, few-shot examples, or output format constraints) must be provided to support reproducibility and to permit evaluation of whether the observed gap is prompt-dependent.
- [Results] Results: The one-point systematic under-prediction is presented as interpretable as a construct difference, yet no statistical test (paired t-test, bootstrap CI, or mixed-effects model accounting for team or response length) is reported to establish that the gap is reliable and not an artifact of the specific text collection or model version. Without these, the claim that the gap 'can be interpreted' as salient moments versus overall verdict remains under-supported.
minor comments (2)
- [Abstract] Abstract and Results: Specify the five MLB teams, the collection period, and any inclusion/exclusion criteria for the ~10,000 responses so readers can assess selection bias and generalizability.
- [Discussion] Discussion: Clarify whether the salient-moments versus overall-verdict distinction is offered as a post-hoc hypothesis or as a claim directly tested by the aspect-level correlations; the current wording leaves the evidential status ambiguous.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation for minor revision. We address the two major comments point by point below and will revise the manuscript accordingly to improve reproducibility and statistical support.
read point-by-point responses
-
Referee: [Methods] Methods: The exact prompt text supplied to GPT-4.1 is not reproduced. Because the central claim rests on the performance of a 'simple, unoptimized prompt,' the full prompt (including any system instructions, few-shot examples, or output format constraints) must be provided to support reproducibility and to permit evaluation of whether the observed gap is prompt-dependent.
Authors: We agree that the full prompt must be disclosed for reproducibility. The revised manuscript will include the exact prompt text used with GPT-4.1, encompassing the complete system instructions, output format constraints, and confirmation that no few-shot examples were employed. This will allow readers to evaluate the prompt's simplicity and assess whether the observed gap depends on specific prompt choices. revision: yes
-
Referee: [Results] Results: The one-point systematic under-prediction is presented as interpretable as a construct difference, yet no statistical test (paired t-test, bootstrap CI, or mixed-effects model accounting for team or response length) is reported to establish that the gap is reliable and not an artifact of the specific text collection or model version. Without these, the claim that the gap 'can be interpreted' as salient moments versus overall verdict remains under-supported.
Authors: The referee is correct that formal statistical tests for the systematic one-point gap were not reported. Although the manuscript describes the gap as consistent and not driven by any single aspect, we will add a paired t-test on the differences, bootstrap confidence intervals for the mean bias, and a mixed-effects model with team and response length as covariates. These analyses will be included in the revision to rigorously demonstrate that the under-prediction is reliable and not an artifact, thereby strengthening the construct-difference interpretation. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper reports an empirical comparison of GPT-4.1 predictions (generated from open-ended fan text via a fixed prompt) against independent survey ground-truth ratings collected from ~10,000 responses. No equations, parameter fitting, or derivations appear in the provided text; the central numbers (67% within ±1, r=0.82, systematic one-point gap) are direct observational outputs rather than quantities that reduce to the inputs by construction. The interpretive claim that the gap reflects a construct difference is presented as one supported reading, not as a theorem or fitted result required for the headline metrics. The work is therefore self-contained against external benchmarks with no load-bearing self-citation chains or self-definitional steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption GPT-4.1 can extract evaluative information from unstructured fan text sufficient for directional rating prediction
- domain assumption Self-reported survey ratings constitute valid ground truth for overall experience
Reference graph
Works this paper leans on
-
[1]
Shleifer, A. (2026). GPT as a measurement tool. NBER Working Paper No. 34834
work page 2026
-
[2]
F., Bratslavsky, E., Finkenauer, C., and Vohs, K
Baumeister, R. F., Bratslavsky, E., Finkenauer, C., and Vohs, K. D. (2001). Bad is stronger than good. Review of General Psychology, 5(4), 323–370
work page 2001
-
[3]
Fredrickson, B. L., and Kahneman, D. (1993). Duration neglect in retrospective evaluations of affective episodes. Journal of Personality and Social Psychology, 65(1), 45–55
work page 1993
-
[4]
Glazier, R. A., Boydstun, A. E., and Feezell, J. T. (2021). Self-coding: A method to assess semantic validity and bias when coding open-ended responses. Research & Politics, 8(3), 1–9
work page 2021
- [5]
-
[6]
Kahneman, D., and Riis, J. (2005). Living, and thinking about it: Two perspectives on life. In F. Huppert, N. Baylis, and B. Keverne (Eds.), The science of well-being (pp. 285–304). Oxford University Press
work page 2005
-
[7]
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology (4th ed.). Sage Publications
work page 2018
-
[8]
arXiv preprint arXiv:2509.03116 , year =
Licht, H., Sarkar, R., Wu, P. Y., Goel, P., Stoehr, N., Ash, E., and Hoyle, A. M. (2025). Measuring scalar constructs in social science with LLMs. arXiv preprint, arXiv:2509.03116
- [9]
-
[10]
Ludwig, J., Mullainathan, S., and Rambachan, A. (2025). Large language models: An applied econometric framework. NBER Working Paper No. 33344
work page 2025
-
[11]
Schwarz, N. (1999). Self-reports: How the questions shape the answers. American Psychologist, 54(2), 93–105
work page 1999
-
[12]
Schwarz, N., and Clore, G. L. (1983). Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45(3), 513–523
work page 1983
-
[13]
Schwarz, N., and Strack, F. (1999). Reports of subjective well-being: Judgmental processes and their methodological implications. In D
work page 1999
-
[14]
Kahneman, E. Diener, and N. Schwarz (Eds.), Well-being: The foundations of hedonic psychology (pp. 61–84). Russell Sage Foundation. Törnberg, P. (2023). How to use large language models for text analysis. arXiv preprint, arXiv:2307.13106. 26 Appendix A: Prompt function schema The analysis in this study was conducted using the Dimension Labs language data ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.