pith. sign in

arxiv: 2606.25518 · v1 · pith:6DLIVEJSnew · submitted 2026-06-24 · 💻 cs.CL

Fault of Our Stars: Behavioral Drivers of Rating-Sentiment Incongruence

Pith reviewed 2026-06-25 20:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords sentiment incongruencestar ratingsreview texttourism reviewstransformer sentiment analysisweak labelsbehavioral driversNLP datasets
0
0 comments X

The pith

Star ratings frequently fail to match the sentiment in the accompanying review text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates cases where the sentiment expressed in a written review does not align with the star rating assigned by the same reviewer. Analyzing 16,156 Sri Lankan tourism reviews from 2010 to 2023, it applies a transformer model to assess text sentiment independently and finds mismatches in 18.6 percent of cases. These mismatches fall into six directional patterns, with conservative rating and obligatory five-star behaviors accounting for most. Factors such as venue type, reviewer expertise, review length, and timing contribute to the divergence. This matters because many NLP systems treat ratings as automatic ground-truth labels for sentiment without verification, which can introduce errors if the two sources conflict.

Core claim

Sentiment-rating incongruence occurs in 18.6% of reviews and falls into six directional patterns, with Conservative Rater and Obligatory 5-Star behaviors accounting for the majority of mismatches. Prevalence varies across venue types, with museums showing the highest rates. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as contributors to rating-text divergence. The study concludes that star ratings are not interchangeable with textual sentiment and should be validated before being treated as ground-truth labels in NLP.

What carries the argument

A transformer-based sentiment pipeline that derives textual sentiment independently of assigned ratings, used to flag incongruence and its drivers.

If this is right

  • Datasets that use star ratings as weak labels for sentiment analysis contain systematic noise from incongruent cases.
  • Models trained on rating labels may learn rating-specific biases instead of text sentiment patterns.
  • Incongruence rates are higher for certain venue types such as museums.
  • Longer reviews and those from more expert reviewers are more likely to show divergence.
  • Machine learning models combined with SHAP values can surface the main behavioral and contextual drivers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the mismatch rate holds in other domains, then sentiment benchmarks that rely on rating labels may report inflated accuracy.
  • Tools that automatically flag rating-text mismatches could be added to dataset cleaning pipelines.
  • The behavioral patterns identified here may also appear in product or service reviews outside tourism.
  • Testing whether the same drivers appear when reviewers are asked to assign both ratings and text in controlled settings would extend the finding.

Load-bearing premise

The transformer-based sentiment pipeline produces an accurate measure of textual sentiment that is independent of the assigned star rating.

What would settle it

Collecting fresh human judgments of sentiment on a random sample of the reviews and finding that they align with the star ratings in more than 90 percent of cases would undermine the reported rate of incongruence.

Figures

Figures reproduced from arXiv: 2606.25518 by Anusan Krishnathas, Asma Rauff, Kovindarajah Sriyathurshan, Kusal Amantha, Nirasha Munasinghe, Nisansa de Silva, Patalee Narasinghe, Ramanaish Abaiyan, Ruththiragayan Sutharsan, Sandareka Wickramanayake.

Figure 2
Figure 2. Figure 2: Incongruence rate by venue type. C. Predictors of Incongruence Bivariate screening with Benjamini–Hochberg correction was used to identify predictors associated with incongruence. As shown in Table V, reviewer tier, province, travel year, and review length remained significant after correction, while review delay was not significant (q = 0.7503). Expert reviewers were 1.97 times more likely than novices to… view at source ↗
Figure 1
Figure 1. Figure 1: Distribution of the six directional incongruence patterns. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Selected incongruence patterns by reviewer expertise, showing higher [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

When people share experiences online, they often express thoughts in two ways: a star rating and a written review. In sentiment analysis, ratings are widely used as convenient weak labels for textual sentiment, yet whether the two actually agree is rarely questioned. This study investigates sentiment-rating incongruence, where the sentiment expressed in review text differs from the sentiment implied by the assigned star rating, in Sri Lankan tourism attraction reviews. A dataset of 16,156 reviews from 2010 to 2023 is analyzed using a transformer-based sentiment pipeline that derives textual sentiment independently of assigned ratings. Incongruence occurs in 18.6% of reviews and falls into six directional patterns, with Conservative Rater and Obligatory 5-Star behaviors accounting for the majority of mismatches. Prevalence also varies across venue types, with museums showing the highest rates. Statistical tests, logistic regression, Random Forest, and SHAP analysis identify venue type, reviewer expertise, review length, and temporal factors as contributors to rating-text divergence. Overall, this study demonstrates that star ratings are not interchangeable with textual sentiment and should be validated before being treated as ground-truth labels in NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper examines 16,156 Sri Lankan tourism attraction reviews (2010–2023) and reports that a transformer-based sentiment pipeline finds textual sentiment incongruent with the assigned star rating in 18.6% of cases. It identifies six directional mismatch patterns (Conservative Rater and Obligatory 5-Star being most common), shows variation by venue type (highest in museums), and uses logistic regression, Random Forest, and SHAP analysis to attribute divergence to venue type, reviewer expertise, review length, and temporal factors. The central claim is that star ratings are not interchangeable with textual sentiment and require validation before use as ground-truth labels in NLP.

Significance. If the textual sentiment labels prove reliable, the result would demonstrate a non-negligible failure rate when ratings are used as weak supervision, with direct implications for dataset construction and model evaluation in sentiment analysis. The behavioral and venue-specific patterns add empirical granularity that could inform review-platform design and data-filtering practices.

major comments (1)
  1. [Abstract/Methods] Abstract/Methods: The manuscript states that the transformer pipeline 'derives textual sentiment independently' of the star rating, yet provides no accuracy, F1, or confusion-matrix results for the Sri Lankan tourism domain. This assumption is load-bearing for the reported 18.6% incongruence rate and all downstream logistic-regression and SHAP findings; systematic domain mismatch (local phrasing, code-switching) would directly inflate the mismatch statistics.
minor comments (1)
  1. [Abstract] Abstract does not name the six directional patterns or the specific transformer model and training corpus used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract/Methods] Abstract/Methods: The manuscript states that the transformer pipeline 'derives textual sentiment independently' of the star rating, yet provides no accuracy, F1, or confusion-matrix results for the Sri Lankan tourism domain. This assumption is load-bearing for the reported 18.6% incongruence rate and all downstream logistic-regression and SHAP findings; systematic domain mismatch (local phrasing, code-switching) would directly inflate the mismatch statistics.

    Authors: We agree this is a substantive gap. The manuscript does not report domain-specific accuracy, F1, or confusion-matrix results for the transformer pipeline on Sri Lankan tourism text, and the possibility of domain shift from local phrasing or code-switching is not quantified. In the revised version we will add a dedicated evaluation subsection that reports performance on a manually annotated held-out sample drawn from the same corpus (or, if annotation resources are limited, on a comparable tourism-domain benchmark), including accuracy, macro-F1, and the confusion matrix. We will also discuss the implications of any observed error rate for the 18.6 % incongruence statistic and the downstream regression/SHAP results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observational study

full rationale

The paper performs an empirical analysis of review data by applying a pre-existing transformer pipeline to generate independent sentiment labels, then compares them to star ratings using standard statistical tools (logistic regression, Random Forest, SHAP). No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear. The pipeline is asserted to operate independently; any concern about its domain accuracy is a validity issue, not a circular reduction of the claimed result to its inputs. No self-citation chains or uniqueness theorems are invoked as load-bearing. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical model, free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5791 in / 1168 out tokens · 24019 ms · 2026-06-25T20:50:07.667347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 1 linked inside Pith

  1. [1]

    Opinion mining and sentiment analysis,

    B. Pang and L. Lee, “Opinion mining and sentiment analysis,”Founda- tions and Trends in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008

  2. [2]

    Sentiment analysis in tourism: Capitalising on big data,

    A. Alaei, S. Becken, and B. Stantic, “Sentiment analysis in tourism: Capitalising on big data,”Journal of Travel Research, vol. 58, no. 2, pp. 175–191, 2019

  3. [3]

    Sentiment analysis applied to tourism: exploring tourist-generated content in the case of a wellness tourism destination,

    O. A. George and C. M. Q. Ramos, “Sentiment analysis applied to tourism: exploring tourist-generated content in the case of a wellness tourism destination,”International Journal of Spa and Wellness, vol. 7, no. 2, pp. 139–161, 2024

  4. [4]

    Are customer star ratings and sentiments aligned? a deep learning study of the customer service experience in tourism destinations,

    E. Bigne, C. Ruiz, C. Perez-Cabanero, and A. Cuenca, “Are customer star ratings and sentiments aligned? a deep learning study of the customer service experience in tourism destinations,”Service Business, vol. 17, pp. 281–314, 2023

  5. [5]

    Beyond the stars: The impact of rating-text inconsistency on perceived review usefulness,

    B. Kwon, J. Lee, J. Min, C. Kwak, and H. B. S. Choi, “Beyond the stars: The impact of rating-text inconsistency on perceived review usefulness,” Asia Pacific Journal of Information Systems, vol. 35, no. 1, pp. 49–72, 2025

  6. [6]

    Sentiment analysis for hotel reviews: A systematic literature review,

    A. Ameur, S. Hamdi, and S. B. Yahia, “Sentiment analysis for hotel reviews: A systematic literature review,”ACM Computing Surveys, vol. 56, no. 2, p. Article 51, Sep. 2023

  7. [7]

    Language interpretation in travel guidance platform: Text mining and sentiment analysis of tripadvisor reviews,

    M. Chu, Y . Chen, L. Yang, and J. Wang, “Language interpretation in travel guidance platform: Text mining and sentiment analysis of tripadvisor reviews,”Frontiers in Psychology, Oct. 2022

  8. [8]

    Analyzing tourism reviews using an lda topic-based sentiment analysis approach,

    T. Ali, B. Omar, and K. Soulaimane, “Analyzing tourism reviews using an lda topic-based sentiment analysis approach,”MethodsX, vol. 9, p. 101894, Nov. 2022

  9. [9]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inNAACL, Minneapolis, MN, USA, Jun. 2019, pp. 4171–4186

  10. [10]

    Ernie: Enhanced representation through knowledge integration,

    Y . Sun, S. Wang, Y . Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu, “Ernie: Enhanced representation through knowledge integration,”arXiv preprint arXiv:1904.09223, 2019

  11. [11]

    Sentiment analysis of hotel online reviews using the bert model and ernie model—data from china,

    Y . Wen, Y . Liang, and X. Zhu, “Sentiment analysis of hotel online reviews using the bert model and ernie model—data from china,”PLOS ONE, vol. 18, no. 3, p. e0275382, Mar. 2023

  12. [12]

    Predicting sentiment and rating of tourist reviews using machine learning,

    K. Puh and M. B. Babac, “Predicting sentiment and rating of tourist reviews using machine learning,”Journal of Hospitality and Tourism Insights, vol. 6, no. 3, pp. 1188–1204, 2023

  13. [13]

    Tourism and travel reviews: Sri lankan destinations,

    T. Sewwandi, “Tourism and travel reviews: Sri lankan destinations,” Mendeley Data, V1, 2023

  14. [14]

    Exploring tourist experience through online reviews using aspect-based sentiment analysis with zero-shot learning for hospitality service enhancement,

    I. Nawawi, K. F. Ilmawan, M. F. Maarif, and M. Syafrudin, “Exploring tourist experience through online reviews using aspect-based sentiment analysis with zero-shot learning for hospitality service enhancement,” Information, vol. 15, no. 8, p. 499, Aug. 2024

  15. [15]

    Sentiment analysis in user reviews: A study of incompatibility in hotel reviews in city of anuradhapura, sri lanka,

    H. P. P. M. Abeysinghe and C. K. Walgampaya, “Sentiment analysis in user reviews: A study of incompatibility in hotel reviews in city of anuradhapura, sri lanka,” inProceedings of iPURSE, vol. 23, Peradeniya, Sri Lanka, Nov. 2021

  16. [16]

    A novel self-learning approach to overcome incompatibility on tripadvisor reviews,

    P. Abeysinghe and T. Bandara, “A novel self-learning approach to overcome incompatibility on tripadvisor reviews,”Data Science and Management, vol. 5, pp. 1–10, 2022

  17. [17]

    Survey on Publicly Available Sinhala Natural Language Processing Tools and Research,

    N. de Silva, “Survey on Publicly Available Sinhala Natural Language Processing Tools and Research,”arXiv preprint arXiv:1906.02358v26, 2026

  18. [18]

    Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts,

    V . Jayawickrama, G. Weeraprameshwara, N. de Silva, and Y . Wijeratne, “Seeking sinhala sentiment: Predicting facebook reactions of sinhala posts,” inInternational Conference on Advances in ICT for Emerging Regions, 2021, pp. 177–182

  19. [19]

    Facebook for sentiment analysis: Baseline models to predict facebook reactions of sinhala posts,

    ——, “Facebook for sentiment analysis: Baseline models to predict facebook reactions of sinhala posts,”The International Journal on Advances in ICT for Emerging Regions, vol. 15, no. 2, 2022

  20. [20]

    Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages,

    G. Weeraprameshwara, V . Jayawickrama, N. de Silva, and Y . Wijeratne, “Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages,” inProceedings of the 36th Pacific Asia Conference on Language, Information and Computation, 2022, pp. 325–336

  21. [21]

    Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data,

    ——, “Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data,” in2022 The 3rd International Conference on Artificial Intelligence in Electronics Engineering. Association for Computing Machinery, 2022, pp. 16–22

  22. [22]

    Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth,

    A. Y . K. Chua and S. Banerjee, “Understanding review helpfulness as a function of reviewer reputation, review rating, and review depth,”Journal of the Association for Information Science and Technology, vol. 66, no. 2, pp. 354–362, 2015