pith. sign in

arxiv: 2605.29199 · v1 · pith:4IF4VS3Nnew · submitted 2026-05-28 · 💻 cs.SI

Scalable AI-Driven Analytics for User Engagement and Stance Detection on Social Media

Pith reviewed 2026-06-29 00:26 UTC · model grok-4.3

classification 💻 cs.SI
keywords conspiracy contentuser engagementstance detectionYouTubesocial media analyticsmisinformationamplification dynamicsscalable framework
0
0 comments X

The pith

Conspiracy videos draw up to 70 percent of total user engagement in their first week after upload.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a modular pipeline that ingests, filters, models topics, and applies sentiment and stance detection to millions of YouTube comments on conspiracy videos. It uses this system to measure how engagement concentrates early and how users respond. The central finding is that most interaction happens quickly and most expressed positions support the narratives. The work argues this shows the value of continuous, service-oriented monitoring at platform scale.

Core claim

A scalable service framework combining data ingestion, topic modelling, sentiment analysis, and stance detection processes over 7 million comments from nearly 50,000 conspiracy-related YouTube videos. The analysis shows conspiracy content attracts up to 70 percent of total user engagement within the first week and that a majority of users express favourable positions toward the narratives, with a small set of highly active users driving disproportionate engagement across channels.

What carries the argument

The modular pipeline that chains data ingestion, filtering, topic modelling, sentiment analysis, and stance detection to operate on large real-world comment sets.

Load-bearing premise

The stance detection and sentiment models correctly classify user positions on conspiracy comments even though no accuracy metrics or validation results are supplied.

What would settle it

Label a random sample of the 7 million comments for stance and sentiment by hand, then measure how often the pipeline's classifications match those labels.

Figures

Figures reproduced from arXiv: 2605.29199 by Dinusha Vatsalan, Hassan Asghar, Mohamed Ali Kaafar, Muhammad Ikram, Thammitage Piyumi Wathsala Seneviratne.

Figure 1
Figure 1. Figure 1: Overview of the proposed scalable AI-driven service architecture. The system consists of five layers: (1) Data Sources Curation, (2) Data Ingestion (YouTube API [21]), (3) Processing Pipeline (filtering, topic modelling, sentiment and stance analysis), (4) Analytics Layer (engagement metrics and behavioural signals), and (5) Service Interface for real-time querying and monitoring. enables us to characteris… view at source ↗
Figure 2
Figure 2. Figure 2: Differences between reported and publicly available comment counts across datasets (log scale) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Clustered document embeddings showing coherent topic clusters. Corpus package [24]. We manually verify that removing such phrases does not alter the semantic meaning of transcripts. Unlike traditional topic modelling approaches based on word-frequency matrices, our objective is to extract human￾interpretable topics that preserve semantic relationships within the data. Therefore, we adopt a transformer-base… view at source ↗
Figure 4
Figure 4. Figure 4: Our data preprocessing pipeline for stance detection. a user is responding in order to accurately infer their opinion. In large-scale social media environments, this challenge is further compounded by the presence of noisy, ambiguous, and low-information content. In particular, spam-like, irrelevant, or meaningless comments are prevalent and can negatively impact stance inference. To address this, we first… view at source ↗
Figure 5
Figure 5. Figure 5: Comments distribution. 10 0 10 1 10 2 10 3 Number of Unique Videos Per User 0.0 0.2 0.4 0.6 0.8 1.0 CDF Other Conspiracies QA-non Conspiracies Baseline Videos [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pairwise Pearson correlation coefficient between users’ number of comments, likes, and views in each dataset. Takeaway (RQ1): Our findings suggest that conspiracy con￾tent exhibits stronger, more skewed, and more distributed engagement compared to mainstream content. A small subset of highly active users contributes disproportionately, and en￾gagement is driven by content across multiple channels rather th… view at source ↗
Figure 9
Figure 9. Figure 9: Proportion of the comments in each dataset by sentiments of most actively engaging users. (i) Sentiment Analysis [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Time series analysis of comments received over time. Normalised comment count is obtained by dividing total comments by the number of videos, enabling comparison across datasets of different sizes. (i) Early Engagement Dynamics [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
read the original abstract

Social media platforms have become a major vector for the large-scale dissemination of misinformation and conspiracy content, posing significant risks to public trust, health, and societal stability. While prior work has primarily focused on analysing such content from a behavioural or content-centric perspective, there is a lack of scalable, service-oriented solutions that enable continuous monitoring and analysis of user engagement at platform scale. In this paper, we present a scalable AI-driven service framework for analysing user engagement and stance on social media content. Our system integrates data ingestion, filtering, topic modelling, sentiment analysis, and stance detection into a modular pipeline that can operate on large-scale, real-world datasets. We implement and evaluate our framework on a dataset comprising over 7 million user comments collected from nearly 50,000 YouTube videos associated with conspiracy narratives. Our analysis reveals that conspiracy content attracts up to 70% of total user engagement within the first week of publication, indicating strong early amplification dynamics. Furthermore, we identify a subset of highly active users who exhibit disproportionately high engagement across multiple videos and channels. Stance analysis shows that a majority of users express favourable positions toward conspiracy narratives, highlighting the role of user communities in reinforcing such content. The proposed framework demonstrates the feasibility of deploying scalable, service-oriented analytics for real-time monitoring of user engagement and behavioural patterns. These findings demonstrate the effectiveness of our framework in capturing large-scale engagement dynamics and highlight the importance of early-stage detection and service-based monitoring for mitigating the spread of harmful content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a modular, service-oriented AI pipeline that combines data ingestion, filtering, topic modelling, sentiment analysis, and stance detection to monitor user engagement with conspiracy-related YouTube content at scale. Evaluated on a corpus of >7 million comments from ~50k videos, the work reports that conspiracy content captures up to 70% of total engagement within the first week and that a majority of users adopt favourable stances toward such narratives, while also identifying a small set of highly active users.

Significance. A validated, production-ready framework for continuous, large-scale stance and engagement monitoring would be a useful contribution to computational social science and platform-governance research. The dataset size and the emphasis on early amplification are strengths; however, the absence of any reported model validation, baselines, or uncertainty estimates for the core AI components substantially reduces the reliability of the headline quantitative claims.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Stance Detection): The central claim that 'a majority of users express favourable positions toward conspiracy narratives' is produced by an unvalidated stance-detection module. No accuracy, F1, confusion matrix, inter-annotator agreement, or held-out test-set results are supplied for either the stance classifier or the upstream sentiment analysis. Without these metrics the majority conclusion cannot be assessed and is load-bearing for the paper's main empirical contribution.
  2. [§3 and §5] §3 and §5 (Engagement Analysis): The reported 'up to 70% of total user engagement within the first week' is presented without baseline comparisons, temporal controls, or error bars. It is therefore impossible to determine whether this figure exceeds what would be expected under a null model of random or popularity-driven engagement.
  3. [§4] §4 (Pipeline Evaluation): The manuscript asserts that the framework 'can operate on large-scale, real-world datasets' and demonstrates 'feasibility of deploying scalable, service-oriented analytics,' yet provides no throughput, latency, or resource-utilization measurements for the end-to-end pipeline on the 7 M comment corpus.
minor comments (2)
  1. [Abstract] The abstract states quantitative findings without any accompanying validation statistics; this should be flagged as a limitation even in the abstract.
  2. [§3] Notation for engagement ratios and stance polarity scores is introduced without explicit definitions or references to the precise formulas used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will incorporate revisions to improve the manuscript's rigor and transparency.

read point-by-point responses
  1. Referee: [Abstract and §4] The central claim that 'a majority of users express favourable positions toward conspiracy narratives' is produced by an unvalidated stance-detection module. No accuracy, F1, confusion matrix, inter-annotator agreement, or held-out test-set results are supplied for either the stance classifier or the upstream sentiment analysis. Without these metrics the majority conclusion cannot be assessed and is load-bearing for the paper's main empirical contribution.

    Authors: We agree that the absence of validation metrics for the stance detection and sentiment components limits the assessability of the majority stance claim. The manuscript applies standard off-the-shelf NLP models without reporting dataset-specific performance. In revision we will add a new evaluation subsection in §4 that reports accuracy, macro-F1, a confusion matrix, and inter-annotator agreement on a manually labelled held-out test set of 500 comments. This will directly support the empirical contribution. revision: yes

  2. Referee: [§3 and §5] The reported 'up to 70% of total user engagement within the first week' is presented without baseline comparisons, temporal controls, or error bars. It is therefore impossible to determine whether this figure exceeds what would be expected under a null model of random or popularity-driven engagement.

    Authors: The 70 % figure is an observational statistic computed from the temporal distribution of comment volumes in the collected corpus. We acknowledge the lack of statistical controls. The revised manuscript will add, in §§3 and 5, a simple null-model baseline (random reassignment of engagement volumes) together with bootstrap-derived 95 % confidence intervals around the weekly engagement percentages to allow readers to judge whether the observed early amplification is distinguishable from chance. revision: yes

  3. Referee: [§4] The manuscript asserts that the framework 'can operate on large-scale, real-world datasets' and demonstrates 'feasibility of deploying scalable, service-oriented analytics,' yet provides no throughput, latency, or resource-utilization measurements for the end-to-end pipeline on the 7 M comment corpus.

    Authors: We concur that quantitative pipeline benchmarks are required to substantiate the scalability claims. The revised version will include, in §4, end-to-end measurements (comments processed per second, average latency per comment, peak CPU and memory usage) obtained while ingesting and analysing the full 7-million-comment corpus on a standard cloud VM configuration. These metrics will be presented alongside the existing qualitative feasibility discussion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline outputs are direct data counts and model applications, not self-defined quantities.

full rationale

The paper describes a modular data-processing pipeline (ingestion, filtering, topic modelling, sentiment, stance detection) applied to a collected dataset of 7M comments. Reported figures such as the 70% early engagement and majority favourable stance are presented as analysis results from this pipeline. No equations, parameter-fitting steps, or derivations appear in the abstract or described framework. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. The stance-detection component is unvalidated in the provided text, but this is a validation gap rather than a circular reduction of the output to its own definition. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract. The work relies on standard, unspecified NLP tools for topic modelling, sentiment, and stance detection.

pith-pipeline@v0.9.1-grok · 5826 in / 1158 out tokens · 33338 ms · 2026-06-29T00:26:23.762806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 3 linked inside Pith

  1. [1]

    Top websites ranking 2023,

    “Top websites ranking 2023,” Sep 2023, [Accessed 12-10-2023]. [Online]. Available: https://www.similarweb.com/top-websites/

  2. [2]

    32 youtube statistics 2024: Key insights & trends you need to know,

    N. Dunn, “32 youtube statistics 2024: Key insights & trends you need to know,” 2024, [Accessed 14-10-2024]. [Online]. Available: https://www.charleagency.com/articles/youtube-statistics/

  3. [3]

    Conspiracy theories as barriers to controlling the spread of covid-19 in the u.s

    D. Romer and K. H. Jamieson, “Conspiracy theories as barriers to controlling the spread of covid-19 in the u.s.”Social Science & Medicine, vol. 263, p. 113356, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S027795362030575X

  4. [4]

    Qanon: The networks of misinformation and conspiracy theories on social media,

    S. Dastgeer and R. Thapaliya, “Qanon: The networks of misinformation and conspiracy theories on social media,” inThe Emerald Handbook of Computer-Mediated Communication and Social Media. Emerald Publishing Limited, 2022, pp. 251–268

  5. [5]

    Managing harmful conspiracy theories on YouTube - blog.youtube,

    Google, “Managing harmful conspiracy theories on YouTube - blog.youtube,” [15-OCT-2020], [Accessed 14-10- 2024]. [Online]. Available: https://blog.youtube/news-and-events/ harmful-conspiracy-theories-youtube/

  6. [6]

    Continuing our work to improve recommendations on youtube — blog.youtube,

    YouTube, “Continuing our work to improve recommendations on youtube — blog.youtube,” [25-01-2019], [Accessed 15- 10-2024]. [Online]. Available: https://blog.youtube/news-and-events/ continuing-our-work-to-improve/

  7. [7]

    Trends in the diffusion of misinformation on social media,

    H. Allcott, M. Gentzkow, and C. Yu, “Trends in the diffusion of misinformation on social media,”Research & Politics, vol. 6, no. 2, p. 2053168019848554, 2019

  8. [8]

    A longitudinal analysis of youtube’s promotion of conspiracy videos,

    M. Faddoul, G. Chaslot, and H. Farid, “A longitudinal analysis of youtube’s promotion of conspiracy videos,” 3 2020. [Online]. Available: http://arxiv.org/abs/2003.03318

  9. [9]

    Conspiracy beliefs, misinformation, social media platforms, and protest participation,

    S. Boulianne and S. Lee, “Conspiracy beliefs, misinformation, social media platforms, and protest participation,”Media and Communication, vol. 10, pp. 30–41, 2022

  10. [10]

    Conspiracy brokers: Under- standing the monetization of youtube conspiracy theories

    C. Ballard, I. Goldstein, P. Mehta, G. Smothers, K. Take, V . Zhong, R. Greenstadt, T. Lauinger, and D. McCoy, “Conspiracy brokers: Under- standing the monetization of youtube conspiracy theories.” Association for Computing Machinery, Inc, 4 2022, pp. 2707–2718

  11. [11]

    Science vs conspiracy: Collective narratives in the age of misinformation,

    A. Bessi, M. Coletto, G. A. Davidescu, A. Scala, G. Caldarelli, and W. Quattrociocchi, “Science vs conspiracy: Collective narratives in the age of misinformation,”PloS one, vol. 10, no. 2, p. e0118093, 2015

  12. [12]

    The spreading of misinformation online,

    M. Del Vicario, A. Bessi, F. Zollo, F. Petroni, A. Scala, G. Caldarelli, H. E. Stanley, and W. Quattrociocchi, “The spreading of misinformation online,”Proceedings of the national academy of Sciences, vol. 113, no. 3, pp. 554–559, 2016

  13. [13]

    Users polarization on facebook and youtube,

    A. Bessi, F. Zollo, M. D. Vicario, M. Puliga, A. Scala, G. Caldarelli, B. Uzzi, and W. Quattrociocchi, “Users polarization on facebook and youtube,”PLoS ONE, vol. 11, 8 2016

  14. [14]

    Conspiracy vs science: a large-scale analysis of online discussion cascades,

    Y . Zhang, L. Wang, J. J. Zhu, and X. Wang, “Conspiracy vs science: a large-scale analysis of online discussion cascades,”World wide web, vol. 24, pp. 585–606, 2021

  15. [15]

    Conspiracy in the time of corona: automatic detection of emerging covid-19 conspiracy theories in social media and the news,

    S. Shahsavari, P. Holur, T. Wang, T. R. Tangherlini, and V . Roychowd- hury, “Conspiracy in the time of corona: automatic detection of emerging covid-19 conspiracy theories in social media and the news,”Journal of computational social science, vol. 3, no. 2, pp. 279–317, 2020

  16. [16]

    Conspiracy theories and social media platforms,

    M. Cinelli, G. Etta, M. Avalle, A. Quattrociocchi, N. Di Marco, C. Valensise, A. Galeazzi, and W. Quattrociocchi, “Conspiracy theories and social media platforms,”Current Opinion in Psychology, p. 101407, 2022

  17. [17]

    Analyzing disinformation and crowd manipulation tactics on youtube,

    M. N. Hussain, S. Tokdemir, N. Agarwal, and S. Al-Khateeb, “Analyzing disinformation and crowd manipulation tactics on youtube,” in2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2018, pp. 1092–1095

  18. [18]

    Caught in a networked collusion? homogeneity in conspiracy-related discussion net- works on youtube,

    D. R ¨ochert, G. Neubaum, B. Ross, and S. Stieglitz, “Caught in a networked collusion? homogeneity in conspiracy-related discussion net- works on youtube,”Information Systems, vol. 103, p. 101866, 2022

  19. [19]

    Antisemitic conspiracy fantasy in the age of digital media: Three ‘conspiracy theorists’ and their youtube audiences,

    D. Allington, B. L. Buarque, and D. B. Flores, “Antisemitic conspiracy fantasy in the age of digital media: Three ‘conspiracy theorists’ and their youtube audiences,”Language and Literature, vol. 30, pp. 78–102, 2 2021

  20. [20]

    Where conspiracy theories flourish: A study of youtube comments and bill gates conspiracy theories,

    L. Ha, T. Graham, and J. Gray, “Where conspiracy theories flourish: A study of youtube comments and bill gates conspiracy theories,”Harvard Kennedy School Misinformation Review, 10 2022

  21. [21]

    Google for developers — add youtube functionality to your app,

    “Google for developers — add youtube functionality to your app,” Oct 2023, [Accessed: 03-03-2024]. [Online]. Available: https: //developers.google.com/youtube/v3

  22. [22]

    youtube-transcript-api — pypi,

    “youtube-transcript-api — pypi,” Oct 2024. [Online]. Available: https://pypi.org/project/youtube-transcript-api/

  23. [23]

    Snorkel: Rapid training data creation with weak supervision,

    A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R ´e, “Snorkel: Rapid training data creation with weak supervision,” in Proceedings of the VLDB Endowment. International Conference on V ery Large Data Bases, vol. 11, no. 3. NIH Public Access, 2017, p. 269

  24. [24]

    P. O. Perry,corpus: Text Corpus Analysis, 2017, r package version 0.10.0. [Online]. Available: http://corpustext.com

  25. [25]

    Bertopic: Neural topic modeling with a class-based tf-idf procedure,

    M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,”arXiv preprint arXiv:2203.05794, 2022

  26. [26]

    Umap: Uniform manifold approximation and projection for dimension reduction,

    L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,”arXiv preprint arXiv:1802.03426, 2018

  27. [27]

    hdbscan: Hierarchical density based clustering

    L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering.”J. Open Source Softw., vol. 2, no. 11, p. 205, 2017

  28. [28]

    Scikit-learn: Machine learning in Python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, “Scikit-learn: Machine learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011

  29. [29]

    Nltk: The natural language toolkit,

    E. Loper and S. Bird, “Nltk: The natural language toolkit,”arXiv preprint cs/0205028, 2002

  30. [30]

    Software Framework for Topic Modelling with Large Corpora,

    R. ˇReh˚uˇrek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” inProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA, May 2010, pp. 45–50, http://is.muni.cz/publication/884893/en

  31. [31]

    Semeval-2018 Task 1: Affect in tweets,

    S. M. Mohammad, F. Bravo-Marquez, M. Salameh, and S. Kiritchenko, “Semeval-2018 Task 1: Affect in tweets,” inProceedings of Interna- tional Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, 2018

  32. [32]

    Replicable semi-supervised approaches to state-of-the-art stance detection of tweets,

    M. Reveilhac and G. Schneider, “Replicable semi-supervised approaches to state-of-the-art stance detection of tweets,”Information Processing and Management, vol. 60, no. 2, p. 103199, 2023

  33. [33]

    On a test of whether one of two random variables is stochastically larger than the other,

    H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,”The Annals of Mathematical Statistics, vol. 18, no. 1, pp. 50–60, 1947