Recognition: unknown
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
Pith reviewed 2026-05-10 03:16 UTC · model grok-4.3
The pith
Large language models receive systematic tests on verifying post authors, generating user-like content, and inferring attributes from Twitter data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study evaluates GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT on social media authorship verification through a systematic sampling framework over users and posts, post generation assessed by multiple metrics plus a human perception study, and user attribute inference annotated with IAB Tech Lab 2023 and 2018 U.S. SOC taxonomies, thereby providing new insights and establishing reproducible benchmarks for LLM-driven social media analytics.
What carries the argument
A multi-task evaluation framework built around systematic sampling of users and posts that tests generalization on post-2023 tweets while linking verification, generation, and attribute inference through shared data.
If this is right
- Models display distinct patterns of success and failure when asked to generate posts that humans judge as authentic.
- Standard taxonomies for occupations and interests allow consistent measurement of how well models infer user attributes from posts.
- Testing on tweets collected after January 2024 separates capabilities learned during training from simple memorization of earlier data.
- Public release of the dataset and evaluation code allows other researchers to run the same tests and track progress over time.
- The connection between authorship verification and post generation highlights shared challenges in style detection and style imitation.
Where Pith is reading between the lines
- Strong results on attribute inference could support more precise automated user profiling in research or moderation settings.
- The sampling approach could be reused to evaluate models on other user-generated content platforms beyond Twitter.
- Findings on human detection of generated posts may help design better tools for spotting synthetic social media content.
- Extending the same multi-task setup to additional analytics problems such as trend detection would create a fuller picture of model strengths.
Load-bearing premise
The chosen Twitter posts, sampling strategies, and mix of automatic and human evaluation metrics accurately capture real-world LLM performance in social media analytics without hidden selection biases or overfitting to the collection period.
What would settle it
If an independent collection of tweets from a later period produces substantially different performance orderings among the same models on any of the three tasks, or if a larger human study reaches opposite conclusions about the realism of generated posts, the reported benchmark insights would require revision.
Figures
read the original abstract
In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate "seen-data" bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users' perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first comprehensive evaluation of modern LLMs (GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT) on three Twitter-based social media analytics tasks: (I) authorship verification via a systematic sampling framework over diverse user/post strategies with generalization tested on newly collected January 2024+ tweets to mitigate seen-data bias, (II) post generation assessed through comprehensive metrics plus a user study measuring real users' perceptions of LLM-generated content conditioned on their own writing, and (III) user attribute inference with occupations/interests annotated via IAB Tech Lab 2023 and 2018 U.S. SOC taxonomies and benchmarked against baselines. The central claim is that this unified multi-task evaluation yields new insights and establishes reproducible benchmarks for LLM-driven social media analytics, with code and data provided in supplementary material for public release.
Significance. If the reported results, error analyses, and robustness checks hold, the work would be significant as one of the first multi-task benchmarks spanning verification, generation, and inference on social media data, with the user study and standardized taxonomies adding practical value. The public release of code/data further strengthens potential impact for the field, though significance is moderated by the need to confirm that findings reflect genuine capabilities rather than dataset-specific artifacts.
major comments (1)
- [Abstract] Abstract: the headline claim that the unified evaluation 'establishes reproducible benchmarks' is load-bearing for the paper's contribution, yet the described systematic sampling and January 2024 hold-out set lack reported ablations testing whether LLM performance rankings remain stable under alternative user/post stratifications (e.g., by account age, follower count, or topic distribution) or a second independent temporal split; without these, the reproducibility assertion risks being sensitive to the specific 2023-2024 collection window and X API effects.
minor comments (2)
- [Abstract] Abstract: 'comprehensive evaluation metrics' for post generation are referenced but not enumerated; the main text should explicitly list and justify each metric (e.g., perplexity, human-likeness scores) to allow readers to assess their appropriateness.
- The manuscript should clarify the exact size and composition of the Twitter dataset (number of users/posts per task) and the precise annotation protocol for the IAB/SOC taxonomies to support the reproducibility claim.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments on our manuscript. We appreciate the careful reading and the focus on strengthening the reproducibility aspects of our work. Below we address the major comment point by point with a commitment to revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that the unified evaluation 'establishes reproducible benchmarks' is load-bearing for the paper's contribution, yet the described systematic sampling and January 2024 hold-out set lack reported ablations testing whether LLM performance rankings remain stable under alternative user/post stratifications (e.g., by account age, follower count, or topic distribution) or a second independent temporal split; without these, the reproducibility assertion risks being sensitive to the specific 2023-2024 collection window and X API effects.
Authors: We thank the referee for this constructive observation. Our systematic sampling framework was explicitly constructed to incorporate multiple diverse user and post selection strategies, and the January 2024 hold-out was collected independently to evaluate generalization beyond the original data window while mitigating seen-data bias. We agree, however, that we did not report explicit ablations confirming that LLM performance rankings remain invariant under further stratifications (e.g., account age, follower count, topic distribution) or an additional temporal split, nor did we isolate potential X API collection artifacts. In the revised manuscript we will add a dedicated robustness subsection that performs and reports such stability checks on available metadata attributes where computationally feasible, and we will revise the abstract language from 'establishes reproducible benchmarks' to 'contributes to establishing reproducible benchmarks' to more precisely reflect the scope of the presented evidence. These changes will be made while preserving the core multi-task evaluation and public data release. revision: partial
Circularity Check
No circularity: empirical evaluation on external data and new collections
full rationale
The paper performs direct empirical benchmarking of LLMs across three tasks using newly collected January 2024 Twitter data, systematic sampling, human user studies, and standardized external taxonomies (IAB and SOC). No equations, parameter fitting presented as prediction, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. All claims rest on observable experimental outcomes rather than reducing to inputs by construction, satisfying the criteria for a self-contained non-circular analysis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2407.11016 , year=
Longlamp: A benchmark for personalized long-form text generation , author=. arXiv preprint arXiv:2407.11016 , year=
-
[2]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Lamp: When large language models meet personalization , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[3]
Personalization of large language models: A survey , author=. arXiv preprint arXiv:2411.00027 , year=
-
[4]
Cheng-Han Chiang and Hung-yi Lee
Evaluation of text generation: A survey , author=. arXiv preprint arXiv:2006.14799 , year=
-
[5]
A. Perez-Castro and M.R. Martínez-Torres and S.L. Toral , keywords =. Efficiency of automatic text generators for online review content generation , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.techfore.2023.122380 , url =
-
[6]
International Journal of Engineering & Technology , volume=
Impact of social media on e-commerce , author=. International Journal of Engineering & Technology , volume=
-
[7]
infodemics
From “infodemics” to health promotion: a novel framework for the role of social media in public health , author=. American journal of public health , volume=. 2020 , publisher=
2020
-
[8]
International Journal of Disaster Risk Reduction , volume=
Use of social media in crisis management: A survey , author=. International Journal of Disaster Risk Reduction , volume=. 2020 , publisher=
2020
-
[9]
Proceedings of the 19th international conference on World wide web , pages=
What is Twitter, a social network or a news media? , author=. Proceedings of the 19th international conference on World wide web , pages=
-
[10]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Bad actor, good advisor: Exploring the role of large language models in fake news detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[11]
arXiv preprint arXiv:2403.16248 , year=
Large language models offer an alternative to the traditional approach of topic modelling , author=. arXiv preprint arXiv:2403.16248 , year=
-
[12]
, author=
Authorship Verification: A Review of Recent Advances. , author=. Res. Comput. Sci. , volume=
-
[13]
Sentiment Analysis in the Era of Large Language Models: A Reality Check.CoRR abs/2305.15005, 2023
Sentiment analysis in the era of large language models: A reality check , author=. arXiv preprint arXiv:2305.15005 , year=
-
[14]
Proceedings of the 2010 ACM Symposium on Applied computing , pages=
E-mail authorship verification for forensic investigation , author=. Proceedings of the 2010 ACM Symposium on Applied computing , pages=
2010
-
[15]
ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=
Similarity learning for authorship verification in social media , author=. ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2019 , organization=
2019
-
[16]
Multimedia Tools and Applications , volume=
Novel authorship verification model for social media accounts compromised by a human , author=. Multimedia Tools and Applications , volume=. 2021 , publisher=
2021
-
[17]
arXiv preprint arXiv:2209.06869 , year=
On the state of the art in authorship attribution and authorship verification , author=. arXiv preprint arXiv:2209.06869 , year=
-
[18]
Proceedings of the 2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies , pages=
Linking user accounts across social media platforms , author=. Proceedings of the 2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies , pages=
2021
-
[19]
2019 IEEE international conference on big data (Big Data) , pages=
Explainable authorship verification in social media via attention-based similarity learning , author=. 2019 IEEE international conference on big data (Big Data) , pages=. 2019 , organization=
2019
-
[20]
Transactions of the Association for Computational Linguistics , volume=
Benchmarking large language models for news summarization , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=
2024
-
[21]
Proceedings of the fourth ACM international conference on Web search and data mining , pages=
Everyone's an influencer: quantifying influence on twitter , author=. Proceedings of the fourth ACM international conference on Web search and data mining , pages=
-
[22]
Journal of information technology & politics , volume=
Twitter use in election campaigns: A systematic literature review , author=. Journal of information technology & politics , volume=. 2016 , publisher=
2016
-
[23]
science , volume=
The spread of true and false news online , author=. science , volume=. 2018 , publisher=
2018
-
[24]
Journal of the American society for information science and technology , volume=
Twitter power: Tweets as electronic word of mouth , author=. Journal of the American society for information science and technology , volume=. 2009 , publisher=
2009
-
[25]
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval , pages=
Short text classification in twitter to improve information filtering , author=. Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval , pages=
-
[26]
Proceedings of the 20th international conference on World wide web , pages=
Information credibility on twitter , author=. Proceedings of the 20th international conference on World wide web , pages=
-
[27]
Artificial Intelligence Review , volume=
Transformer-based models for combating rumours on microblogging platforms: a review , author=. Artificial Intelligence Review , volume=. 2024 , publisher=
2024
-
[28]
Frontiers in big Data , volume=
Sentiment analysis of COP9-related tweets: a comparative study of pre-trained models and traditional techniques , author=. Frontiers in big Data , volume=. 2024 , publisher=
2024
-
[29]
2024 IEEE International Conference on Big Data (BigData) , pages=
SentimentGPT: Leveraging GPT for Advancing Sentiment Analysis , author=. 2024 IEEE International Conference on Big Data (BigData) , pages=. 2024 , organization=
2024
-
[30]
Findings of the Association for Computational Linguistics: EMNLP 2024 , year=
Can Large Language Models Identify Authorship? , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , year=
2024
-
[31]
ACM SIGKDD Explorations Newsletter , volume=
Authorship attribution in the era of llms: Problems, methodologies, and challenges , author=. ACM SIGKDD Explorations Newsletter , volume=. 2025 , publisher=
2025
-
[32]
Findings of the Association for Computational Linguistics ACL 2024 , pages=
RePALM: Popular Quote Tweet Generation via Auto-Response Augmentation , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=
2024
-
[33]
Science Advances , volume=
AI model GPT-3 (dis) informs us better than humans , author=. Science Advances , volume=. 2023 , publisher=
2023
-
[34]
arXiv preprint arXiv:2304.06588 , year=
Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning , author=. arXiv preprint arXiv:2304.06588 , year=
-
[35]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Towards open-domain Twitter user profile inference , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
2023
-
[36]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Who Wrote it and Why? Prompting Large-Language Models for Authorship Verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
2023
-
[37]
Amplifying Your Social Media Presence: Personalized Influential Content Generation with LLMs , author=. arXiv preprint arXiv:2505.01698 , year=
-
[38]
Proceedings of the 31st International Conference on Computational Linguistics , pages=
Engagement-driven Persona Prompting for Rewriting News Tweets , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=
-
[39]
Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=
Once: Boosting content-based recommendation with both open-and closed-source large language models , author=. Proceedings of the 17th ACM International Conference on Web Search and Data Mining , pages=
-
[40]
Machine Learning , volume=
Evaluating large language models for user stance detection on X (Twitter) , author=. Machine Learning , volume=. 2024 , publisher=
2024
-
[41]
2023 IEEE International Conference on Big Data (BigData) , pages=
An analysis of the dynamics of ties on twitter , author=. 2023 IEEE International Conference on Big Data (BigData) , pages=. 2023 , organization=
2023
-
[42]
2023 , note =
Content Taxonomy: v3.1 , author =. 2023 , note =
2023
-
[43]
2018 , note =
Standard Occupational Classification (SOC) System , author =. 2018 , note =
2018
-
[44]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-bert: Sentence embeddings using siamese bert-networks , author=. arXiv preprint arXiv:1908.10084 , year=
work page internal anchor Pith review arXiv 1908
-
[45]
OpenAI blog , volume=
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[46]
Intelligent Medicine , volume=
Large language models-powered clinical decision support: enhancing or replacing human expertise? , author=. Intelligent Medicine , volume=. 2025 , publisher=
2025
-
[47]
Nature Medicine , pages=
Toward expert-level medical question answering with large language models , author=. Nature Medicine , pages=. 2025 , publisher=
2025
-
[48]
npj Science of Learning , volume=
Evaluating large language models in analysing classroom dialogue , author=. npj Science of Learning , volume=. 2024 , publisher=
2024
-
[49]
Philosophical Transactions of the Royal Society A , volume=
Large language models as tax attorneys: a case study in legal capabilities emergence , author=. Philosophical Transactions of the Royal Society A , volume=. 2024 , publisher=
2024
-
[50]
Bakal , title =
Emre Cicekyurt and Mehmet G. Bakal , title =. Computational Economics , volume =. 2025 , doi =
2025
-
[51]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages =
Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages =. 2020 , publisher =
2020
-
[52]
Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert-Voss and Gretchen Krueger and Tom Henighan and Rewon Child and Aditya Ramesh and Daniel M. Ziegler and Jeffrey Wu and Clemens Winter and ...
-
[53]
Klein and Arjun Magge and Graciela Gonzalez-Hernandez , title =
Karen O'Connor and Su Golder and Davy Weissenbacher and Ari Z. Klein and Arjun Magge and Graciela Gonzalez-Hernandez , title =. Journal of Medical Internet Research , volume =. 2024 , doi =
2024
-
[54]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =
Anna Tigunova and Andrew Yates and Paramita Mirza and Gerhard Weikum , title =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =. 2020 , publisher =
2020
-
[55]
Advances in Neural Information Processing Systems , year=
TwiBot-22: Towards Graph-Based Twitter Bot Detection , author=. Advances in Neural Information Processing Systems , year=
-
[56]
Expert Systems with Applications , year=
A review on sentiment analysis from social media platforms , author=. Expert Systems with Applications , year=
-
[57]
Scientific Reports , year=
The potential of generative AI for personalized persuasion at scale , author=. Scientific Reports , year=
-
[58]
On the Opportunities and Risks of Foundation Models
On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=
work page internal anchor Pith review arXiv
-
[59]
Interactive Learning Environments , volume=
Automatic text generation using deep learning: providing large-scale support for online learning communities , author=. Interactive Learning Environments , volume=. 2023 , publisher=
2023
-
[60]
PNAS Nexus , year=
Large Language Models Can Infer Psychological Dispositions of Social Media Users , author=. PNAS Nexus , year=
-
[61]
Sensors , volume =
Enhancing Personalized Ads Using Interest Category Classification of SNS Users Based on Deep Neural Networks , author =. Sensors , volume =. 2021 , publisher =
2021
-
[62]
AGILE: GIScience Series , volume=
Occupation Prediction with Multimodal Learning from Tweet Messages and Google Street View Images , author=. AGILE: GIScience Series , volume=. 2024 , publisher=
2024
-
[63]
arXiv preprint arXiv:2407.12882 , year=
InstructAV: Instruction Fine-tuning Large Language Models for Authorship Verification , author=. arXiv preprint arXiv:2407.12882 , year=
-
[64]
arXiv preprint arXiv:2411.13226 , year=
AIDBench: A benchmark for evaluating the authorship identification capability of large language models , author=. arXiv preprint arXiv:2411.13226 , year=
-
[65]
S3: Social-network simulation system with large language model-empowered agents
S3: Social-network simulation system with large language model-empowered agents , author=. arXiv preprint arXiv:2307.14984 , year=
-
[66]
Knowledge-Based Systems , volume=
Understanding writing style in social media with a supervised contrastively pre-trained transformer , author=. Knowledge-Based Systems , volume=. 2024 , publisher=
2024
-
[67]
Engineering, Technology & Applied Science Research , volume=
Authorship Attribution for English Short Texts , author=. Engineering, Technology & Applied Science Research , volume=
-
[68]
Proceedings of the International AAAI Conference on Web and Social Media , volume=
StyleLink: User Identity Linkage Across Social Media with Stylometric Representations , author=. Proceedings of the International AAAI Conference on Web and Social Media , volume=
-
[69]
arXiv preprint arXiv:2502.12073 , year=
Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation , author=. arXiv preprint arXiv:2502.12073 , year=
-
[70]
Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=
Twhin-bert: A socially-enriched pre-trained language model for multilingual tweet representations at twitter , author=. Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining , pages=
-
[71]
Proceedings of the 16th international conference on World Wide Web , pages=
Demographic prediction based on user's browsing behavior , author=. Proceedings of the 16th international conference on World Wide Web , pages=
-
[72]
Fundamenta Informaticae , volume=
Predicting website audience demographics forweb advertising targeting using multi-website clickstream data , author=. Fundamenta Informaticae , volume=. 2010 , publisher=
2010
-
[73]
Marketing Science , volume=
Predicting individual behavior with social networks , author=. Marketing Science , volume=. 2014 , publisher=
2014
-
[74]
NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark
NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark , author=. arXiv preprint arXiv:2310.18018 , year=
-
[75]
Neurocomputing , volume=
Data mining techniques in social media: A survey , author=. Neurocomputing , volume=. 2016 , publisher=
2016
-
[76]
Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
-
[77]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[78]
1998 , publisher=
Evaluation metrics for language models , author=. 1998 , publisher=
1998
-
[79]
Advances in neural information processing systems , volume=
Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers , author=. Advances in neural information processing systems , volume=
-
[80]
Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations , pages=
Universal sentence encoder for English , author=. Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations , pages=
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.