pith. machine review for the scientific record. sign in

arxiv: 2604.27710 · v1 · submitted 2026-04-30 · 💻 cs.SI · cs.CY

Recognition: unknown

Social Media Data Toolkit: Standardization and Anonymization of Social Network Datasets

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:26 UTC · model grok-4.3

classification 💻 cs.SI cs.CY
keywords social mediadata toolkitstandardizationanonymizationcross-platformPythonLLM enrichment
0
0 comments X

The pith

The Social Media Data Toolkit unifies heterogeneous social media datasets into a single generic schema for standardization, anonymization, and enrichment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the challenge of analyzing social media data across platforms when datasets come in inconsistent formats due to varying collection methods and API limits. It does this by creating the Social Media Data Toolkit that converts any dataset into a standard structure made up of communities, accounts, posts, actions, and entities. This matters because it lets researchers run the same analysis code on data from multiple platforms, add privacy protection, and include extra tools like language model-based scoring without rewriting everything for each new dataset. The authors test it on several cases involving text and network analysis and make the code freely available with documentation.

Core claim

We introduce the Social Media Data Toolkit, a Python framework that standardizes diverse social network datasets into a generic schema comprising Communities, Accounts, Posts, Actions, and Entities. It features configurable anonymization to protect personal information and an extendable layer for enrichment using large language models and network tools for tasks like stance detection. Demonstrated in case studies and released open-source, it supports consistent multi-platform research.

What carries the argument

The generic schema comprising Communities, Accounts, Posts, Actions, and Entities, which unifies all datasets and serves as the base for the anonymization and enrichment features.

If this is right

  • Enables application of the same analysis code across datasets from different platforms.
  • Standardizes protection of personally identifiable information.
  • Allows integration of LLM-based features such as stance detection without per-dataset development.
  • Promotes reproducible research by providing open-source code with documentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could simplify combining data from emerging platforms that lack official data access.
  • It may lower the entry barrier for conducting comparative studies across many sites.
  • The schema might be extended to handle new data types as platforms evolve.

Load-bearing premise

That the essential information from any social media platform fits into one fixed set of data categories without losing what is needed for analysis.

What would settle it

Comparing the results of network or text analysis performed on original platform data versus the version processed by the toolkit to check whether key patterns or information are lost.

Figures

Figures reproduced from arXiv: 2604.27710 by Ali Najafi, Letizia Iannucci, Mikko Kivel\"a, Onur Varol.

Figure 1
Figure 1. Figure 1: Schematic of the SMDT platform. Platform consist of distinct modules offering different capabilities. Standardizer unifies data coming from diverse social media platforms into one shared taxonomy. To distribute data for additional analysis, we anonymize the sensitive entries and store it further analysis. We offer different extension to enrich dataset by inferring account- and content-level measures such a… view at source ↗
Figure 2
Figure 2. Figure 2: Daily sentiment and post volume surrounding key election events, with corresponding hourly sentiment trends. view at source ↗
Figure 3
Figure 3. Figure 3: Hashtag and Domain Analyses Between Twitter and TruthSocial. A) Hashtag Toxicity Comparison. Each point represents a hashtag appearing at least twice on both platforms. The x-axis shows the mean toxicity score on Twitter, and the y-axis shows the mean score on TruthSocial. Color indicates the absolute difference in toxicity between two platforms. B) Domain Popularity and Bias. Each point represents a domai… view at source ↗
Figure 4
Figure 4. Figure 4: Relationship between local clustering coefficient and node degree. Analysis conducted to investigate four different datasets. High-level hashtags for each context is identified and highlighted in figures. 5 Discussion and Future Work To enable robust cross-platform comparisons, our framework maps diverse entity types to a uniform specification. This creates a shared analytical foundation that eliminates th… view at source ↗
read the original abstract

The rapid diversification of social media platforms and the increasing restrictions on official APIs have significantly complicated cross-platform analysis. Researchers are often forced to rely on heterogeneous datasets obtained through web scraping and historical archives; however they often lack structural consistency. Prior to conducting cross-platform social media analyses, one needs to answer three critical questions: (1) What makes platforms different and similar? (2) How were the datasets collected? (3) How can we align the datasets of different platforms to conduct fair analyses? To address these questions, we introduce the Social Media Data Toolkit (\projectname{}), a comprehensive Python framework designed for the standardization, anonymization, and enrichment of social network datasets. \projectname{} unifies diverse data structures into a generic schema comprising Communities, Accounts, Posts, Actions, and Entities to facilitate multi-platform research. The framework features a configurable anonymization module to secure Personally Identifiable Information (PII) and an extendable enrichment layer that integrates Large Language Models (LLMs) and network analysis tools for downstream tasks such as stance detection and toxicity scoring without creating codebase for different datasets. We demonstrate the versatility of \projectname{} through four case studies spanning from textual analysis of the content to network analysis across platforms. To offer reproducible social media research, \projectname{} is released as an open-source tool featuring detailed documentation and practical guides for researchers at any skill-level. It can be accessed at github.com/ViralLab/SMDT and varollab.com/SMDT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Social Media Data Toolkit (SMDT), a comprehensive Python framework for the standardization, anonymization, and enrichment of social network datasets. It unifies heterogeneous data structures from various social media platforms into a generic schema comprising five entity types: Communities, Accounts, Posts, Actions, and Entities. The framework includes a configurable module for anonymizing personally identifiable information and an extendable enrichment layer that incorporates large language models and network analysis tools to support downstream tasks such as stance detection and toxicity scoring. The authors demonstrate the toolkit's versatility through four case studies involving textual and network analyses across platforms and release it as an open-source tool with documentation to promote reproducible research.

Significance. If the proposed generic schema effectively preserves the essential features of diverse social media datasets without significant information loss or distortion, the SMDT could provide a valuable standardized approach for cross-platform social media analysis, particularly in light of increasing API restrictions. The open-source release, detailed documentation, and practical guides represent a strength that enhances accessibility for researchers. However, the current presentation relies on descriptive case studies without quantitative benchmarks, which limits the ability to fully assess its impact on maintaining data fidelity for complex analyses.

major comments (1)
  1. [Case Studies section] Case Studies section: The four case studies illustrate application of the standardization, anonymization, and enrichment processes across platforms but contain no quantitative validation, such as pre/post-standardization comparisons of information retention (e.g., unique fields like retweet graphs or subreddit hierarchies), changes in derived metrics (e.g., degree distributions or content embeddings), or error analysis. This is load-bearing for the central claim that the five-entity schema unifies heterogeneous datasets without material loss for downstream tasks.
minor comments (1)
  1. [Abstract] Abstract: Consider briefly naming the specific platforms and analysis types used in the four case studies to more concretely convey the toolkit's scope and versatility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the Case Studies section would benefit from quantitative validation to more rigorously support the claim that the five-entity schema unifies datasets with minimal material loss. Below we outline our planned revisions to address this point directly.

read point-by-point responses
  1. Referee: The four case studies illustrate application of the standardization, anonymization, and enrichment processes across platforms but contain no quantitative validation, such as pre/post-standardization comparisons of information retention (e.g., unique fields like retweet graphs or subreddit hierarchies), changes in derived metrics (e.g., degree distributions or content embeddings), or error analysis. This is load-bearing for the central claim that the five-entity schema unifies heterogeneous datasets without material loss for downstream tasks.

    Authors: We appreciate the referee identifying this as a load-bearing element. The case studies were designed to showcase practical versatility across textual and network analyses on multiple platforms, but we concur that illustrative examples alone are insufficient to quantify fidelity. In the revised manuscript we will augment the Case Studies section with quantitative metrics. For each of the four studies we will add: (1) pre/post-standardization retention statistics, including counts of preserved unique fields (e.g., retweet edges, subreddit hierarchies, post metadata) and overall entity coverage rates; (2) comparisons of derived network metrics such as degree distributions and clustering coefficients before and after schema mapping, reported via tables and Kolmogorov-Smirnov tests where distributions differ; (3) embedding similarity scores (cosine) for content representations pre- and post-enrichment, plus error rates for any unmapped fields or anonymization-induced utility loss. These additions will be accompanied by new tables, figures, and a brief error-analysis subsection. We believe the expanded evidence will substantiate the schema’s utility for downstream tasks while preserving the original demonstration focus. revision: yes

Circularity Check

0 steps flagged

No circularity: software framework with design choices, not derivations or fitted predictions

full rationale

The paper presents a Python toolkit (SMDT) for standardizing heterogeneous social media datasets into a five-entity generic schema (Communities, Accounts, Posts, Actions, Entities), plus anonymization and LLM enrichment modules. No equations, predictions, or first-principles derivations appear anywhere in the manuscript. The unification claim is an explicit design decision justified by the need for cross-platform consistency, not a result that reduces to its own inputs by construction. Case studies demonstrate usage but contain no pre/post quantitative validation that would require fitted parameters. No self-citations are load-bearing for any central claim, and the work is self-contained as an open-source artifact rather than a mathematical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

As a software-tool paper there are no free parameters, mathematical axioms, or externally validated invented entities. The five-component schema is an author-designed abstraction whose adequacy rests on the untested assumption that it captures all necessary platform features.

invented entities (1)
  • Generic schema (Communities, Accounts, Posts, Actions, Entities) no independent evidence
    purpose: To unify heterogeneous social-media data structures into one common representation
    The schema is introduced by the authors as the core of the toolkit; no independent evidence of its completeness or superiority is supplied beyond the claim that it facilitates multi-platform research.

pith-pipeline@v0.9.0 · 5580 in / 1305 out tokens · 83970 ms · 2026-05-07T06:26:15.458264+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 6 canonical work pages

  1. [1]

    Share, like, recommend: Decoding the social media news consumer.Journalism studies, 13(5-6):815–824, 2012

    Alfred Hermida, Fred Fletcher, Darryl Korell, and Donna Logan. Share, like, recommend: Decoding the social media news consumer.Journalism studies, 13(5-6):815–824, 2012. 15 Social Media Data Toolkit

  2. [2]

    News sharing in social media: A review of current research on news sharing users, content, and networks.Social media+ society, 1(2):2056305115610141, 2015

    Anna Sophie Kümpel, Veronika Karnowski, and Till Keyling. News sharing in social media: A review of current research on news sharing users, content, and networks.Social media+ society, 1(2):2056305115610141, 2015

  3. [3]

    Mainstream media and the distribution of news in the age of social media

    Nic Newman. Mainstream media and the distribution of news in the age of social media. Technical report, Reuters Institute for the Study of Journalism, 2011

  4. [4]

    Social media use for health purposes: systematic review.Journal of medical Internet research, 23(5):e17917, 2021

    Junhan Chen, Yuan Wang, et al. Social media use for health purposes: systematic review.Journal of medical Internet research, 23(5):e17917, 2021

  5. [5]

    Health advice from internet discussion forums: how bad is dangerous?Journal of medical Internet research, 18(1):e4, 2016

    Jennifer Cole, Chris Watkins, and Dorothea Kleine. Health advice from internet discussion forums: how bad is dangerous?Journal of medical Internet research, 18(1):e4, 2016

  6. [6]

    Social media finfluencers: Evidence from youtube and cryptocurrencies

    Sita Kedvarin and Kanis Saengchote. Social media finfluencers: Evidence from youtube and cryptocurrencies. Available at SSRN 4594081, 2023

  7. [7]

    Market manipulation and suspicious stock recommendations on social media.Available at SSRN 3010850, 2017

    Thomas Renault. Market manipulation and suspicious stock recommendations on social media.Available at SSRN 3010850, 2017

  8. [8]

    Computational research in the post-api age.Political Communication, 35(4):665–668, 2018

    Deen Freelon. Computational research in the post-api age.Political Communication, 35(4):665–668, 2018

  9. [9]

    After the ‘apicalypse’: Social media platforms and their fight against critical scholarly research

    Axel Bruns. After the ‘apicalypse’: Social media platforms and their fight against critical scholarly research. Disinformation and data lockdown on social platforms, pages 14–36, 2021

  10. [10]

    Applications of flow models to the generation of correlated lattice qcd ensembles.Physical Review D, 109(9):094514, 2024

    Ryan Abbott, Aleksandar Botev, Denis Boyda, Daniel C Hackett, Gurtej Kanwar, Sébastien Racanière, Danilo J Rezende, Fernando Romero-López, Phiala E Shanahan, and Julian M Urban. Applications of flow models to the generation of correlated lattice qcd ensembles.Physical Review D, 109(9):094514, 2024

  11. [11]

    Navigating the post-api dilemma

    Amrit Poudel and Tim Weninger. Navigating the post-api dilemma. InProceedings of the ACM Web Conference 2024, pages 2476–2484, 2024

  12. [12]

    Archiv- ing information from geotagged tweets to promote reproducibility and comparability in social media research.Big Data & Society, 4(2):2053951717736336, 2017

    Katharina Kinder-Kurlanda, Katrin Weller, Wolfgang Zenk-Möltgen, Jürgen Pfeffer, and Fred Morstatter. Archiv- ing information from geotagged tweets to promote reproducibility and comparability in social media research.Big Data & Society, 4(2):2053951717736336, 2017

  13. [13]

    The rise of bluesky.arXiv preprint arXiv:2504.12902, 2025

    Ozgur Can Seckin, Filipi Nascimento Silva, Bao Tran Truong, Sangyeon Kim, Fan Huang, Nick Liu, Alessandro Flammini, and Filippo Menczer. The rise of bluesky.arXiv preprint arXiv:2504.12902, 2025

  14. [14]

    The koo dataset: An indian microblogging platform with global ambitions

    Amin Mekacher, Max Falkenberg, and Andrea Baronchelli. The koo dataset: An indian microblogging platform with global ambitions. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 1991–2002, 2024

  15. [15]

    Truth social dataset

    Patrick Gerard, Nicholas Botzer, and Tim Weninger. Truth social dataset. InProceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 1034–1040, 2023

  16. [16]

    Unfiltered conversations: A dataset of 2024 us presidential election discourse on truth social.arXiv preprint arXiv:2411.01330, 2024

    Kashish Shah, Patrick Gerard, Luca Luceri, and Emilio Ferrara. Unfiltered conversations: A dataset of 2024 us presidential election discourse on truth social.arXiv preprint arXiv:2411.01330, 2024

  17. [17]

    Variations on a theme? comparing 4chan, 8kun, and other chans’ far-right “/pol” boards.Perspectives on Terrorism, 15(1):65–80, 2021

    Stephane J Baele, Lewys Brace, and Travis G Coan. Variations on a theme? comparing 4chan, 8kun, and other chans’ far-right “/pol” boards.Perspectives on Terrorism, 15(1):65–80, 2021

  18. [18]

    Deplatforming: Following extreme internet celebrities to telegram and alternative social media

    Richard Rogers. Deplatforming: Following extreme internet celebrities to telegram and alternative social media. European Journal of Communication, 35(3):213–229, 2020

  19. [19]

    Evaluating the effectiveness of deplatforming as a moderation strategy on twitter.Proceedings of the ACM on human-computer interaction, 5(CSCW2):1–30, 2021

    Shagun Jhaver, Christian Boylston, Diyi Yang, and Amy Bruckman. Evaluating the effectiveness of deplatforming as a moderation strategy on twitter.Proceedings of the ACM on human-computer interaction, 5(CSCW2):1–30, 2021

  20. [20]

    The other side of deplatforming: right-wing telegram in the wake of trump’s twitter ouster

    Kirill Bryanov, Dina Vasina, Yulia Pankova, and Victor Pakholkov. The other side of deplatforming: right-wing telegram in the wake of trump’s twitter ouster. InInternational Conference on Digital Transformation and Global Society, pages 417–428. Springer, 2021

  21. [21]

    Cross-platform reactions to the post-january 6 deplatforming.Journal of Quantitative Description: Digital Media, 3, 2023

    Cody Buntain, Martin Innes, Tamar Mitts, and Jacob Shapiro. Cross-platform reactions to the post-january 6 deplatforming.Journal of Quantitative Description: Digital Media, 3, 2023

  22. [22]

    You can’t stay here: The efficacy of reddit’s 2015 ban examined through hate speech.Proceedings of the ACM on human-computer interaction, 1(CSCW):1–22, 2017

    Eshwar Chandrasekharan, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam Glynn, Jacob Eisenstein, and Eric Gilbert. You can’t stay here: The efficacy of reddit’s 2015 ban examined through hate speech.Proceedings of the ACM on human-computer interaction, 1(CSCW):1–22, 2017

  23. [23]

    Osome: The iuni observatory on social media.PeerJ Computer Science, 2, 2016

    Luca Maria Aiello, Keychul Chung, Michael D Conover, Emilio Ferrara, Alessandro Flammini, Geoffrey C Fox, Xiaoming Gao, Bruno Gonçalves, Przemyslaw Grabowicz, Kibeom Hong, et al. Osome: The iuni observatory on social media.PeerJ Computer Science, 2, 2016

  24. [24]

    Studying anti-social behaviour on reddit with communalytic, 2020

    Anatoliy Gruzd, Philip Mai, and Zahra Vahedi. Studying anti-social behaviour on reddit with communalytic, 2020. 16 Social Media Data Toolkit

  25. [25]

    A multi-platform collection of social media posts about the 2022 us midterm elections

    Rachith Aiyappa, Matthew R DeVerna, Manita Pote, Bao Tran Truong, Wanying Zhao, David Axelrod, Aria Pessianzadeh, Zoher Kachwala, Munjung Kim, Ozgur Can Seckin, et al. A multi-platform collection of social media posts about the 2022 us midterm elections. InProceedings of the international AAAI conference on web and social media, volume 17, pages 981–989, 2023

  26. [26]

    Ita-election-2022: A multi-platform dataset of social media conversations around the 2022 italian general election

    Francesco Pierri, Geng Liu, and Stefano Ceri. Ita-election-2022: A multi-platform dataset of social media conversations around the 2022 italian general election. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 5386–5390, 2023

  27. [27]

    Ide- ological fragmentation of the social media ecosystem: From echo chambers to echo platforms.PNAS Nexus, 4(9):pgaf262, 2025

    Edoardo Di Martino, Alessandro Galeazzi, Michele Starnini, Walter Quattrociocchi, and Matteo Cinelli. Ide- ological fragmentation of the social media ecosystem: From echo chambers to echo platforms.PNAS Nexus, 4(9):pgaf262, 2025

  28. [28]

    Divergent patterns of engagement with partisan and low-quality news across seven social media platforms.Proceedings of the National Academy of Sciences, 122(44):e2425739122, 2025

    Mohsen Mosleh, Jennifer Allen, and David G Rand. Divergent patterns of engagement with partisan and low-quality news across seven social media platforms.Proceedings of the National Academy of Sciences, 122(44):e2425739122, 2025

  29. [29]

    Multi-platform aggregated dataset of online communities (madoc).arXiv preprint arXiv:2501.12886, 2025

    Marija Mitrovi´c Dankulov, Aleksandar Tomaševi´c, Slobodan Maleti´c, Miroslav An ¯delkovi´c, Ana Vrani´c, Darja Cvetkovi´c, Boris Stupovski, Dušan Vudragovi´c, Sara Major, and Aleksandar Bogojevi´c. Multi-platform aggregated dataset of online communities (madoc).arXiv preprint arXiv:2501.12886, 2025

  30. [30]

    A survey of datasets for information diffusion tasks, 2024

    Fuxia Guo, Xiaowen Wang, Yanwei Xie, Zehao Wang, Jingqiu Li, and Lanjun Wang. A survey of datasets for information diffusion tasks, 2024

  31. [31]

    geopy: Python geocoding toolbox

    Kostya Esmukov and contributors. geopy: Python geocoding toolbox. https://geopy.readthedocs.io/en/ stable/, 2023

  32. [32]

    First public dataset to study 2023 turkish general election.Scientific Reports, 14(1):8794, 2024

    Ali Najafi, Nihat Mugurtay, Yasser Zouzou, Ege Demirci, Serhat Demirkiran, Huseyin Alper Karadeniz, and Onur Varol. First public dataset to study 2023 turkish general election.Scientific Reports, 14(1):8794, 2024

  33. [33]

    i’m in the bluesky tonight

    Andrea Failla and Giulio Rossetti. "i’m in the bluesky tonight": Insights from a year worth of social data.arXiv preprint arXiv:2404.18984, 2024

  34. [34]

    Gab posts - 2016-08 to 2018-10

    PushShift. Gab posts - 2016-08 to 2018-10. https://academictorrents.com/details/ 064f2953e8b16a9b33119874aa0b1a907d857bc1, 2018. Accessed: 2026-02-24

  35. [35]

    i can’t keep it up

    Amin Mekacher and Antonis Papasavva. " i can’t keep it up." a dataset from the defunct voat. co news aggregator. InProceedings of the International AAAI Conference on Web and Social Media, volume 16, pages 1302–1311, 2022

  36. [36]

    idrama-scored-2024: A dataset of the scored social media platform from 2020 to 2023

    Jay Patel, Pujan Paudel, Emiliano De Cristofaro, Gianluca Stringhini, and Jeremy Blackburn. idrama-scored-2024: A dataset of the scored social media platform from 2020 to 2023. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 2014–2024, 2024

  37. [37]

    The pushshift reddit dataset

    Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The pushshift reddit dataset. InProceedings of the international AAAI conference on web and social media, volume 14, pages 830–839, 2020

  38. [38]

    A public dataset tracking social media discourse about the 2024 us presidential election on twitter/x.arXiv preprint arXiv:2411.00376, 2024

    Ashwin Balasubramanian, Vito Zou, Hitesh Narayana, Christina You, Luca Luceri, and Emilio Ferrara. A public dataset tracking social media discourse about the 2024 us presidential election on twitter/x.arXiv preprint arXiv:2411.00376, 2024

  39. [39]

    An early look at the parler online social network.arXiv preprint arXiv:2101.03820, 2021

    Max Aliapoulios, Emmi Bevensee, Jeremy Blackburn, Barry Bradlyn, Emiliano De Cristofaro, Gianluca Stringhini, and Savvas Zannettou. An early look at the parler online social network.arXiv preprint arXiv:2101.03820, 2021

  40. [40]

    Scalable and generalizable social bot detection through data selection

    Kai-Cheng Yang, Onur Varol, Pik-Mai Hui, and Filippo Menczer. Scalable and generalizable social bot detection through data selection. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 1096–1103, 2020

  41. [41]

    Demographic inference and representative population estimates from multilingual social media data

    Zijian Wang, Scott Hale, David Ifeoluwa Adelani, Przemyslaw Grabowicz, Timo Hartman, Fabian Flöck, and David Jurgens. Demographic inference and representative population estimates from multilingual social media data. InThe world wide web conference, pages 2056–2067, 2019

  42. [42]

    Unsupervised detection of coordinated fake-follower campaigns on social media

    Yasser Zouzou and Onur Varol. Unsupervised detection of coordinated fake-follower campaigns on social media. EPJ Data Science, 13(1):62, 2024

  43. [43]

    TweetNLP: Cutting-Edge Natural Language Processing for Social Media

    Jose Camacho-Collados, Kiamehr Rezaee, Talayeh Riahi, Asahi Ushio, Daniel Loureiro, Dimosthenis Antypas, Joanne Boisson, Luis Espinosa-Anke, Fangyu Liu, Eugenio Martínez-Cámara, et al. TweetNLP: Cutting-Edge Natural Language Processing for Social Media. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demons...

  44. [44]

    Detoxify

    Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

  45. [45]

    Online human-bot interactions: Detection, estimation, and characterization

    Onur Varol, Emilio Ferrara, Clayton Davis, Filippo Menczer, and Alessandro Flammini. Online human-bot interactions: Detection, estimation, and characterization. InProceedings of the international AAAI conference on web and social media, volume 11, pages 280–289, 2017

  46. [46]

    Turkishbertweet: Fast and reliable large language model for social media analysis

    Ali Najafi and Onur Varol. Turkishbertweet: Fast and reliable large language model for social media analysis. Expert Systems with Applications, 255:124737, 2024

  47. [47]

    Domaindemo: a dataset of domain-sharing activities among different demographic groups on twitter.Scientific data, 12(1):1251, 2025

    Kai-Cheng Yang, Pranav Goel, Alexi Quintana-Mathé, Luke Horgan, Stefan D McCabe, Nir Grinberg, Kenneth Joseph, and David Lazer. Domaindemo: a dataset of domain-sharing activities among different demographic groups on twitter.Scientific data, 12(1):1251, 2025

  48. [48]

    Tracking online topics over time: understanding dynamic hashtag communities.Computational social networks, 5(1):9, 2018

    Philipp Lorenz-Spreen, Frederik Wolf, Jonas Braun, Gourab Ghoshal, Nataša Djurdjevac Conrad, and Philipp Hövel. Tracking online topics over time: understanding dynamic hashtag communities.Computational social networks, 5(1):9, 2018

  49. [49]

    Chroma: The ai-native open-source embedding database

    Chroma Core. Chroma: The ai-native open-source embedding database. https://github.com/chroma-core/ chroma, 2023. Accessed: 2026-02-20

  50. [50]

    What is the model context protocol (mcp)? https://modelcontextprotocol.io, n.d

    Model Context Protocol. What is the model context protocol (mcp)? https://modelcontextprotocol.io, n.d. Accessed: 2026-02-22. 18