pith. sign in

arxiv: 2605.20435 · v1 · pith:6AZVT4YVnew · submitted 2026-05-19 · 💻 cs.SI · cs.CL

Hiding in Plain Sight: Finding MAHA on Reddit

Pith reviewed 2026-05-21 06:30 UTC · model grok-4.3

classification 💻 cs.SI cs.CL
keywords MAHAMake America Healthy AgainReddit datasethealth beliefssocial mediaonline movementsbelief dynamicsthematic collection
0
0 comments X

The pith

A six-year Reddit dataset of 19.4 million posts supplies the raw discussions around 12 Make America Healthy Again beliefs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset drawn from Reddit covering 2020 through 2025 that includes 19.4 million posts by 4 million users. The posts are assembled to preserve the everyday language and surrounding topics for twelve specific health-related beliefs associated with the MAHA movement. This structured collection turns scattered social media text into a usable resource for examining how those beliefs spread, how the movement is organized, and how its supporters express themselves. A sympathetic reader would see the value in having ready-made digital traces that already embed the natural context instead of having to build such a collection from scratch.

Core claim

We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

What carries the argument

The dataset construction process that selects and organizes Reddit posts to retain the natural and thematic context of twelve MAHA-aligned beliefs.

If this is right

  • Researchers gain the ability to track the growth and spread of MAHA beliefs over a six-year period using fine-grained post data.
  • Analysis of the movement's structural and functional components becomes possible through the thematic grouping of posts.
  • Linguistic and behavioral patterns among MAHA proponents can be measured directly from the collected text and user activity.
  • Cross-domain studies of health-related online movements are supported by the ready availability of this large, themed collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collection approach could be applied to other health or political belief sets to compare how different movements appear in the same platform data.
  • Long-term patterns in the dataset might show whether mainstream diet advice and more contested topics reinforce each other within the same user groups.
  • The resource could serve as a baseline for testing whether belief adoption on Reddit follows similar timing or network structures as on other platforms.
  • Future extensions might add timestamps or user-interaction graphs to enable direct studies of information flow within the movement.

Load-bearing premise

The collection and filtering steps correctly identify posts that genuinely reflect MAHA-aligned beliefs without large amounts of selection bias or incorrect labeling.

What would settle it

A random sample of several hundred posts from the released dataset reviewed by independent readers showing that most do not discuss the claimed MAHA themes or contain clear mislabeling.

Figures

Figures reproduced from arXiv: 2605.20435 by Henry Kautz, Sabit Ahmed, Subigya Nepal.

Figure 1
Figure 1. Figure 1: Data collection and stance-labeling pipeline with worked examples for the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: User stance distribution on 12 themes. 3 Applications The released dataset enables a broad range of analyses on multi-theme belief communities. Per-theme stance labels and per-user aggregate scores support sentiment analysis of how MAHA-aligned and mainstream users frame health topics, both within and across themes. The coexistence of mainstream and con￾tested themes in a single dataset enables study of po… view at source ↗
read the original abstract

Make America Healthy Again (MAHA) is a national health movement that encompasses a striking mix of beliefs, from broadly accepted concerns about good diet and exercise to controversial takes on organic and genetically modified food, childhood vaccination, science, and institutions. Various influencers and promoters of the MAHA movement on social media are scattered throughout the online space. Investigating the structure, discourse, and contagion of MAHA beliefs requires large-scale fine-grained digital footprints. Constructing structured data covering different MAHA themes from vast unstructured social media data is challenging. We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a Reddit dataset spanning 2020-2025 with 19.4M posts from 4M users that is claimed to embed the natural thematic context of 12 MAHA-aligned beliefs, enabling research on the movement's dynamics, structure, linguistic patterns, and contagion.

Significance. If the collection and labeling pipeline is shown to be reliable, the dataset's scale and multi-year coverage would offer a useful resource for studying online health-related belief systems and social media dynamics in social informatics and public health research.

major comments (2)
  1. [Data collection section] Data collection section: The manuscript provides no validation metrics (e.g., precision, recall, or inter-rater agreement) for the subreddit selection, keyword filtering, or labeling process used to associate posts with the 12 MAHA-aligned beliefs; this directly affects whether the dataset genuinely captures thematic context rather than incidental matches.
  2. [Dataset statistics section] Dataset statistics and coverage: Aggregate figures (19.4M posts, 4M users) are given but no per-belief or per-subreddit breakdowns or contamination estimates are reported, leaving the claim of balanced thematic coverage unverified.
minor comments (1)
  1. [Abstract and Introduction] The abstract and introduction use 'MAHA' before fully defining the movement's scope; a brief parenthetical expansion on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback correctly identifies gaps in validation and detailed statistics for the dataset construction. We address each point below and will incorporate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Data collection section] The manuscript provides no validation metrics (e.g., precision, recall, or inter-rater agreement) for the subreddit selection, keyword filtering, or labeling process used to associate posts with the 12 MAHA-aligned beliefs; this directly affects whether the dataset genuinely captures thematic context rather than incidental matches.

    Authors: We acknowledge this limitation in the current version. The Data Collection section describes the subreddit list and keyword sets derived from domain knowledge of MAHA beliefs, but does not report quantitative validation. In the revised manuscript we will add a dedicated validation subsection reporting precision and recall from manual review of a stratified sample of 1,000 posts per belief, along with Cohen's kappa for inter-rater agreement among three annotators. This will directly address whether the pipeline captures thematic context beyond incidental keyword matches. revision: yes

  2. Referee: [Dataset statistics section] Aggregate figures (19.4M posts, 4M users) are given but no per-belief or per-subreddit breakdowns or contamination estimates are reported, leaving the claim of balanced thematic coverage unverified.

    Authors: We agree that aggregate numbers alone leave the coverage claim unverified. The revised manuscript will include new tables and figures with exact post and user counts broken down by each of the 12 beliefs and by the primary subreddits. We will also report contamination estimates (false-positive rates) obtained from the same manual annotation study described in response to the first comment, allowing readers to assess the balance and purity of the thematic coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: data resource paper with no derivations, predictions or fitted parameters

full rationale

This is a dataset introduction paper whose central claim is the release of a 19.4M-post Reddit corpus spanning 2020-2025 that captures the thematic context of 12 MAHA-aligned beliefs. No equations, parameter fitting, predictions, or self-citation chains appear in the provided abstract or described structure. The collection and filtering pipeline is presented as a methodological contribution rather than a derived quantity that reduces to its own inputs. Because the work contains no load-bearing derivations or self-referential predictions, none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that social media posts can be systematically collected and thematically organized to represent a specific belief system like MAHA without major distortions from platform algorithms or user self-selection.

axioms (1)
  • domain assumption Reddit posts can be filtered and labeled to accurately capture the natural context of 12 specific MAHA-aligned beliefs.
    This underpins the claim that the dataset contains the thematic context of those beliefs.

pith-pipeline@v0.9.0 · 5686 in / 1353 out tokens · 45059 ms · 2026-05-21T06:30:48.065942+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Journal of Epidemiology and Population Health74(1), 203167 (2026)

    Alba, C.: Sentiments and discourse surrounding the make america healthy again (#maha) movement on social media. Journal of Epidemiology and Population Health74(1), 203167 (2026)

  2. [2]

    covid-denial subred- dits

    Brodkin, J.: Reddit bans r/nonewnormal and quarantines 54 “covid-denial subred- dits”. Ars Technica (September 2021), accessed: 2026-04-29

  3. [3]

    Science Advances8(33) (Aug 2022)

    Dalege, J., Van Der Does, T.: Using a cognitive network model of moral and social beliefs to explain belief change. Science Advances8(33) (Aug 2022)

  4. [4]

    Nature Reviews

    Ecker, U.K.H., Lewandowsky, S., Cook, J., Schmid, P., Fazio, L.K., Brashier, N., Kendeou, P., Vraga, E.K., Amazeen, M.A.: The psychological drivers of misinfor- mation belief and its resistance to correction. Nature Reviews. Psychology1(1) (Jan 2022)

  5. [5]

    Communications Earth & Environment6(1) (Jan 2025)

    Fariello, G., Jemielniak, D.: The changing language and sentiment of conversations about climate change in Reddit posts over sixteen years. Communications Earth & Environment6(1) (Jan 2025)

  6. [6]

    FrontPageMetrics: New subreddits by month: Reddit history.https://frontpagem etrics.com/month/(2022), accessed 2026-04-29

  7. [7]

    PLOS ONE20(12) (2025)

    Gyawali, N., Caragea, D., Caragea, C., Mohammad, S.M.: The shifting landscape of vaccine discourse: Insights from a decade of pre- to post-covid-19 vaccine posts on social media. PLOS ONE20(12) (2025)

  8. [8]

    Sociological Inquiry (2025)

    Paino, M., Claggett, J.L., Kitchens, B.: Medical skepticism in the digital age: An experimental study on digital health literacy. Sociological Inquiry (2025)

  9. [9]

    Journal of Medical Internet Research23(6) (Jun 2021)

    Rao, A., Morstatter, F., Hu, M., Chen, E., Burghardt, K., Ferrara, E., Lerman, K.: Political Partisanship and Antiscience Attitudes in Online Discussions About COVID-19: Twitter Content Analysis. Journal of Medical Internet Research23(6) (Jun 2021)

  10. [10]

    2024 form 10-k (2025),\url{https://www.sec.gov/Arch ives/edgar/data/1713445/000171344525000096/redditannualreport2024.pd f}, sEC filing, accessed 2026-04-29

    Reddit, Inc.: Reddit, inc. 2024 form 10-k (2025),\url{https://www.sec.gov/Arch ives/edgar/data/1713445/000171344525000096/redditannualreport2024.pd f}, sEC filing, accessed 2026-04-29

  11. [11]

    REVIEW, R.F.P.: The structure and dynamics of the online maha–sphere (2026), under review

  12. [12]

    Scientific Reports15(1) (Apr 2025)

    Wu, Q., Sano, Y., Takayasu, H., Havlin, S., Takayasu, M.: Twitter communities are associated with changing user’s opinion towards COVID-19 vaccine in Japan. Scientific Reports15(1) (Apr 2025)

  13. [13]

    Journal of Medical Internet Research24(11) (Nov 2022)

    Zhao, Y.C., Zhao, M., Song, S.: Online Health Information Seeking Among Patients With Chronic Conditions: Integrating the Health Belief Model and Social Support Theory. Journal of Medical Internet Research24(11) (Nov 2022)