Hiding in Plain Sight: Finding MAHA on Reddit

Henry Kautz; Sabit Ahmed; Subigya Nepal

arxiv: 2605.20435 · v1 · pith:6AZVT4YVnew · submitted 2026-05-19 · 💻 cs.SI · cs.CL

Hiding in Plain Sight: Finding MAHA on Reddit

Sabit Ahmed , Subigya Nepal , Henry Kautz This is my paper

Pith reviewed 2026-05-21 06:30 UTC · model grok-4.3

classification 💻 cs.SI cs.CL

keywords MAHAMake America Healthy AgainReddit datasethealth beliefssocial mediaonline movementsbelief dynamicsthematic collection

0 comments

The pith

A six-year Reddit dataset of 19.4 million posts supplies the raw discussions around 12 Make America Healthy Again beliefs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset drawn from Reddit covering 2020 through 2025 that includes 19.4 million posts by 4 million users. The posts are assembled to preserve the everyday language and surrounding topics for twelve specific health-related beliefs associated with the MAHA movement. This structured collection turns scattered social media text into a usable resource for examining how those beliefs spread, how the movement is organized, and how its supporters express themselves. A sympathetic reader would see the value in having ready-made digital traces that already embed the natural context instead of having to build such a collection from scratch.

Core claim

We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

What carries the argument

The dataset construction process that selects and organizes Reddit posts to retain the natural and thematic context of twelve MAHA-aligned beliefs.

If this is right

Researchers gain the ability to track the growth and spread of MAHA beliefs over a six-year period using fine-grained post data.
Analysis of the movement's structural and functional components becomes possible through the thematic grouping of posts.
Linguistic and behavioral patterns among MAHA proponents can be measured directly from the collected text and user activity.
Cross-domain studies of health-related online movements are supported by the ready availability of this large, themed collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection approach could be applied to other health or political belief sets to compare how different movements appear in the same platform data.
Long-term patterns in the dataset might show whether mainstream diet advice and more contested topics reinforce each other within the same user groups.
The resource could serve as a baseline for testing whether belief adoption on Reddit follows similar timing or network structures as on other platforms.
Future extensions might add timestamps or user-interaction graphs to enable direct studies of information flow within the movement.

Load-bearing premise

The collection and filtering steps correctly identify posts that genuinely reflect MAHA-aligned beliefs without large amounts of selection bias or incorrect labeling.

What would settle it

A random sample of several hundred posts from the released dataset reviewed by independent readers showing that most do not discuss the claimed MAHA themes or contain clear mislabeling.

Figures

Figures reproduced from arXiv: 2605.20435 by Henry Kautz, Sabit Ahmed, Subigya Nepal.

**Figure 2.** Figure 2: User stance distribution on 12 themes. 3 Applications The released dataset enables a broad range of analyses on multi-theme belief communities. Per-theme stance labels and per-user aggregate scores support sentiment analysis of how MAHA-aligned and mainstream users frame health topics, both within and across themes. The coexistence of mainstream and contested themes in a single dataset enables study of po… view at source ↗

read the original abstract

Make America Healthy Again (MAHA) is a national health movement that encompasses a striking mix of beliefs, from broadly accepted concerns about good diet and exercise to controversial takes on organic and genetically modified food, childhood vaccination, science, and institutions. Various influencers and promoters of the MAHA movement on social media are scattered throughout the online space. Investigating the structure, discourse, and contagion of MAHA beliefs requires large-scale fine-grained digital footprints. Constructing structured data covering different MAHA themes from vast unstructured social media data is challenging. We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A sizable new Reddit dataset on MAHA beliefs, but the abstract gives no details on how the posts were collected or checked.

read the letter

Colleague, this paper's main offering is a dataset of 19.4 million Reddit posts from 2020 to 2025, drawn from 4 million users and organized around 12 MAHA-aligned beliefs. The authors position it as a resource for studying the movement's dynamics and patterns. What stands out is the scale and the thematic focus. Collecting that volume of data over six years on a specific set of health and institutional beliefs is not trivial, and it could open doors for analyses that smaller or less targeted collections can't support. Public health researchers tracking vaccine skepticism or diet trends might find raw material here. The soft spot is the complete absence of information on construction. The abstract claims the posts contain the natural context of those beliefs, but there's no description of subreddit selection, keyword lists, machine learning filters, or any validation against manual labels. Without that, it's difficult to gauge selection bias or how much off-topic content slipped in. If the process relies on simple heuristics, the utility drops quickly. The citation pattern looks light so far, which is normal for a new dataset paper, but it would help to see comparisons to existing Reddit health datasets to clarify what's distinct. This is for computational social science groups or public health teams that need large social media corpora for belief contagion studies. A reader willing to invest time in their own validation could get value from it. I would send this to peer review. The core idea is straightforward and the data volume is real, so referees could push for better documentation of the pipeline and perhaps some basic stats on the distribution across the 12 beliefs.

Referee Report

2 major / 1 minor

Summary. The paper introduces a Reddit dataset spanning 2020-2025 with 19.4M posts from 4M users that is claimed to embed the natural thematic context of 12 MAHA-aligned beliefs, enabling research on the movement's dynamics, structure, linguistic patterns, and contagion.

Significance. If the collection and labeling pipeline is shown to be reliable, the dataset's scale and multi-year coverage would offer a useful resource for studying online health-related belief systems and social media dynamics in social informatics and public health research.

major comments (2)

[Data collection section] Data collection section: The manuscript provides no validation metrics (e.g., precision, recall, or inter-rater agreement) for the subreddit selection, keyword filtering, or labeling process used to associate posts with the 12 MAHA-aligned beliefs; this directly affects whether the dataset genuinely captures thematic context rather than incidental matches.
[Dataset statistics section] Dataset statistics and coverage: Aggregate figures (19.4M posts, 4M users) are given but no per-belief or per-subreddit breakdowns or contamination estimates are reported, leaving the claim of balanced thematic coverage unverified.

minor comments (1)

[Abstract and Introduction] The abstract and introduction use 'MAHA' before fully defining the movement's scope; a brief parenthetical expansion on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback correctly identifies gaps in validation and detailed statistics for the dataset construction. We address each point below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [Data collection section] The manuscript provides no validation metrics (e.g., precision, recall, or inter-rater agreement) for the subreddit selection, keyword filtering, or labeling process used to associate posts with the 12 MAHA-aligned beliefs; this directly affects whether the dataset genuinely captures thematic context rather than incidental matches.

Authors: We acknowledge this limitation in the current version. The Data Collection section describes the subreddit list and keyword sets derived from domain knowledge of MAHA beliefs, but does not report quantitative validation. In the revised manuscript we will add a dedicated validation subsection reporting precision and recall from manual review of a stratified sample of 1,000 posts per belief, along with Cohen's kappa for inter-rater agreement among three annotators. This will directly address whether the pipeline captures thematic context beyond incidental keyword matches. revision: yes
Referee: [Dataset statistics section] Aggregate figures (19.4M posts, 4M users) are given but no per-belief or per-subreddit breakdowns or contamination estimates are reported, leaving the claim of balanced thematic coverage unverified.

Authors: We agree that aggregate numbers alone leave the coverage claim unverified. The revised manuscript will include new tables and figures with exact post and user counts broken down by each of the 12 beliefs and by the primary subreddits. We will also report contamination estimates (false-positive rates) obtained from the same manual annotation study described in response to the first comment, allowing readers to assess the balance and purity of the thematic coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: data resource paper with no derivations, predictions or fitted parameters

full rationale

This is a dataset introduction paper whose central claim is the release of a 19.4M-post Reddit corpus spanning 2020-2025 that captures the thematic context of 12 MAHA-aligned beliefs. No equations, parameter fitting, predictions, or self-citation chains appear in the provided abstract or described structure. The collection and filtering pipeline is presented as a methodological contribution rather than a derived quantity that reduces to its own inputs. Because the work contains no load-bearing derivations or self-referential predictions, none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that social media posts can be systematically collected and thematically organized to represent a specific belief system like MAHA without major distortions from platform algorithms or user self-selection.

axioms (1)

domain assumption Reddit posts can be filtered and labeled to accurately capture the natural context of 12 specific MAHA-aligned beliefs.
This underpins the claim that the dataset contains the thematic context of those beliefs.

pith-pipeline@v0.9.0 · 5686 in / 1353 out tokens · 45059 ms · 2026-05-21T06:30:48.065942+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a Reddit dataset ... 2-stage keyword search ... logistic regression classifier ... tree-based few-shot LLM stance classifier

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Journal of Epidemiology and Population Health74(1), 203167 (2026)

Alba, C.: Sentiments and discourse surrounding the make america healthy again (#maha) movement on social media. Journal of Epidemiology and Population Health74(1), 203167 (2026)

work page 2026
[2]

covid-denial subred- dits

Brodkin, J.: Reddit bans r/nonewnormal and quarantines 54 “covid-denial subred- dits”. Ars Technica (September 2021), accessed: 2026-04-29

work page 2021
[3]

Science Advances8(33) (Aug 2022)

Dalege, J., Van Der Does, T.: Using a cognitive network model of moral and social beliefs to explain belief change. Science Advances8(33) (Aug 2022)

work page 2022
[4]

Nature Reviews

Ecker, U.K.H., Lewandowsky, S., Cook, J., Schmid, P., Fazio, L.K., Brashier, N., Kendeou, P., Vraga, E.K., Amazeen, M.A.: The psychological drivers of misinfor- mation belief and its resistance to correction. Nature Reviews. Psychology1(1) (Jan 2022)

work page 2022
[5]

Communications Earth & Environment6(1) (Jan 2025)

Fariello, G., Jemielniak, D.: The changing language and sentiment of conversations about climate change in Reddit posts over sixteen years. Communications Earth & Environment6(1) (Jan 2025)

work page 2025
[6]

FrontPageMetrics: New subreddits by month: Reddit history.https://frontpagem etrics.com/month/(2022), accessed 2026-04-29

work page 2022
[7]

PLOS ONE20(12) (2025)

Gyawali, N., Caragea, D., Caragea, C., Mohammad, S.M.: The shifting landscape of vaccine discourse: Insights from a decade of pre- to post-covid-19 vaccine posts on social media. PLOS ONE20(12) (2025)

work page 2025
[8]

Sociological Inquiry (2025)

Paino, M., Claggett, J.L., Kitchens, B.: Medical skepticism in the digital age: An experimental study on digital health literacy. Sociological Inquiry (2025)

work page 2025
[9]

Journal of Medical Internet Research23(6) (Jun 2021)

Rao, A., Morstatter, F., Hu, M., Chen, E., Burghardt, K., Ferrara, E., Lerman, K.: Political Partisanship and Antiscience Attitudes in Online Discussions About COVID-19: Twitter Content Analysis. Journal of Medical Internet Research23(6) (Jun 2021)

work page 2021
[10]

2024 form 10-k (2025),\url{https://www.sec.gov/Arch ives/edgar/data/1713445/000171344525000096/redditannualreport2024.pd f}, sEC filing, accessed 2026-04-29

Reddit, Inc.: Reddit, inc. 2024 form 10-k (2025),\url{https://www.sec.gov/Arch ives/edgar/data/1713445/000171344525000096/redditannualreport2024.pd f}, sEC filing, accessed 2026-04-29

work page arXiv 2024
[11]

REVIEW, R.F.P.: The structure and dynamics of the online maha–sphere (2026), under review

work page 2026
[12]

Scientific Reports15(1) (Apr 2025)

Wu, Q., Sano, Y., Takayasu, H., Havlin, S., Takayasu, M.: Twitter communities are associated with changing user’s opinion towards COVID-19 vaccine in Japan. Scientific Reports15(1) (Apr 2025)

work page 2025
[13]

Journal of Medical Internet Research24(11) (Nov 2022)

Zhao, Y.C., Zhao, M., Song, S.: Online Health Information Seeking Among Patients With Chronic Conditions: Integrating the Health Belief Model and Social Support Theory. Journal of Medical Internet Research24(11) (Nov 2022)

work page 2022

[1] [1]

Journal of Epidemiology and Population Health74(1), 203167 (2026)

Alba, C.: Sentiments and discourse surrounding the make america healthy again (#maha) movement on social media. Journal of Epidemiology and Population Health74(1), 203167 (2026)

work page 2026

[2] [2]

covid-denial subred- dits

Brodkin, J.: Reddit bans r/nonewnormal and quarantines 54 “covid-denial subred- dits”. Ars Technica (September 2021), accessed: 2026-04-29

work page 2021

[3] [3]

Science Advances8(33) (Aug 2022)

Dalege, J., Van Der Does, T.: Using a cognitive network model of moral and social beliefs to explain belief change. Science Advances8(33) (Aug 2022)

work page 2022

[4] [4]

Nature Reviews

Ecker, U.K.H., Lewandowsky, S., Cook, J., Schmid, P., Fazio, L.K., Brashier, N., Kendeou, P., Vraga, E.K., Amazeen, M.A.: The psychological drivers of misinfor- mation belief and its resistance to correction. Nature Reviews. Psychology1(1) (Jan 2022)

work page 2022

[5] [5]

Communications Earth & Environment6(1) (Jan 2025)

Fariello, G., Jemielniak, D.: The changing language and sentiment of conversations about climate change in Reddit posts over sixteen years. Communications Earth & Environment6(1) (Jan 2025)

work page 2025

[6] [6]

FrontPageMetrics: New subreddits by month: Reddit history.https://frontpagem etrics.com/month/(2022), accessed 2026-04-29

work page 2022

[7] [7]

PLOS ONE20(12) (2025)

Gyawali, N., Caragea, D., Caragea, C., Mohammad, S.M.: The shifting landscape of vaccine discourse: Insights from a decade of pre- to post-covid-19 vaccine posts on social media. PLOS ONE20(12) (2025)

work page 2025

[8] [8]

Sociological Inquiry (2025)

Paino, M., Claggett, J.L., Kitchens, B.: Medical skepticism in the digital age: An experimental study on digital health literacy. Sociological Inquiry (2025)

work page 2025

[9] [9]

Journal of Medical Internet Research23(6) (Jun 2021)

Rao, A., Morstatter, F., Hu, M., Chen, E., Burghardt, K., Ferrara, E., Lerman, K.: Political Partisanship and Antiscience Attitudes in Online Discussions About COVID-19: Twitter Content Analysis. Journal of Medical Internet Research23(6) (Jun 2021)

work page 2021

[10] [10]

2024 form 10-k (2025),\url{https://www.sec.gov/Arch ives/edgar/data/1713445/000171344525000096/redditannualreport2024.pd f}, sEC filing, accessed 2026-04-29

Reddit, Inc.: Reddit, inc. 2024 form 10-k (2025),\url{https://www.sec.gov/Arch ives/edgar/data/1713445/000171344525000096/redditannualreport2024.pd f}, sEC filing, accessed 2026-04-29

work page arXiv 2024

[11] [11]

REVIEW, R.F.P.: The structure and dynamics of the online maha–sphere (2026), under review

work page 2026

[12] [12]

Scientific Reports15(1) (Apr 2025)

Wu, Q., Sano, Y., Takayasu, H., Havlin, S., Takayasu, M.: Twitter communities are associated with changing user’s opinion towards COVID-19 vaccine in Japan. Scientific Reports15(1) (Apr 2025)

work page 2025

[13] [13]

Journal of Medical Internet Research24(11) (Nov 2022)

Zhao, Y.C., Zhao, M., Song, S.: Online Health Information Seeking Among Patients With Chronic Conditions: Integrating the Health Belief Model and Social Support Theory. Journal of Medical Internet Research24(11) (Nov 2022)

work page 2022