Hiding in Plain Sight: Finding MAHA on Reddit
Pith reviewed 2026-05-21 06:30 UTC · model grok-4.3
The pith
A six-year Reddit dataset of 19.4 million posts supplies the raw discussions around 12 Make America Healthy Again beliefs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.
What carries the argument
The dataset construction process that selects and organizes Reddit posts to retain the natural and thematic context of twelve MAHA-aligned beliefs.
If this is right
- Researchers gain the ability to track the growth and spread of MAHA beliefs over a six-year period using fine-grained post data.
- Analysis of the movement's structural and functional components becomes possible through the thematic grouping of posts.
- Linguistic and behavioral patterns among MAHA proponents can be measured directly from the collected text and user activity.
- Cross-domain studies of health-related online movements are supported by the ready availability of this large, themed collection.
Where Pith is reading between the lines
- The same collection approach could be applied to other health or political belief sets to compare how different movements appear in the same platform data.
- Long-term patterns in the dataset might show whether mainstream diet advice and more contested topics reinforce each other within the same user groups.
- The resource could serve as a baseline for testing whether belief adoption on Reddit follows similar timing or network structures as on other platforms.
- Future extensions might add timestamps or user-interaction graphs to enable direct studies of information flow within the movement.
Load-bearing premise
The collection and filtering steps correctly identify posts that genuinely reflect MAHA-aligned beliefs without large amounts of selection bias or incorrect labeling.
What would settle it
A random sample of several hundred posts from the released dataset reviewed by independent readers showing that most do not discuss the claimed MAHA themes or contain clear mislabeling.
Figures
read the original abstract
Make America Healthy Again (MAHA) is a national health movement that encompasses a striking mix of beliefs, from broadly accepted concerns about good diet and exercise to controversial takes on organic and genetically modified food, childhood vaccination, science, and institutions. Various influencers and promoters of the MAHA movement on social media are scattered throughout the online space. Investigating the structure, discourse, and contagion of MAHA beliefs requires large-scale fine-grained digital footprints. Constructing structured data covering different MAHA themes from vast unstructured social media data is challenging. We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Reddit dataset spanning 2020-2025 with 19.4M posts from 4M users that is claimed to embed the natural thematic context of 12 MAHA-aligned beliefs, enabling research on the movement's dynamics, structure, linguistic patterns, and contagion.
Significance. If the collection and labeling pipeline is shown to be reliable, the dataset's scale and multi-year coverage would offer a useful resource for studying online health-related belief systems and social media dynamics in social informatics and public health research.
major comments (2)
- [Data collection section] Data collection section: The manuscript provides no validation metrics (e.g., precision, recall, or inter-rater agreement) for the subreddit selection, keyword filtering, or labeling process used to associate posts with the 12 MAHA-aligned beliefs; this directly affects whether the dataset genuinely captures thematic context rather than incidental matches.
- [Dataset statistics section] Dataset statistics and coverage: Aggregate figures (19.4M posts, 4M users) are given but no per-belief or per-subreddit breakdowns or contamination estimates are reported, leaving the claim of balanced thematic coverage unverified.
minor comments (1)
- [Abstract and Introduction] The abstract and introduction use 'MAHA' before fully defining the movement's scope; a brief parenthetical expansion on first use would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The feedback correctly identifies gaps in validation and detailed statistics for the dataset construction. We address each point below and will incorporate revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Data collection section] The manuscript provides no validation metrics (e.g., precision, recall, or inter-rater agreement) for the subreddit selection, keyword filtering, or labeling process used to associate posts with the 12 MAHA-aligned beliefs; this directly affects whether the dataset genuinely captures thematic context rather than incidental matches.
Authors: We acknowledge this limitation in the current version. The Data Collection section describes the subreddit list and keyword sets derived from domain knowledge of MAHA beliefs, but does not report quantitative validation. In the revised manuscript we will add a dedicated validation subsection reporting precision and recall from manual review of a stratified sample of 1,000 posts per belief, along with Cohen's kappa for inter-rater agreement among three annotators. This will directly address whether the pipeline captures thematic context beyond incidental keyword matches. revision: yes
-
Referee: [Dataset statistics section] Aggregate figures (19.4M posts, 4M users) are given but no per-belief or per-subreddit breakdowns or contamination estimates are reported, leaving the claim of balanced thematic coverage unverified.
Authors: We agree that aggregate numbers alone leave the coverage claim unverified. The revised manuscript will include new tables and figures with exact post and user counts broken down by each of the 12 beliefs and by the primary subreddits. We will also report contamination estimates (false-positive rates) obtained from the same manual annotation study described in response to the first comment, allowing readers to assess the balance and purity of the thematic coverage. revision: yes
Circularity Check
No circularity: data resource paper with no derivations, predictions or fitted parameters
full rationale
This is a dataset introduction paper whose central claim is the release of a 19.4M-post Reddit corpus spanning 2020-2025 that captures the thematic context of 12 MAHA-aligned beliefs. No equations, parameter fitting, predictions, or self-citation chains appear in the provided abstract or described structure. The collection and filtering pipeline is presented as a methodological contribution rather than a derived quantity that reduces to its own inputs. Because the work contains no load-bearing derivations or self-referential predictions, none of the enumerated circularity patterns apply.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reddit posts can be filtered and labeled to accurately capture the natural context of 12 specific MAHA-aligned beliefs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a Reddit dataset ... 2-stage keyword search ... logistic regression classifier ... tree-based few-shot LLM stance classifier
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Journal of Epidemiology and Population Health74(1), 203167 (2026)
Alba, C.: Sentiments and discourse surrounding the make america healthy again (#maha) movement on social media. Journal of Epidemiology and Population Health74(1), 203167 (2026)
work page 2026
-
[2]
Brodkin, J.: Reddit bans r/nonewnormal and quarantines 54 “covid-denial subred- dits”. Ars Technica (September 2021), accessed: 2026-04-29
work page 2021
-
[3]
Science Advances8(33) (Aug 2022)
Dalege, J., Van Der Does, T.: Using a cognitive network model of moral and social beliefs to explain belief change. Science Advances8(33) (Aug 2022)
work page 2022
-
[4]
Ecker, U.K.H., Lewandowsky, S., Cook, J., Schmid, P., Fazio, L.K., Brashier, N., Kendeou, P., Vraga, E.K., Amazeen, M.A.: The psychological drivers of misinfor- mation belief and its resistance to correction. Nature Reviews. Psychology1(1) (Jan 2022)
work page 2022
-
[5]
Communications Earth & Environment6(1) (Jan 2025)
Fariello, G., Jemielniak, D.: The changing language and sentiment of conversations about climate change in Reddit posts over sixteen years. Communications Earth & Environment6(1) (Jan 2025)
work page 2025
-
[6]
FrontPageMetrics: New subreddits by month: Reddit history.https://frontpagem etrics.com/month/(2022), accessed 2026-04-29
work page 2022
-
[7]
Gyawali, N., Caragea, D., Caragea, C., Mohammad, S.M.: The shifting landscape of vaccine discourse: Insights from a decade of pre- to post-covid-19 vaccine posts on social media. PLOS ONE20(12) (2025)
work page 2025
-
[8]
Paino, M., Claggett, J.L., Kitchens, B.: Medical skepticism in the digital age: An experimental study on digital health literacy. Sociological Inquiry (2025)
work page 2025
-
[9]
Journal of Medical Internet Research23(6) (Jun 2021)
Rao, A., Morstatter, F., Hu, M., Chen, E., Burghardt, K., Ferrara, E., Lerman, K.: Political Partisanship and Antiscience Attitudes in Online Discussions About COVID-19: Twitter Content Analysis. Journal of Medical Internet Research23(6) (Jun 2021)
work page 2021
-
[10]
Reddit, Inc.: Reddit, inc. 2024 form 10-k (2025),\url{https://www.sec.gov/Arch ives/edgar/data/1713445/000171344525000096/redditannualreport2024.pd f}, sEC filing, accessed 2026-04-29
-
[11]
REVIEW, R.F.P.: The structure and dynamics of the online maha–sphere (2026), under review
work page 2026
-
[12]
Scientific Reports15(1) (Apr 2025)
Wu, Q., Sano, Y., Takayasu, H., Havlin, S., Takayasu, M.: Twitter communities are associated with changing user’s opinion towards COVID-19 vaccine in Japan. Scientific Reports15(1) (Apr 2025)
work page 2025
-
[13]
Journal of Medical Internet Research24(11) (Nov 2022)
Zhao, Y.C., Zhao, M., Song, S.: Online Health Information Seeking Among Patients With Chronic Conditions: Integrating the Health Belief Model and Social Support Theory. Journal of Medical Internet Research24(11) (Nov 2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.