Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

Adriana-Simona Mih\u{a}i\c{t}\u{a}; Angela Brillantes; Jooyoung Lee; Lin Tian; Marian-Andrei Rizoiu

arxiv: 2606.04274 · v1 · pith:JTM3IVADnew · submitted 2026-06-02 · 💻 cs.CL · cs.CY

Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

JooYoung Lee , Lin Tian , Angela Brillantes , Adriana-Simona Mih\u{a}i\c{t}\u{a} , Marian-Andrei Rizoiu This is my paper

Pith reviewed 2026-06-28 09:42 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords misinformation classificationfine-tuningzero-shot LLMsReddit commentsbelief detectionRoBERTafact-checkingsocial media discourse

0 comments

The pith

Fine-tuned RoBERTa reaches 0.62 macro-F1 on Reddit misinformation labels while the best zero-shot LLM hits only 0.50.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the general capabilities of large language models are enough for nuanced classification of online misinformation discourse. It evaluates nine models on 900 Reddit comments labeled as belief (propagates a claim), fact-check (corrects it), or other across three verified claims. Fine-tuned RoBERTa outperforms all zero-shot approaches, with the largest gains on the implicit belief category that every LLM under-detects. Scaling up models does not close the gap, and safety alignment in larger LLMs sometimes causes outright refusals. The results indicate that task-specific supervised training remains more reliable than relying on scale alone for this verification task.

Core claim

Fine-tuned RoBERTa reaches 0.62 macro-F1 against a best zero-shot result of 0.50 from Claude Haiku 4.5 on the three-class task. The supervised advantage concentrates on the belief class. Larger models do not outperform smaller ones, and some frontier LLMs collapse on belief detection or refuse sensitive comments due to alignment rather than capacity limits. Label schema and topic jointly affect zero-shot performance by more than 0.13 macro-F1 for the same model.

What carries the argument

Direct head-to-head comparison of fine-tuned DistilBERT and RoBERTa against zero-shot BART-MNLI, Llama variants, and commercial LLMs on a fixed set of 900 human-labeled Reddit comments using both universal and topic-specific label schemas.

If this is right

In settings where failing to detect belief comments carries high cost, task-specific fine-tuning delivers more consistent performance than scaling zero-shot models.
Choosing label schemas and topics can swing zero-shot macro-F1 by over 0.13 for the same model.
Safety alignment in frontier LLMs can produce refusals on sensitive content, limiting applicability even when capacity exists.
The performance gap remains even when comparing Llama-3-8B to Llama-3-70B, showing that parameter count alone does not overcome the supervised advantage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For high-stakes moderation pipelines, allocating resources to labeled data and fine-tuning may yield better returns than repeated calls to larger generative models.
The consistent under-detection of belief suggests current LLMs have difficulty with affective or implicit endorsement language in misinformation contexts.
Hybrid systems that route ambiguous comments to a fine-tuned classifier after an initial zero-shot pass could balance coverage and cost.

Load-bearing premise

The human labels on the 900 comments correctly capture commenter intent as belief, fact-check, or other without substantial annotator disagreement or systematic bias.

What would settle it

Collect a fresh set of Reddit comments on the same or similar PolitiFact claims, obtain independent human labels, and re-run the same nine models to check whether the 0.12 macro-F1 gap between fine-tuned RoBERTa and the best zero-shot model persists.

read the original abstract

As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro-$F_1$ against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned RoBERTa beats zero-shot LLMs here mainly on the belief class, but the result rests on unvalidated human labels.

read the letter

The central finding is that fine-tuned RoBERTa reaches 0.62 macro F1 on the 900 Reddit comments while the strongest zero-shot model (Claude Haiku) hits 0.50, with the gap driven by better detection of the belief class. Scaling the LLMs does not close it, and some models even refuse on sensitive items due to alignment.

The paper runs a direct head-to-head on three topics and two label schemas, which is useful. It documents that performance varies by more than 0.13 F1 across topics for the same model and that zero-shot models consistently under-detect the implicit belief category. The cost comparison and the safety-alignment observation are practical points.

The soft spot is the labels. Belief is described as an affective, implicit category, yet the work provides no inter-annotator agreement, annotator count, or external check. If agreement is low on that class, the fine-tuned models are partly fitting label noise that zero-shot models cannot match. That assumption is load-bearing for the headline claim.

This is for people who build or evaluate misinformation classifiers on social media and need evidence on when fine-tuning still pays off. A reader already working on similar Reddit or belief-detection tasks would get concrete numbers and topic effects to compare against.

The comparison is straightforward enough to warrant peer review, though the label reliability section needs strengthening before publication.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates nine models across zero-shot (BART-MNLI, Llama variants, Claude Haiku/Sonnet, Gemini) and supervised (fine-tuned DistilBERT, RoBERTa) paradigms on a 900-comment Reddit dataset spanning three PolitiFact misinformation topics. Comments are labeled belief (propagates claim), fact-check (corrects it), or other. It reports fine-tuned RoBERTa at 0.62 macro-F1 versus best zero-shot Claude Haiku 4.5 at 0.50, with the gap largest on the belief class; it further notes that scaling does not help, label schema and topic affect zero-shot results by >0.13 F1, and safety alignment causes refusals in larger models.

Significance. If the human labels are reliable, the results supply concrete, falsifiable evidence that general-purpose LLMs underperform task-specific fine-tuning on implicit affective categories in misinformation discourse, while also showing cost and reliability advantages for supervised models in verification settings.

major comments (1)

[Dataset construction] Dataset construction (abstract and methods): No inter-annotator agreement, number of annotators per item, adjudication procedure, or external validation is reported for the belief class labels. Because supervised models are trained and evaluated against these same labels while zero-shot models are scored against them, and because the headline gap is concentrated on belief (the implicit category), any substantial annotator disagreement or systematic bias on this class would make the reported 0.12 macro-F1 advantage partly an artifact of fitting to label noise.

minor comments (2)

[Results] Results tables: clarify whether the reported macro-F1 values are averaged over multiple random seeds or single runs, and whether topic-specific versus universal schemas are evaluated on the same test splits.
[Experimental setup] Prompting details: the exact zero-shot templates and any refusal-handling rules for the commercial LLMs should be provided in an appendix so that the 0.50 ceiling can be reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address the single major comment below and will incorporate the requested clarifications in a revised manuscript.

read point-by-point responses

Referee: [Dataset construction] Dataset construction (abstract and methods): No inter-annotator agreement, number of annotators per item, adjudication procedure, or external validation is reported for the belief class labels. Because supervised models are trained and evaluated against these same labels while zero-shot models are scored against them, and because the headline gap is concentrated on belief (the implicit category), any substantial annotator disagreement or systematic bias on this class would make the reported 0.12 macro-F1 advantage partly an artifact of fitting to label noise.

Authors: We agree this information is missing and that its absence is a material limitation, especially given the concentration of the performance gap on the belief class. We will revise the Methods section to fully describe the annotation process (number of annotators per item, adjudication procedure if any, and any external validation steps). We will also add an explicit Limitations paragraph discussing how label noise or annotator bias on the implicit belief category could inflate the apparent advantage of supervised models over zero-shot ones. This directly addresses the referee's concern that the 0.12 macro-F1 gap may partly reflect fitting to label artifacts rather than true model capability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with independent measurements

full rationale

The paper performs direct empirical comparisons of model performance (fine-tuned RoBERTa/DistilBERT vs. zero-shot LLMs) on a fixed set of 900 human-labeled Reddit comments. All reported metrics (macro-F1, per-class F1) are computed from held-out evaluation on the same labels without any derivation, fitted parameters renamed as predictions, self-citation chains, or ansatzes. No equations or uniqueness theorems appear; results are external measurements against the dataset. This is the standard non-circular case for benchmarking studies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work applies standard supervised fine-tuning and zero-shot prompting techniques from prior NLP literature to a new dataset without introducing new free parameters, axioms beyond domain-standard evaluation assumptions, or invented entities.

axioms (1)

domain assumption Human-provided labels serve as reliable ground truth for the belief, fact-check, and other categories.
Performance metrics are computed directly against these labels.

pith-pipeline@v0.9.1-grok · 5901 in / 1205 out tokens · 29176 ms · 2026-06-28T09:42:53.696394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 25 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:230308774

Achiam J, Adler S, Agarwal S, et al (2023) Gpt-4 technical report. arXiv preprint arXiv:230308774

2023
[2]

In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing ( EMNLP )

Augenstein I, Rockt \"a schel T, Vlachos A, et al (2016) Stance detection with bidirectional conditional encoding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). Association for Computational Linguistics, pp 876--885, doi:10.18653/v1/D16-1084

work page doi:10.18653/v1/d16-1084 2016
[3]

most of california's water

Bacher D (2025) Fact check: The resnicks do not own “most of california's water”. ://www.c-win.org/blog/2025/1/27/fact-check-the-resnicks-do-not-own-most-of-californias-water

2025
[4]

here's what to know

Bladt C (2025) Claims about who owns california's water are spreading online. here's what to know. ://www.cbsnews.com/news/los-angeles-wildfires-stewart-resnick-lynda-resnick-water-rights/

2025
[5]

https://praw.readthedocs.io/en/stable/, accessed: 2025-03-04

Boe B (2023) Praw 7.7.1 documentation. https://praw.readthedocs.io/en/stable/, accessed: 2025-03-04

2023
[6]

In: Advances in Neural Information Processing Systems 33 ( NeurIPS 2020)

Brown TB, Mann B, Ryder N, et al (2020) Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33 ( NeurIPS 2020). Curran Associates, Inc., pp 1877--1901

2020
[7]

maybe if you drank bleach you may be okay

Calefati J (2020) On covid-19, donald trump said that “maybe if you drank bleach you may be okay.”. ://www.politifact.com/factchecks/2020/jul/11/joe-biden/no-trump-didnt-tell-americans-infected-coronavirus/

2020
[8]

Proceedings of the National Academy of Sciences 118(9):e2023301118

Cinelli M, De Francisci Morales G, Galeazzi A, et al (2021) The echo chamber effect on social media. Proceedings of the National Academy of Sciences 118(9):e2023301118. doi:10.1073/pnas.2023301118

work page doi:10.1073/pnas.2023301118 2021
[9]

In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017)

Derczynski L, Bontcheva K, Liakata M, et al (2017) SemEval -2017 task 8: RumourEval : Determining rumour veracity and support for rumours. In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017). Association for Computational Linguistics, pp 69--76, doi:10.18653/v1/S17-2006

work page doi:10.18653/v1/s17-2006 2017
[10]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin J, Chang MW, Lee K, et al (2019) BERT : Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 4171--4186, doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[11]

In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT )

Ferreira W, Vlachos A (2016) Emergent: A novel data-set for stance classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 1163--1168, doi:10.18653/v1/N16-1138

work page doi:10.18653/v1/n16-1138 2016
[12]

In: Proceedings of the 13th International Workshop on Semantic Evaluation ( SemEval -2019)

Gorrell G, Kochkina E, Liakata M, et al (2019) SemEval -2019 task 7: RumourEval , determining rumour veracity and support for rumours. In: Proceedings of the 13th International Workshop on Semantic Evaluation ( SemEval -2019). Association for Computational Linguistics, pp 845--854, doi:10.18653/v1/S19-2147

work page doi:10.18653/v1/s19-2147 2019
[13]

A Survey on Automated Fact-Checking

Guo Z, Schlichtkrull M, Vlachos A (2022) A survey on automated fact-checking. Transactions of the Association for Computational Linguistics 10:178--206. doi:10.1162/tacl_a_00454

work page doi:10.1162/tacl_a_00454 2022
[14]

In: Findings of the Association for Computational Linguistics: NAACL 2021

Hardalov M, Arora A, Nakov P, et al (2021) A survey on stance detection for mis- and disinformation identification. In: Findings of the Association for Computational Linguistics: NAACL 2021. Association for Computational Linguistics, pp 1259--1277, doi:10.18653/v1/2021.findings-naacl.324, venue corrected: the review body states EMNLP 2021 but the DOI and ...

work page doi:10.18653/v1/2021.findings-naacl.324 2021
[15]

In: Proceedings of the 27th International Conference on Computational Linguistics ( COLING )

Hazarika D, Poria S, Gorantla S, et al (2018) CASCADE : Contextual sarcasm detection in online discussion forums. In: Proceedings of the 27th International Conference on Computational Linguistics ( COLING ). Association for Computational Linguistics, pp 1837--1848

2018
[16]

In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT )

Hedderich MA, Lange L, Adel H, et al (2021) A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 2545--2568, doi:...

work page doi:10.18653/v1/2021.naacl-main.201 2021
[17]

had ‘ms-13' on his knuckles tattooed. … he had ‘ms' as clear as you can be. not 'interpreted.'

Jacobson L (2025) Kilmar armando abrego garcia “had ‘ms-13' on his knuckles tattooed. … he had ‘ms' as clear as you can be. not 'interpreted.'”. ://www.politifact.com/factchecks/2025/apr/30/donald-trump/trump-abrego-garcia-hand-tattoos-abc-news/

2025
[18]

ACM Computing Surveys 50(5):1--22

Joshi A, Bhattacharyya P, Carman MJ (2017) Automatic sarcasm detection: A survey. ACM Computing Surveys 50(5):1--22. doi:10.1145/3124420

work page doi:10.1145/3124420 2017
[19]

PeerJ Computer Science 7:e467

Karande H, Walambe R, Benjamin V, et al (2021) Stance detection with BERT embeddings for credibility analysis of information on social media. PeerJ Computer Science 7:e467. doi:10.7717/peerj-cs.467

work page doi:10.7717/peerj-cs.467 2021
[20]

In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT )

Kawintiranon K, Singh L (2021) Knowledge enhanced masked language model for stance detection. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 4725--4735, doi:10.18653/v1/2021.naacl-main.376

work page doi:10.18653/v1/2021.naacl-main.376 2021
[21]

In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017)

Kochkina E, Liakata M, Augenstein I (2017) Turing at SemEval -2017 task 8: Sequential approach to rumour stance classification with Branch-LSTM . In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017). Association for Computational Linguistics, pp 475--480, doi:10.18653/v1/S17-2083

work page doi:10.18653/v1/s17-2083 2017
[22]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis M, Liu Y, Goyal N, et al (2020) BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics ( ACL ). Association for Computational Linguistics, pp 7871--7880, doi:10.18653/v1/2020.acl-main.703

work page doi:10.18653/v1/2020.acl-main.703 2020
[23]

arXiv preprint arXiv:190711692 ://arxiv.org/abs/1907.11692

Liu Y, Ott M, Goyal N, et al (2019) RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:190711692 ://arxiv.org/abs/1907.11692

Pith/arXiv arXiv 2019
[24]

://www.politifact.com/factchecks/2025/jan/14/more-perfect-union/does-a-billionaire-couple-own-almost-all-the-water/

McCullough C (2025) No, one couple doesn't own almost all california's water. ://www.politifact.com/factchecks/2025/jan/14/more-perfect-union/does-a-billionaire-couple-own-almost-all-the-water/

2025
[25]

S em E val-2016 task 6: Detecting stance in tweets

Mohammad S, Kiritchenko S, Sobhani P, et al (2016) SemEval -2016 task 6: Detecting stance in tweets. In: Proceedings of the 10th International Workshop on Semantic Evaluation ( SemEval -2016). Association for Computational Linguistics, pp 31--41, doi:10.18653/v1/S16-1003

work page doi:10.18653/v1/s16-1003 2016
[26]

ACM Transactions on Internet Technology 17(3):1--23

Mohammad S, Sobhani P, Kiritchenko S (2017) Stance and sentiment in tweets. ACM Transactions on Internet Technology 17(3):1--23. doi:10.1145/3003433

work page doi:10.1145/3003433 2017
[27]

://perfectunion.us/how-this-billionaire-couple-stole-californias-water-supply/

Morrow S (2022) How this billionaire couple stole california's water supply. ://perfectunion.us/how-this-billionaire-couple-stole-californias-water-supply/

2022
[28]

://www.politifact.com/

PolitiFact (2025) Politifact. ://www.politifact.com/

2025
[29]

own most of california's water

Rascouët-Paz A (2025) No, billionaire couple does not “own most of california's water”. ://www.snopes.com/news/2025/01/16/billonaire-couple-own-californias-water/

2025
[30]

arXiv preprint arXiv:191001108 ://arxiv.org/abs/1910.01108, NeurIPS 2019 Workshop on Energy Efficient Machine Learning and Cognitive Computing

Sanh V, Debut L, Chaumond J, et al (2019) DistilBERT , a distilled version of BERT : Smaller, faster, cheaper and lighter. arXiv preprint arXiv:191001108 ://arxiv.org/abs/1910.01108, NeurIPS 2019 Workshop on Energy Efficient Machine Learning and Cognitive Computing

Pith/arXiv arXiv 2019
[31]

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

Schick T, Sch \"u tze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics ( EACL ). Association for Computational Linguistics, pp 255--269, doi:10.18653/v1/2021.eacl-main.20

work page doi:10.18653/v1/2021.eacl-main.20 2021
[32]

ACM SIGKDD Explorations Newsletter 19(1):22--36

Shu K, Sliva A, Wang S, et al (2017) Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19(1):22--36. doi:10.1145/3137597.3137600

work page doi:10.1145/3137597.3137600 2017
[33]

told americans all they had to do was inject bleach in themselves. just take a shot of uv light

Specht P (2024) Says donald trump “told americans all they had to do was inject bleach in themselves. just take a shot of uv light.”. ://www.politifact.com/factchecks/2024/mar/28/joe-biden/biden-exaggerates-trumps-pandemic-comments-about-d/

2024
[34]

arXiv preprint arXiv:231211805

Team G, Anil R, Borgeaud S, et al (2023) Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:231211805

2023
[35]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne J, Vlachos A, Christodoulopoulos C, et al (2018) FEVER : A large-scale dataset for fact extraction and VERification . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 809--819, doi:10.18653/v...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[36]

arXiv preprint arXiv:230213971

Touvron H, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971

2023
[37]

Science 359(6380):1146--1151

Vosoughi S, Roy D, Aral S (2018) The spread of true and false news online. Science 359(6380):1146--1151. doi:10.1126/science.aap9559

work page doi:10.1126/science.aap9559 2018
[38]

Wei J, Zou K (2019) EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP ). Association for Computational Linguistics, pp 6382--6388, d...

work page doi:10.18653/v1/d19-1670 2019
[39]

In: Proceedings of the 15th International AAAI Conference on Web and Social Media ( ICWSM ), vol 15

Weld G, Glenski M, Althoff T (2021) Political bias and factualness in news sharing across more than 100,000 online communities. In: Proceedings of the 15th International AAAI Conference on Web and Social Media ( ICWSM ), vol 15. AAAI Press, pp 796--807, doi:10.1609/icwsm.v15i1.18104

work page doi:10.1609/icwsm.v15i1.18104 2021
[40]

In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Williams A, Nangia N, Bowman S (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, pp 1112--1122, ://aclweb...

2018
[41]

Yin W, Hay J, Roth D (2019) Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP ). Association for Computational Linguistics, pp 3914--3923...

work page doi:10.18653/v1/d19-1404 2019
[42]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP )

Yu J, Jiang J, Khoo LMS, et al (2020) Coupled hierarchical transformer for stance-aware rumor verification in social media conversations. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). Association for Computational Linguistics, pp 1392--1401, doi:10.18653/v1/2020.emnlp-main.108

work page doi:10.18653/v1/2020.emnlp-main.108 2020
[43]

PLoS ONE 11(3):e0150989

Zubiaga A, Liakata M, Procter R, et al (2016) Analysing how people orient to and spread rumours in social media by looking at conversational threads. PLoS ONE 11(3):e0150989. doi:10.1371/journal.pone.0150989

work page doi:10.1371/journal.pone.0150989 2016

[1] [1]

arXiv preprint arXiv:230308774

Achiam J, Adler S, Agarwal S, et al (2023) Gpt-4 technical report. arXiv preprint arXiv:230308774

2023

[2] [2]

In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing ( EMNLP )

Augenstein I, Rockt \"a schel T, Vlachos A, et al (2016) Stance detection with bidirectional conditional encoding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). Association for Computational Linguistics, pp 876--885, doi:10.18653/v1/D16-1084

work page doi:10.18653/v1/d16-1084 2016

[3] [3]

most of california's water

Bacher D (2025) Fact check: The resnicks do not own “most of california's water”. ://www.c-win.org/blog/2025/1/27/fact-check-the-resnicks-do-not-own-most-of-californias-water

2025

[4] [4]

here's what to know

Bladt C (2025) Claims about who owns california's water are spreading online. here's what to know. ://www.cbsnews.com/news/los-angeles-wildfires-stewart-resnick-lynda-resnick-water-rights/

2025

[5] [5]

https://praw.readthedocs.io/en/stable/, accessed: 2025-03-04

Boe B (2023) Praw 7.7.1 documentation. https://praw.readthedocs.io/en/stable/, accessed: 2025-03-04

2023

[6] [6]

In: Advances in Neural Information Processing Systems 33 ( NeurIPS 2020)

Brown TB, Mann B, Ryder N, et al (2020) Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33 ( NeurIPS 2020). Curran Associates, Inc., pp 1877--1901

2020

[7] [7]

maybe if you drank bleach you may be okay

Calefati J (2020) On covid-19, donald trump said that “maybe if you drank bleach you may be okay.”. ://www.politifact.com/factchecks/2020/jul/11/joe-biden/no-trump-didnt-tell-americans-infected-coronavirus/

2020

[8] [8]

Proceedings of the National Academy of Sciences 118(9):e2023301118

Cinelli M, De Francisci Morales G, Galeazzi A, et al (2021) The echo chamber effect on social media. Proceedings of the National Academy of Sciences 118(9):e2023301118. doi:10.1073/pnas.2023301118

work page doi:10.1073/pnas.2023301118 2021

[9] [9]

In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017)

Derczynski L, Bontcheva K, Liakata M, et al (2017) SemEval -2017 task 8: RumourEval : Determining rumour veracity and support for rumours. In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017). Association for Computational Linguistics, pp 69--76, doi:10.18653/v1/S17-2006

work page doi:10.18653/v1/s17-2006 2017

[10] [10]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin J, Chang MW, Lee K, et al (2019) BERT : Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 4171--4186, doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019

[11] [11]

In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT )

Ferreira W, Vlachos A (2016) Emergent: A novel data-set for stance classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 1163--1168, doi:10.18653/v1/N16-1138

work page doi:10.18653/v1/n16-1138 2016

[12] [12]

In: Proceedings of the 13th International Workshop on Semantic Evaluation ( SemEval -2019)

Gorrell G, Kochkina E, Liakata M, et al (2019) SemEval -2019 task 7: RumourEval , determining rumour veracity and support for rumours. In: Proceedings of the 13th International Workshop on Semantic Evaluation ( SemEval -2019). Association for Computational Linguistics, pp 845--854, doi:10.18653/v1/S19-2147

work page doi:10.18653/v1/s19-2147 2019

[13] [13]

A Survey on Automated Fact-Checking

Guo Z, Schlichtkrull M, Vlachos A (2022) A survey on automated fact-checking. Transactions of the Association for Computational Linguistics 10:178--206. doi:10.1162/tacl_a_00454

work page doi:10.1162/tacl_a_00454 2022

[14] [14]

In: Findings of the Association for Computational Linguistics: NAACL 2021

Hardalov M, Arora A, Nakov P, et al (2021) A survey on stance detection for mis- and disinformation identification. In: Findings of the Association for Computational Linguistics: NAACL 2021. Association for Computational Linguistics, pp 1259--1277, doi:10.18653/v1/2021.findings-naacl.324, venue corrected: the review body states EMNLP 2021 but the DOI and ...

work page doi:10.18653/v1/2021.findings-naacl.324 2021

[15] [15]

In: Proceedings of the 27th International Conference on Computational Linguistics ( COLING )

Hazarika D, Poria S, Gorantla S, et al (2018) CASCADE : Contextual sarcasm detection in online discussion forums. In: Proceedings of the 27th International Conference on Computational Linguistics ( COLING ). Association for Computational Linguistics, pp 1837--1848

2018

[16] [16]

In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT )

Hedderich MA, Lange L, Adel H, et al (2021) A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 2545--2568, doi:...

work page doi:10.18653/v1/2021.naacl-main.201 2021

[17] [17]

had ‘ms-13' on his knuckles tattooed. … he had ‘ms' as clear as you can be. not 'interpreted.'

Jacobson L (2025) Kilmar armando abrego garcia “had ‘ms-13' on his knuckles tattooed. … he had ‘ms' as clear as you can be. not 'interpreted.'”. ://www.politifact.com/factchecks/2025/apr/30/donald-trump/trump-abrego-garcia-hand-tattoos-abc-news/

2025

[18] [18]

ACM Computing Surveys 50(5):1--22

Joshi A, Bhattacharyya P, Carman MJ (2017) Automatic sarcasm detection: A survey. ACM Computing Surveys 50(5):1--22. doi:10.1145/3124420

work page doi:10.1145/3124420 2017

[19] [19]

PeerJ Computer Science 7:e467

Karande H, Walambe R, Benjamin V, et al (2021) Stance detection with BERT embeddings for credibility analysis of information on social media. PeerJ Computer Science 7:e467. doi:10.7717/peerj-cs.467

work page doi:10.7717/peerj-cs.467 2021

[20] [20]

In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT )

Kawintiranon K, Singh L (2021) Knowledge enhanced masked language model for stance detection. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 4725--4735, doi:10.18653/v1/2021.naacl-main.376

work page doi:10.18653/v1/2021.naacl-main.376 2021

[21] [21]

In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017)

Kochkina E, Liakata M, Augenstein I (2017) Turing at SemEval -2017 task 8: Sequential approach to rumour stance classification with Branch-LSTM . In: Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017). Association for Computational Linguistics, pp 475--480, doi:10.18653/v1/S17-2083

work page doi:10.18653/v1/s17-2083 2017

[22] [22]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis M, Liu Y, Goyal N, et al (2020) BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics ( ACL ). Association for Computational Linguistics, pp 7871--7880, doi:10.18653/v1/2020.acl-main.703

work page doi:10.18653/v1/2020.acl-main.703 2020

[23] [23]

arXiv preprint arXiv:190711692 ://arxiv.org/abs/1907.11692

Liu Y, Ott M, Goyal N, et al (2019) RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:190711692 ://arxiv.org/abs/1907.11692

Pith/arXiv arXiv 2019

[24] [24]

://www.politifact.com/factchecks/2025/jan/14/more-perfect-union/does-a-billionaire-couple-own-almost-all-the-water/

McCullough C (2025) No, one couple doesn't own almost all california's water. ://www.politifact.com/factchecks/2025/jan/14/more-perfect-union/does-a-billionaire-couple-own-almost-all-the-water/

2025

[25] [25]

S em E val-2016 task 6: Detecting stance in tweets

Mohammad S, Kiritchenko S, Sobhani P, et al (2016) SemEval -2016 task 6: Detecting stance in tweets. In: Proceedings of the 10th International Workshop on Semantic Evaluation ( SemEval -2016). Association for Computational Linguistics, pp 31--41, doi:10.18653/v1/S16-1003

work page doi:10.18653/v1/s16-1003 2016

[26] [26]

ACM Transactions on Internet Technology 17(3):1--23

Mohammad S, Sobhani P, Kiritchenko S (2017) Stance and sentiment in tweets. ACM Transactions on Internet Technology 17(3):1--23. doi:10.1145/3003433

work page doi:10.1145/3003433 2017

[27] [27]

://perfectunion.us/how-this-billionaire-couple-stole-californias-water-supply/

Morrow S (2022) How this billionaire couple stole california's water supply. ://perfectunion.us/how-this-billionaire-couple-stole-californias-water-supply/

2022

[28] [28]

://www.politifact.com/

PolitiFact (2025) Politifact. ://www.politifact.com/

2025

[29] [29]

own most of california's water

Rascouët-Paz A (2025) No, billionaire couple does not “own most of california's water”. ://www.snopes.com/news/2025/01/16/billonaire-couple-own-californias-water/

2025

[30] [30]

arXiv preprint arXiv:191001108 ://arxiv.org/abs/1910.01108, NeurIPS 2019 Workshop on Energy Efficient Machine Learning and Cognitive Computing

Sanh V, Debut L, Chaumond J, et al (2019) DistilBERT , a distilled version of BERT : Smaller, faster, cheaper and lighter. arXiv preprint arXiv:191001108 ://arxiv.org/abs/1910.01108, NeurIPS 2019 Workshop on Energy Efficient Machine Learning and Cognitive Computing

Pith/arXiv arXiv 2019

[31] [31]

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

Schick T, Sch \"u tze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics ( EACL ). Association for Computational Linguistics, pp 255--269, doi:10.18653/v1/2021.eacl-main.20

work page doi:10.18653/v1/2021.eacl-main.20 2021

[32] [32]

ACM SIGKDD Explorations Newsletter 19(1):22--36

Shu K, Sliva A, Wang S, et al (2017) Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19(1):22--36. doi:10.1145/3137597.3137600

work page doi:10.1145/3137597.3137600 2017

[33] [33]

told americans all they had to do was inject bleach in themselves. just take a shot of uv light

Specht P (2024) Says donald trump “told americans all they had to do was inject bleach in themselves. just take a shot of uv light.”. ://www.politifact.com/factchecks/2024/mar/28/joe-biden/biden-exaggerates-trumps-pandemic-comments-about-d/

2024

[34] [34]

arXiv preprint arXiv:231211805

Team G, Anil R, Borgeaud S, et al (2023) Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:231211805

2023

[35] [35]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne J, Vlachos A, Christodoulopoulos C, et al (2018) FEVER : A large-scale dataset for fact extraction and VERification . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ( NAACL-HLT ). Association for Computational Linguistics, pp 809--819, doi:10.18653/v...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018

[36] [36]

arXiv preprint arXiv:230213971

Touvron H, et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971

2023

[37] [37]

Science 359(6380):1146--1151

Vosoughi S, Roy D, Aral S (2018) The spread of true and false news online. Science 359(6380):1146--1151. doi:10.1126/science.aap9559

work page doi:10.1126/science.aap9559 2018

[38] [38]

Wei J, Zou K (2019) EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP ). Association for Computational Linguistics, pp 6382--6388, d...

work page doi:10.18653/v1/d19-1670 2019

[39] [39]

In: Proceedings of the 15th International AAAI Conference on Web and Social Media ( ICWSM ), vol 15

Weld G, Glenski M, Althoff T (2021) Political bias and factualness in news sharing across more than 100,000 online communities. In: Proceedings of the 15th International AAAI Conference on Web and Social Media ( ICWSM ), vol 15. AAAI Press, pp 796--807, doi:10.1609/icwsm.v15i1.18104

work page doi:10.1609/icwsm.v15i1.18104 2021

[40] [40]

In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Williams A, Nangia N, Bowman S (2018) A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, pp 1112--1122, ://aclweb...

2018

[41] [41]

Yin W, Hay J, Roth D (2019) Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP ). Association for Computational Linguistics, pp 3914--3923...

work page doi:10.18653/v1/d19-1404 2019

[42] [42]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP )

Yu J, Jiang J, Khoo LMS, et al (2020) Coupled hierarchical transformer for stance-aware rumor verification in social media conversations. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). Association for Computational Linguistics, pp 1392--1401, doi:10.18653/v1/2020.emnlp-main.108

work page doi:10.18653/v1/2020.emnlp-main.108 2020

[43] [43]

PLoS ONE 11(3):e0150989

Zubiaga A, Liakata M, Procter R, et al (2016) Analysing how people orient to and spread rumours in social media by looking at conversational threads. PLoS ONE 11(3):e0150989. doi:10.1371/journal.pone.0150989

work page doi:10.1371/journal.pone.0150989 2016