Recognition: no theorem link
REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control
Pith reviewed 2026-05-17 05:34 UTC · model grok-4.3
The pith
REFLEX disentangles fact from style in LLM fact-checking explanations by building verdict-anchored steering vectors from self-disagreement signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
REFLEX is a self-refining paradigm that explicitly controls reasoning style anchored on verdict. It utilizes self-disagreement veracity signals between the backbone model and its fine-tuned variant to construct steering vectors that naturally disentangle fact from style. Experiments on real-world datasets show it reaches state-of-the-art performance under LLaMA-series models with only 465 self-refined samples, yields up to a 7.54 percent gain on in-the-wild data thanks to transferability, and mitigates faithful hallucination to produce more accurate verdicts than prior explainable fact-checking methods.
What carries the argument
Verdict-anchored style control via steering vectors constructed from self-disagreement veracity signals between a backbone LLM and its fine-tuned variant.
If this is right
- REFLEX reaches state-of-the-art performance on LLaMA-series models using only 465 self-refined samples.
- The approach delivers up to a 7.54 percent performance gain on in-the-wild data through its transferability.
- REFLEX reduces faithful hallucination in explanations and supports more accurate verdicts than earlier explainable fact-checking systems.
Where Pith is reading between the lines
- The steering-vector technique could apply to other LLM tasks that require separating content accuracy from output style without large external datasets.
- Lower sample requirements might allow fact-checking tools to adapt quickly to emerging misinformation topics with minimal retraining.
- Testing the method across additional social-media platforms could show whether the fact-style separation holds for varied misinformation formats.
Load-bearing premise
That self-disagreement veracity signals between the backbone model and its fine-tuned variant can naturally disentangle fact from style without introducing new biases or requiring external validation.
What would settle it
A controlled test on a held-out fact-checking dataset in which REFLEX-generated explanations still exhibit style-induced misleading content or produce lower verdict accuracy than standard fine-tuning baselines.
Figures
read the original abstract
The prevalence of fake news on social media demands automated fact-checking systems to provide accurate verdicts with faithful explanations. However, existing large language model (LLM)-based approaches ignore deceptive misinformation styles in LLM-generated explanations, resulting in unfaithful rationales that can mislead human judgments. They rely heavily on external knowledge sources, introducing hallucinations and even high latency that undermine reliability and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations (REFLEX), a self-refining paradigm that explicitly controls reasoning style anchored on verdict. REFLEX utilizes self-disagreement veracity signals between the backbone model and its fine-tuned variant to construct steering vectors, naturally disentangling fact from style. Experiments on the real-world dataset show REFLEX achieves state-of-the-art performance under LLaMA-series models with only 465 self-refined samples. Moreover, owing to its transferability, REFLEX yields up to a 7.54% gain on in-the-wild data. Our results further demonstrate that our method effectively mitigates faithful hallucination, thereby guiding the model toward more accurate verdicts than previous works in explainable fact-checking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes REFLEX, a self-refining paradigm for explainable fact-checking. It constructs steering vectors from self-disagreement veracity signals between a backbone LLaMA model and its fine-tuned variant on 465 samples to enable verdict-anchored style control. This is intended to disentangle fact from deceptive style in explanations, reducing faithful hallucinations without external knowledge. The authors claim SOTA performance under LLaMA-series models on real-world data, up to 7.54% gains on in-the-wild data, and improved mitigation of faithful hallucination.
Significance. If the results and the disentanglement hold under rigorous controls, REFLEX could offer a practical advance in LLM-based fact-checking by achieving strong performance and style control with minimal self-refined data and without external retrieval. The reported transferability to in-the-wild settings and the focus on mitigating unfaithful rationales are potentially valuable for real-time applications. The small sample count (465) would be a notable efficiency strength if the evaluation demonstrates clear separation of style from factuality.
major comments (2)
- [Abstract] Abstract: The central performance claims (SOTA results, 7.54% in-the-wild gain, and mitigation of faithful hallucination) are presented without any mention of baselines, experimental controls, error bars, statistical significance, or criteria for selecting the 465 samples. This absence is load-bearing because the soundness of the reported gains cannot be assessed from the provided information.
- [Method] Method description: The steering vectors are derived from disagreement between the backbone and its fine-tuned variant on the same 465 samples. Without a controlled measurement or ablation demonstrating that these vectors modulate explanation style independently of factuality (rather than capturing calibration artifacts or uncertainty), the claim that self-disagreement naturally disentangles fact from style remains unverified and risks circular reinforcement of the fine-tuned model's outputs.
minor comments (2)
- Clarify whether the 465 self-refined samples are drawn from the evaluation distribution or held out, and provide the exact fine-tuning procedure for the variant model to allow reproducibility.
- [Introduction] The abstract uses the term 'faithful hallucination' without a precise definition or reference to prior usage; a brief operational definition in the introduction would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and outline the revisions we will make to improve the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims (SOTA results, 7.54% in-the-wild gain, and mitigation of faithful hallucination) are presented without any mention of baselines, experimental controls, error bars, statistical significance, or criteria for selecting the 465 samples. This absence is load-bearing because the soundness of the reported gains cannot be assessed from the provided information.
Authors: We concur that the abstract lacks sufficient detail to fully contextualize our claims. To address this, we will revise the abstract to include references to the baselines (such as standard fine-tuned LLMs and prior explainable fact-checking approaches), experimental controls including multiple evaluation runs, error bars representing standard deviations, statistical significance via appropriate tests, and the selection criteria for the 465 samples as a randomly sampled balanced subset from the available training data. These additions will make the performance claims more transparent and assessable. revision: yes
-
Referee: [Method] Method description: The steering vectors are derived from disagreement between the backbone and its fine-tuned variant on the same 465 samples. Without a controlled measurement or ablation demonstrating that these vectors modulate explanation style independently of factuality (rather than capturing calibration artifacts or uncertainty), the claim that self-disagreement naturally disentangles fact from style remains unverified and risks circular reinforcement of the fine-tuned model's outputs.
Authors: This is a valid concern about the verification of the disentanglement mechanism. Our current experiments demonstrate that applying the steering vectors improves both verdict accuracy and explanation quality over the fine-tuned model, suggesting the signals capture useful style information. However, to more rigorously demonstrate independence from factuality and rule out calibration artifacts, we will add controlled ablations in the revised manuscript. These will include comparisons with steering vectors from non-veracity disagreements and quantitative measures of style (e.g., via perplexity on style-specific prompts) versus factuality metrics. We believe this will substantiate the claim without circularity. revision: yes
Circularity Check
No significant circularity; method claims rest on independent experimental validation
full rationale
The paper proposes REFLEX as a self-refining paradigm that constructs steering vectors from self-disagreement signals between a backbone model and its fine-tuned variant on 465 samples, claiming this naturally disentangles fact from style. Performance is reported via SOTA results on real-world datasets and up to 7.54% gains on separate in-the-wild data, with explicit mitigation of faithful hallucination. No quoted derivation step reduces by construction to its inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise relies on a self-citation chain. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of self-refined samples
axioms (1)
- domain assumption Self-disagreement between backbone and fine-tuned model yields disentangled fact-style signals
Reference graph
Works this paper leans on
-
[1]
Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, and Yury Maximov. 2025. Self-training: A survey.Neurocomputing616 (2025), 128904
work page 2025
-
[2]
Pepa Atanasova. 2024. Generating fact checking explanations. InAccountable and Explainable Methods for Complex Reasoning over Text. Springer, 83–103
work page 2024
- [3]
-
[4]
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. [n. d.]. Discovering Latent Knowledge in Language Models Without Supervision. InThe Eleventh International Conference on Learning Representations
-
[5]
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. 2025. Towards reason- ing era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Tsun-Hin Cheung and Kin-Man Lam. 2023. Factllama: Optimizing instruction- following language models with external knowledge for automated fact-checking. In2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 846–853
work page 2023
-
[7]
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R Glass, and Pengcheng He. 2023. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. InThe Twelfth International Conference on Learning Representations
work page 2023
-
[8]
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. [n. d.]. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. InInternational Conference on Learning Representations
-
[9]
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. 2025. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307(2025)
work page internal anchor Pith review arXiv 2025
- [10]
-
[11]
Gaurav Rohit Ghosal, Tatsunori Hashimoto, and Aditi Raghunathan. 2024. Under- standing Finetuning for Factual Knowledge Extraction. InInternational Conference on Machine Learning. PMLR, 15540–15558
work page 2024
-
[12]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [13]
- [14]
-
[15]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55
work page 2025
-
[17]
Shailza Jolly, Pepa Atanasova, and Isabelle Augenstein. 2022. Generating fluent fact checking explanations with unsupervised post-editing.Information13, 10 (2022), 500
work page 2022
- [18]
- [19]
- [20]
-
[21]
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2021. GeDi: Generative Discriminator Guided Sequence Generation. InFindings of the Association for Computational Linguistics: EMNLP 2021. 4929–4952
work page 2021
-
[22]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems33 (2020), 9459–9474
work page 2020
-
[23]
Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. [n. d.]. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. InThe Eleventh International Conference on Learning Representations
-
[24]
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wat- tenberg. 2023. Inference-time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems36 (2023), 41451–41530
work page 2023
-
[25]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. 2022. Diffusion-lm improves controllable text generation.Advances in neural information processing systems35 (2022), 4328–4343
work page 2022
-
[26]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252
work page 2022
- [27]
-
[28]
Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. 2022. Generated Knowledge Prompting for Commonsense Reasoning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3154–3169
work page 2022
-
[29]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. 2022. What Makes Good In-Context Examples for GPT-3?. InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 100–114
work page 2022
-
[30]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[31]
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp
-
[32]
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8086–8098
- [33]
-
[34]
Jing Ma, Wei Gao, Shafiq Joty, and Kam-Fai Wong. 2019. Sentence-level evi- dence embedding for claim verification with hierarchical attention networks. Association for Computational Linguistics
work page 2019
-
[35]
Melkamu Mersha, Khang Lam, Joseph Wood, Ali K Alshami, and Jugal Kalita. 2024. Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction.Neurocomputing599 (2024), 128111
work page 2024
-
[36]
Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Noisy Channel Language Model Prompting for Few-Shot Text Classification. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5316–5330
work page 2022
-
[37]
Tai Nguyen and Eric Wong. 2023. In-context Example Selection with Influences. arXiv e-prints(2023), arXiv–2302
work page 2023
-
[38]
Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 6859–6866
work page 2019
-
[39]
OpenAI. 2023.Introducing ChatGPT. https://openai.com/blog/chatgpt
work page 2023
- [40]
-
[41]
Verónica Pérez-Rosas, Bennett Kleinberg, Alexandra Lefevre, and Rada Mihalcea
-
[42]
Automatic detection of fake news.arXiv preprint arXiv:1708.07104(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum
-
[44]
Declare: Debunking fake news and false claims using evidence-aware deep learning.arXiv preprint arXiv:1809.06416(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and Yejin Choi
-
[46]
InProceedings of the 2017 conference on empirical methods in natural language processing
Truth of varying shades: Analyzing language in fake news and political fact-checking. InProceedings of the 2017 conference on empirical methods in natural language processing. 2931–2937
work page 2017
-
[47]
John W. Ratcliff and David E. Metzener. 1988. Pattern Matching: The Gestalt Approach.Dr. Dobb’s Journal13, 7 (Jul 1988), 46
work page 1988
-
[48]
Xuan Ren, Biao Wu, and Lingqiao Liu. 2024. I learn better if you speak my lan- guage: Enhancing large language model fine-tuning with style-aligned response adjustments.CoRR(2024)
work page 2024
-
[49]
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexan- der Turner. 2024. Steering Llama 2 via Contrastive Activation Addition. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15504–15522
work page 2024
-
[50]
Daniel Russo, Serra Sinem Tekiroğlu, and Marco Guerini. 2023. Benchmarking the generation of fact checking explanations.Transactions of the Association for Computational Linguistics11 (2023), 1250–1264
work page 2023
-
[51]
Michael Schlichtkrull, Zhijiang Guo, and Andreas Vlachos. 2023. Averitec: A dataset for real-world claim verification with evidence from the web.Advances in Neural Information Processing Systems36 (2023), 65128–65167
work page 2023
-
[52]
Tal Schuster, Roei Schuster, Darsh J Shah, and Regina Barzilay. 2020. The limita- tions of stylometry for detecting machine-generated fake news.Computational Linguistics46, 2 (2020), 499–510
work page 2020
-
[53]
Jiaming Shen, Jialu Liu, Dan Finnie, Negar Rahmati, Mike Bendersky, and Marc Najork. 2023. “Why is this misleading?”: Detecting News Headline Hallucinations with Explanations. InProceedings of the ACM Web Conference 2023. 1662–1672
work page 2023
-
[54]
Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee, and Huan Liu. 2019. de- fend: Explainable fake news detection. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 395–405
work page 2019
- [55]
-
[57]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, and Yi Chang. 2024. Explainable fake news detection with large language model via defense among competing wisdom. InProceedings of the ACM Web Conference
work page 2024
-
[59]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
work page 2022
-
[60]
Jiaying Wu, Jiafeng Guo, and Bryan Hooi. 2024. Fake news in sheep’s clothing: Robust fake news detection against LLM-empowered style attacks. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 3367–3378
work page 2024
-
[61]
Lianwei Wu, Yuan Rao, Ling Sun, and Wangbo He. 2021. Evidence inference net- works for interpretable claim verification. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 14058–14066
work page 2021
-
[62]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Zhiwei Yang, Jing Ma, Hechang Chen, Hongzhan Lin, Ziyang Luo, and Yi Chang
- [64]
-
[65]
Barry Menglong Yao, Aditya Shah, Lichao Sun, Jin-Hee Cho, and Lifu Huang. 2023. End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2733–2743
work page 2023
- [66]
-
[67]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. 2024. Star: Self- taught reasoner bootstrapping reasoning with reasoning. InProc. the 36th Inter- national Conference on Neural Information Processing Systems, Vol. 1126
work page 2024
-
[68]
Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gon- zalez. [n. d.]. TEMPERA: Test-Time Prompt Editing via Reinforcement Learning. InThe Eleventh International Conference on Learning Representations
- [69]
-
[70]
Eric Zhao, Pranjal Awasthi, and Nika Haghtalab. 2025. From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning.arXiv preprint arXiv:2503.05919(2025). A Prompt Template Following [5, 28], the prompt template we use to conduct training and inference for claims is as follows: A chat between a curious human and an artificial intellig...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.