arxiv: 2604.12748 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

Generating Effective CoT Traces for Mitigating Causal Hallucination

Yiheng Zhao , Jun Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords causal hallucinationchain-of-thoughtevent causality identificationfine-tuningsmaller language modelsCausal Hallucination RateCoT generation

0 comments

The pith

A pipeline for generating effective chain-of-thought traces enables smaller language models to reduce causal hallucination in event causality identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Smaller language models frequently invent false causal connections when asked to identify whether one event causes another. The paper first determines the necessary qualities of chain-of-thought explanations that can correct this tendency, then builds an automated pipeline to produce such explanations. It also defines the Causal Hallucination Rate metric to measure the problem and to shape the pipeline. When smaller models are fine-tuned on the resulting traces, they commit fewer causal errors, achieve higher overall accuracy, and maintain these gains across different datasets, difficulty levels, and even when given misleading prompts.

Core claim

The authors establish that chain-of-thought traces meeting specific criteria for logical structure and causal focus can be produced automatically through a dedicated pipeline. Fine-tuning smaller LLMs on these traces produces a substantial drop in causal hallucination as quantified by the new Causal Hallucination Rate, an increase in mean task accuracy, and improved robustness under cross-dataset testing, cross-difficulty testing, and misleading intervention prompts.

What carries the argument

The pipeline that generates CoT traces satisfying the essential criteria for mitigating causal hallucination, guided and validated by the introduced Causal Hallucination Rate metric.

If this is right

Smaller LLMs fine-tuned on the generated traces commit substantially fewer causal hallucinations on event causality tasks.
Mean accuracy on the task increases in addition to the reduction in errors.
The performance gains transfer to different datasets and to harder examples.
The fine-tuned models remain accurate even when presented with misleading prompts that suggest incorrect causal links.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same criteria-and-pipeline method could be adapted to reduce other forms of hallucination that appear in multi-step reasoning.
Automating high-quality causal reasoning traces may offer a scalable way to improve reliability in smaller models across additional domains.
This suggests that training data focused on explicit causal structure can narrow the performance gap between small and large models without requiring larger architectures.

Load-bearing premise

The criteria selected for effective CoT traces are sufficient to ensure the generated traces improve performance on new datasets and situations beyond those used to design the pipeline.

What would settle it

Evaluating the fine-tuned models on a fresh event-causality dataset containing novel event types and causal structures, then checking whether the Causal Hallucination Rate stays low or returns to the original high levels.

Figures

Figures reproduced from arXiv: 2604.12748 by Jun Yan, Yiheng Zhao.

**Figure 2.** Figure 2: Overview of the proposed CoT trace genera [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of Qwen2.5-1.5B outputs before and after fine-tuning on a non-causal event pair. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of Llama3.2-1B outputs before and after fine-tuning on a non-causal event pair. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that smaller LLMs suffer from causal hallucination in event causality identification (ECI); by investigating criteria for effective CoT traces, designing a generation pipeline, and introducing the Causal Hallucination Rate (CHR) metric to both guide criteria and measure outcomes, fine-tuning with the generated traces substantially reduces causal hallucination, improves mean accuracy, and yields strong cross-dataset, cross-difficulty generalization plus robustness to misleading intervention prompts.

Significance. If the central claims hold after resolving validation concerns, the work would be significant for practical mitigation of causal errors in small models on ECI tasks, where labeled CoT data was previously unavailable. It contributes a new metric, an explicit criteria investigation, and evidence of generalization/robustness that could inform CoT fine-tuning more broadly. The reported accuracy gains and cross-dataset results, if independently verified, would strengthen the case for targeted trace generation over generic CoT.

major comments (1)

[Abstract] Abstract: The CHR metric is introduced to quantify causal hallucination, guide formulation of the CoT trace criteria, and validate pipeline effectiveness. This dual role creates a load-bearing circularity risk—the criteria may have been selected or tuned using CHR on data that later serves as the evaluation benchmark, so reported CHR reductions and accuracy improvements could partly reflect metric-specific optimization rather than an independent reduction in hallucination. An external anchor (human correlation study, alternative hallucination probe, or fully held-out causal test set) is needed to break the loop.

minor comments (2)

The abstract and methods would benefit from explicit statements on whether trace-generation criteria were derived on data completely disjoint from the fine-tuning and test splits, and whether CHR computation was performed on held-out examples only.
Clarify the precise formula and edge-case handling for CHR (e.g., how 'causal hallucination' instances are identified and counted) so that the metric can be reproduced and compared to other hallucination measures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying a potential methodological concern regarding the CHR metric. We address this point directly below, providing clarification on our development process and committing to revisions that increase transparency.

read point-by-point responses

Referee: [Abstract] Abstract: The CHR metric is introduced to quantify causal hallucination, guide formulation of the CoT trace criteria, and validate pipeline effectiveness. This dual role creates a load-bearing circularity risk—the criteria may have been selected or tuned using CHR on data that later serves as the evaluation benchmark, so reported CHR reductions and accuracy improvements could partly reflect metric-specific optimization rather than an independent reduction in hallucination. An external anchor (human correlation study, alternative hallucination probe, or fully held-out causal test set) is needed to break the loop.

Authors: We appreciate the referee highlighting this risk of circularity. In the manuscript, the CoT criteria were first derived from a qualitative analysis of common failure modes (e.g., spurious correlations and missing explicit causal links) observed in small LLMs, informed by prior causal reasoning literature. CHR was then defined to operationalize these pre-specified criteria for quantitative guidance during pipeline iteration. All CHR-guided decisions occurred on a development partition; final CHR and accuracy results were measured exclusively on held-out test portions of the datasets that played no role in criteria formulation or tuning. We will revise the paper to (1) explicitly document this data partitioning and ordering of steps, and (2) report an additional independent probe—performance under misleading intervention prompts—as an external validation anchor that does not rely on CHR. These changes eliminate the load-bearing circularity while preserving the original experimental outcomes. revision: yes

Circularity Check

1 steps flagged

CHR metric used both to derive CoT criteria and to validate pipeline creates partial validation circularity

specific steps

self definitional [Abstract]
"we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline."

CHR simultaneously defines what counts as 'effective' CoT criteria (by guiding their formulation) and serves as the validation metric for the pipeline's success. The reported reduction in causal hallucination is therefore achieved and measured by the same newly introduced quantity used to select the criteria, making the improvement partly tautological rather than independently demonstrated.

full rationale

The paper introduces CHR explicitly to guide CoT criteria formulation and to validate the pipeline, then claims the generated traces reduce causal hallucination (measured by CHR). This creates a self-referential loop where effectiveness is defined and confirmed inside the same metric. Cross-dataset generalization provides partial independence, preventing a higher score, but the core derivation chain still reduces the success claim to optimization within CHR by construction. No self-citations, uniqueness theorems, or other patterns are present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the investigated criteria for effective CoT traces are both necessary and sufficient; no free parameters or invented entities are visible in the abstract, but the pipeline itself likely contains design choices that function as implicit parameters.

axioms (1)

domain assumption Smaller LLMs can be fine-tuned on generated CoT traces to internalize better causal reasoning without introducing new failure modes.
Stated implicitly in the claim that fine-tuning reduces hallucination and improves accuracy.

pith-pipeline@v0.9.0 · 5502 in / 1258 out tokens · 93002 ms · 2026-05-10T15:07:44.941447+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 techni- cal report.arXiv preprint arXiv:2303.08774. Ruichu Cai, Shengyin Yu, Jiahao Zhang, Wei Chen, Boyan Xu, and Keli Zhang

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2305.07375 , year=

Is chatgpt a good causal reasoner? a comprehensive evaluation.arXiv preprint arXiv:2305.07375. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others

work page arXiv
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186. Viet Dac Lai, Amir Pouran Ben Veyseh, Minh Van Nguyen, Franck Dernoncourt, and Thien Huu Nguyen

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Instruction Data Selection via Answer Divergence

Meci: A multilingual dataset for event causality identification. InProceedings of the 29th international conference on computational lin- guistics, pages 2346–2356. Bo Li, Mingda Wang, Shikun Zhang, and Wei Ye. 2026a. Instruction data selection via answer diver- gence.Preprint, arXiv:2604.10448. Bo Li, Shikun Zhang, and Wei Ye. 2026b. Data se- lection for...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Small models struggle to learn from strong reasoners

Small mod- els struggle to learn from strong reasoners.arXiv preprint arXiv:2502.12143. Paramita Mirza and Sara Tonelli

work page arXiv
[7]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

Maven-ere: A unified large-scale dataset for event coreference, tem- poral, causal, and subevent relation extraction.arXiv preprint arXiv:2211.07342. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

work page arXiv
[8]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen, Minfeng Zhu, and Qian Liu

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Zefan Zeng, Xingchen Hu, Qing Cheng, Weiping Ding, Wentao Li, and Zhong Liu

Self- distillation bridges distribution gap in language model fine-tuning.arXiv preprint arXiv:2402.13669. Zefan Zeng, Xingchen Hu, Qing Cheng, Weiping Ding, Wentao Li, and Zhong Liu

work page arXiv
[10]

Dylan Zhang, Qirun Dai, and Hao Peng

Zero-shot event causality identification via multi-source evidence fuzzy aggregation with large language models.arXiv preprint arXiv:2506.05675. Dylan Zhang, Qirun Dai, and Hao Peng

work page arXiv
[11]

The best instruction-tuning data are those that fit,

The best instruction-tuning data are those that fit.arXiv preprint arXiv:2502.04194. Yiheng Zhao and Jun Yan

work page arXiv
[12]

Pos” denotes causal samples, while “Neg

Mrbalance: A framework for enhancing event causal- ity identification in multi-agent debates via role as- signment.Knowledge-Based Systems, page 114470. Training Set Statistics Fold Total Pos Neg 1 7021 2179 4842 2 7491 2122 5369 3 7378 2128 5250 4 7629 2213 5416 5 6697 2094 4603 Test Set Statistics Fold Total Pos Neg 1 4108 630 3478 2 3244 648 2596 3 332...

2094