BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Ahmed Alfey Sani; Ajwad Abrar; Ekramul Alam Esham; Ishmam Tashdeed; Md Taukir Azam Chowdhury; Shefayat E Shams Adib

arxiv: 2605.31483 · v3 · pith:N6VT3RH2new · submitted 2026-05-29 · 💻 cs.CL · cs.AI

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Shefayat E Shams Adib , Ahmed Alfey Sani , Ekramul Alam Esham , Ajwad Abrar , Ishmam Tashdeed , Md Taukir Azam Chowdhury This is my paper

Pith reviewed 2026-06-30 10:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination evaluationBengali language modelsmulti-task benchmarkdual-track protocollow-resource languagesBenHalluScorequestion answeringsummarization

0 comments

The pith

BenHalluEval introduces the first dedicated hallucination benchmark for Bengali across four tasks using a dual-track protocol.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates BenHalluEval as the first systematic framework to measure hallucinations in large language models when they process Bengali. It generates 12,000 hallucinated candidates across twelve types drawn from existing datasets and tests seven models with separate tracks for false positives on correct answers and detection rates on fabricated ones. These tracks feed into the BenHalluScore metric, which ranges from 7.72 percent to 55.42 percent and shows that single-track checks produce misleading results. Chain-of-thought prompting changes answer patterns but does not reliably improve detection. The approach matters because Bengali lacks prior hallucination resources and standard English-tuned methods do not transfer cleanly.

Core claim

BenHalluEval constructs 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types for generative question answering, code-mixed QA, summarization, and reasoning; it evaluates seven LLMs under a dual-track protocol that measures false-positive rate on ground-truth instances in Track A and hallucination detection rate on hallucinated candidates in Track B; the resulting BenHalluScore ranges from 7.72 percent to 55.42 percent across models and tasks, while chain-of-thought prompting shifts response distributions without consistently improving discrimination.

What carries the argument

BenHalluEval dual-track protocol (Track A for false-positive rate on ground-truth, Track B for detection rate on hallucinated candidates) combined with the BenHalluScore calibration metric that jointly penalizes both failure modes.

If this is right

Models exhibit substantial variation in hallucination calibration across the four tasks.
Single-track evaluation approaches produce inflated scores from uniform response bias.
Chain-of-thought prompting shifts response distributions without consistently improving hallucination discrimination.
Prompting-only strategies are inadequate for low-resource language settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-track structure could be applied to other low-resource languages to check whether single-track methods fail there too.
The generated candidate set could be validated against real model outputs to serve as training data for hallucination detectors.
Task-specific patterns in the twelve hallucination types might guide targeted mitigation techniques beyond prompting.

Load-bearing premise

The 12,000 hallucinated candidates generated by GPT-5.4 across twelve types are representative of the hallucinations that real LLMs produce when processing Bengali.

What would settle it

Direct comparison of hallucination types and frequencies between the GPT-5.4 generated set and the actual outputs of the seven evaluated LLMs on fresh Bengali inputs from the same task domains.

Figures

Figures reproduced from arXiv: 2605.31483 by Ahmed Alfey Sani, Ajwad Abrar, Ekramul Alam Esham, Ishmam Tashdeed, Md Taukir Azam Chowdhury, Shefayat E Shams Adib.

**Figure 2.** Figure 2: BenHalluEval system pipeline. Four tasks (GQA, Code-Mixed QA, Summarization, Reasoning) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: BenHalluScore before and after CoT prompting for Mistral-nemo-12B (M) and LLaMA3.1-8B (L) across GQA, Summarization, and Reasoning. Lower is better. ↓ = improvement; ↑ = worsening [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at https://anonymous.4open.science/r/BanglaHalluEval-EB77.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BenHalluEval is the first hallucination benchmark for Bengali but its GPT-5.4-generated candidates need validation against real model outputs.

read the letter

The main takeaway is that this is the first systematic hallucination evaluation framework for Bengali LLMs. It covers four tasks drawn from existing datasets, generates 12,000 candidates across twelve hallucination types, tests seven models in reasoning, multilingual, and Bengali-centric categories, and proposes BenHalluScore as a dual-track metric that balances false-positive rate on clean data with detection rate on hallucinated examples.

The work does a few things right. The dual-track protocol and the calibration metric are a direct response to the problem of uniform response bias, and the results show BenHalluScore ranging from roughly 8% to 55% across models and tasks. Testing chain-of-thought prompting and finding it does not consistently help is also useful. Releasing the dataset and code supports reproducibility.

The soft spot is the data construction. The hallucinated candidates come from GPT-5.4, and the paper needs to show that these errors are representative of what the seven evaluated models actually produce on Bengali inputs. If the synthetic distribution differs in type or frequency from real model behavior, especially for the Bengali-centric models, then both the score calibration and the conclusion that single-track methods are inadequate rest on an unverified assumption. The abstract does not indicate any direct comparison or validation step for this.

This is for researchers working on multilingual evaluation or low-resource language reliability. It deserves peer review because it fills a documented gap with a concrete benchmark and metric, even though the synthetic data step will likely need more justification or additional experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces BenHalluEval, the first dedicated hallucination evaluation framework for Bengali LLMs across four tasks (Generative QA, Bangla-English code-mixed QA, summarization, reasoning). It constructs 12,000 hallucinated candidates via GPT-5.4 across twelve task-specific types drawn from three existing Bengali datasets, then evaluates seven LLMs (reasoning-oriented, multilingual, Bengali-centric) under a dual-track protocol: Track A measures false-positive rate on ground-truth instances and Track B measures hallucination detection rate on the synthetic candidates. A new dual-track metric, BenHalluScore, is proposed to jointly penalize both failure modes and avoid bias from uniform responses; scores range 7.72–55.42 %. Chain-of-thought prompting is tested as mitigation but shifts distributions without consistent gains in discrimination. The work claims this demonstrates the inadequacy of single-track and prompting-only approaches for low-resource settings and releases the dataset and code.

Significance. If the synthetic candidates prove representative, the work provides the first systematic hallucination benchmark for Bengali and introduces a dual-track calibration metric that addresses a known weakness in prior single-metric evaluations. The explicit release of data and code supports reproducibility. The empirical finding that CoT prompting fails to improve discrimination consistently is a useful negative result for low-resource prompting research.

major comments (2)

[Data construction] Data construction section: The 12,000 hallucinated candidates are generated exclusively by GPT-5.4, yet no validation (human annotation, overlap statistics, or error-type comparison) is reported showing that these synthetic errors match the distribution, frequency, or linguistic form of hallucinations produced by the seven evaluated models on Bengali inputs. Because Track B and the BenHalluScore calibration rest directly on detection rates against these candidates, this unverified proxy assumption is load-bearing for the central claim that single-track methods are inadequate.
[Evaluation protocol] Evaluation protocol (§4 or equivalent): The dual-track design and BenHalluScore are well-motivated, but the manuscript does not report per-model, per-task confusion matrices or calibration curves that would allow readers to verify that the reported 7.72–55.42 % range reflects genuine variation rather than artifacts of the GPT-5.4 candidate distribution.

minor comments (2)

[Metric definition] The abstract states the BenHalluScore range but does not define its exact formula; the main text should include the closed-form expression (including how false-positive and detection rates are combined) in an early section for reproducibility.
[Task definitions] Task descriptions would benefit from one additional sentence each clarifying how the twelve hallucination types map onto the four tasks, to avoid ambiguity when readers attempt to replicate the candidate generation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where revisions to the manuscript will be made to strengthen the work.

read point-by-point responses

Referee: [Data construction] Data construction section: The 12,000 hallucinated candidates are generated exclusively by GPT-5.4, yet no validation (human annotation, overlap statistics, or error-type comparison) is reported showing that these synthetic errors match the distribution, frequency, or linguistic form of hallucinations produced by the seven evaluated models on Bengali inputs. Because Track B and the BenHalluScore calibration rest directly on detection rates against these candidates, this unverified proxy assumption is load-bearing for the central claim that single-track methods are inadequate.

Authors: We acknowledge that the manuscript does not report explicit validation (such as human annotation or direct comparison) demonstrating that the GPT-5.4-generated candidates match the hallucination distributions of the seven evaluated models. The generation follows established hallucination taxonomies drawn from prior literature and is applied at scale to create controlled, task-specific examples, which is a common proxy approach in hallucination benchmarking. Nevertheless, we agree that this assumption merits explicit discussion given its role in the evaluation. In the revision we will add a limitations subsection that (a) states the proxy nature of the synthetic data, (b) reports the distribution of the twelve hallucination types across the 12,000 candidates, and (c) discusses potential implications for generalizability to model-specific error patterns. revision: yes
Referee: [Evaluation protocol] Evaluation protocol (§4 or equivalent): The dual-track design and BenHalluScore are well-motivated, but the manuscript does not report per-model, per-task confusion matrices or calibration curves that would allow readers to verify that the reported 7.72–55.42 % range reflects genuine variation rather than artifacts of the GPT-5.4 candidate distribution.

Authors: We agree that per-model, per-task confusion matrices and calibration information would improve transparency and allow readers to inspect the underlying Track A and Track B rates. In the revised manuscript we will add these diagnostics as an appendix, including confusion matrices for each of the seven models across the four tasks and, where space permits, calibration curves that plot the relationship between reported BenHalluScore and the raw detection rates. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or metric definition

full rationale

The paper constructs 12,000 hallucinated candidates from three existing Bengali datasets using GPT-5.4 across twelve task-specific types, then evaluates seven LLMs under an independent dual-track protocol (Track A on ground-truth, Track B on candidates) and defines BenHalluScore as a calibration metric combining false-positive and detection rates. No quoted step reduces any result to its own inputs by construction, no parameter is fitted on a subset and renamed as a prediction, and no load-bearing claim rests on self-citation or imported uniqueness theorems. The derivation remains self-contained against external datasets and the proposed metric is defined directly from the dual-track measurements without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available so ledger is minimal; main unverified premise is representativeness of GPT-generated hallucination examples.

axioms (1)

domain assumption The twelve task-specific hallucination types cover the relevant failure modes for Bengali LLMs.
Basis for constructing the 12,000 candidates from three existing datasets.

pith-pipeline@v0.9.1-grok · 5826 in / 1149 out tokens · 33752 ms · 2026-06-30T10:49:22.094717+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 3 canonical work pages · 1 internal anchor

[1]

In Find- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021 , pages 4693–4703, Online

XL-sum: Large-scale multilingual abstrac- tive summarization for 44 languages . In Find- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021 , pages 4693–4703, Online. Association for Computational Linguis- tics. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Fen...

work page arXiv 2021
[2]

In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 85–93, Singapore

BanglaCHQ-summ: An abstractive summarization dataset for medical queries in Bangla conversational speech . In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 85–93, Singapore. Association for Computational Linguistics. Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A large-scale hallucin...

2023
[3]

In Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 3214–3252, Dublin, Ireland

TruthfulQA: Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer...
[4]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 12076–12100, Singapore

F ActScore: Fine-grained atomic evalua- tion of factual precision in long form text gener- ation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 12076–12100, Singapore. Association for Computational Linguistics. Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, and Ying Ding...

work page arXiv 2023
[5]

Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

Banglasummeval: Reference-free factual consistency evaluation for bangla summariza- tion. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 595–608. Nishat Raihan and Marcos Zampieri. 2025. Tigerllm-a family of bangla large language mod- els. In Proceedings of the 63rd Annual Meeting of the Associati...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

In Find- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021 , pages 4693–4703, Online

XL-sum: Large-scale multilingual abstrac- tive summarization for 44 languages . In Find- ings of the Association for Computational Lin- guistics: ACL-IJCNLP 2021 , pages 4693–4703, Online. Association for Computational Linguis- tics. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Fen...

work page arXiv 2021

[2] [2]

In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 85–93, Singapore

BanglaCHQ-summ: An abstractive summarization dataset for medical queries in Bangla conversational speech . In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 85–93, Singapore. Association for Computational Linguistics. Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. HaluEval: A large-scale hallucin...

2023

[3] [3]

In Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 3214–3252, Dublin, Ireland

TruthfulQA: Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer...

[4] [4]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 12076–12100, Singapore

F ActScore: Fine-grained atomic evalua- tion of factual precision in long form text gener- ation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 12076–12100, Singapore. Association for Computational Linguistics. Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, and Ying Ding...

work page arXiv 2023

[5] [5]

Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

Banglasummeval: Reference-free factual consistency evaluation for bangla summariza- tion. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 595–608. Nishat Raihan and Marcos Zampieri. 2025. Tigerllm-a family of bangla large language mod- els. In Proceedings of the 63rd Annual Meeting of the Associati...

work page internal anchor Pith review Pith/arXiv arXiv 2026