pith. sign in

arxiv: 2605.17481 · v1 · pith:D6QUKWHWnew · submitted 2026-05-17 · 💻 cs.CL

Hybrid Feature Combinations with CNN for Bangla Fake News Classification

Pith reviewed 2026-05-20 12:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords Bangla fake newsCNN classifierhybrid featuressemantic featuresstatistical featurescharacter-level featuresBanFakeNews-2.0
0
0 comments X

The pith

Combining semantic, statistical, and character-level features with a CNN improves recall and F1 scores for Bangla fake news detection over single-feature baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether different groups of features help a convolutional neural network spot fake Bangla news more reliably. It compares semantic features that capture meaning, statistical features that count patterns, and character-level features that look at raw text strings, both alone and in combinations. On the BanFakeNews-2.0 dataset the hybrid versions raise recall and F1 scores compared with any one group used by itself. This matters because many people in Bangladesh get news from social media, where false stories spread quickly and harm trust in real reporting. The work shows a practical way to pick and mix features so the model catches more fake items without needing an entirely new architecture.

Core claim

On the BanFakeNews-2.0 dataset a CNN classifier reaches its highest recall and F1 scores when semantic, statistical, and character-level features are supplied together rather than when any single feature group is used alone.

What carries the argument

Hybrid feature combinations (semantic plus statistical plus character-level) fed into a convolutional neural network for binary classification of Bangla news articles.

If this is right

  • The best-performing model uses all three feature families together rather than any subset.
  • Recall improves more than precision when the hybrid set is used, so the detector misses fewer fake articles.
  • The same feature-selection step can be repeated on new Bangla news collections without changing the CNN architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported gains suggest that low-resource language detection pipelines can often be strengthened by adding cheap statistical and character counts instead of switching to larger models.
  • If the hybrid advantage holds on other South Asian languages, the same feature recipe could serve as a quick baseline before language-specific tuning begins.

Load-bearing premise

The labels in the BanFakeNews-2.0 dataset correctly mark real and fake articles, and the feature extraction process does not create artificial performance gains.

What would settle it

Run the same CNN pipeline on a version of the dataset whose labels have been randomly permuted; if the hybrid-feature advantage disappears or reverses, the original gains are likely tied to label quality rather than the feature combinations.

Figures

Figures reproduced from arXiv: 2605.17481 by Babe Sultana, Md Gulzar Hussain, Md Rinku Ali.

Figure 1
Figure 1. Figure 1: Proposed Research Flow Diagram. A. Dataset The dataset used in this research is the BanFakeNews-2.0 corpus [15], a Bangla fake news dataset comprising 48,678 real news articles and 12,903 manually annotated fake news articles. The original dataset included separate training, vali￾dation, and test sets. For our experiments, we combined them into a single file and labeled real news as 1 and fake news as 0. A… view at source ↗
Figure 2
Figure 2. Figure 2: Training and Validation History for TF-IDF, Word2Vec, FastText, Character-level TF-IDF, and statistical characteristics Combination. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Nowadays, people in Bangladesh frequently rely on the internet and social media for daily news instead of traditional newspapers. However, the spread of false Bangla news through these platforms poses risks and challenges to the credibility of authentic media. Although several studies have been conducted on detecting Bangla fake news, there is still significant room for improvement in this area. To assist people, this research explores the effectiveness of feature selection approaches in identifying appropriate features, such as semantic, statistical, and character-level features, or their combinations, on the BanFakeNews-2.0 dataset for detecting Bangla fake news using a CNN model. In this paper, key findings reveal that combining multiple features significantly improves recall and F1-scores compared to using individual features alone. The code for this research can be availed here, https://github.com/gulzar09/Bn\_FNews\_H.Feature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper explores feature selection and hybrid combinations of semantic, statistical, and character-level features fed into a CNN classifier for Bangla fake news detection on the BanFakeNews-2.0 dataset. It reports that multi-feature combinations yield notable gains in recall and F1-score relative to single-feature baselines, with code released on GitHub.

Significance. If the performance gains prove robust under rigorous validation, the work would usefully demonstrate the value of feature complementarity for low-resource-language fake-news tasks and could guide practitioners toward hybrid representations in CNN pipelines. The public code release is a positive step toward reproducibility.

major comments (3)
  1. [Methods] Methods: The manuscript supplies no information on train-test split ratios, the hyperparameter search procedure (learning rate, filter sizes, dropout), or whether nested cross-validation was employed when selecting and evaluating feature combinations. Without these details the headline claim that hybrids improve recall/F1 cannot be verified as free of selection bias or multiple-comparison artifacts.
  2. [Results] Results: No statistical significance tests, confidence intervals, or error bars accompany the reported metrics. Consequently it is impossible to determine whether the observed gains over individual features are reliable or could arise from random variation.
  3. [Experimental Setup] Experimental design: The description of post-hoc feature selection and combination exploration does not clarify whether performance on the same data used for final reporting was used to choose which hybrids to highlight. If so, the central claim of genuine complementarity is at risk of inflation.
minor comments (2)
  1. [Abstract] The abstract states that 'key findings reveal that combining multiple features significantly improves recall and F1-scores' but does not quantify the absolute or relative gains; adding concrete numbers would strengthen the summary.
  2. [Feature Extraction] Notation for the three feature families (semantic, statistical, character-level) is introduced without explicit definitions or formulas; a short table or equations would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We address each of the major comments below and will incorporate the necessary revisions to improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: The manuscript supplies no information on train-test split ratios, the hyperparameter search procedure (learning rate, filter sizes, dropout), or whether nested cross-validation was employed when selecting and evaluating feature combinations. Without these details the headline claim that hybrids improve recall/F1 cannot be verified as free of selection bias or multiple-comparison artifacts.

    Authors: We agree with the referee that these methodological details are crucial. Our experiments utilized an 80:20 train-test split. Hyperparameters were selected using grid search on the training portion, with specific ranges for learning rate, filter sizes, and dropout rates. Nested cross-validation was not used. In the revised version, we will provide a comprehensive description of the experimental protocol, including these details and a discussion of potential limitations regarding selection bias. revision: yes

  2. Referee: No statistical significance tests, confidence intervals, or error bars accompany the reported metrics. Consequently it is impossible to determine whether the observed gains over individual features are reliable or could arise from random variation.

    Authors: We acknowledge this limitation in the current manuscript. To address it, we will conduct statistical significance testing (e.g., using paired t-tests) and include confidence intervals and error bars in the results section of the revised manuscript. This will help demonstrate the reliability of the performance gains. revision: yes

  3. Referee: The description of post-hoc feature selection and combination exploration does not clarify whether performance on the same data used for final reporting was used to choose which hybrids to highlight. If so, the central claim of genuine complementarity is at risk of inflation.

    Authors: We appreciate the concern regarding potential data leakage in feature selection. In our work, feature combinations were explored and selected using cross-validation on the training data, with final evaluation performed on an independent test set. We will revise the experimental setup section to explicitly detail this process and the measures taken to avoid overfitting to the test data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML comparison with no derivation chain

full rationale

The paper is a standard empirical study that evaluates combinations of semantic, statistical, and character-level features fed to a CNN on the BanFakeNews-2.0 dataset. No mathematical derivation, first-principles prediction, or claimed uniqueness theorem is present. The central finding (hybrid features improve recall/F1) is an experimental outcome, not a quantity that reduces to its inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear. The work is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is populated from the stated experimental setup. No new entities are postulated. Standard supervised-learning assumptions apply.

free parameters (1)
  • CNN hyperparameters (learning rate, filter sizes, dropout)
    Typical CNN training requires choosing these values; the abstract does not state whether they were tuned on the test set or held-out validation data.
axioms (1)
  • domain assumption BanFakeNews-2.0 labels are ground truth with negligible noise
    The performance numbers rest on the assumption that the dataset annotations are reliable.

pith-pipeline@v0.9.0 · 5679 in / 1153 out tokens · 26221 ms · 2026-05-20T12:48:42.913975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    A study towards bangla fake news detection using machine learning and deep learning,

    E. Hossain, M. Nadim Kaysar, A. Z. M. Jalal Uddin Joy, M. Miza- nur Rahman, and W. Rahman, “A study towards bangla fake news detection using machine learning and deep learning,” in Sentimental analysis and deep learning: proceedings of ICSADL 2021 . Springer, 2021, pp. 79–95

  2. [2]

    Machine learning for fake news classification with optimal feature selection,

    M. Fayaz, A. Khan, M. Bilal, and S. U. Khan, “Machine learning for fake news classification with optimal feature selection,” Soft Computing, vol. 26, no. 16, pp. 7763–7771, 2022

  3. [3]

    Bangla fake news detection using machine learning, deep learning and transformer models,

    R. I. Rasel, A. H. Zihad, N. Sultana, and M. M. Hoque, “Bangla fake news detection using machine learning, deep learning and transformer models,” in 2022 25th International Conference on Computer and Information Technology (ICCIT) . IEEE, 2022, pp. 959–964

  4. [4]

    Bangla counterfeit news identification: Using the power of bert,

    M. S. Khatun and I. Khan, “Bangla counterfeit news identification: Using the power of bert,” in 2024 IEEE International Conference on Power , Electrical, Electronics and Industrial Applications (PEEIACON). IEEE, 2024, pp. 518–522

  5. [5]

    Detection of bangla fake news using mnb and svm classifier,

    M. G. Hussain, M. R. Hasan, M. Rahman, J. Protim, and S. Al Hasan, “Detection of bangla fake news using mnb and svm classifier,” in 2020 International conference on computing, electronics & communications engineering (iCCECE) . IEEE, 2020, pp. 81–85

  6. [6]

    Bnnetxtreme: An enhanced methodology for bangla fake news detection online,

    Z. Wahid, A. A. Imran, and M. R. I. Rifat, “Bnnetxtreme: An enhanced methodology for bangla fake news detection online,” in International Conference on Computational Data and Social Networks . Springer, 2022, pp. 157–166

  7. [7]

    Comparative analysis of bangla news classification: a study of fake news detection and multiclass clas- sification using bert and fasttext,

    R. Barua, M. Rahman, and U. G. Joy, “Comparative analysis of bangla news classification: a study of fake news detection and multiclass clas- sification using bert and fasttext,” International Journal of Computers and Applications , vol. 47, no. 5, pp. 475–485, 2025

  8. [8]

    Ibfnd: An improved dataset for bangla fake news detection and comparative analysis of performance of baseline models,

    S. Rohman, J. Ferdous, S. M. R. Ullah, and M. A. Rahman, “Ibfnd: An improved dataset for bangla fake news detection and comparative analysis of performance of baseline models,” in 2023 International Conference on Next-Generation Computing, IoT and Machine Learning (NCIM). IEEE, 2023, pp. 1–6

  9. [9]

    Roberta-gcn: A novel approach for combating fake news in bangla using advanced language processing and graph convolutional networks,

    M. Ahammad, A. Sani, K. Rahman, M. T. Islam, M. M. R. Masud, M. M. Hassan, M. A. T. Rony, S. M. N. Alam, and M. S. H. Mukta, “Roberta-gcn: A novel approach for combating fake news in bangla using advanced language processing and graph convolutional networks,” IEEE Access , 2024

  10. [10]

    Breaking the fake news barrier: Deep learning approaches in bangla language,

    P . K. Mondal, S. S. Khan, M. M. Rana, S. S. Ramit, A. Sattar, and M. S. Rahman, “Breaking the fake news barrier: Deep learning approaches in bangla language,” in 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT) . IEEE, 2024, pp. 1–6

  11. [11]

    Enhancing bangla fake news detection using bidirectional gated recurrent units and deep learning techniques,

    U. Roy, M. S. Tahosin, M. M. Hasan, T. Islam, F. Imtiaz, M. R. Sadik, Y . Maleh, R. B. Sulaiman, and M. S. Hassan Talukder, “Enhancing bangla fake news detection using bidirectional gated recurrent units and deep learning techniques,” in Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security , 2024, pp. 1–10

  12. [12]

    Semi-supervised based bangla fake review detection: A comparative analysis,

    N. Absar, T. Mahmud, A. Hanip, and M. S. Hossain, “Semi-supervised based bangla fake review detection: A comparative analysis,” in 2025 In- ternational Conference on Inventive Computation Technologies (ICICT) . IEEE, 2025, pp. 1428–1433

  13. [13]

    Automatic detection of manipulated bangla news: A new knowledge-driven approach,

    A. Akther, K. M. Alam, and R. Debnath, “Automatic detection of manipulated bangla news: A new knowledge-driven approach,” Natural Language Processing Journal , p. 100155, 2025

  14. [14]

    Multibanfakedetect: Integrating advanced fusion techniques for multimodal detection of bangla fake news in under- resourced contexts,

    F. T. J. Faria, M. B. Moin, Z. Hasan, M. A. A. Khandaker, N. Islam, K. M. Hasib, and M. Mridha, “Multibanfakedetect: Integrating advanced fusion techniques for multimodal detection of bangla fake news in under- resourced contexts,” International Journal of Information Management Data Insights , vol. 5, no. 2, p. 100347, 2025

  15. [15]

    From scarcity to capability: Empowering fake news detection in low-resource languages with LLMs,

    H. M. Shibu, S. Datta, M. S. Miah, N. Sami, M. S. Chowdhury, and M. S. Islam, “From scarcity to capability: Empowering fake news detection in low-resource languages with LLMs,” in Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages , R. Weerasinghe, I. Anuradha, and D. Sumanathilaka, Eds. Abu Dhabi: Asso...

  16. [16]

    Continuous-bag-of-words and skip-gram for word vector train- ing and text classification,

    H. Xia, “Continuous-bag-of-words and skip-gram for word vector train- ing and text classification,” in Journal of Physics: Conference Series , vol. 2634, no. 1. IOP Publishing, 2023, p. 012052

  17. [17]

    Review and visualization of facebook’s fasttext pretrained word vector model,

    J. C. Y oung and A. Rusli, “Review and visualization of facebook’s fasttext pretrained word vector model,” in 2019 international conference on engineering, science, and industrial applications (ICESI) . IEEE, 2019, pp. 1–6

  18. [18]

    Feature selection for fake news classification,

    S. Sverdrup-Thygeson and P . C. Haddow, “Feature selection for fake news classification,” in 2021 IEEE Symposium Series on Computational Intelligence (SSCI) . IEEE, 2021, pp. 1–8. 6