arxiv: 2604.14218 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes

Samir Wagle , Reewaj Khanal , Abiral Adhikari

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hate speech detectionmultimodal fusionNepali memesDevanagari scriptsentiment analysislow-resource languagescross-modal attentiongating network

0 comments

The pith

A cross-modal attention fusion model with dynamic gating improves binary hate detection on Nepali memes by 5.9 percent F1-macro over text-only baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates a hybrid architecture that pairs CLIP visual features with BGE-M3 multilingual text embeddings, then routes them through four-head self-attention and a learnable per-sample gate. Systematic tests across eight configurations show the added cross-modal path lifts performance on the binary hate task while exposing two failure modes: English-centric vision encoders produce near-random results on Devanagari text, and conventional ensembles overfit catastrophically when training folds contain roughly 850 examples. A sympathetic reader would care because the setting combines extreme data scarcity, multimodal meme structure, and a non-Latin script that most public vision models were never trained to handle.

Core claim

Explicit cross-modal reasoning via a 4-head self-attention layer and learnable gating network that dynamically weights visual and textual contributions on each sample produces a 5.9 percent F1-macro gain on Subtask A relative to the text-only baseline; the same evaluation reveals that standard CLIP vision encoders perform near chance on Devanagari-scripted images and that ensemble averaging collapses under the small per-fold sample size.

What carries the argument

The hybrid cross-modal attention fusion architecture that connects CLIP ViT-B/32 visual encoding to BGE-M3 text encoding through 4-head self-attention and a learnable gating network.

If this is right

Cross-modal attention plus per-sample gating can be applied to other low-resource multimodal classification tasks that mix text and image.
Vision encoders pre-trained only on Latin-script images will systematically underperform on Devanagari or similar scripts unless additional script-specific adaptation is performed.
Ensemble methods become unreliable once training-set size per class drops below roughly one thousand examples.
Gating that learns modality weights sample-by-sample is more robust than fixed fusion weights under data scarcity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that future shared tasks in low-resource settings should supply script-aware vision backbones rather than relying on off-the-shelf CLIP variants.
The gating network could be extended to decide whether to ignore one modality entirely on a per-sample basis, which might further reduce overfitting.
The observed ensemble collapse suggests that model diversity must be enforced explicitly when data are this limited.

Load-bearing premise

That the observed performance lift and the two failure modes result from the cross-modal architecture and gating rather than from unstated preprocessing choices, hyperparameter tuning, or the particular data splits used.

What would settle it

Re-running the eight configurations on the same data splits but with a different random seed for the gating network and attention layers, then observing whether the 5.9 percent margin and the two failure modes remain statistically significant.

read the original abstract

Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at https://github.com/Tri-Yantra-Technologies/MEME-Fusion/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 5.9% F1 gain on Nepali memes is reported from a standard cross-modal setup, but the small dataset leaves open whether it comes from the architecture or from tuning differences.

read the letter

The main takeaway is that the authors get a 5.9% F1-macro lift on binary hate detection for Nepali memes by adding 4-head cross-modal self-attention and a learnable gating network on top of CLIP ViT-B/32 and BGE-M3. They also document two practical problems: English-centric vision models perform near random on Devanagari script, and ensembles overfit badly when each fold has only about 850 samples. Both observations match what people already see in low-resource work, but the paper puts numbers on them for this specific setting and releases the code at the GitHub link in the abstract.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a hybrid cross-modal attention fusion architecture for the CHiPSAL 2026 shared task on Nepali memes, combining CLIP (ViT-B/32) visual encoding with BGE-M3 multilingual text representations through 4-head self-attention and a learnable gating network. It reports results from systematic evaluation across eight model configurations on Subtask A (binary hate speech detection) and Subtask B (three-class sentiment analysis), claiming a 5.9% F1-macro improvement over text-only baselines on Subtask A, along with two findings: English-centric vision models show near-random performance on Devanagari script, and standard ensembles degrade under data scarcity (N ≈ 850 per fold). Code is released at the provided GitHub link.

Significance. If the reported gains prove robustly attributable to the cross-modal components, the work would offer a useful contribution to multimodal NLP in low-resource, non-Latin script settings by providing an ablation study and highlighting practical limitations of off-the-shelf vision models and ensembles. The public code release supports reproducibility and enables independent verification of the eight configurations.

major comments (2)

[Abstract and §4 (Results and Ablation)] Abstract and §4 (Results and Ablation): The headline 5.9% F1-macro lift on Subtask A is presented as evidence for the value of explicit cross-modal reasoning via 4-head attention and gating. However, the manuscript does not state whether the text-only baselines (BGE-M3 alone) received identical hyperparameter budgets, learning-rate schedules, early-stopping rules, or tokenization choices as the full model. With only ~850 samples per fold, modest differences in these factors can produce gains of this magnitude; without reported per-fold standard deviations, multiple random seeds, or a statement that all eight configurations were tuned under the same protocol, the attribution to the gating network and cross-modal attention cannot be confirmed as load-bearing.
[§3 (Model Architecture) and §5 (Findings)] §3 (Model Architecture) and §5 (Findings): The two reported failure modes (CLIP near-random on Devanagari; ensembles overfitting) are consistent with known issues, yet the text does not clarify whether these behaviors were observed uniformly across all eight configurations or whether they depend on specific choices such as the exact ViT-B/32 checkpoint, ensemble size, or data-split seed. This detail is needed to establish that the failure modes are general rather than configuration-specific.

minor comments (2)

[Abstract and §1] The abstract and introduction could more explicitly define the eight model configurations (e.g., which variants include gating, which are text-only, which are vision-only) to allow readers to map the ablation results directly to the architectural claims.
[§4 (Results)] Tables reporting F1 scores should include the number of runs or folds used and any statistical significance tests performed; this is a standard expectation for shared-task submissions on small datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important aspects of experimental rigor that we will address to strengthen the attribution of results and the generality of the reported failure modes. Below we respond point by point and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract and §4 (Results and Ablation)] Abstract and §4 (Results and Ablation): The headline 5.9% F1-macro lift on Subtask A is presented as evidence for the value of explicit cross-modal reasoning via 4-head attention and gating. However, the manuscript does not state whether the text-only baselines (BGE-M3 alone) received identical hyperparameter budgets, learning-rate schedules, early-stopping rules, or tokenization choices as the full model. With only ~850 samples per fold, modest differences in these factors can produce gains of this magnitude; without reported per-fold standard deviations, multiple random seeds, or a statement that all eight configurations were tuned under the same protocol, the attribution to the gating network and cross-modal attention cannot be confirmed as load-bearing.

Authors: We agree that the manuscript should have explicitly confirmed protocol equivalence. All eight configurations were implemented in a single codebase and followed the identical training protocol prescribed by the CHiPSAL 2026 shared task, including the same hyperparameter search ranges, learning-rate schedule, early-stopping patience, batch size, and BGE-M3 tokenization. To make this transparent, we will revise §4 to add a dedicated paragraph and a supplementary table listing the shared settings applied uniformly to every model, including the text-only BGE-M3 baseline. We will also report the per-fold F1-macro scores for the top-performing configuration and the text-only baseline so that variance across the five fixed folds can be inspected directly. While additional independent random-seed runs were not performed owing to the shared-task submission deadline, the improvement remained consistent across all folds, supporting that the cross-modal components are the primary source of the gain. revision: partial
Referee: [§3 (Model Architecture) and §5 (Findings)] §3 (Model Architecture) and §5 (Findings): The two reported failure modes (CLIP near-random on Devanagari; ensembles overfitting) are consistent with known issues, yet the text does not clarify whether these behaviors were observed uniformly across all eight configurations or whether they depend on specific choices such as the exact ViT-B/32 checkpoint, ensemble size, or data-split seed. This detail is needed to establish that the failure modes are general rather than configuration-specific.

Authors: The near-random performance of CLIP was observed in every configuration that incorporated the English-centric ViT-B/32 visual encoder, and the catastrophic ensemble degradation appeared in both ensemble variants we evaluated. In the revised §5 we will explicitly state that these behaviors held across the relevant subsets of the eight configurations, specify the exact checkpoint (openai/clip-vit-base-patch32), the ensemble size (three models), and note that the data splits were the fixed partitions released by the shared-task organizers and therefore independent of any random seed chosen by us. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation on held-out folds with released code

full rationale

The paper reports experimental F1-macro results from training and evaluating eight model configurations (including the proposed cross-modal attention + gating architecture) against text-only baselines on the CHiPSAL 2026 dataset splits. The 5.9% gain is a direct measured difference on held-out data, not a quantity derived from any equation or fitted parameter that is then re-labeled as a prediction. No self-citations, uniqueness theorems, or ansatzes appear in the provided text; the architecture description is presented as a design choice whose performance is then measured, not as a tautological re-expression of the inputs. This is a standard empirical shared-task submission whose central claims rest on observable performance deltas rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical performance of a learnable gating network and cross-modal attention applied to pre-trained encoders; no new physical entities are introduced, but several modeling assumptions are required.

free parameters (2)

gating network parameters
Learnable weights that dynamically balance visual and textual contributions per sample; fitted during training on the shared-task data.
attention head count
Fixed at 4 heads in the self-attention layer; chosen as a hyperparameter.

axioms (2)

domain assumption CLIP ViT-B/32 and BGE-M3 provide suitable initial representations for Devanagari meme images and text
Invoked directly in the architecture description without additional justification or fine-tuning details in the abstract.
domain assumption The CHiPSAL 2026 data splits with N approximately 850 per fold are sufficient and representative for reliable ablation comparisons
Used to support claims about ensemble degradation under scarcity.

pith-pipeline@v0.9.0 · 5534 in / 1569 out tokens · 43498 ms · 2026-05-10T15:47:33.290401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 1 canonical work pages · 1 internal anchor

[1]

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1994-2003

Crisishatemm: Multimodal analy- sis of directed and undirected hate speech in text- embedded images from russia-ukraine conflict. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1994-2003. Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, and Pratyush Kumar

1994
[2]

In Proceedings of the 11th Forum for Information Retrieval Evaluation, pages 14—17

Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th Forum for Information Retrieval Evaluation, pages 14—17. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al

2019
[3]

In Proceedings of the Second Workshop on Challenges in Pro- cessing South Asian Languages (CHIPSAL)

Findings of the sec- ond workshop on challenges in processing south asian languages (chipsal 2026). In Proceedings of the Second Workshop on Challenges in Pro- cessing South Asian Languages (CHIPSAL). Surendrabikram Thapa, Shuvam Shiwakoti, Sid- dhant Bikram Shah, Kritesh Rauniyar, Laxmi Thapa, Surabhi Adhikari, Kristina T. Johnson, Kengatharaiyer Sarvesw...

2026
[4]

In Proceedings of the Second Work- shop on Challenges in Processing South Asian Languages (CHIPSAL)

Multimodal hate and sentiment understanding in low-resource text- embedded images for online safety and digital well-being. In Proceedings of the Second Work- shop on Challenges in Processing South Asian Languages (CHIPSAL). Surendrabikram Thapa, Hariram Veeramani, Liang Hu, Qi Zhang, Wei Wang, and Usman Naseem. 2025a. A multimodal prompt-based frame- wor...

1913
[5]

C-Pack: Packed Resources For General Chinese Embeddings

C-pack: Packaged re- sources to advance general chinese embedding. arXiv preprint arXiv:2309.07597

work page internal anchor Pith review arXiv