Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation

Abhijit Paul; Ahmedul Kabir; Sharif Mohammad Abdullah; Shebuti Rayana; Shubhashis Roy Dipta; Zarif Masud

arxiv: 2504.02293 · v3 · submitted 2025-04-03 · 💻 cs.CL · cs.AI

Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation

Sharif Mohammad Abdullah , Abhijit Paul , Shubhashis Roy Dipta , Zarif Masud , Shebuti Rayana , Ahmedul Kabir This is my paper

Pith reviewed 2026-05-22 21:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Bangla Sign Languagetext-to-gloss translationlow-resource machine translationsynthetic datasign language datasetBangla NLP

0 comments

The pith

The first dataset of 5,000 Bangla sentence-gloss pairs enables text-to-gloss translation models for Bangla Sign Language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds the first dataset for translating Bangla sentences into glosses for Bangla Sign Language, addressing a complete absence of prior resources for a population of at least three million. It supplies 1,000 manually annotated pairs, 4,000 synthetically generated pairs, and a 159-pair expert-annotated test set. Experiments benchmark several fine-tuned open models against a closed-source LLM, showing that a fine-tuned mBART model competes with much larger systems and that synthetic data improves results in this low-resource setting. A sympathetic reader would care because the work supplies the basic training material needed for any future machine translation tools that could support communication access for deaf Bangla speakers.

Core claim

We introduce the first Bangla text-to-gloss dataset consisting of 1,000 manually annotated and 4,000 synthetically generated sentence-gloss pairs together with a 159 expert human-annotated test set, and we show through comparative experiments that GPT-5.4 achieves the best overall scores while a fine-tuned mBART model remains competitive despite being roughly 100 times smaller and Qwen-3 leads in human evaluation, confirming that systematic synthetic data generation can mitigate data scarcity for low-resource sign language translation.

What carries the argument

The Bangla sentence-gloss dataset created by manual annotation plus systematic synthetic generation, used to train and evaluate text-to-gloss translation models.

If this is right

Fine-tuned open-source models can deliver usable BdSL text-to-gloss performance without requiring the largest available LLMs.
Systematic synthetic data generation scales training resources when native annotated data remains scarce.
Expert-annotated test sets of a few hundred pairs already allow reliable comparison of translation systems for this language pair.
Both closed-source LLMs and smaller fine-tuned models constitute practical starting points for initial BdSL translation tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic-generation recipe could be reused to bootstrap datasets for other under-documented sign languages.
Once text-to-gloss models exist, they could be paired with gloss-to-video rendering to produce full sentence-level sign output.
Public release of the dataset lowers the barrier for researchers outside Bangladesh to contribute to BdSL technology.

Load-bearing premise

The manually annotated and synthetically generated sentence-gloss pairs accurately represent valid Bangla Sign Language glosses.

What would settle it

A fresh evaluation by multiple independent BdSL experts on held-out sentences where model gloss outputs match human references at rates no higher than random guessing.

read the original abstract

Gloss is a written approximation that bridges Sign Language (SL) and its corresponding spoken language. Despite a deaf and hard-of-hearing population of at least 3 million in Bangladesh, Bangla Sign Language (BdSL) remains largely understudied, with no prior work on Bangla text-to-gloss translation and no publicly available datasets. To address this gap, we construct the first Bangla text-to-gloss dataset, consisting of 1,000 manually annotated and 4,000 synthetically generated Bangla sentence-gloss pairs, along with 159 expert human-annotated pairs used as a test set. Our experimental framework performs a comparative analysis between several fine-tuned open-source models and a leading closed-source LLM to evaluate their performance in low-resource BdSL translation. GPT-5.4 achieves the best overall performance, while a fine-tuned mBART model performs competitively despite being approximately 100% smaller. Qwen-3 outperforms all other models in human evaluation. This work introduces the first dataset and trained model for Bangla text-to-gloss translation. It also demonstrates the effectiveness of systematically generated synthetic data for addressing challenges in low-resource sign language translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases the first Bangla text-to-gloss dataset, which fills a real gap, but the 159-pair test set and lack of validation details make the model comparisons unreliable.

read the letter

The main thing to know is that this is the first public dataset for Bangla text-to-gloss translation. No prior work existed, so the 1,000 manual pairs plus 4,000 synthetic ones, with a 159-pair expert test set, actually opens the area for BdSL. Releasing the data is the useful part here. They also run a basic comparison across fine-tuned open models and a closed LLM, noting that mBART stays competitive despite its size and that Qwen-3 does well on human eval. That gives a starting point for low-resource sign language work. The soft spots sit in the evaluation. A 159-pair test set is small for translation benchmarks, and the abstract reports no inter-annotator agreement or expert check on the synthetic glosses. Without those, any ranking of GPT-5.4 over the others could easily reflect annotation noise or domain mismatch instead of real capability. The synthetic generation process also stays opaque. This paper is for researchers in low-resource machine translation or sign language accessibility who need a Bangla starting point. A reader focused on South Asian NLP or deaf community tools would find the dataset release worth looking at. The central claim holds up as far as creating the resource goes, but the benchmark numbers need more grounding before they can be taken as firm. I would send it to peer review because the dataset itself is new and the topic matters, even if the experiments require tightening on validation and scale.

Referee Report

3 major / 2 minor

Summary. The paper introduces the first dataset and benchmark for Bangla text-to-gloss translation, consisting of 1,000 manually annotated sentence-gloss pairs, 4,000 synthetically generated pairs, and a held-out test set of 159 expert-annotated pairs. It reports a comparative evaluation of fine-tuned open-source models (including mBART) against a closed-source LLM (GPT-5.4), with GPT-5.4 achieving the highest automatic scores, mBART performing competitively despite its smaller size, and Qwen-3 ranking highest in human evaluation. The central claim is that this resource and the use of systematic synthetic data enable effective translation in this low-resource sign language setting.

Significance. If the dataset construction and evaluation prove reliable, the work would fill a notable gap by providing the first public resource for BdSL text-to-gloss translation and by showing that synthetic data can mitigate data scarcity in sign language translation. This could support downstream accessibility applications and serve as a template for other under-resourced sign languages.

major comments (3)

[Dataset construction] Dataset construction (abstract and §3): No inter-annotator agreement statistics are reported for the 159 expert-annotated test pairs. Without IAA, it is impossible to assess the stability or reliability of the benchmark used to rank all models.
[Synthetic data generation] Synthetic data generation (§3.2): The paper provides no details on the synthetic generation process, no expert validation pass on the 4,000 synthetic pairs, and no comparison of synthetic vs. manual gloss quality. This directly affects the claim that synthetic data effectively addresses low-resource challenges.
[Evaluation setup] Evaluation setup (§4): The test set contains only 159 pairs with no reported variance, confidence intervals, or statistical significance tests on the model rankings. Given the small size relative to standard MT benchmarks, performance differences (e.g., GPT-5.4 vs. mBART) may be dominated by annotation noise or domain effects rather than true capability.

minor comments (2)

[Abstract] The abstract and introduction should explicitly state the evaluation metrics used (BLEU, chrF, etc.) and any human evaluation protocol details.
[Dataset description] Clarify whether the 1,000 manual pairs overlap with or are disjoint from the 159 expert test pairs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work introducing the first Bangla text-to-gloss dataset and benchmark. We address each major comment below and commit to revisions that improve the manuscript's clarity and rigor without altering our core claims.

read point-by-point responses

Referee: [Dataset construction] Dataset construction (abstract and §3): No inter-annotator agreement statistics are reported for the 159 expert-annotated test pairs. Without IAA, it is impossible to assess the stability or reliability of the benchmark used to rank all models.

Authors: We agree that inter-annotator agreement statistics would strengthen confidence in the test set. The 159 pairs were produced through expert annotation following a detailed guideline; however, multiple independent annotations for IAA calculation were not performed in the original process. We will add a description of the annotation protocol and, where feasible, report agreement on any overlapping annotations or note this as a limitation in the revised manuscript. revision: yes
Referee: [Synthetic data generation] Synthetic data generation (§3.2): The paper provides no details on the synthetic generation process, no expert validation pass on the 4,000 synthetic pairs, and no comparison of synthetic vs. manual gloss quality. This directly affects the claim that synthetic data effectively addresses low-resource challenges.

Authors: The referee correctly identifies that details on the synthetic generation pipeline were insufficient. We will expand §3.2 in the revision to describe the generation method, any post-generation filtering or validation steps applied to the 4,000 pairs, and include a side-by-side quality comparison (e.g., via automatic metrics or expert review samples) between synthetic and manual glosses to better support the utility of synthetic data. revision: yes
Referee: [Evaluation setup] Evaluation setup (§4): The test set contains only 159 pairs with no reported variance, confidence intervals, or statistical significance tests on the model rankings. Given the small size relative to standard MT benchmarks, performance differences (e.g., GPT-5.4 vs. mBART) may be dominated by annotation noise or domain effects rather than true capability.

Authors: We acknowledge the small test-set size and the absence of statistical measures. In the revised §4 we will report bootstrap confidence intervals, standard deviations across runs where applicable, and statistical significance tests (such as paired bootstrap or approximate randomization tests) for key model comparisons. We will also explicitly discuss the implications of the test-set size for low-resource sign-language settings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and model evaluation only

full rationale

The paper introduces a new dataset (1,000 manual + 4,000 synthetic Bangla sentence-gloss pairs) and evaluates fine-tuned models (mBART, Qwen-3) plus GPT-5.4 on a held-out 159-pair expert test set. No mathematical derivations, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. No self-citations are load-bearing; the work explicitly states it is the first on Bangla text-to-gloss translation. All claims rest on external data collection and standard MT metrics rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the new dataset and the assumption that standard machine translation fine-tuning applies directly to text-to-gloss without domain-specific adjustments beyond those tested.

axioms (1)

domain assumption Standard sequence-to-sequence fine-tuning techniques transfer to text-to-gloss translation tasks
The experimental framework applies fine-tuned mBART and other MT models to the new task without additional justification.

pith-pipeline@v0.9.0 · 5760 in / 1247 out tokens · 25451 ms · 2026-05-22T21:35:03.541105+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct the first Bangla text-to-gloss dataset, consisting of 1,000 manually annotated and 4,000 synthetically generated Bangla sentence-gloss pairs, along with 159 expert human-annotated pairs used as a test set.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GPT-5.4 achieves the best overall performance, while a fine-tuned mBART model performs competitively despite being approximately 100× smaller.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.