pith. sign in

arxiv: 2504.02293 · v3 · submitted 2025-04-03 · 💻 cs.CL · cs.AI

Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation

Pith reviewed 2026-05-22 21:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Bangla Sign Languagetext-to-gloss translationlow-resource machine translationsynthetic datasign language datasetBangla NLP
0
0 comments X

The pith

The first dataset of 5,000 Bangla sentence-gloss pairs enables text-to-gloss translation models for Bangla Sign Language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds the first dataset for translating Bangla sentences into glosses for Bangla Sign Language, addressing a complete absence of prior resources for a population of at least three million. It supplies 1,000 manually annotated pairs, 4,000 synthetically generated pairs, and a 159-pair expert-annotated test set. Experiments benchmark several fine-tuned open models against a closed-source LLM, showing that a fine-tuned mBART model competes with much larger systems and that synthetic data improves results in this low-resource setting. A sympathetic reader would care because the work supplies the basic training material needed for any future machine translation tools that could support communication access for deaf Bangla speakers.

Core claim

We introduce the first Bangla text-to-gloss dataset consisting of 1,000 manually annotated and 4,000 synthetically generated sentence-gloss pairs together with a 159 expert human-annotated test set, and we show through comparative experiments that GPT-5.4 achieves the best overall scores while a fine-tuned mBART model remains competitive despite being roughly 100 times smaller and Qwen-3 leads in human evaluation, confirming that systematic synthetic data generation can mitigate data scarcity for low-resource sign language translation.

What carries the argument

The Bangla sentence-gloss dataset created by manual annotation plus systematic synthetic generation, used to train and evaluate text-to-gloss translation models.

If this is right

  • Fine-tuned open-source models can deliver usable BdSL text-to-gloss performance without requiring the largest available LLMs.
  • Systematic synthetic data generation scales training resources when native annotated data remains scarce.
  • Expert-annotated test sets of a few hundred pairs already allow reliable comparison of translation systems for this language pair.
  • Both closed-source LLMs and smaller fine-tuned models constitute practical starting points for initial BdSL translation tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthetic-generation recipe could be reused to bootstrap datasets for other under-documented sign languages.
  • Once text-to-gloss models exist, they could be paired with gloss-to-video rendering to produce full sentence-level sign output.
  • Public release of the dataset lowers the barrier for researchers outside Bangladesh to contribute to BdSL technology.

Load-bearing premise

The manually annotated and synthetically generated sentence-gloss pairs accurately represent valid Bangla Sign Language glosses.

What would settle it

A fresh evaluation by multiple independent BdSL experts on held-out sentences where model gloss outputs match human references at rates no higher than random guessing.

read the original abstract

Gloss is a written approximation that bridges Sign Language (SL) and its corresponding spoken language. Despite a deaf and hard-of-hearing population of at least 3 million in Bangladesh, Bangla Sign Language (BdSL) remains largely understudied, with no prior work on Bangla text-to-gloss translation and no publicly available datasets. To address this gap, we construct the first Bangla text-to-gloss dataset, consisting of 1,000 manually annotated and 4,000 synthetically generated Bangla sentence-gloss pairs, along with 159 expert human-annotated pairs used as a test set. Our experimental framework performs a comparative analysis between several fine-tuned open-source models and a leading closed-source LLM to evaluate their performance in low-resource BdSL translation. GPT-5.4 achieves the best overall performance, while a fine-tuned mBART model performs competitively despite being approximately 100% smaller. Qwen-3 outperforms all other models in human evaluation. This work introduces the first dataset and trained model for Bangla text-to-gloss translation. It also demonstrates the effectiveness of systematically generated synthetic data for addressing challenges in low-resource sign language translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the first dataset and benchmark for Bangla text-to-gloss translation, consisting of 1,000 manually annotated sentence-gloss pairs, 4,000 synthetically generated pairs, and a held-out test set of 159 expert-annotated pairs. It reports a comparative evaluation of fine-tuned open-source models (including mBART) against a closed-source LLM (GPT-5.4), with GPT-5.4 achieving the highest automatic scores, mBART performing competitively despite its smaller size, and Qwen-3 ranking highest in human evaluation. The central claim is that this resource and the use of systematic synthetic data enable effective translation in this low-resource sign language setting.

Significance. If the dataset construction and evaluation prove reliable, the work would fill a notable gap by providing the first public resource for BdSL text-to-gloss translation and by showing that synthetic data can mitigate data scarcity in sign language translation. This could support downstream accessibility applications and serve as a template for other under-resourced sign languages.

major comments (3)
  1. [Dataset construction] Dataset construction (abstract and §3): No inter-annotator agreement statistics are reported for the 159 expert-annotated test pairs. Without IAA, it is impossible to assess the stability or reliability of the benchmark used to rank all models.
  2. [Synthetic data generation] Synthetic data generation (§3.2): The paper provides no details on the synthetic generation process, no expert validation pass on the 4,000 synthetic pairs, and no comparison of synthetic vs. manual gloss quality. This directly affects the claim that synthetic data effectively addresses low-resource challenges.
  3. [Evaluation setup] Evaluation setup (§4): The test set contains only 159 pairs with no reported variance, confidence intervals, or statistical significance tests on the model rankings. Given the small size relative to standard MT benchmarks, performance differences (e.g., GPT-5.4 vs. mBART) may be dominated by annotation noise or domain effects rather than true capability.
minor comments (2)
  1. [Abstract] The abstract and introduction should explicitly state the evaluation metrics used (BLEU, chrF, etc.) and any human evaluation protocol details.
  2. [Dataset description] Clarify whether the 1,000 manual pairs overlap with or are disjoint from the 159 expert test pairs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work introducing the first Bangla text-to-gloss dataset and benchmark. We address each major comment below and commit to revisions that improve the manuscript's clarity and rigor without altering our core claims.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction (abstract and §3): No inter-annotator agreement statistics are reported for the 159 expert-annotated test pairs. Without IAA, it is impossible to assess the stability or reliability of the benchmark used to rank all models.

    Authors: We agree that inter-annotator agreement statistics would strengthen confidence in the test set. The 159 pairs were produced through expert annotation following a detailed guideline; however, multiple independent annotations for IAA calculation were not performed in the original process. We will add a description of the annotation protocol and, where feasible, report agreement on any overlapping annotations or note this as a limitation in the revised manuscript. revision: yes

  2. Referee: [Synthetic data generation] Synthetic data generation (§3.2): The paper provides no details on the synthetic generation process, no expert validation pass on the 4,000 synthetic pairs, and no comparison of synthetic vs. manual gloss quality. This directly affects the claim that synthetic data effectively addresses low-resource challenges.

    Authors: The referee correctly identifies that details on the synthetic generation pipeline were insufficient. We will expand §3.2 in the revision to describe the generation method, any post-generation filtering or validation steps applied to the 4,000 pairs, and include a side-by-side quality comparison (e.g., via automatic metrics or expert review samples) between synthetic and manual glosses to better support the utility of synthetic data. revision: yes

  3. Referee: [Evaluation setup] Evaluation setup (§4): The test set contains only 159 pairs with no reported variance, confidence intervals, or statistical significance tests on the model rankings. Given the small size relative to standard MT benchmarks, performance differences (e.g., GPT-5.4 vs. mBART) may be dominated by annotation noise or domain effects rather than true capability.

    Authors: We acknowledge the small test-set size and the absence of statistical measures. In the revised §4 we will report bootstrap confidence intervals, standard deviations across runs where applicable, and statistical significance tests (such as paired bootstrap or approximate randomization tests) for key model comparisons. We will also explicitly discuss the implications of the test-set size for low-resource sign-language settings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and model evaluation only

full rationale

The paper introduces a new dataset (1,000 manual + 4,000 synthetic Bangla sentence-gloss pairs) and evaluates fine-tuned models (mBART, Qwen-3) plus GPT-5.4 on a held-out 159-pair expert test set. No mathematical derivations, equations, fitted parameters, or first-principles predictions exist that could reduce to inputs by construction. No self-citations are load-bearing; the work explicitly states it is the first on Bangla text-to-gloss translation. All claims rest on external data collection and standard MT metrics rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the new dataset and the assumption that standard machine translation fine-tuning applies directly to text-to-gloss without domain-specific adjustments beyond those tested.

axioms (1)
  • domain assumption Standard sequence-to-sequence fine-tuning techniques transfer to text-to-gloss translation tasks
    The experimental framework applies fine-tuned mBART and other MT models to the new task without additional justification.

pith-pipeline@v0.9.0 · 5760 in / 1247 out tokens · 25451 ms · 2026-05-22T21:35:03.541105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.