pith. machine review for the scientific record. sign in

arxiv: 2604.22367 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.AI

Recognition: unknown

CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords sign language understandingmultimodal large language modelsChinese National Sign Languagebenchmark evaluationfinger-spellingair-writingmanual alphabetMLLM performance gaps
0
0 comments X

The pith

Current MLLMs perform substantially worse than humans at understanding Chinese National Sign Language, with clear gaps across input types and signing styles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CNSL-bench, a new evaluation set anchored to the official National Common Sign Language Dictionary. It supplies matched text descriptions, images, and videos for sign language tasks while distinguishing forms such as air-writing, finger-spelling, and the Chinese manual alphabet. Tests on 21 recent open-source and proprietary MLLMs show they trail human accuracy by large margins and display consistent differences depending on whether the input is video, image, or text. The work also finds that these shortfalls remain even after accounting for reasoning ability and that models differ sharply in how reliably they follow instructions on signed content.

Core claim

CNSL-bench supplies the first large-scale, dictionary-grounded collection of Chinese National Sign Language examples that align textual glosses, still images, and video clips while tagging each item by articulatory form. When 21 current MLLMs are run on the benchmark, their accuracy stays well below human levels and varies systematically by modality and by whether the sign uses air-writing, finger-spelling, or the manual alphabet; diagnostic checks indicate that these shortfalls are not removed by stronger reasoning alone and that instruction-following stability differs markedly across models.

What carries the argument

CNSL-bench, the benchmark that supplies aligned text, image, and video data anchored to the National Common Sign Language Dictionary and broken down by manual articulatory form.

If this is right

  • MLLMs require targeted gains in processing sign-language video and image streams beyond current multimodal training.
  • Instruction-following reliability must be improved separately from raw reasoning capacity.
  • Performance differences across air-writing, finger-spelling, and manual-alphabet items point to uneven coverage of articulatory detail in existing training data.
  • Human baselines set a concrete target that future models can be measured against on the same fixed items.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wider adoption of such benchmarks could accelerate development of sign-language interfaces for Chinese deaf users.
  • The same evaluation design could be replicated for other national sign languages to test whether similar modality gaps appear.
  • If instruction-following varies so much, prompt engineering or fine-tuning for signed input may yield quick gains without full model retraining.

Load-bearing premise

The tasks in the benchmark truly isolate sign-language comprehension rather than being solved mainly by general visual recognition or ordinary language skills.

What would settle it

A new MLLM that reaches human accuracy on the full CNSL-bench suite while still showing the same modality and form differences would indicate the reported gaps are not fundamental.

Figures

Figures reproduced from arXiv: 2604.22367 by Jinsong Su, Rui Zhao, Xiaoyun Zheng, Xuewen Zhong, Yidong Chen.

Figure 1
Figure 1. Figure 1: Examples from CNSL-bench showing the aligned textual description, illustrative image, and cor￾responding sign language video for each sign entry. More recently, advances in large language mod￾els (LLMs) have further stimulated progress in automatic sign language understanding, primarily within specific tasks, e.g., sign language translation, where LLMs are incorporated as semantic augmen￾tation modules or … view at source ↗
Figure 2
Figure 2. Figure 2: Examples illustrating the three categories of sign articulation: air-writing (top), finger-spelling (middle), view at source ↗
Figure 3
Figure 3. Figure 3: A concise summary of the MLLMs’ performance on CNSL-bench. (a) provides a comparison detailing the view at source ↗
Figure 4
Figure 4. Figure 4: CoT gains across models and multimodal subsets. view at source ↗
Figure 5
Figure 5. Figure 5: Top-tier reasoning models, i.e., GPT-5 and Gemini-2.5-Pro, exhibit a boundary effect in test-time scaling view at source ↗
Figure 6
Figure 6. Figure 6: KDE plot of reasoning tokens. Videos are at view at source ↗
Figure 7
Figure 7. Figure 7: Video-based sign language understanding failure for “laptop”. Despite correctly recognizing the typing view at source ↗
Figure 8
Figure 8. Figure 8: All manual alphabets in the Chinese Manual Alphabet, including 26 single-letter alphabets, 4 double-letter alphabets, and 2 alphabets with symbols. only a single representative form. To construct a lexically complete benchmark aligned with the official dictionary, we explicitly recover the omit￾ted synonymous entries and re-associate them with the corresponding video instances during dataset alignment, the… view at source ↗
Figure 9
Figure 9. Figure 9: An example of open-ended sign language understanding from the CSL-Daily dataset ( view at source ↗
Figure 10
Figure 10. Figure 10: Text-based sign language understanding for the lexical concept “laptop”. Under textual descriptions, view at source ↗
Figure 11
Figure 11. Figure 11: Image-based sign language understanding for “laptop”. With illustrative images, some models correctly view at source ↗
Figure 12
Figure 12. Figure 12: Video-based sign language understanding failure for “laptop”. Despite correctly recognizing the typing view at source ↗
Figure 13
Figure 13. Figure 13: Partial semantic recognition without correct contextual integration. Models correctly identify individual view at source ↗
Figure 14
Figure 14. Figure 14: Emerging sign-specific symbolic understanding in advanced MLLMs. A stronger model successfully view at source ↗
read the original abstract

Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CNSL-bench, the first benchmark for Chinese National Sign Language understanding in multimodal large language models (MLLMs). It is anchored to the officially standardized National Common Sign Language Dictionary, supplies aligned textual descriptions, images, and videos, and supports analysis across articulatory forms including air-writing, finger-spelling, and the Chinese manual alphabet. The authors evaluate 21 open-source and proprietary MLLMs and report that current models remain substantially inferior to human performance while exhibiting systematic disparities across input modalities and manual articulatory forms.

Significance. If the benchmark labels prove reliable, the work supplies the first large-scale, multimodal resource for CNSL and documents clear performance gaps in existing MLLMs. The combination of authoritative dictionary grounding, multimodal alignment, and articulatory stratification offers a reusable testbed that could drive targeted improvements in sign-language multimodal modeling.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: the claim that anchoring to the National Common Sign Language Dictionary 'mitigates ambiguity from regional or non-canonical variants' is presented without any reported verification (e.g., inter-signer agreement scores, independent expert review of video fidelity, or checks for residual idiolectal variation). This verification is load-bearing for the central claim that observed model errors and modality/articulatory disparities can be attributed to model limitations rather than label noise.
  2. [Evaluation and results] Evaluation section: the manuscript reports human baselines and systematic disparities yet provides no dataset cardinality, exact task formulations (e.g., recognition vs. translation vs. multiple-choice), statistical testing procedures, or human-baseline collection protocol. These omissions prevent independent assessment of whether the reported performance gaps are statistically robust or reproducible.
minor comments (1)
  1. [Abstract] Abstract: the statement that 21 models were evaluated would be strengthened by naming the models or at least indicating the split between open-source and proprietary systems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the claim that anchoring to the National Common Sign Language Dictionary 'mitigates ambiguity from regional or non-canonical variants' is presented without any reported verification (e.g., inter-signer agreement scores, independent expert review of video fidelity, or checks for residual idiolectal variation). This verification is load-bearing for the central claim that observed model errors and modality/articulatory disparities can be attributed to model limitations rather than label noise.

    Authors: We agree that additional verification details would strengthen the presentation. The CNSL-bench is directly anchored to the officially published National Common Sign Language Dictionary, which is the government-standardized reference intended to define canonical forms and reduce regional or idiolectal variation by design. The videos and images are sourced from the dictionary's official materials without modification. We have revised the Benchmark construction section to include an expanded description of the data sourcing and curation process, along with references to the dictionary's standardization authority. While we did not conduct new inter-signer agreement studies (as the source material is already standardized), this revision clarifies the basis for attributing errors primarily to model limitations rather than label noise. revision: partial

  2. Referee: [Evaluation and results] Evaluation section: the manuscript reports human baselines and systematic disparities yet provides no dataset cardinality, exact task formulations (e.g., recognition vs. translation vs. multiple-choice), statistical testing procedures, or human-baseline collection protocol. These omissions prevent independent assessment of whether the reported performance gaps are statistically robust or reproducible.

    Authors: We acknowledge that these methodological details were not presented with sufficient explicitness in the initial submission. We have revised the Evaluation section to include the dataset cardinality, precise task formulations (covering recognition, translation, and multiple-choice variants), the statistical testing procedures used to assess significance of gaps and disparities, and the full human-baseline collection protocol (including participant criteria and agreement measures). These additions enable independent verification of the robustness and reproducibility of the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction and external evaluations are self-contained.

full rationale

The paper introduces CNSL-bench as a new dataset anchored to an external official dictionary and reports empirical evaluations of 21 third-party MLLMs. No equations, fitted parameters, predictions, or derivations appear in the provided text. Central claims rest on direct model performance measurements rather than any self-referential construction, self-citation chain, or renamed prior result. The anchoring assumption is an external grounding claim, not a definitional loop. This matches the default expectation of no significant circularity for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the creation of a new benchmark dataset whose validity depends on the authority of the official dictionary and the representativeness of selected multimodal examples; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The National Common Sign Language Dictionary provides unambiguous and authoritative semantic definitions for signs.
    Invoked to ground the benchmark and eliminate regional variant ambiguity.

pith-pipeline@v0.9.0 · 5556 in / 1292 out tokens · 60266 ms · 2026-05-08T11:43:24.539864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Studying and mitigating biases in sign lan- guage understanding models. InProceedings of the 2024 Conference on Empirical Methods in Natu- ral Language Processing, pages 268–283, Miami, Florida, USA. Association for Computational Lin- guistics. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zho...

  2. [2]

    因为 (because/due to)

    Part (二) shows the dominant hand signifying the reason placing into the palm of the other hand representing the result. Considering this, option C, "因为 (because/due to)", looks like the clear winner. ...... First, I see two distinct movements. Movement (一) involves a typing motion, palms down, fingers bent, hands moving up and down slightly. That's clearl...