pith. machine review for the scientific record. sign in

arxiv: 2604.15301 · v2 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords sign language translationgloss-free SLTlatent thoughtscross-modal reasoningplan-then-ground decodingnew datasetvideo-to-text
0
0 comments X

The pith

Sign language translation improves when models reason through an ordered sequence of latent thoughts before generating text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that gloss-free sign language translation fails when it assumes brief video chunks map straight to words, because signers build meaning dynamically through context, space, and movement. It reframes the task as cross-modal reasoning and inserts an explicit middle layer of ordered latent thoughts that gradually extract and organize meaning over time. A plan-then-ground decoding step lets the model decide its output first and then verify it against the video evidence. The authors also release a new large-scale dataset designed with stronger context dependencies. Sympathetic readers would care because this changes how models handle real signing rather than forcing direct mappings.

Core claim

Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words, but this breaks down because signers often create meaning on the fly using context, space, and movement. The paper therefore treats SLT as mainly a cross-modal reasoning task. It introduces a reasoning-driven framework that places an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text so that meaning is extracted and organized gradually over time. On top of this, plan-then-ground decoding lets the model first decide what it wants to say and then look back at the video to find supporting evidence, which improves coherence and faithfulness. A新

What carries the argument

Ordered sequence of latent thoughts as an explicit middle layer for gradual meaning extraction, combined with plan-then-ground decoding that separates output planning from video grounding.

If this is right

  • Translations gain coherence and faithfulness by separating the decision of what to say from later evidence checking in the video.
  • The model handles dynamic use of context, space, and movement in signing more naturally than direct-mapping approaches.
  • Consistent performance gains appear over prior gloss-free methods across multiple benchmarks.
  • The released dataset supplies stronger context dependencies for testing future reasoning-based SLT systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-thought middle layer could be tested on other video-to-text tasks that involve evolving context, such as action captioning.
  • Plan-then-ground decoding might lower hallucination rates in long outputs by forcing explicit grounding checks.
  • The new dataset could serve as a stress test for any cross-modal model that must track meaning built across an entire sequence rather than local chunks.

Load-bearing premise

That the main failure mode in gloss-free SLT is the assumption of direct chunk-to-word mapping and that latent thoughts plus plan-then-ground can be learned effectively without gloss supervision or extra constraints.

What would settle it

An ablation experiment on the new dataset that removes the latent thoughts layer and plan-then-ground step and still matches or exceeds the full model's translation quality would show the reasoning machinery is not required.

Figures

Figures reproduced from arXiv: 2604.15301 by Li Qing, Li Zhang, Xiao-Yong Wei, Yiyang Jiang.

Figure 1
Figure 1. Figure 1: Latent thoughts clarified the context to dis [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the SignThought framework, which consists of three parts: (i) a sign encoder that maps [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative visualization on CSL-Daily [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LC-HKSLT scaling trend. scarce. Existing Hong Kong Sign Language re￾sources are typically limited in scale or rely heav￾ily on gloss supervision, which is misaligned with learning directly from raw videos and sentence￾level translations. To bridge this gap and support our Latent Chain-of-Thought framework for long￾range, weakly supervised sign-to-text reasoning, we construct LC-HKSLT, a large-scale corpus … view at source ↗
read the original abstract

Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Our code and data are available at https://github.com/fletcherjiang/SignThought.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper argues that gloss-free sign language translation (SLT) is primarily a cross-modal reasoning task rather than a direct video-to-text mapping. It introduces a reasoning-driven framework that inserts an ordered sequence of latent thoughts as an explicit middle layer between video input and generated text, with these thoughts gradually extracting and organizing meaning over time. This is combined with a plan-then-ground decoding strategy in which the model first plans the output and then grounds it by referencing the video evidence. A new large-scale gloss-free SLT dataset with stronger context dependencies is released, and experiments report consistent gains over prior gloss-free methods.

Significance. If the latent thoughts can be shown to function as interpretable, temporally ordered reasoning steps rather than arbitrary internal states, the work could meaningfully shift SLT research toward more explicit reasoning paradigms that better handle contextual, spatial, and non-literal aspects of signing. The public release of code and the new dataset would further support reproducibility and progress on gloss-free settings.

major comments (2)
  1. [Abstract] Abstract: the central claim that an 'ordered sequence of latent thoughts' forms an explicit middle layer that 'gradually extract[s] and organize[s] meaning over time' is load-bearing, yet the abstract provides no description of the training objective, auxiliary losses, or regularization terms that would enforce ordering or semantic content in the latent sequence. Without such constraints the layer could collapse, rendering performance gains attributable to the new dataset or other architectural changes rather than the proposed reasoning paradigm.
  2. [Abstract] Abstract: the plan-then-ground decoding method is presented as separating planning from grounding to improve coherence and faithfulness, but no implementation details, loss formulations, or architectural diagrams are referenced that would allow verification of how the two stages interact with the latent-thought layer during training and inference.
minor comments (1)
  1. The abstract states that experiments show 'consistent gains over existing gloss-free methods' across 'several benchmarks' but supplies neither the specific benchmarks nor any quantitative results, tables, or ablation studies, which hinders immediate assessment of the magnitude and robustness of the improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our core contributions. We agree that the abstract requires expansion to better support the central claims and have revised it accordingly to include brief references to the training objectives, auxiliary losses, and decoding implementation. We respond point by point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that an 'ordered sequence of latent thoughts' forms an explicit middle layer that 'gradually extract[s] and organize[s] meaning over time' is load-bearing, yet the abstract provides no description of the training objective, auxiliary losses, or regularization terms that would enforce ordering or semantic content in the latent sequence. Without such constraints the layer could collapse, rendering performance gains attributable to the new dataset or other architectural changes rather than the proposed reasoning paradigm.

    Authors: We acknowledge that the original abstract was too concise on this point and did not reference the mechanisms that enforce ordering and semantic content. In the full manuscript, Section 3.2 defines the training objective as a joint loss combining the primary sequence-to-sequence cross-entropy with an auxiliary contrastive loss over consecutive latent thoughts (to promote temporal ordering) and a mutual-information regularization term (to discourage collapse to uninformative states). Ablation experiments in Section 5.3 demonstrate that removing these terms reduces performance to levels comparable with prior gloss-free baselines, supporting that the gains are tied to the reasoning layer rather than solely the new dataset. We have revised the abstract to include a short clause referencing these constraints. revision: yes

  2. Referee: [Abstract] Abstract: the plan-then-ground decoding method is presented as separating planning from grounding to improve coherence and faithfulness, but no implementation details, loss formulations, or architectural diagrams are referenced that would allow verification of how the two stages interact with the latent-thought layer during training and inference.

    Authors: We agree the abstract omitted implementation specifics. Section 4.2 of the manuscript details the plan-then-ground decoder: the planning stage generates an intermediate plan token sequence conditioned on the latent thoughts, while the grounding stage performs cross-attention to video features using a dedicated grounding loss (a combination of token-level reconstruction and attention alignment). The two stages share parameters with the latent-thought encoder and are trained end-to-end with a staged curriculum that first optimizes planning then jointly optimizes grounding. A diagram illustrating the interaction is provided in Figure 3. We have updated the abstract to reference this section and added a pointer to the supplementary architectural diagram. revision: yes

Circularity Check

0 steps flagged

New architectural choices with no derivation chain or self-referential reductions

full rationale

The paper introduces a reasoning-driven framework with latent thoughts as an explicit middle layer and plan-then-ground decoding, plus a new dataset, as architectural and data contributions rather than mathematical derivations. No equations, fitted parameters, or self-citations appear in the provided text that would reduce any central claim to its own inputs by construction. The claims rest on the novelty of the paradigm and empirical gains, which do not exhibit self-definitional, fitted-prediction, or uniqueness-imported circularity. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities beyond the high-level introduction of latent thoughts; full methods section would be required for a complete ledger.

invented entities (1)
  • latent thoughts no independent evidence
    purpose: explicit middle layer for gradual meaning extraction and organization
    Introduced as a new conceptual component without reference to prior independent evidence or formal definition in the abstract.

pith-pipeline@v0.9.0 · 5479 in / 1185 out tokens · 35611 ms · 2026-05-10T11:41:26.177942+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design

    cs.AI 2026-04 unverdicted novelty 6.0

    Mol-Debate applies multi-agent debate in an iterative loop with perspective orchestration to achieve state-of-the-art text-guided molecular design, scoring 59.82% exact match on ChEBI-20 and 50.52% weighted success on...

Reference graph

Works this paper leans on

76 extracted references · 26 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Nikolas Adaloglou, Theocharis Chatzis, Ilias Papastratis, Andreas Stergioulas, Georgios Th Papadopoulos, Vassia Zacharopoulou, George J Xydopoulos, Klimnis Atzakas, Dimitris Papazachariou, and Petros Daras. 2021. A comprehensive study on deep learning-based methods for sign language recognition. IEEE transactions on multimedia, 24:1750--1762

  2. [2]

    Samuel Albanie, G \"u l Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, and Andrew Zisserman. 2020. Bsl-1k: Scaling up co-articulated sign language recognition using mouthing cues. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16, pages 35--53. Springer

  3. [3]

    Sign language translation based on transformers for the how2sign dataset

    Patricia Cabot Alvarez, Xavier Gir \'o Nieto, and Laia Tarr \'e s Benet. Sign language translation based on transformers for the how2sign dataset

  4. [4]

    Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, and Tessa Verhoef. 2019. Sign language recognition, generation, and translation: An interdisciplinary perspective. In Proceedings of the 21st international ACM SIGACCESS conference on computers and accessibilit...

  5. [5]

    Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7784--7793

  6. [6]

    Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023--10033

  7. [7]

    Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. 2022 a . A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5120--5130

  8. [8]

    Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. 2022 b . Two-stream network for sign language recognition and translation. Advances in Neural Information Processing Systems, 35:17043--17056

  9. [9]

    Zhigang Chen, Benjia Zhou, Yiqing Huang, Jun Wan, Yibo Hu, Hailin Shi, Yanyan Liang, Zhen Lei, and Du Zhang. 2024 a . C ^2 rl: Content and context representation learning for gloss-free sign language translation and retrieval

  10. [10]

    Zhigang Chen, Benjia Zhou, Jun Li, Jun Wan, Zhen Lei, Ning Jiang, Quan Lu, and Guoqing Zhao. 2024 b . Factorized learning assisted with large language model for gloss-free sign language translation. pages 7071--7081

  11. [11]

    Zhiwei Chen, Yupeng Hu, Zhiheng Fu, Zixu Li, Jiale Huang, Qinlei Huang, and Yinwei Wei. 2026. Intent: Invariance and discrimination-aware noise mitigation for robust composed image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 20463--20471

  12. [12]

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Xuemeng Song, and Liqiang Nie. 2025. Offset: Segmentation-based focus shift revision for composed image retrieval. In Proceedings of the ACM International Conference on Multimedia, page 6113–6122

  13. [13]

    Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully convolutional networks for continuous sign language recognition. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXIV 16, pages 697--714. Springer

  14. [14]

    Kearsy Cormier, David Quinto-Pozos, Zed Sevcikova, and Adam Schembri. 2012. Lexicalisation and de-lexicalisation processes in sign languages: Comparing depicting constructions and viewpoint gestures. Language & communication, 32(4):329--348

  15. [15]

    Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang, and Guojie Song. 2026. https://arxiv.org/abs/2604.02972 Neureasoner: Towards explainable, controllable, and unified reasoning via mixture-of-neurons . Preprint, arXiv:2604.02972

  16. [16]

    Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro i Nieto. 2021. https://arxiv.org/abs/2008.08143 How2sign: A large-scale multimodal dataset for continuous american sign language . Preprint, arXiv:2008.08143

  17. [17]

    Jens Forster, Christoph Schmidt, Oscar Koller, Martin Bellgardt, and Hermann Ney. 2014. Extensions of the sign language recognition and translation corpus rwth-phoenix-weather. In LREC, pages 1911--1916

  18. [18]

    Jia Gong, Lin Geng Foo, Yixuan He, Hossein Rahmani, and Jun Liu. 2024. Llms are good sign language translators. In CVPR, pages 18362--18372

  19. [19]

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2025. https://arxiv.org/abs/2412.06769 Training large language models to reason in a continuous latent space . Preprint, arXiv:2412.06769

  20. [20]

    Yupeng Hu, Zixu Li, Zhiwei Chen, Qinlei Huang, Zhiheng Fu, Mingzhu Xu, and Liqiang Nie. 2026. Refine: Composed video retrieval via shared and differential semantics enhancement. ACM Transactions on Multimedia Computing, Communications and Applications

  21. [21]

    Jiani Huang, Shijie Wang, Liangbo Ning, Wenqi Fan, Shuaiqiang Wang, Dawei Yin, and Qing Li. 2026. Towards next-generation recommender systems: A benchmark for personalized recommendation assistant with llms. In Proceedings of the Nineteenth ACM International Conference on Web Search and Data Mining, pages 217--226

  22. [22]

    Jiani Huang, Xingchen Zou, Lianghao Xia, and Qing Li. 2025. Mr. rec: Synergizing memory and reasoning for personalized recommendation assistant with llms. arXiv preprint arXiv:2510.14629

  23. [23]

    Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song. 2026 a . https://arxiv.org/abs/2604.02967 Foe: Forest of errors makes the first solution the best in large reasoning models . Preprint, arXiv:2604.02967

  24. [24]

    Yiyang Jiang, Guangwu Qian, Jiaxin Wu, Qi Huang, Qing Li, Yongkang Wu, and Xiao-Yong Wei. 2026 b . https://doi.org/10.1109/TMI.2025.3637237 Self-paced learning for images of antinuclear antibodies . IEEE Transactions on Medical Imaging, 45(4):1661--1672

  25. [25]

    Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. 2024. https://doi.org/10.1145/3664647.3681115 Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval . In Proceedings of the 32nd ACM International Conference on Multimedia, MM '24, page 7249–7258, New York, NY, USA. Association...

  26. [26]

    Peiqi Jiao, Yuecong Min, and Xilin Chen. 2024. Visual alignment pre-training for sign language translation. In European Conference on Computer Vision, pages 349--367. Springer

  27. [27]

    Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141:108--125

  28. [28]

    Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. https://doi.org/10.1109/CVPR.2017.364 Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms . In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3416--3424

  29. [29]

    Dongxu Li, Cristian Rodriguez, Xin Yu, and Hongdong Li. 2020 a . Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1459--1469

  30. [30]

    Dongxu Li, Chenchen Xu, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen, and Hongdong Li. 2020 b . Tspnet: Hierarchical feature learning via temporal semantic pyramid for sign language translation. Advances in Neural Information Processing Systems, 33:12034--12045

  31. [31]

    Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, and Houqiang Li. 2025. Uni-sign: Toward unified sign language understanding at scale. arXiv preprint arXiv:2501.15187

  32. [32]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, and Meng Liu. 2026 a . Retrack: Evidence-driven dual-stream directional anchor calibration network for composed video retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23373--23381

  33. [33]

    Zixu Li, Yupeng Hu, Zhiwei Chen, Shiqi Zhang, Qinlei Huang, Zhiheng Fu, and Yinwei Wei. 2026 b . Habit: Chrono-synergia robust progressive learning framework for composed image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6762--6770

  34. [34]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74--81

  35. [35]

    Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, and Yi Yang. 2023. Gloss-free end-to-end sign language translation. arXiv preprint arXiv:2305.12876

  36. [36]

    Peiyang Liu, Ziqiang Cui, Di Liang, and Wei Ye. 2025 a . Who stole your data? a method for detecting unauthorized rag theft. arXiv preprint arXiv:2510.07728

  37. [37]

    Peiyang Liu, Xi Wang, Ziqiang Cui, and Wei Ye. 2025 b . Queries are not alone: Clustering text embeddings for video search. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 874--883

  38. [38]

    Peiyang Liu, Jinyu Yang, Lin Wang, Sen Wang, Yunlai Hao, and Huihui Bai. 2023. Retrieval-based unsupervised noisy label detection on text data. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4099--4104

  39. [39]

    Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. 2024. Compositional chain-of-thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14420--14431

  40. [40]

    Chong-Wah Ngo, Yu-Gang Jiang, Xiao-Yong Wei, Wanlei Zhao, Feng Wang, Xiao Wu, and Hung-Khoon Tan. 2008. Beyond semantic search: What you observe may not be what you think. In IEEE Computer Society

  41. [41]

    Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, and Jing Zhang. 2025. Unigeoseg: Towards unified open-world segmentation for geospatial scenes. arXiv preprint arXiv:2511.23332

  42. [42]

    Zhe Niu, Ronglai Zuo, Brian Mak, and Fangyun Wei. 2024. https://arxiv.org/abs/2405.00980 A hong kong sign language corpus collected from sign-interpreted tv news . Preprint, arXiv:2405.00980

  43. [43]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311--318

  44. [44]

    Bowen Shi, Diane Brentari, Greg Shakhnarovich, and Karen Livescu. 2022 a . Open-domain sign language translation learned from online video. In EMNLP

  45. [45]

    Bowen Shi, Diane Brentari, Greg Shakhnarovich, and Karen Livescu. 2022 b . https://arxiv.org/abs/2205.12870 Open-domain sign language translation learned from online video . Preprint, arXiv:2205.12870

  46. [46]

    Laia Tarr \'e s, Gerard I G \'a llego, Amanda Duarte, Jordi Torres, and Xavier Gir \'o -i Nieto. 2023. Sign language translation from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5624--5634

  47. [47]

    Guobin Tu and Di Weng. 2026. https://arxiv.org/abs/2601.03549 Easlt: Emotion-aware sign language translation . Preprint, arXiv:2601.03549

  48. [48]

    David Uthus, Garrett Tanzer, and Manfred Georg. 2023. https://arxiv.org/abs/2306.15162 Youtube-asl: A large-scale, open-domain american sign language-english parallel corpus . Preprint, arXiv:2306.15162

  49. [49]

    Andreas Voskou, Konstantinos P Panousis, Dimitrios Kosmopoulos, Dimitris N Metaxas, and Sotirios Chatzis. 2021. Stochastic transformer networks with linear competing units: Application to end-to-end sl translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11946--11955

  50. [50]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . Preprint, arXiv:2203.11171

  51. [51]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . Preprint, arXiv:2201.11903

  52. [52]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  53. [53]

    Ryan Wong, Necati Cihan Camgoz, and Richard Bowden. 2024. Sign2 GPT : Leveraging large language models for gloss-free sign language translation. In ICLR

  54. [54]

    Zhen-Qun Yang Xiao-Yong Wei. 2013. Coaching the exploration and exploitation in active learning for interactive video retrieval. IEEE Transactions on Image Processing, 22(3):955--968

  55. [55]

    Can Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, and Guorui Zhou. 2025. Unlocking exploration in rlvr: Uncertainty-aware advantage shaping for deeper reasoning. arXiv preprint arXiv:2510.10649

  56. [56]

    Kevin Xu and Issei Sato. 2025. https://arxiv.org/abs/2509.25239 A formal comparison between chain-of-thought and latent thought . Preprint, arXiv:2509.25239

  57. [57]

    Qianyun Yang, Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, and Liqiang Nie. 2026. Stable: Efficient hybrid nearest neighbor search via magnitude-uniformity and cardinality-robustness. arXiv preprint arXiv:2604.01617

  58. [58]

    Jiayu Yao, He Chen, Yizhuang Xie, Ning Zhang, Mingxu Yang, and Liang Chen. 2025. S 2 net: Spatial-aligned and semantic-discriminative network for remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing

  59. [59]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. https://arxiv.org/abs/2210.03629 React: Synergizing reasoning and acting in language models . Preprint, arXiv:2210.03629

  60. [60]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations

  61. [61]

    Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, and Zhou Zhao. 2023. Gloss attention for gloss-free sign language translation. In ICCV, pages 2551--2562

  62. [62]

    Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani. 2021. Including signed languages in natural language processing. arXiv preprint arXiv:2105.05222

  63. [63]

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. 2025. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685

  64. [64]

    Biao Zhang, Mathias M \"u ller, and Rico Sennrich. 2023. SLTUNET : A simple unified model for sign language translation. In ICLR

  65. [65]

    Ning Zhang, Shuo Ni, Liang Chen, Tong Wang, and He Chen. 2025 a . High-throughput and energy-efficient fpga-based accelerator for all adder neural networks. IEEE Internet of Things Journal

  66. [66]

    Wengyu Zhang, Qi Tian, Yi Cao, Wenqi Fan, Dongmei Jiang, Yaowei Wang, Qing Li, and Xiao-Yong Wei. 2025 b . Graphatc: advancing multilevel and multi-label anatomical therapeutic chemical classification via atom-level graph learning. Briefings in bioinformatics, 26(2):bbaf194

  67. [67]

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2024. https://arxiv.org/abs/2302.00923 Multimodal chain-of-thought reasoning in language models . Preprint, arXiv:2302.00923

  68. [68]

    Rui Zhao, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, and Yidong Chen. 2024. Conditional variational autoencoder for sign language translation with cross-modal alignment. In AAAI, pages 19643--19651

  69. [69]

    Benjia Zhou, Zhigang Chen, Albert Clap \'e s, Jun Wan, Yanyan Liang, Sergio Escalera, Zhen Lei, and Du Zhang. 2023. Gloss-free sign language translation: Improving from visual-language pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20871--20881

  70. [70]

    Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. 2021. Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1316--1325

  71. [71]

    Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, and Houqiang Li. 2024 a . Scaling up multimodal pre-training for sign language understanding. arXiv preprint arXiv:2408.08544

  72. [72]

    Xiaoling Zhou, Wei Ye, Zhemg Lee, Lei Zou, and Shikun Zhang. 2025. Valuing training data via causal inference for in-context learning. IEEE Transactions on Knowledge and Data Engineering

  73. [73]

    Xiaoling Zhou, Wei Ye, Yidong Wang, Chaoya Jiang, Zhemg Lee, Rui Xie, and Shikun Zhang. 2024 b . Enhancing in-context learning via implicit demonstration augmentation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2810--2828

  74. [74]

    Inge Zwitserlood. 2012. Classifiers. In Sign language: An international handbook. de Gruyter

  75. [75]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  76. [76]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...