CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder
Pith reviewed 2026-06-25 23:55 UTC · model grok-4.3
The pith
Treating Arabic character deduplication as CTC sequence alignment allows a lightweight encoder to normalize text without dictionaries or rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Connectionist Temporal Classification applied to a character-based encoder solves Arabic noise deduplication as a sequence alignment problem, reaching a sentence error rate of 5.37 percent on three benchmarks while outperforming a classification baseline, with a distilled two-layer version retaining most accuracy and yielding up to 12.8 percent relative reduction in tokenizer fertility.
What carries the argument
Connectionist Temporal Classification (CTC) applied as a sequence alignment mechanism over a character-based encoder to map noisy input to normalized output.
If this is right
- The CTC model reaches a sentence error rate as low as 5.37 percent across clean, ambiguous, and social-media benchmarks.
- It outperforms a classification-based baseline by a large margin on all three evaluation sets.
- Distilling the six-layer model into two layers produces a threefold depth reduction with minimal accuracy loss.
- Normalization produces up to 12.8 percent relative reduction in tokenizer fertility across multiple Arabic LLM tokenizers.
Where Pith is reading between the lines
- The same CTC framing could be tested on other languages that use character repetition for emphasis in informal writing.
- Preprocessing pipelines for Arabic LLMs could incorporate this step to improve context-window utilization without extra linguistic resources.
- Further model compression might allow deployment on resource-limited devices while preserving the reported error rates.
Load-bearing premise
That CTC alignment over raw characters alone can separate correct repetitions from noise without morphological or dictionary information.
What would settle it
A new benchmark of ambiguous elongations where a morphology-aware system achieves lower sentence error rate than the CTC model.
read the original abstract
Handling repeated characters in text can be tricky, since they can represent either the correct spelling of a word or informal character elongation often seen in social media posts. We present CANDLE, a lightweight system for character-level Arabic noise deduplication that addresses this challenge without relying on handcrafted rules, dictionaries, or morphological analyzers. At the heart of CANDLE is a novel application of Connectionist Temporal Classification (CTC) to this task, a formulation not previously explored for character deduplication, which frames normalization as a sequence alignment problem over a character-based encoder. Evaluated on three benchmarks spanning clean newspaper, manually curated ambiguous cases, and real-world social media text, the CTC model achieves a Sentence Error Rate (SER) as low as $5.37\%$ and consistently outperforms a classification-based baseline by a large margin. To reduce inference overhead, we distill the 6-layer CTC model into a 2-layer student, achieving a $3\times$ depth reduction with minimal performance degradation. Beyond deduplication accuracy, normalization yields a practical downstream benefit: a relative reduction in tokenizer fertility of up to $12.8\%$ across a diverse set of Arabic LLM tokenizers, directly lowering inference costs and improving context window utilization. We release all code and models publicly to support reproducibility and advance future research\footnote{https://github.com/abjadai/candle}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CANDLE, a lightweight character-level system for Arabic noise deduplication that frames the task as a sequence alignment problem solved via Connectionist Temporal Classification (CTC) over raw characters, without rules, dictionaries, or morphological analyzers. A 6-layer CTC encoder is evaluated on three benchmarks (clean newspaper, manually curated ambiguous cases, real-world social media), achieving a minimum Sentence Error Rate (SER) of 5.37% and outperforming a classification baseline by a large margin; the model is distilled to 2 layers (3× depth reduction) with minimal degradation. Normalization is also shown to yield up to 12.8% relative reduction in tokenizer fertility across Arabic LLM tokenizers. All code and models are released publicly.
Significance. If the reported SER numbers and baseline margins hold under scrutiny, the work demonstrates a practical, parameter-light alternative to rule-based or morphology-dependent normalization for Arabic social media text, with direct downstream utility in lowering LLM inference costs via improved tokenization. The public release of code and models is a clear strength that supports reproducibility and enables follow-on work.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation: The central empirical claims rest on SER values (minimum 5.37%) and large-margin outperformance of the classification baseline across three benchmarks, yet no dataset sizes, training details, data splits, or error analysis are supplied. This absence prevents independent verification of the numbers and of whether the CTC formulation truly separates correct repetitions from noise on the manually curated ambiguous cases.
- [Distillation paragraph] Distillation paragraph: The claim of a 3× depth reduction (6-layer to 2-layer student) with only minimal performance degradation is load-bearing for the practicality argument, but no quantitative SER or fertility numbers are given for the student model, making it impossible to assess the trade-off.
minor comments (1)
- [Abstract] The footnote citing the GitHub release appears after the final sentence; moving it to the first mention of public release would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive suggestions. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation: The central empirical claims rest on SER values (minimum 5.37%) and large-margin outperformance of the classification baseline across three benchmarks, yet no dataset sizes, training details, data splits, or error analysis are supplied. This absence prevents independent verification of the numbers and of whether the CTC formulation truly separates correct repetitions from noise on the manually curated ambiguous cases.
Authors: We agree that the current manuscript lacks sufficient detail on dataset sizes, training procedures, data splits, and error analysis, which limits independent verification. In the revised version we will expand the Experiments section to report the exact sizes of all three benchmarks, the full training hyperparameters and optimization details for the 6-layer CTC encoder, the train/validation/test splits, and a qualitative error analysis on the manually curated ambiguous cases that illustrates how the CTC alignment distinguishes legitimate character repetitions from noise. revision: yes
-
Referee: [Distillation paragraph] Distillation paragraph: The claim of a 3× depth reduction (6-layer to 2-layer student) with only minimal performance degradation is load-bearing for the practicality argument, but no quantitative SER or fertility numbers are given for the student model, making it impossible to assess the trade-off.
Authors: We acknowledge that the manuscript currently provides no quantitative SER or fertility numbers for the distilled 2-layer model. In the revised Distillation subsection we will report the exact SER on each benchmark and the tokenizer fertility reductions achieved by the student model, allowing readers to evaluate the accuracy–efficiency trade-off directly. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical ML system that applies CTC sequence modeling to character-level deduplication on Arabic text. All reported results (SER of 5.37%, margin over baseline, tokenizer fertility reduction) are direct measurements on held-out benchmarks. No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction back to its own inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claim is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CTC loss can be used to train an encoder to map noisy repeated-character sequences to normalized forms
Reference graph
Works this paper leans on
-
[1]
Electronic Commerce Research , volume=
A model for sentiment and emotion analysis of unstructured social media text , author=. Electronic Commerce Research , volume=. 2018 , publisher=
2018
-
[2]
Procedia Computer Science , volume=
Normalization of noisy text data , author=. Procedia Computer Science , volume=. 2015 , publisher=
2015
-
[3]
Proceedings of the 12th workshop on Asian language resources (ALR12) , pages=
Sentiment analysis for low resource languages: A study on informal Indonesian tweets , author=. Proceedings of the 12th workshop on Asian language resources (ALR12) , pages=
-
[4]
2023 , publisher=
Systematic review on text normalization techniques and its approach to non-standard words , author=. 2023 , publisher=
2023
-
[5]
Plos one , volume=
A normalization model for repeated letters in social media hate speech text based on rules and spelling correction , author=. Plos one , volume=. 2024 , publisher=
2024
-
[6]
Heliyon , volume=
Preprocessing Arabic text on social media , author=. Heliyon , volume=. 2021 , publisher=
2021
-
[7]
SNET: a statistical normalisation method for Twitter , author=
-
[8]
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) , pages=
Normalization of Indonesian-English code-mixed Twitter data , author=. Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) , pages=
2019
-
[9]
Proceedings of the 21st ACM international conference on Information and knowledge management , pages=
Language processing for arabic microblog retrieval , author=. Proceedings of the 21st ACM international conference on Information and knowledge management , pages=
-
[10]
Engineering Science and Technology, an International Journal , volume=
Graph-based Turkish text normalization and its impact on noisy text processing , author=. Engineering Science and Technology, an International Journal , volume=. 2022 , publisher=
2022
-
[11]
Proceedings of the international AAAI conference on web and social media , volume=
Adapting sequence to sequence models for text normalization in social media , author=. Proceedings of the international AAAI conference on web and social media , volume=
-
[12]
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) , pages=
Dialect text normalization to normative standard Finnish , author=. Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) , pages=
2019
-
[13]
https://arxiv.org/abs/1611.04033
1.5 Billion Words Arabic Corpus , author=. arXiv preprint arXiv:1611.04033 , url = "https://arxiv.org/abs/1611.04033", year=
-
[14]
https://aclanthology.org/2024.arabicnlp-1.21/
CATT: Character-based Arabic Tashkeel Transformer , author=. Proceedings of the Second Arabic Natural Language Processing Conference , url = "https://aclanthology.org/2024.arabicnlp-1.21/", pages=
2024
-
[15]
https://arxiv.org/abs/2405.06239
Saudibert: A large language model pretrained on saudi dialect corpora , author=. arXiv preprint arXiv:2405.06239 , url="https://arxiv.org/abs/2405.06239", year=
-
[16]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[17]
Proceedings of the 23rd international conference on Machine learning , pages=
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , author=. Proceedings of the 23rd international conference on Machine learning , pages=
-
[18]
1987 , address =
Al-Qalqashandi, Ahmed bin Ali , title =. 1987 , address =
1987
-
[19]
https://arxiv.org/abs/1408.2873
First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs , author=. arXiv preprint arXiv:1408.2873 , url="https://arxiv.org/abs/1408.2873", year=
-
[20]
arXiv preprint arXiv:2010.13002 , year=
Pre-trained summarization distillation , author=. arXiv preprint arXiv:2010.13002 , year=
arXiv 2010
-
[21]
arXiv preprint arXiv:2311.00430 , year=
Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling , author=. arXiv preprint arXiv:2311.00430 , year=
-
[22]
Advances in neural information processing systems , pages=
Sequence to sequence learning with neural networks , author=. Advances in neural information processing systems , pages=
-
[23]
arXiv preprint arXiv:1503.02531 , year =
Distilling the Knowledge in a Neural Network , author =. arXiv preprint arXiv:1503.02531 , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.