arxiv: 2604.23627 · v1 · submitted 2026-04-26 · 💻 cs.CL · cs.LG

Recognition: unknown

Neural Grammatical Error Correction for Romanian

Mihai Dascalu, Stefan Ruseti, Teodor-Mihai Cotet

Pith reviewed 2026-05-08 06:08 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords grammatical error correctionRomanianneural modelstransformerpretrainingartificial data generationlow-resourcePOS tagger

0 comments

The pith

Pretraining a larger Transformer on artificial errors then finetuning on a new 10k Romanian GEC corpus reaches F0.5 of 53.76.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first grammatical error correction corpus for Romanian consisting of 10,000 sentence pairs and adapts the German ERRANT scorer for Romanian to evaluate edits. Experiments show a baseline small Transformer trained only on this corpus achieves an F0.5 score of 44.38, while pretraining a larger Transformer on errors artificially generated with a POS tagger and then finetuning on the real data improves the score to 53.76. The authors present the artificial data generation method as easily extensible to other languages because it requires only a POS tagger. This matters in low-resource settings where advanced GEC tools beyond basic spellcheckers have been unavailable for Romanian and similar languages.

Core claim

By constructing a 10k-pair GEC dataset for Romanian and adapting an error annotation toolkit, the paper shows that pretraining a larger Transformer model on artificially generated errors created using only a POS tagger, followed by finetuning on the actual corpus, produces the best model with an F0.5 score of 53.76, outperforming direct training of a smaller model solely on the human-annotated data.

What carries the argument

The two-stage process of pretraining a larger Transformer on synthetic grammatical errors generated via a POS tagger before finetuning on the real Romanian GEC corpus.

If this is right

The method for generating additional training examples applies to any language using only a POS tagger.
Pretraining strategies are effective for neural GEC in low-resource settings.
Larger Transformer models gain more from the artificial pretraining step than smaller models trained directly on the limited real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same POS-based synthetic pretraining could be applied to GEC for other languages that have a tagger but lack large annotated corpora.
If performance gains hold across languages, simple rule-based error simulation may reduce reliance on expensive human annotation for GEC.
One could measure how much the improvement depends on the size gap between the synthetic pretraining set and the 10k real pairs.

Load-bearing premise

Errors artificially generated with only a POS tagger are representative enough of real Romanian grammatical mistakes for the pretraining to transfer usefully when finetuning on the human-annotated corpus.

What would settle it

Evaluating the best pretrained model on a new collection of real human-annotated Romanian sentences and finding its F0.5 score no higher than the 44.38 baseline would indicate the artificial data did not deliver transferable gains.

read the original abstract

Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies the first public Romanian GEC corpus of 10k pairs and shows a concrete lift from pretraining on POS-tagger synthetic errors, but the match between those errors and real Romanian mistakes is not yet demonstrated.

read the letter

The paper introduces a 10k-sentence-pair Romanian GEC dataset and adapts the German ERRANT scorer for evaluation. It then trains a small Transformer baseline on the real data alone (F0.5 44.38) and compares it to a larger Transformer pretrained on synthetic errors before finetuning (F0.5 53.76). The synthetic generation method is presented as language-agnostic since it only needs a POS tagger, which is a practical detail for other low-resource settings.

Referee Report

2 major / 2 minor

Summary. The paper introduces the first Romanian GEC corpus of 10k sentence pairs, adapts the ERRANT scorer for Romanian error analysis, and evaluates neural Transformer models. The baseline (small Transformer trained only on the corpus) achieves F0.5 = 44.38; the best result (F0.5 = 53.76) is obtained by pretraining a larger Transformer on synthetic errors generated via a POS tagger and then fine-tuning on the real corpus. The synthetic-generation procedure is presented as language-agnostic and requiring only a POS tagger.

Significance. If the central result holds, the work supplies the first public Romanian GEC resource and demonstrates that POS-based synthetic pretraining can improve low-resource GEC performance. The extensible generation method and ERRANT adaptation are practical contributions that could be reused for other under-resourced languages.

major comments (2)

[Synthetic data generation] Synthetic data generation (method section): the description states that errors are inserted using only POS tags, yet provides no explicit rules, examples, or frequency tables for Romanian-specific phenomena (definite-article placement, case marking, gender/number agreement). Without a quantitative comparison of ERRANT error-type distributions between the synthetic corpus and the 10k human-annotated pairs, the 9.38-point F0.5 lift cannot be confidently attributed to targeted pretraining rather than model scale or generic noise.
[Results and evaluation] Results and evaluation (experimental section): F0.5 scores are reported for single runs without train/test split sizes, hyper-parameter search details, or statistical significance tests. On a 10k-pair corpus, variance across seeds or bootstrap confidence intervals is needed to establish that the two-stage procedure reliably outperforms the baseline.

minor comments (2)

[Abstract] Abstract: 'the German version of ERRANT' should be clarified as the adapted Romanian version used for both corpus analysis and final scoring.
[Evaluation metrics] Notation: the paper uses F0.5 without defining the beta value or confirming it matches the standard GEC convention (beta = 0.5).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work introducing the first Romanian GEC corpus and POS-based synthetic pretraining. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and analysis.

read point-by-point responses

Referee: [Synthetic data generation] Synthetic data generation (method section): the description states that errors are inserted using only POS tags, yet provides no explicit rules, examples, or frequency tables for Romanian-specific phenomena (definite-article placement, case marking, gender/number agreement). Without a quantitative comparison of ERRANT error-type distributions between the synthetic corpus and the 10k human-annotated pairs, the 9.38-point F0.5 lift cannot be confidently attributed to targeted pretraining rather than model scale or generic noise.

Authors: We agree that the method section would benefit from greater explicitness. Although the generation procedure is intentionally language-agnostic and relies solely on POS tags to insert plausible errors, we will expand it with concrete Romanian examples (e.g., definite-article insertion/deletion, gender/number agreement violations, and case-marking errors) together with the frequency tables used to sample error types. In addition, we will compute and report the ERRANT error-type distributions for both the synthetic pretraining corpus and the 10k human-annotated pairs; this comparison will allow readers to assess how closely the synthetic data mirrors real error patterns and will help attribute the observed F0.5 improvement more confidently to the pretraining strategy. revision: yes
Referee: [Results and evaluation] Results and evaluation (experimental section): F0.5 scores are reported for single runs without train/test split sizes, hyper-parameter search details, or statistical significance tests. On a 10k-pair corpus, variance across seeds or bootstrap confidence intervals is needed to establish that the two-stage procedure reliably outperforms the baseline.

Authors: We acknowledge that single-run reporting limits the strength of the claims. In the revised manuscript we will explicitly state the train/test split sizes, document the hyper-parameter search procedure (including the ranges explored and the final selected values), and report F0.5 scores averaged over at least five random seeds together with standard deviations. Where appropriate we will also include bootstrap confidence intervals or paired statistical significance tests to demonstrate that the two-stage pretraining procedure reliably outperforms the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: standard supervised GEC training and evaluation on held-out data

full rationale

The paper introduces a 10k-sentence Romanian GEC corpus, adapts ERRANT for evaluation, and reports F0.5 scores from Transformer models trained on the corpus (baseline 44.38) or pretrained on synthetic data then finetuned (53.76). These metrics are produced by conventional supervised training and held-out evaluation; no equation, parameter fit, or self-citation reduces the reported scores to the inputs by construction. The synthetic-data generation method is presented as an extensible heuristic requiring only a POS tagger, but it is not used to define or tautologically guarantee the evaluation outcome. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

This is an applied empirical NLP paper. It introduces a new dataset and applies standard sequence-to-sequence neural methods. No new mathematical axioms, physical constants, or invented entities are postulated. Free parameters consist of ordinary training choices and the details of how synthetic errors are generated from POS tags.

free parameters (2)

Transformer model size and architecture hyperparameters
Chosen for the baseline and larger model experiments; values not specified in abstract.
Synthetic error generation rules and volume
Parameters controlling how artificial errors are created from POS tags and how much synthetic data is used before fine-tuning.

axioms (1)

domain assumption Standard Transformer encoder-decoder architecture is suitable for grammatical error correction as a sequence transduction task
Invoked implicitly by choosing Transformer models for the GEC task.

pith-pipeline@v0.9.0 · 5476 in / 1550 out tokens · 88706 ms · 2026-05-08T06:08:58.525405+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 12 canonical work pages · 2 internal anchors

[1]

The conll-2014 shared task on grammatical error correction,

H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant, “The conll-2014 shared task on grammatical error correction,” inProceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1–14, 2014

2014
[2]

Jfleg: A fluency cor- pus and benchmark for grammatical error correction,

C. Napoles, K. Sakaguchi, and J. Tetreault, “Jfleg: A fluency cor- pus and benchmark for grammatical error correction,”arXiv preprint arXiv:1702.04066, 2017

work page arXiv 2017
[3]

Automatic extraction of learner errors in esl sentences using linguistically enhanced alignments,

M. Felice, C. Bryant, and E. Briscoe, “Automatic extraction of learner errors in esl sentences using linguistically enhanced alignments,” Asso- ciation for Computational Linguistics, 2016

2016
[4]

Automatic annotation and evaluation of error types for grammatical error correction,

C. Bryant, M. Felice, and E. Briscoe, “Automatic annotation and evaluation of error types for grammatical error correction,” Association for Computational Linguistics, 2017

2017
[5]

Large scale arabic error annotation: Guidelines and framework,

W. Zaghouani, B. Mohit, N. Habash, O. Obeid, N. Tomeh, A. Ro- zovskaya, N. Farra, S. Alkuhlani, and K. Oflazer, “Large scale arabic error annotation: Guidelines and framework,” 2014

2014
[6]

Using wikipedia edits in low resource grammatical error correction,

A. Boyd, “Using wikipedia edits in low resource grammatical error correction,” inProceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 79–84, 2018

2018
[7]

Grammar error correction in morpho- logically rich languages: The case of russian,

A. Rozovskaya and D. Roth, “Grammar error correction in morpho- logically rich languages: The case of russian,”Transactions of the Association for Computational Linguistics, vol. 7, pp. 1–17, 2019

2019
[8]

Natural language correction,

J. N ´aplava, “Natural language correction,” 2017

2017
[9]

Overview of grammatical error diagnosis for learning chinese as a foreign language,

L.-C. Yu, L.-H. Lee, and L.-P. Chang, “Overview of grammatical error diagnosis for learning chinese as a foreign language,” inProceedings of the 1st Workshop on Natural Language Processing Techniques for Educational Applications, pp. 42–47, 2014

2014
[10]

Mining revision log of language learning sns for automated japanese error cor- rection of second language learners,

T. Mizumoto, M. Komachi, M. Nagata, and Y . Matsumoto, “Mining revision log of language learning sns for automated japanese error cor- rection of second language learners,” inProceedings of 5th International Joint Conference on Natural Language Processing, pp. 147–155, 2011

2011
[11]

Grammatical error correction in low-resource scenarios,

J. N ´aplava and M. Straka, “Grammatical error correction in low-resource scenarios,”arXiv preprint arXiv:1910.00353, 2019

work page arXiv 1910
[12]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in neural information processing systems, pp. 5998–6008, 2017

2017
[13]

Neural grammatical error correction systems with unsupervised pre-training on synthetic data,

R. Grundkiewicz, M. Junczys-Dowmunt, and K. Heafield, “Neural grammatical error correction systems with unsupervised pre-training on synthetic data,” inProceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 252– 263, 2019

2019
[14]

The bea- 2019 shared task on grammatical error correction,

C. Bryant, M. Felice, Ø. E. Andersen, and T. Briscoe, “The bea- 2019 shared task on grammatical error correction,” inProceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 52–75, 2019

2019
[15]

Phrase-based machine translation is state-of-the-art for automatic grammatical error correction,

M. Junczys-Dowmunt and R. Grundkiewicz, “Phrase-based machine translation is state-of-the-art for automatic grammatical error correction,” arXiv preprint arXiv:1605.06353, 2016

work page arXiv 2016
[16]

Neu- ral language correction with character-based attention,

Z. Xie, A. Avati, N. Arivazhagan, D. Jurafsky, and A. Y . Ng, “Neu- ral language correction with character-based attention,”arXiv preprint arXiv:1603.09727, 2016

work page arXiv 2016
[17]

Approaching neural grammatical error correction as a low-resource machine translation task,

M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, and K. Heafield, “Approaching neural grammatical error correction as a low-resource machine translation task,”arXiv preprint arXiv:1804.05940, 2018

work page arXiv 2018
[18]

Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data,

W. Zhao, L. Wang, K. Shen, R. Jia, and J. Liu, “Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data,”arXiv preprint arXiv:1903.00138, 2019

work page arXiv 1903
[19]

An empirical study of incorporating pseudo data into grammatical error correction,

S. Kiyono, J. Suzuki, M. Mita, T. Mizumoto, and K. Inui, “An empirical study of incorporating pseudo data into grammatical error correction,” arXiv preprint arXiv:1909.00502, 2019

work page arXiv 1909
[20]

Encoder- decoder models can benefit from pre-trained masked language models in grammatical error correction,

M. Kaneko, M. Mita, S. Kiyono, J. Suzuki, and K. Inui, “Encoder- decoder models can benefit from pre-trained masked language models in grammatical error correction,”arXiv preprint arXiv:2005.00987, 2020

work page arXiv 2005
[21]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,”arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review arXiv 2018
[22]

Language model based grammatical error correction without annotated training data,

C. Bryant and T. Briscoe, “Language model based grammatical error correction without annotated training data,” inProceedings of the thirteenth workshop on innovative use of NLP for building educational applications, pp. 247–253, 2018

2018
[23]

Corpora generation for grammatical error correction,

J. Lichtarge, C. Alberti, S. Kumar, N. Shazeer, N. Parmar, and S. Tong, “Corpora generation for grammatical error correction,”arXiv preprint arXiv:1904.05780, 2019

work page arXiv 1904
[24]

Constrained grammatical error correction using statistical machine translation,

Z. Yuan and M. Felice, “Constrained grammatical error correction using statistical machine translation,” inProceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pp. 52–61, 2013

2013
[25]

Readerbench goes online: A comprehension-centered framework for educational pur- poses,

G. Gutu, M. Dascalu, S. Trausan-Matu, and P. Dessus, “Readerbench goes online: A comprehension-centered framework for educational pur- poses,” inInternational Conference on Human-Computer Interaction (RoCHI 2016)(A. Iftene and J. Vanderdonckt, eds.), p. 95–102, MA- TRIX ROM, 2016

2016
[26]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey,et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,”arXiv preprint arXiv:1609.08144, 2016

work page internal anchor Pith review arXiv 2016
[27]

Scalable modified kneser-ney language model estimation,

K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, “Scalable modified kneser-ney language model estimation,” inProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696, 2013

2013
[28]

BLEU: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting on association for computational linguistics, pp. 311– 318, Association for Computational Linguistics, 2002

2002
[29]

Incorporating

J. Zhu, Y . Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, and T.-Y . Liu, “Incorporating bert into neural machine translation,”arXiv preprint arXiv:2002.06823, 2020

work page arXiv 2002
[30]

Diac+: A professional diacritics recovering system,

D. Tufis ¸, A. Ceaus ¸u,et al., “Diac+: A professional diacritics recovering system,”Proceedings of LREC 2008, pp. 167–174, 2008

2008