pith. machine review for the scientific record. sign in

arxiv: 2605.13624 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: unknown

Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords grammatical error correctionLLM inferencemajority votingover-correctionmultilingual evaluationtraining-free method
0
0 comments X

The pith

Edit-level majority voting over multiple LLM candidates reduces over-correction in grammatical error correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that grammatical error correction with large language models frequently produces unnecessary changes that alter correct text. To address this without any retraining or model changes, the authors generate several correction candidates from one LLM and apply majority voting separately to each individual edit. This approach improves results over standard greedy decoding and minimum Bayes risk decoding on nine benchmarks spanning English, Czech, German, Ukrainian, Korean, Hindi, and Romanian. It also keeps correction quality consistent even when the instruction prompt is varied. Readers who use LLMs for writing assistance would value a lightweight inference step that limits unwanted alterations.

Core claim

Performing majority voting at the level of individual edits, rather than on complete sentences, reliably reduces the over-correction problem that arises when a single large language model is prompted to correct grammar.

What carries the argument

Edit-level majority voting: multiple correction candidates are produced from one LLM, edits are identified and aligned across candidates, and the most frequent version of each edit is retained in the final output.

If this is right

  • The method outperforms both greedy and minimum Bayes risk decoding on most of the tested multilingual benchmarks.
  • Correction quality remains stable when different instruction prompts are used.
  • No model modification or additional training data is required to obtain the improvement.
  • The approach applies directly to existing LLM inference pipelines for grammatical error correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same voting step could be tested on other LLM tasks that suffer from over-generation, such as summarization or style transfer.
  • If edit alignment proves robust across languages, the technique might extend to low-resource settings where prompt sensitivity is high.
  • Combining edit-level voting with lightweight reranking could further reduce the remaining errors without extra training.

Load-bearing premise

The generated candidates must contain enough diversity that correct edits appear in the majority while over-corrections appear only in minorities.

What would settle it

On any of the nine benchmarks, if the edit-voted output introduces more over-corrections or lower overall scores than greedy decoding from the same model and prompt, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.13624 by Takumi Goto, Taro Watanabe, Yusuke Sakai.

Figure 1
Figure 1. Figure 1: Overview of the edit-level majority voting [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: plots the results for each k in a space where the horizontal axis represents the average computation time per sentence (seconds) and the 8 6 4 2 0 Computation time (seconds, negative values) 0.30 0.35 0.40 0.45 0.50 0.55 Score k=1 k=2 k=4 k=8 k=32 k=16 k=1 k=2 k=8k=4 k=32 k=16 k=2k=1 k=8k=4 k=32 k=16 CWEB-G-dev BEA19-dev JFLEG-dev [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between edit frequency and pre [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows the template for English experi￾ments, and [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Instruction template for datasets of other than English. [LANG] will be replaced with a language name, such as “Czech” or “’German.” stage used the entire W&I-LOCNESS (Yan￾nakoudakis et al., 2018) training data. We used bert-base-cased 6 (Devlin et al., 2019) and deberta-v3-large 7 (He et al., 2023) as the models. The training settings also basi￾cally followed Omelianchuk et al. (2020), ex￾cept that the fi… view at source ↗
read the original abstract

Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets loading and LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a training-free inference method for LLM-based grammatical error correction that generates multiple candidate outputs from a single model and applies majority voting at the level of individual edits to mitigate over-correction. It reports that this approach outperforms both greedy decoding and minimum Bayes risk (MBR) decoding across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, while also producing stable correction quality independent of the specific instruction prompts used. Two supporting repositories for dataset loading and LLM inference are released.

Significance. If the central empirical claims hold after clarification of the method, the work offers a lightweight, training-free technique that could be widely adopted to improve reliability of LLM-based GEC systems without model changes. The cross-lingual scope and reported prompt stability are practically relevant, and the open release of code and data strengthens reproducibility.

major comments (3)
  1. [§3] §3 (Proposed Method): The edit extraction and alignment procedure is not given a formal, language-agnostic definition or pseudocode. The manuscript relies on an implicit alignment step without specifying how insertions, deletions, substitutions, and multi-token edits are canonicalized or how edit boundaries are determined across scripts and morphologies. This is load-bearing for the majority-vote guarantee and risks inconsistent aggregation in languages such as Korean and Hindi.
  2. [§4.2] §4.2 (Experimental Results): The tables reporting outperformance over MBR decoding do not include statistical significance tests, confidence intervals, or variance estimates across runs. Without these, it is unclear whether the gains in 'most cases' are robust or attributable to sampling variance in candidate generation.
  3. [§4.3] §4.3 (Analysis): No quantitative analysis of candidate diversity or inter-candidate edit agreement is provided. The central claim that edit-level voting reliably mitigates over-correction presupposes sufficient diversity; absence of such diagnostics leaves the mechanism under-supported.
minor comments (2)
  1. A summary table listing the nine benchmarks, their sizes, error-type distributions, and reference sources would improve clarity and allow readers to assess cross-lingual coverage.
  2. [§3] Notation for edit operations (e.g., how an edit is represented as a tuple or string) should be introduced explicitly in §3 before the voting algorithm is described.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We provide point-by-point responses below.

read point-by-point responses
  1. Referee: [§3] §3 (Proposed Method): The edit extraction and alignment procedure is not given a formal, language-agnostic definition or pseudocode. The manuscript relies on an implicit alignment step without specifying how insertions, deletions, substitutions, and multi-token edits are canonicalized or how edit boundaries are determined across scripts and morphologies. This is load-bearing for the majority-vote guarantee and risks inconsistent aggregation in languages such as Korean and Hindi.

    Authors: We agree that providing a formal, language-agnostic definition and pseudocode for the edit extraction and alignment procedure will enhance reproducibility and address concerns about consistency across languages. In the revised manuscript, we will add a detailed pseudocode algorithm that describes the steps for identifying edits using a standard alignment method (e.g., based on Levenshtein distance at the token level), canonicalizing operations for insertions, deletions, substitutions, and multi-token spans, and handling boundary determination in a script-agnostic manner. This will explicitly support the majority-vote mechanism. revision: yes

  2. Referee: [§4.2] §4.2 (Experimental Results): The tables reporting outperformance over MBR decoding do not include statistical significance tests, confidence intervals, or variance estimates across runs. Without these, it is unclear whether the gains in 'most cases' are robust or attributable to sampling variance in candidate generation.

    Authors: We acknowledge this limitation in the current version. To demonstrate the robustness of our results, we will include statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) and confidence intervals computed via bootstrapping over multiple independent runs in the revised tables for the comparisons with MBR decoding. revision: yes

  3. Referee: [§4.3] §4.3 (Analysis): No quantitative analysis of candidate diversity or inter-candidate edit agreement is provided. The central claim that edit-level voting reliably mitigates over-correction presupposes sufficient diversity; absence of such diagnostics leaves the mechanism under-supported.

    Authors: We agree that quantifying candidate diversity and edit agreement would better support the mechanism. In the revised analysis section, we will add metrics such as the average number of unique edits per sentence across candidates, the pairwise edit overlap ratio, and the variance in correction quality, to show that there is sufficient diversity for the majority voting to be effective while reducing over-corrections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical voting heuristic is self-contained

full rationale

The paper describes a training-free inference procedure that generates multiple LLM candidates for grammatical error correction and aggregates them via edit-level majority voting. No equations, fitted parameters, or derivations are present that could reduce to self-definition or input-as-prediction. The method relies on external empirical evaluation across nine benchmarks rather than any self-citation chain, uniqueness theorem, or ansatz imported from prior work. Edit extraction is presented as an implementation detail without circular redefinition of the voting outcome. This is a standard empirical inference setup whose validity is tested against baselines, yielding a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced in the abstract; the approach relies on standard LLM generation and majority voting.

pith-pipeline@v0.9.0 · 5388 in / 1004 out tokens · 56708 ms · 2026-05-14T19:55:17.732370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

  1. [1]

    Abhijeet Awasthi, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, and Vihari Piratla. 2019. https://doi.org/10.18653/v1/D19-1435 Parallel iterative edit models for local sequence transduction . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing...

  2. [2]

    Adriane Boyd. 2018. https://doi.org/10.18653/v1/W18-6111 Using W ikipedia edits in low resource grammatical error correction . In Proceedings of the 2018 EMNLP Workshop W- NUT : The 4th Workshop on Noisy User-generated Text , pages 79--84, Brussels, Belgium. Association for Computational Linguistics

  3. [3]

    Andersen, and Ted Briscoe

    Christopher Bryant, Mariano Felice, istein E. Andersen, and Ted Briscoe. 2019. https://doi.org/10.18653/v1/W19-4406 The BEA -2019 shared task on grammatical error correction . In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 52--75, Florence, Italy. Association for Computational Linguistics

  4. [4]

    Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. https://doi.org/10.18653/v1/P17-1074 Automatic annotation and evaluation of error types for grammatical error correction . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793--805, Vancouver, Canada. Association for Computat...

  5. [5]

    Bin Cao, Kai Jiang, Fayu Pan, Chenlei Bao, and Jing Fan. 2024. https://aclanthology.org/2024.lrec-main.772/ Improving grammatical error correction by correction acceptability discrimination . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 8818--8827, Torin...

  6. [6]

    Teodor-Mihai Cotet, Stefan Ruseti, and Mihai Dascalu. 2020. https://doi.org/10.1109/ICTAI50040.2020.00101 Neural grammatical error correction for romanian . In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pages 625--631

  7. [7]

    Andersen, Shiva Taslimipoor, Helen Yannakoudakis, Zheng Yuan, Christopher Bryant, Marek Rei, and Paula Buttery

    Christopher Davis, Andrew Caines, istein E. Andersen, Shiva Taslimipoor, Helen Yannakoudakis, Zheng Yuan, Christopher Bryant, Marek Rei, and Paula Buttery. 2024. https://doi.org/10.18653/v1/2024.findings-acl.711 Prompting open-source and commercial language models for grammatical error correction of E nglish learner text . In Findings of the Association f...

  8. [8]

    Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.emnlp-demo.37 mbrs: A library for minimum B ayes risk decoding . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 351--362, Miami, Florida, USA. Association for Computational L...

  9. [9]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

  10. [10]

    Bryan Eikema and Wilker Aziz. 2020. https://doi.org/10.18653/v1/2020.coling-main.398 Is MAP decoding all you need? the inadequacy of the mode in neural machine translation . In Proceedings of the 28th International Conference on Computational Linguistics, pages 4506--4520, Barcelona, Spain (Online). International Committee on Computational Linguistics

  11. [11]

    Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. https://aclanthology.org/C16-1079/ Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments . In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers , pages 825--835, Osaka, Japan. The COLING 2016 Orga...

  12. [12]

    Simon Flachs, Oph \'e lie Lacroix, Helen Yannakoudakis, Marek Rei, and Anders S gaard. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.680 Grammatical error correction in low error density domains: A new benchmark and analyses . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8467--8478, Online. A...

  13. [13]

    Vaibhava Goel and William J Byrne. 2000. https://doi.org/10.1006/csla.2000.0138 Minimum bayes-risk automatic speech recognition . Computer Speech & Language, 14(2):115--135

  14. [14]

    Peiyuan Gong, Xuebo Liu, Heyan Huang, and Min Zhang. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.463 Revisiting grammatical error correction evaluation and beyond . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6891--6902, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  15. [15]

    Takumi Goto, Yusuke Sakai, and Taro Watanabe. 2025 a . https://doi.org/10.18653/v1/2025.acl-demo.50 gec-metrics: A unified library for grammatical error correction evaluation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 524--534, Vienna, Austria. Association for Compu...

  16. [16]

    Takumi Goto, Yusuke Sakai, and Taro Watanabe. 2025 b . https://doi.org/10.18653/v1/2025.acl-short.92 Rethinking evaluation metrics for grammatical error correction: Why use a different evaluation process than human? In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1165--1172, Vienna...

  17. [17]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  18. [18]

    Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. https://openreview.net/forum?id=sE7-XhLxHA DeBERTaV3 : Improving DeBERTa using ELECTRA -style pre-training with gradient-disentangled embedding sharing . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

  19. [19]

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. https://openreview.net/forum?id=rygGQyrFvH The curious case of neural text degeneration . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net

  20. [20]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 LoRA : Low-rank adaptation of large language models . In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net

  21. [21]

    Masahiro Kaneko and Naoaki Okazaki. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.619 Reducing sequence length by predicting edit spans with large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10017--10029, Singapore. Association for Computational Linguistics

  22. [22]

    Anisia Katinskaia and Roman Yangarber. 2024. https://aclanthology.org/2024.lrec-main.692/ GPT -3.5 for grammatical error correction . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7831--7843, Torino, Italia. ELRA and ICCL

  23. [23]

    Masamune Kobayashi, Masato Mita, and Mamoru Komachi. 2024. https://doi.org/10.1162/tacl_a_00676 Revisiting meta-evaluation for grammatical error correction . Transactions of the Association for Computational Linguistics, 12:837--855

  24. [24]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023 a . https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with pagedattention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, Oc...

  25. [25]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023 b . Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  26. [26]

    Jiehao Liang, Haihui Yang, Shiping Gao, and Xiaojun Quan. 2025. https://aclanthology.org/2025.coling-main.229/ Edit-wise preference optimization for grammatical error correction . In Proceedings of the 31st International Conference on Computational Linguistics, pages 3401--3414, Abu Dhabi, UAE. Association for Computational Linguistics

  27. [27]

    Ruixi Lin and Hwee Tou Ng. 2021. https://aclanthology.org/2021.ranlp-1.94/ System combination for grammatical error correction based on integer programming . In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 824--829, Held Online. INCOMA Ltd

  28. [28]

    Mengsay Loem, Masahiro Kaneko, Sho Takase, and Naoaki Okazaki. 2023. https://doi.org/10.18653/v1/2023.bea-1.18 Exploring effectiveness of GPT -3 in grammatical error correction: A study on performance and controllability in prompt-based methods . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023),...

  29. [29]

    Koki Maeda, Masahiro Kaneko, and Naoaki Okazaki. 2022. https://aclanthology.org/2022.coling-1.316/ IMPARA : Impact-based metric for GEC using parallel data . In Proceedings of the 29th International Conference on Computational Linguistics, pages 3578--3588, Gyeongju, Republic of Korea. International Committee on Computational Linguistics

  30. [30]

    Jakub N \'a plava and Milan Straka. 2019. https://doi.org/10.18653/v1/D19-5545 Grammatical error correction in low-resource scenarios . In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 346--356, Hong Kong, China. Association for Computational Linguistics

  31. [31]

    Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. https://doi.org/10.3115/v1/P15-2097 Ground truth for grammatical error correction metrics . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), ...

  32. [32]

    Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2016. https://arxiv.org/abs/1605.02592 GLEU without tuning . Preprint, arXiv:1605.02592

  33. [33]

    Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. https://aclanthology.org/E17-2037/ JFLEG : A fluency corpus and benchmark for grammatical error correction . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 229--234, Valencia, Spain. Association fo...

  34. [34]

    Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. https://doi.org/10.3115/v1/W14-1701 The C o NLL -2014 shared task on grammatical error correction . In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1--14, Baltimore, Maryland. Associat...

  35. [35]

    Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. https://doi.org/10.18653/v1/2020.bea-1.16 GECT o R -- grammatical error correction: Tag, not rewrite . In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 163--170, Seattle, WA, USA Online. Association f...

  36. [36]

    Kostiantyn Omelianchuk, Andrii Liubonko, Oleksandr Skurzhanskyi, Artem Chernodub, Oleksandr Korniienko, and Igor Samokhin. 2024. https://aclanthology.org/2024.bea-1.3/ Pillars of grammatical error correction: Comprehensive inspection of contemporary approaches in the era of large language models . In Proceedings of the 19th Workshop on Innovative Use of N...

  37. [37]

    Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. https://doi.org/10.18653/v1/2020.acl-demos.14 S tanza: A python natural language processing toolkit for many human languages . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101--108, Online. Associat...

  38. [38]

    Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, Siliang Liu, and Jungyeul Park. 2025. https://aclanthology.org/2025.bea-1.15/ Multilingual grammatical error annotation: Combining language-agnostic framework with language-specific flexibility . In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educationa...

  39. [39]

    Muhammad Reza Qorib, Seung-Hoon Na, and Hwee Tou Ng. 2022. https://doi.org/10.18653/v1/2022.naacl-main.143 Frustratingly easy system combination for grammatical error correction . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1964--1974, Seattle, Uni...

  40. [40]

    Muhammad Reza Qorib and Hwee Tou Ng. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.785 System combination via quality estimation for grammatical error correction . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12746--12759, Singapore. Association for Computational Linguistics

  41. [41]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . Journal of Machine Learning Research, 21(140):1--67

  42. [42]

    Vyas Raina and Mark Gales. 2023. https://doi.org/10.18653/v1/2023.ijcnlp-short.12 Minimum B ayes' risk decoding for system combination of grammatical error correction systems . In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Lin...

  43. [43]

    Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2021. https://doi.org/10.18653/v1/2021.acl-short.89 A simple recipe for multilingual grammatical error correction . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language ...

  44. [44]

    Yusuke Sakai, Adam Nohejl, Jiangnan Hang, Hidetaka Kamigaito, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.blackboxnlp-1.31 Toward the evaluation of large language models considering score variance across instruction templates . In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 499--529,...

  45. [45]

    Ujjwal Sharma and Pushpak Bhattacharyya. 2025. https://aclanthology.org/2025.coling-main.406/ Hi- GEC : H indi grammar error correction in low resource scenario . In Proceedings of the 31st International Conference on Computational Linguistics, pages 6063--6075, Abu Dhabi, UAE. Association for Computational Linguistics

  46. [46]

    Alexey Sorokin. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.785 Improved grammatical error correction by ranking elementary edits . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11416--11429, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  47. [47]

    Ryszard Staruch, Filip Gralinski, and Daniel Dzienisiewicz. 2025. https://doi.org/10.18653/v1/2025.bea-1.9 Adapting LLM s for minimal-edit grammatical error correction . In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 118--128, Vienna, Austria. Association for Computational Linguistics

  48. [48]

    Oleksiy Syvokon and Mariana Romanyshyn. 2023. https://doi.org/10.18653/v1/2023.unlp-1.16 The UNLP 2023 shared task on grammatical error correction for U krainian . In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), pages 132--137, Dubrovnik, Croatia. Association for Computational Linguistics

  49. [49]

    Chenming Tang, Fanyi Qu, and Yunfang Wu. 2024. https://doi.org/10.18653/v1/2024.naacl-long.99 Ungrammatical-syntax-based in-context example selection for grammatical error correction . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

  50. [50]

    Maksym Tarnavskyi, Artem Chernodub, and Kostiantyn Omelianchuk. 2022. https://doi.org/10.18653/v1/2022.acl-long.266 Ensembling and knowledge distilling of large sequence taggers for grammatical error correction . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3842--3852, Dublin, Ir...

  51. [51]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

  52. [52]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, and 179 others. 2024. https://arxiv.org/abs/2408.00118 Gemma 2: ...

  53. [53]

    Junrui Wang, Mengyang Qiu, Yang Gu, Zihao Huang, and Jungyeul Park. 2025. https://aclanthology.org/2025.coling-main.52/ Refined evaluation for end-to-end grammatical error correction using an alignment-based approach . In Proceedings of the 31st International Conference on Computational Linguistics, pages 774--785, Abu Dhabi, UAE. Association for Computat...

  54. [54]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://openreview.net/forum?id=1PL1NIMMrw Self-consistency improves chain of thought reasoning in language models . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenR...

  55. [55]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  56. [56]

    Helen Yannakoudakis, Øistein E Andersen, Ardeshir Geranpayeh, Ted Briscoe, and Diane Nicholls. 2018. https://doi.org/10.1080/08957347.2018.1464447 Developing an automated writing placement system for ESL learners . Applied Measurement in Education, 31(3):251--267

  57. [57]

    Soyoung Yoon, Sungjoon Park, Gyuwan Kim, Junhee Cho, Kihyo Park, Gyu Tae Kim, Minjoon Seo, and Alice Oh. 2023. https://doi.org/10.18653/v1/2023.acl-long.371 Towards standardizing K orean grammatical error correction: Datasets and annotation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

  58. [58]

    Ryoma Yoshimura, Masahiro Kaneko, Tomoyuki Kajiwara, and Mamoru Komachi. 2020. https://doi.org/10.18653/v1/2020.coling-main.573 SOME : Reference-less sub-metrics optimized for manual evaluations of grammatical error correction . In Proceedings of the 28th International Conference on Computational Linguistics, pages 6516--6522, Barcelona, Spain (Online). I...

  59. [59]

    Yue Zhang, Bo Zhang, Zhenghua Li, Zuyi Bao, Chen Li, and Min Zhang. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.162 S yn GEC : Syntax-enhanced grammatical error correction with a tailored GEC -oriented parser . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2518--2531, Abu Dhabi, United Arab Emirates...

  60. [60]

    Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. 2019. https://doi.org/10.18653/v1/N19-1014 Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technolog...

  61. [61]

    Yike Zhao, Xiaoman Wang, Yunshi Lan, and Weining Qian. 2025. https://aclanthology.org/2025.coling-demos.5/ U nified GEC : Integrating grammatical error correction approaches for multi-languages with a unified framework . In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, pages 37--45, Abu Dhabi, UAE. A...