Smarter edits? Post-editing with error highlights and translation suggestions
Pith reviewed 2026-05-21 04:46 UTC · model grok-4.3
The pith
Professional translators saw no productivity or quality gains from LLM error highlights or correction suggestions in post-editing, though they preferred automatic post-editing highlights and liked the suggestions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a study with professional En-Nl translators, post-editing with APE error highlights and correction suggestions showed no productivity or quality gains compared to regular post-editing or QE-derived highlights, yet APE highlights were better received than QE highlights and correction suggestions improved user experience.
What carries the argument
A four-condition user study that measures productivity (time and edits), final quality, and subjective user-experience ratings while varying the source of error highlights and the presence of correction suggestions.
If this is right
- Automatic post-editing highlights can be more acceptable to translators than quality-estimation highlights even when neither improves speed or quality.
- Correction suggestions can raise subjective satisfaction with the post-editing interface without raising objective productivity.
- Standard post-editing without extra highlights remains competitive on both speed and output quality.
- User-experience measures should be tracked separately from productivity when evaluating new post-editing features.
Where Pith is reading between the lines
- Tool designers might try combining highlight sources or making suggestions more interactive to turn the observed experience gains into actual speed improvements.
- The preference for APE highlights could stem from how closely they match the kinds of errors translators naturally notice.
- Results might shift if the study moved to language pairs with very different error profiles or to translators with less experience.
- Future experiments could test whether the same features affect revision behavior when translators work on longer documents or under time pressure.
Load-bearing premise
The particular LLM-derived highlights and APE suggestions tested here would behave the same way in other real professional workflows and that results from these En-Nl translators would hold for different language pairs or translator groups.
What would settle it
A replication study using different language pairs or different underlying models that finds measurable increases in words per minute or quality scores when the same highlights and suggestions are provided would disprove the no-gain result.
Figures
read the original abstract
As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper reports results from a controlled user study with professional translators performing English-to-Dutch post-editing. It compares four conditions: standard post-editing, post-editing with quality-estimation-derived error highlights, post-editing with automatic-post-editing-derived error highlights, and post-editing with automatic-post-editing-derived correction suggestions. Productivity, final translation quality, and user-experience measures are reported. The central findings are that none of the enhanced conditions produced productivity or quality gains relative to standard post-editing, yet APE-derived highlights were rated more favorably than QE-derived highlights and the addition of correction suggestions improved overall user experience.
Significance. If the empirical results hold under broader conditions, the work supplies useful negative evidence on productivity and quality gains from current LLM-based post-editing aids while documenting positive effects on translator satisfaction. Such findings are relevant for MT tool design and for HCI research on translation workflows, indicating that user-experience considerations may matter more for adoption than raw efficiency metrics. The head-to-head comparison of APE versus QE signals is timely given the rapid integration of LLMs into translation pipelines.
major comments (3)
- [Methods] Methods section: The generation procedures for APE error highlights and correction suggestions are described at a high level but without any quantitative assessment of their intrinsic quality (e.g., highlight precision/recall against human error annotations or suggestion acceptance rates during the study). This omission makes it difficult to attribute the reported UX preference for APE over QE to the underlying signal type rather than to incidental differences in the quality of the particular LLM outputs used.
- [Results] Results section: The null findings on productivity and quality are presented without accompanying effect sizes, confidence intervals, or power analysis. Given that user studies with professional translators often involve modest sample sizes, the absence of these statistics leaves open the possibility that meaningful differences were simply undetected.
- [Discussion] Discussion section: The claim that APE highlights are better received than QE-derived highlights is framed as a general advantage, yet the study is restricted to a single language pair (En-Nl) and a specific set of LLM prompts and models. The paper should explicitly discuss the risk that the observed preference is implementation- or domain-specific and outline concrete steps (additional language pairs, alternative models, or ablation of prompt components) that would be needed to test broader applicability.
minor comments (1)
- [Results] Table 2 or the corresponding results table: Ensure that all condition labels are fully spelled out in the caption so that readers can map them directly to the four experimental arms without cross-referencing the text.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of methodological transparency, statistical reporting, and generalizability that we will address in the revision. We respond to each major comment below.
read point-by-point responses
-
Referee: [Methods] Methods section: The generation procedures for APE error highlights and correction suggestions are described at a high level but without any quantitative assessment of their intrinsic quality (e.g., highlight precision/recall against human error annotations or suggestion acceptance rates during the study). This omission makes it difficult to attribute the reported UX preference for APE over QE to the underlying signal type rather than to incidental differences in the quality of the particular LLM outputs used.
Authors: We agree that quantitative assessment of the generated highlights and suggestions would strengthen attribution of the UX differences. Our primary focus was the user study outcomes rather than intrinsic system evaluation, and we did not obtain separate human error annotations for precision/recall. However, we did log suggestion acceptance rates during the sessions. In the revised manuscript we will report these acceptance rates and add a brief discussion of how they relate to the observed UX preference. We will also clarify the generation procedures with additional implementation details. revision: partial
-
Referee: [Results] Results section: The null findings on productivity and quality are presented without accompanying effect sizes, confidence intervals, or power analysis. Given that user studies with professional translators often involve modest sample sizes, the absence of these statistics leaves open the possibility that meaningful differences were simply undetected.
Authors: We accept this point. In the revised results section we will report effect sizes (Cohen’s d) and 95% confidence intervals for all key comparisons. A prospective power analysis was not performed because the study was exploratory and constrained by the limited availability of professional translators; we will add a post-hoc discussion of achieved power and the implications for detecting small-to-medium effects given our sample size. revision: yes
-
Referee: [Discussion] Discussion section: The claim that APE highlights are better received than QE-derived highlights is framed as a general advantage, yet the study is restricted to a single language pair (En-Nl) and a specific set of LLM prompts and models. The paper should explicitly discuss the risk that the observed preference is implementation- or domain-specific and outline concrete steps (additional language pairs, alternative models, or ablation of prompt components) that would be needed to test broader applicability.
Authors: We agree that the current framing risks over-generalization. In the revised discussion we will explicitly state the limitations of the single En-Nl pair, the chosen models, and prompt design. We will also add a dedicated paragraph outlining concrete next steps: replication with at least two additional language pairs, comparison with alternative LLMs, and systematic prompt ablations to isolate which components drive the preference. revision: yes
Circularity Check
No significant circularity: empirical user study with direct measurements
full rationale
The paper reports an empirical user study comparing post-editing conditions (regular PE, QE highlights, APE highlights plus suggestions) on productivity, quality, and UX metrics collected from professional En-Nl translators. No derivation chain, equations, fitted parameters renamed as predictions, or first-principles results exist that could reduce to inputs by construction. Claims rest on observed experimental outcomes rather than self-definitional loops or load-bearing self-citations. The work is self-contained against its own study data and does not invoke uniqueness theorems or ansatzes from prior author work to force conclusions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Professional translators' self-reported experience and measured productivity accurately reflect real-world post-editing performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compare productivity, measured as the average number of source characters processed over the text-level edit time
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Keystroke Logging in Writing Research: Using Inputlog to Analyze Writing Processes , journal =
Leijten, Mariëlle and Van Waes, Luuk , year =. Keystroke Logging in Writing Research: Using Inputlog to Analyze Writing Processes , journal =
-
[2]
xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection , author=. 2023 , eprint=
work page 2023
-
[3]
In: Webber, B., Cohn, T., He, Y., Liu, Y
Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon. COMET : A Neural Framework for MT Evaluation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.213
-
[5]
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Kocmi, Tom and Federmann, Christian. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. 2023
work page 2023
-
[6]
In: Koehn, P., Haddow, B., Kocmi, T., Monz, C
Kocmi, Tom and Federmann, Christian. GEMBA - MQM : Detecting Translation Quality Error Spans with GPT -4. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.64
-
[7]
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Kocmi, Tom and Zouhar, Vil \'e m and Avramidis, Eleftherios and Grundkiewicz, Roman and Karpinska, Marzena and Popovi \'c , Maja and Sachan, Mrinmaya and Shmatova, Mariya. Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.131
-
[8]
QE4PE: Word-level Quality Estimation for Human Post-Editing , author=. 2025 , eprint=
work page 2025
-
[9]
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models
Lu, Qingyu and Qiu, Baopu and Ding, Liang and Zhang, Kanjian and Kocmi, Tom and Tao, Dacheng. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.520
-
[10]
Fernandes, Patrick and Deutsch, Daniel and Finkelstein, Mara and Riley, Parker and Martins, Andr \'e and Neubig, Graham and Garg, Ankush and Clark, Jonathan and Freitag, Markus and Firat, Orhan. The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation. Proceedings of the Eighth Conference on Machine Tran...
-
[11]
and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F
Guerreiro, Nuno M. and Rei, Ricardo and Stigt, Daan van and Coheur, Luisa and Colombo, Pierre and Martins, André F. T. , title = ". Transactions of the Association for Computational Linguistics , volume =. 2024 , month =. doi:10.1162/tacl_a_00683 , url =
-
[12]
Arle Lommel and Hans Uszkoreit and Aljoscha Burchardt , year =. Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics , journal =
-
[13]
Kepler, Fabio and Tr \'e nous, Jonay and Treviso, Marcos and Vera, Miguel and Martins, Andr \'e F. T. O pen K iwi: An Open Source Framework for Quality Estimation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2019. doi:10.18653/v1/P19-3020
-
[14]
Advances in Neural Information Processing Systems , year =
Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E Gonzalez and Ion Stoica , title =. Advances in Neural Information Processing Systems , year =
-
[15]
Findings of the WMT 2023 Shared Task on Automatic Post-Editing
Bhattacharyya, Pushpak and Chatterjee, Rajen and Freitag, Markus and Kanojia, Diptesh and Negri, Matteo and Turchi, Marco. Findings of the WMT 2023 Shared Task on Automatic Post-Editing. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.55
-
[16]
Macken, Lieve. Machine Translation Meets Large Language Models: Evaluating C hat GPT ' s Ability to Automatically Post-Edit Literary Texts. Proceedings of the 1st Workshop on Creative-text Translation and Technology. 2024
work page 2024
-
[17]
Quality Estimation-Assisted Automatic Post-Editing
Deoghare, Sourabh and Kanojia, Diptesh and Blain, Fred and Ranasinghe, Tharindu and Bhattacharyya, Pushpak. Quality Estimation-Assisted Automatic Post-Editing. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.115
-
[18]
Combining Quality Estimation and Automatic Post-editing to Enhance Machine Translation output
Chatterjee, Rajen and Negri, Matteo and Turchi, Marco and Blain, Fr \'e d \'e ric and Specia, Lucia. Combining Quality Estimation and Automatic Post-editing to Enhance Machine Translation output. Proceedings of the 13th Conference of the Association for Machine Translation in the A mericas (Volume 1: Research Track). 2018
work page 2018
-
[19]
Leveraging GPT -4 for Automatic Translation Post-Editing
Raunak, Vikas and Sharaf, Amr and Wang, Yiren and Awadalla, Hany and Menezes, Arul. Leveraging GPT -4 for Automatic Translation Post-Editing. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.804
-
[20]
doi:10.3115/1073083.1073135 , editor =
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135
-
[21]
Popovi \'c , Maja. chr F : character n-gram F -score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049
-
[22]
Deploying MT Quality Estimation on a large scale: Lessons learned and open questions
Tamchyna, Ale s. Deploying MT Quality Estimation on a large scale: Lessons learned and open questions. Proceedings of Machine Translation Summit XVIII: Users and Providers Track. 2021
work page 2021
-
[23]
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems , pages =
Coppers, Sven and Van den Bergh, Jan and Luyten, Kris and Coninx, Karin and van der Lek-Ciudin, Iulianna and Vanallemeersch, Tom and Vandeghinste, Vincent , title =. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems , pages =. 2018 , isbn =. doi:10.1145/3173574.3174098 , abstract =
-
[24]
MMPE : A M ulti- M odal I nterface for P ost- E diting M achine T ranslation
Herbig, Nico and D. MMPE : A M ulti- M odal I nterface for P ost- E diting M achine T ranslation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.155
-
[25]
MT Quality Estimation for Computer-assisted Translation: Does it Really Help?
Turchi, Marco and Negri, Matteo and Federico, Marcello. MT Quality Estimation for Computer-assisted Translation: Does it Really Help?. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3115/v1/P15-2087
-
[26]
Béchara, Hannah and Orăsan, Constantin and Parra Escartín, Carla and Zampieri, Marcos and Lowe, William , TITLE =. Informatics , VOLUME =. 2021 , NUMBER =
work page 2021
-
[27]
The Prague Bulletin of Mathematical Linguistics , year=
Questing for quality estimation a user study , author=. The Prague Bulletin of Mathematical Linguistics , year=
-
[28]
Liu, Siqi and Dai, Guangrong and Li, Dechao. Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness. Proceedings of Machine Translation Summit XX: Volume 1. 2025
work page 2025
-
[29]
The Impact of MT Quality Estimation on Post-Editing Effort
Teixeira, Carlos and O ' Brien, Sharon. The Impact of MT Quality Estimation on Post-Editing Effort. Proceedings of Machine Translation Summit XVI: Commercial MT Users and Translators Track. 2017
work page 2017
-
[30]
Shenoy, Raksha and Herbig, Nico and Kr. Investigating the Helpfulness of Word-Level Quality Estimation for Post-Editing Machine Translation Output. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.799
-
[31]
Word-Level Quality Estimation for Korean-English Neural Machine Translation , year=
Eo, Sugyeong and Park, Chanjun and Moon, Hyeonseok and Seo, Jaehyung and Lim, Heuiseok , journal=. Word-Level Quality Estimation for Korean-English Neural Machine Translation , year=
-
[32]
Natural Language Engineering , volume=
Can machine translation systems be evaluated by the crowd alone , author=. Natural Language Engineering , volume=. 2017 , publisher=
work page 2017
-
[33]
Briakou, Eleftheria and Luo, Jiaming and Cherry, Colin and Freitag, Markus. Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.123
-
[34]
Briva-Iglesias, Vicent. Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication. Proceedings of Machine Translation Summit XX: Volume 1. 2025
work page 2025
-
[35]
Wu, Minghao and Xu, Jiahao and Yuan, Yulin and Haffari, Gholamreza and Wan, Longyue and Luo, Weihua and Zhang, Kaifu. (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.25
-
[36]
Deoghare, Sourabh and Kanojia, Diptesh and Bhattacharyya, Pushpak. Giving the Old a Fresh Spin: Quality Estimation-Assisted Constrained Decoding for Automatic Post-Editing. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2025. ...
-
[37]
A user study of neural interactive translation prediction , author=. Machine Translation , volume=. 2019 , publisher=
work page 2019
-
[38]
New directions in empirical translation process research: exploring the CRITT TPR-DB , pages=
Learning advanced post-editing , author=. New directions in empirical translation process research: exploring the CRITT TPR-DB , pages=. 2016 , publisher=
work page 2016
-
[39]
Human-centered, augmented machine translation: analysing user experience, quality and productivity in interactive post-editing vs traditional post-editing , author=. Tradum
-
[40]
Translation, Cognition & Behavior , volume=
The impact of traditional and interactive post-editing on machine translation user experience, quality, and productivity , author=. Translation, Cognition & Behavior , volume=. 2023 , publisher=
work page 2023
-
[41]
Productivity in post-editing and in neural interactive translation prediction: A study of English-to-Spanish professional translators , author=. 2017 , school=
work page 2017
-
[42]
Translators and translation technology: The dance of agency , author=. Translation studies , volume=. 2011 , publisher=
work page 2011
-
[43]
Human-centered augmented translation: Against antagonistic dualisms , author=. Perspectives , volume=. 2024 , publisher=
work page 2024
-
[44]
Productivity in the post-editing of neural machine translation: A mixed-methods analysis of speed and edits at Toppan Digital Language , author=. 2024 , school=
work page 2024
-
[45]
Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level
Neves, Mariana and Grozea, Cristian and Thomas, Philippe and Roller, Roland and Bawden, Rachel and N \'e v \'e ol, Aur \'e lie and Castle, Steffen and Bonato, Vanessa and Di Nunzio, Giorgio Maria and Vezzani, Federica and Vicente Navarro, Maika and Yeganova, Lana and Jimeno Yepes, Antonio. Findings of the WMT 2024 Biomedical Translation Shared Task: Test ...
-
[46]
Alabau, Vicent, Michael Carl, Francisco Casacuberta, Mercedes Garc \' a Mart \' nez, Jes \'u s Gonz \'a lez-Rubio, Bartolom \'e Mesa-Lao, Daniel Ortiz-Mart \' nez, Moritz Schaeffer, and Germ \'a n Sanchis-Trilles. 2016. Learning advanced post-editing. In New directions in empirical translation process research: exploring the CRITT TPR-DB , pages 95--110. Springer
work page 2016
-
[47]
Briakou, Eleftheria, Jiaming Luo, Colin Cherry, and Markus Freitag. 2024. Translating step-by-step: Decomposing the translation process for improved translation quality of long-form texts. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Ninth Conference on Machine Translation , pages 1301--1317, Miami, Florida, U...
work page 2024
-
[48]
Briva-Iglesias, Vicent, Sharon O’Brien, and Benjamin R Cowan. 2023. The impact of traditional and interactive post-editing on machine translation user experience, quality, and productivity. Translation, Cognition & Behavior , 6(1):60--86
work page 2023
-
[49]
Briva-Iglesias, Vicent. 2025a. Are AI agents the new machine translation frontier? challenges and opportunities of single- and multi-agent systems for multilingual digital communication. In Bouillon, Pierrette, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Mon...
-
[50]
Briva-Iglesias, Vicent. 2025b. Human-centered, augmented machine translation: analysing user experience, quality and productivity in interactive post-editing vs traditional post-editing. Tradum \`a tica tecnologies de la traducci \'o , (23):350--382
-
[51]
Béchara, Hannah, Constantin Orăsan, Carla Parra Escartín, Marcos Zampieri, and William Lowe. 2021. The role of machine translation quality estimation in the post-editing workflow. Informatics , 8(3)
work page 2021
-
[52]
Chatterjee, Rajen, Matteo Negri, Marco Turchi, Fr \'e d \'e ric Blain, and Lucia Specia. 2018. Combining quality estimation and automatic post-editing to enhance machine translation output. In Cherry, Colin and Graham Neubig, editors, Proceedings of the 13th Conference of the Association for Machine Translation in the A mericas (Volume 1: Research Track) ...
work page 2018
-
[53]
Coppers, Sven, Jan Van den Bergh, Kris Luyten, Karin Coninx, Iulianna van der Lek-Ciudin, Tom Vanallemeersch, and Vincent Vandeghinste. 2018. Intellingo: An intelligible translation environment. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems , CHI '18, page 1–13, New York, NY, USA. Association for Computing Machinery
work page 2018
-
[54]
Deoghare, Sourabh, Diptesh Kanojia, Fred Blain, Tharindu Ranasinghe, and Pushpak Bhattacharyya. 2023. Quality estimation-assisted automatic post-editing. In Bouamor, Houda, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 1686--1698, Singapore, December. Association for Computational Linguistics
work page 2023
-
[55]
Deoghare, Sourabh, Diptesh Kanojia, and Pushpak Bhattacharyya. 2025. Giving the old a fresh spin: Quality estimation-assisted constrained decoding for automatic post-editing. In Chiruzzo, Luis, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Huma...
work page 2025
-
[56]
Escart \' n, Carla Parra, Hanna B \'e chara, and Constantin Or a san. 2017. Questing for quality estimation a user study. The Prague Bulletin of Mathematical Linguistics
work page 2017
-
[57]
Fernandes, Patrick, Daniel Deutsch, Mara Finkelstein, Parker Riley, Andr \'e Martins, Graham Neubig, Ankush Garg, Jonathan Clark, Markus Freitag, and Orhan Firat. 2023. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation , pages 1066--1...
work page 2023
-
[58]
Graham, Yvette, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2017. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering , 23(1):3--30
work page 2017
-
[59]
Guerreiro, Nuno M., Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and André F. T. Martins. 2023. xcomet: Transparent machine translation evaluation through fine-grained error detection
work page 2023
-
[60]
Guerreiro, Nuno M., Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and André F. T. Martins. 2024. xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection . Transactions of the Association for Computational Linguistics , 12:979--995, 09
work page 2024
-
[61]
Kepler, Fabio, Jonay Tr \'e nous, Marcos Treviso, Miguel Vera, and Andr \'e F. T. Martins. 2019. O pen K iwi: An open source framework for quality estimation. In Costa-juss \`a , Marta R. and Enrique Alfonseca, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages 117--122, Florence...
work page 2019
-
[62]
Knowles, Rebecca, Marina Sanchez-Torron, and Philipp Koehn. 2019. A user study of neural interactive translation prediction. Machine Translation , 33(1):135--154
work page 2019
-
[63]
Kocmi, Tom and Christian Federmann. 2023a. GEMBA - MQM : Detecting translation quality error spans with GPT -4. In Koehn, Philipp, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation , pages 768--775, Singapore, December. Association for Computational Linguistics
-
[64]
Kocmi, Tom and Christian Federmann. 2023b. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation , pages 193--203, Tampere, Finland, June. European Association for Machine Translation
-
[65]
Kocmi, Tom, Vil \'e m Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popovi \'c , Mrinmaya Sachan, and Mariya Shmatova. 2024. Error span annotation: A balanced approach for human evaluation of machine translation. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Ninth Conference on Mach...
work page 2024
-
[66]
Liu, Siqi, Guangrong Dai, and Dechao Li. 2025. Introducing quality estimation to machine translation post-editing workflow: An empirical study on its usefulness. In Bouillon, Pierrette, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, and Sara Szoc, editor...
work page 2025
-
[67]
Lommel, Arle, Hans Uszkoreit, and Aljoscha Burchardt. 2014. Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics. Revista Tradumàtica: tecnologies de la traducció
work page 2014
-
[68]
Lu, Qingyu, Baopu Qiu, Liang Ding, Kanjian Zhang, Tom Kocmi, and Dacheng Tao. 2024. Error analysis prompting enables human-like translation evaluation in large language models. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 8801--8816, Bangkok, Thailand, August. Association for Computational Linguistics
work page 2024
-
[69]
Macken, Lieve. 2024. Machine translation meets large language models: Evaluating C hat GPT ' s ability to automatically post-edit literary texts. In Vanroy, Bram, Marie-Aude Lefer, Lieve Macken, and Paola Ruffo, editors, Proceedings of the 1st Workshop on Creative-text Translation and Technology , pages 65--81, Sheffield, United Kingdom, June. European As...
work page 2024
-
[70]
Neves, Mariana, Cristian Grozea, Philippe Thomas, Roland Roller, Rachel Bawden, Aur \'e lie N \'e v \'e ol, Steffen Castle, Vanessa Bonato, Giorgio Maria Di Nunzio, Federica Vezzani, Maika Vicente Navarro, Lana Yeganova, and Antonio Jimeno Yepes. 2024. Findings of the WMT 2024 biomedical translation shared task: Test sets on abstract level. In Haddow, Bar...
work page 2024
-
[71]
Olohan, Maeve. 2011. Translators and translation technology: The dance of agency. Translation studies , 4(3):342--357
work page 2011
-
[72]
O’Brien, Sharon. 2024. Human-centered augmented translation: Against antagonistic dualisms. Perspectives , 32(3):391--406
work page 2024
-
[73]
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. B leu: a method for automatic evaluation of machine translation. In Isabelle, Pierre, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311--318, Philadelphia, Pennsylvania, USA, July. Association for ...
work page 2002
-
[74]
Popovi \'c , Maja. 2015. chr F : character n-gram F -score for automatic MT evaluation. In Bojar, Ond r ej, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Varvara Logacheva, and Pavel Pecina, editors, Proceedings of the Tenth Workshop on Statistical Machine Translation , pages 392--395, Lisbon, Portugal, September. Assoc...
work page 2015
-
[75]
Raunak, Vikas, Amr Sharaf, Yiren Wang, Hany Awadalla, and Arul Menezes. 2023. Leveraging GPT -4 for automatic translation post-editing. In Bouamor, Houda, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 12009--12024, Singapore, December. Association for Computational Linguistics
work page 2023
-
[76]
Sarti, Gabriele, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, and Arianna Bisazza. 2025. Qe4pe: Word-level quality estimation for human post-editing
work page 2025
-
[77]
Shenoy, Raksha, Nico Herbig, Antonio Kr \"u ger, and Josef van Genabith. 2021. Investigating the helpfulness of word-level quality estimation for post-editing machine translation output. In Moens, Marie-Francine, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Proces...
work page 2021
-
[78]
Teixeira, Carlos and Sharon O ' Brien. 2017. The impact of MT quality estimation on post-editing effort. In Yamada, Masaru and Mark Seligman, editors, Proceedings of Machine Translation Summit XVI: Commercial MT Users and Translators Track , pages 142--153, Nagoya Japan, September 18 – September 22
work page 2017
-
[79]
Terribile, Silvia. 2024. Productivity in the post-editing of neural machine translation: A mixed-methods analysis of speed and edits at Toppan Digital Language . Ph.D. thesis, The University of Manchester (United Kingdom)
work page 2024
- [80]
-
[81]
Turchi, Marco, Matteo Negri, and Marcello Federico. 2015. MT quality estimation for computer-assisted translation: Does it really help? In Zong, Chengqing and Michael Strube, editors, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: ...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.