PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

Jun Suzuki; Kaori Abe; Kazuaki Hanawa; Kentaro Inui; Makoto Morishita; Masato Mita; Ryo Fujii

arxiv: 2011.02121 · v1 · submitted 2020-11-04 · 💻 cs.CL

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

Ryo Fujii , Masato Mita , Kaori Abe , Kazuaki Hanawa , Makoto Morishita , Jun Suzuki , Kentaro Inui This is my paper

Pith reviewed 2026-05-24 14:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine translationuser-generated contentrobustness evaluationJapanese-English translationlinguistic phenomenadataset constructionneural machine translation

0 comments

The pith

PheMT dataset shows specific phenomena in user-generated Japanese text greatly degrade English MT output from both in-house and commercial systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates PheMT, a dataset that isolates individual linguistic phenomena in Japanese user-generated content to measure their effect on machine translation quality to English. It addresses the unexplained performance gap between clean text and noisy internet text by testing models on controlled examples of each phenomenon. Experiments find that certain phenomena cause large drops in translation accuracy across multiple systems. A sympathetic reader would care because this breakdown could guide targeted fixes for real-world cross-lingual online communication rather than treating UGC noise as a single undifferentiated problem.

Core claim

The authors establish that a phenomenon-wise dataset for Japanese-English translation can quantify how particular features of user-generated content disturb neural machine translation, and their tests confirm that both custom models and widely used off-the-shelf systems suffer substantial degradation from the presence of these features.

What carries the argument

PheMT dataset, which constructs or selects translation pairs to isolate specific linguistic phenomena found in user-generated content.

If this is right

Machine translation systems must address the identified phenomena individually to close the gap with clean-text performance.
Off-the-shelf commercial systems share the same vulnerabilities to these phenomena as research models.
The dataset provides a standardized way to measure progress on UGC robustness without relying on broad noisy test sets.
Developers can prioritize training or post-processing steps that target the most disruptive phenomena first.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar phenomenon-wise datasets could be built for other language pairs to check whether the same features cause problems elsewhere.
If the most harmful phenomena are few in number, targeted data augmentation or rule-based handling might yield quick robustness gains.
The approach could extend to other noisy domains such as speech transcripts or informal chat logs beyond internet UGC.

Load-bearing premise

The chosen linguistic phenomena account for most of the observed performance difference between clean input and user-generated content, and the dataset examples faithfully represent real UGC problems.

What would settle it

An evaluation on the PheMT dataset that finds no significant translation quality drop for any of the included phenomena, or a comparison showing that real-world UGC errors differ substantially from the dataset's examples.

Figures

Figures reproduced from arXiv: 2011.02121 by Jun Suzuki, Kaori Abe, Kazuaki Hanawa, Kentaro Inui, Makoto Morishita, Masato Mita, Ryo Fujii.

**Figure 2.** Figure 2: Entire flow of our phenomenon-wise dataset creation. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of appropriateness scores for the MTNT dataset. Human evaluators answered the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Correlation between the accuracy and human judgment scores for each phenomenon (WMT [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising directions is to develop a model that correctly handles these expressions. Though its importance has been recognized, it is still not clear as to what creates the great gap in performance between the translation of clean input and that of UGC. To answer the question, we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation. Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PheMT is a new phenomenon-wise dataset for JP-EN UGC that shows MT degradation on targeted issues, but the abstract gives too little on scale and selection to judge how solid the evidence is.

read the letter

Hey, the key takeaway is that this paper introduces PheMT, a dataset designed to evaluate machine translation robustness on user-generated content by breaking it down into specific linguistic phenomena for Japanese to English. Their experiments indicate that both their own models and off-the-shelf systems are significantly affected by certain of these phenomena. What stands out as new is the phenomenon-wise approach to dataset construction, which allows for more targeted analysis than previous work on noisy inputs. This seems like a practical step for understanding the performance gap between clean and UGC text. The paper does well in highlighting that the issue is not just general noise but particular expressions that trip up the systems. That said, the abstract leaves out specifics like the size of the dataset, the criteria for choosing the phenomena, or any statistical measures from the experiments. This makes it difficult to fully assess how representative the findings are or if the observed disturbances are robust. The assumption that these phenomena are the main causes could be strengthened with more validation against real-world UGC distributions. Overall, this is the kind of work that would interest people building or evaluating MT systems for online communication and social media. A reader focused on practical improvements in handling noisy text would get value from the resource. It deserves a serious referee because it offers a new evaluation tool that can be used by others, even if it doesn't propose new modeling techniques. I would recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces PheMT, a phenomenon-wise dataset for Japanese-English machine translation that targets specific linguistic phenomena common in user-generated content (UGC). The central claim is that experiments on this dataset demonstrate substantial performance degradation in both in-house and off-the-shelf MT systems when these phenomena are present, thereby explaining part of the gap between clean and UGC translation quality.

Significance. If the dataset construction and experimental results are sound, the work provides a useful controlled benchmark for isolating the effects of individual linguistic phenomena on MT robustness. The phenomenon-wise design is a clear strength for diagnostic analysis and could guide targeted improvements in handling real-world noisy text.

major comments (2)

[Dataset Construction] Dataset Construction section: The paper provides no validation (e.g., frequency analysis or human judgment study) that the chosen phenomena are the primary drivers of the performance gap in actual UGC rather than other unmodeled factors such as domain mismatch; this assumption is load-bearing for the motivation and the interpretation of the experimental results.
[Experiments] Experiments section: No information is given on the number of examples per phenomenon, the exact test-set sizes, or any statistical significance testing for the reported disturbances, making it impossible to assess whether the observed effects are reliable or generalizable beyond the constructed examples.

minor comments (2)

[Abstract] Abstract: The abstract states that experiments 'revealed' disturbances but supplies no quantitative summary of the effect sizes or metrics used.
[Experiments] The manuscript would benefit from an explicit comparison table showing clean vs. phenomenon-containing BLEU or other scores for each system.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below, clarifying our claims and outlining planned revisions where appropriate.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: The paper provides no validation (e.g., frequency analysis or human judgment study) that the chosen phenomena are the primary drivers of the performance gap in actual UGC rather than other unmodeled factors such as domain mismatch; this assumption is load-bearing for the motivation and the interpretation of the experimental results.

Authors: We thank the referee for this observation. Our central claim is not that the selected phenomena are the primary or sole drivers of the UGC-clean performance gap, but that PheMT provides a controlled, phenomenon-wise benchmark to isolate and measure their individual effects on MT systems. The motivation draws from existing literature on UGC challenges in Japanese-English translation. We acknowledge that the current manuscript lacks explicit frequency analysis or human validation studies confirming primacy over factors like domain mismatch. In revision, we will add a dedicated subsection in the Dataset Construction section discussing the selection rationale with references to prior linguistic studies and real UGC examples, while explicitly noting that other factors may contribute and that PheMT is diagnostic rather than exhaustive. revision: partial
Referee: [Experiments] Experiments section: No information is given on the number of examples per phenomenon, the exact test-set sizes, or any statistical significance testing for the reported disturbances, making it impossible to assess whether the observed effects are reliable or generalizable beyond the constructed examples.

Authors: We agree that these details are necessary for evaluating reliability and generalizability. Although the dataset sizes are described at a high level, we will revise the Experiments section to include explicit tables reporting the exact number of examples per phenomenon, precise test-set sizes for each condition, and statistical significance testing (e.g., paired bootstrap or McNemar's test) on the observed performance differences. These additions will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; dataset creation and evaluation are self-contained

full rationale

The paper constructs a new phenomenon-wise dataset (PheMT) for Japanese-English MT robustness evaluation on UGC and reports experimental results showing performance degradation on specific phenomena for both in-house and off-the-shelf systems. No equations, fitted parameters, predictions, or derivations are present that could reduce to inputs by construction. The central claim rests directly on the controlled examples in the new dataset and the observed MT outputs, without self-citation load-bearing, ansatz smuggling, or renaming of prior results. This matches the default case of an honest empirical dataset paper with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset paper with no mathematical derivations, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5713 in / 962 out tokens · 20865 ms · 2026-05-24T14:24:02.215291+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Neural machine translation of text from non-native speakers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 3070–3080. Lo¨ıc Barrault, Ond ˇrej Bojar, Marta R. Costa-juss `a, Christian Federmann, Mark Fishel, Yvette ...

work page 2019
[2]

In Proceedings of the F ourth Conference on Machine Translation (V olume 2: Shared Task Papers, Day

Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the F ourth Conference on Machine Translation (V olume 2: Shared Task Papers, Day

work page 2019
[3]

Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long Papers) , pages 1304–1313. Yonatan Belinkov and Yonatan Bisk

work page 2018
[4]

In 6th International Conference on Learning Representations, ICLR 2018

Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018 . Alexandre Berard, Ioan Calapodescu, Marc Dymetman, Claude Roux, Jean-Luc Meunier, and Vassilina Nikoulina. 2019a. Machine translation of restaurant reviews: New corpus for domain adaptation and robustness. In Pro- cee...

work page 2018
[5]

One size does not ﬁt all: Comparing NMT representations of different granularities. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 1504–1516. Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Ch...

work page 2019
[6]

Achieving Human Parity on Automatic Chinese to English News Translation

Achieving Human Parity on Automatic Chinese to English News Translation. arXiv, abs/1803.05567. Georg Heigold, Stalin Varanasi, G¨unter Neumann, and Josef van Genabith

work page internal anchor Pith review Pith/arXiv arXiv
[7]

In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 2486–2496

A challenge set approach to evaluating machine trans- lation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 2486–2496. Marcin Junczys-Dowmunt

work page 2017
[8]

In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 42–47

Training on synthetic noise improves robustness to natural noise in machine translation. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 42–47. Philipp Koehn, Huda Khayrallah, Kenneth Heaﬁeld, and Mikel L. Forcada

work page 2019
[9]

In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739

Findings of the WMT 2018 shared task on parallel corpus ﬁltering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739. Taku Kudo and John Richardson

work page 2018
[10]

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages 66–71

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages 66–71. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto

work page 2018
[11]

In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230–237

Applying conditional random ﬁelds to japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230–237. Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, Orhan Firat, Philipp Koehn, Graham Neubig, Juan Pino, and Hassan Sajjad

work page 2004
[12]

In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421

Effective approaches to attention-based neu- ral machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Paul Michel and Graham Neubig

work page 2015
[13]

InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 543–553

MTNT: A Testbed for Machine Translation of Noisy Text. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 543–553. Makoto Morishita, Jun Suzuki, and Masaaki Nagata

work page 2018
[14]

In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) , pages 48–53

fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) , pages 48–53. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu

work page 2019
[15]

Sequence-to-sequence neural network models for transliteration

Sequence-to-sequence neural network models for transliteration. arXiv, abs/1610.09565. Itsumi Saito, Kugatsu Sadamitsu, Hisako Asano, and Yoshihiro Matsuo

work page internal anchor Pith review Pith/arXiv arXiv
[16]

In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers , pages 1773–1782

Morphological analysis for Japanese noisy text based on character-level and word-level normalization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers , pages 1773–1782. Ryohei Sasano, Sadao Kurohashi, and Manabu Okumura

work page 2014
[17]

Character-based Neural Machine Translation

Character-based neural machine translation. arXiv, abs/1511.04586. John S. White, Theresa A. O’Connell, and Francis E. O’Mara

work page internal anchor Pith review Pith/arXiv arXiv
[18]

In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 193–205

The ARPA MT evaluation methodolo- gies: Evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 193–205. Score Deﬁnition 5 translations that conveys the meaning completely and ﬂuent as target language sentence 4 translations that does not show any lack of informati...

work page 1994

[1] [1]

Neural machine translation of text from non-native speakers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 3070–3080. Lo¨ıc Barrault, Ond ˇrej Bojar, Marta R. Costa-juss `a, Christian Federmann, Mark Fishel, Yvette ...

work page 2019

[2] [2]

In Proceedings of the F ourth Conference on Machine Translation (V olume 2: Shared Task Papers, Day

Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the F ourth Conference on Machine Translation (V olume 2: Shared Task Papers, Day

work page 2019

[3] [3]

Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long Papers) , pages 1304–1313. Yonatan Belinkov and Yonatan Bisk

work page 2018

[4] [4]

In 6th International Conference on Learning Representations, ICLR 2018

Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018 . Alexandre Berard, Ioan Calapodescu, Marc Dymetman, Claude Roux, Jean-Luc Meunier, and Vassilina Nikoulina. 2019a. Machine translation of restaurant reviews: New corpus for domain adaptation and robustness. In Pro- cee...

work page 2018

[5] [5]

One size does not ﬁt all: Comparing NMT representations of different granularities. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 1504–1516. Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Ch...

work page 2019

[6] [6]

Achieving Human Parity on Automatic Chinese to English News Translation

Achieving Human Parity on Automatic Chinese to English News Translation. arXiv, abs/1803.05567. Georg Heigold, Stalin Varanasi, G¨unter Neumann, and Josef van Genabith

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 2486–2496

A challenge set approach to evaluating machine trans- lation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 2486–2496. Marcin Junczys-Dowmunt

work page 2017

[8] [8]

In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 42–47

Training on synthetic noise improves robustness to natural noise in machine translation. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 42–47. Philipp Koehn, Huda Khayrallah, Kenneth Heaﬁeld, and Mikel L. Forcada

work page 2019

[9] [9]

In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739

Findings of the WMT 2018 shared task on parallel corpus ﬁltering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739. Taku Kudo and John Richardson

work page 2018

[10] [10]

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages 66–71

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages 66–71. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto

work page 2018

[11] [11]

In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230–237

Applying conditional random ﬁelds to japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230–237. Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, Orhan Firat, Philipp Koehn, Graham Neubig, Juan Pino, and Hassan Sajjad

work page 2004

[12] [12]

In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421

Effective approaches to attention-based neu- ral machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Paul Michel and Graham Neubig

work page 2015

[13] [13]

InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 543–553

MTNT: A Testbed for Machine Translation of Noisy Text. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 543–553. Makoto Morishita, Jun Suzuki, and Masaaki Nagata

work page 2018

[14] [14]

In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) , pages 48–53

fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) , pages 48–53. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu

work page 2019

[15] [15]

Sequence-to-sequence neural network models for transliteration

Sequence-to-sequence neural network models for transliteration. arXiv, abs/1610.09565. Itsumi Saito, Kugatsu Sadamitsu, Hisako Asano, and Yoshihiro Matsuo

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers , pages 1773–1782

Morphological analysis for Japanese noisy text based on character-level and word-level normalization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers , pages 1773–1782. Ryohei Sasano, Sadao Kurohashi, and Manabu Okumura

work page 2014

[17] [17]

Character-based Neural Machine Translation

Character-based neural machine translation. arXiv, abs/1511.04586. John S. White, Theresa A. O’Connell, and Francis E. O’Mara

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 193–205

The ARPA MT evaluation methodolo- gies: Evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas, pages 193–205. Score Deﬁnition 5 translations that conveys the meaning completely and ﬂuent as target language sentence 4 translations that does not show any lack of informati...

work page 1994