Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
hub
COMET : A Neural Framework for MT Evaluation
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 10verdicts
UNVERDICTED 10representative citing papers
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Combining English and target-language web retrieval boosts medical QA for low-resource languages to match high-resource performance, while English web data benefits high-resource languages most and specialized sources like PubMed lack multilingual coverage.
BM25 retrieval makes many-shot ICL for low-resource MT roughly 5x more data-efficient, with 50 examples matching 250 random ones and 250 matching 1000.
Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.
citing papers explorer
-
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
-
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
-
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
-
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation
Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.
-
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
Towards Visually-Guided Movie Subtitle Translation for Indic Languages
Selective replacement of the worst 20-30% of text-only subtitle segments with visual-enhanced outputs raises COMET scores for Indic languages, but full visual grounding is ineffective because of temporal misalignment between subtitles and frames.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
-
Effects of Cross-lingual Evidence in Multilingual Medical Question Answering
Combining English and target-language web retrieval boosts medical QA for low-resource languages to match high-resource performance, while English web data benefits high-resource languages most and specialized sources like PubMed lack multilingual coverage.
-
An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages
BM25 retrieval makes many-shot ICL for low-resource MT roughly 5x more data-efficient, with 50 examples matching 250 random ones and 250 matching 1000.
-
Adam's Law: Textual Frequency Law on Large Language Models
Frequent sentence-level text improves LLM prompting and fine-tuning performance across math, translation, commonsense, and tool-use tasks via a proposed frequency law and curriculum ordering.