BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
hub
BLEURT : Learning robust metrics for text generation
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
fields
cs.CL 14representative citing papers
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
Diffusion-based localized editing framework for faithful summarization of evolving contexts, introducing the StreamSum benchmark and showing tradeoffs in faithfulness, speed, and preservation.
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
LLMs judge their own outputs higher because they assign better scores to lower-perplexity text, even when the text is not self-generated.
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
citing papers explorer
-
BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
-
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
-
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
-
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
-
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
-
AI translation of literary texts is "fine", but readers still prefer human translations
Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
-
Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts
Diffusion-based localized editing framework for faithful summarization of evolving contexts, introducing the StreamSum benchmark and showing tradeoffs in faithfulness, speed, and preservation.
-
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
-
Self-Preference Bias in LLM-as-a-Judge
LLMs judge their own outputs higher because they assign better scores to lower-perplexity text, even when the text is not self-generated.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
-
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
- Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation