VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
GPTS core: Evaluate as You Desire
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 10representative citing papers
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
Early-token log-probabilities from LLM decoding are stronger predictors of reasoning quality than full-sequence statistics in multi-agent debate on essay scoring tasks.
In two-agent debate, log-probability confidence aligns with LLM-judged reasoning quality roughly twice as strongly for the Constructor (AUROC 0.804 for critical failure detection) as for the Auditor (0.634).
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
Constructs multi-video summarization benchmark and evaluates nine MLLMs showing positional bias is domain- and model-dependent with middle positions often weaker and budgets not uniformly fixing it.
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
citing papers explorer
-
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.