hub

Can Large Language Models Be an Alternative to Human Evaluations?

Chiang, Cheng-Han, Lee, Hung-yi · 2023 · DOI 10.18653/v1/2023.acl-long.870

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

open at publisher browse 16 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4

citation-polarity summary

background 2 support 1 unclear 1

representative citing papers

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

cs.CV · 2026-06-12 · unverdicted · novelty 7.0

VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

The DECK taxonomy partitions LLM hallucinations into four detectability regimes using consistency and confidence axes, mapping each to scorer families and identifying a universal blind spot for output-level uncertainty quantification on knowledge-gap inputs.

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

cs.CL · 2026-05-19 · conditional · novelty 7.0

A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.

NARRA-Gym for Evaluating Interactive Narrative Agents

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.

LLM Advertisement based on Neuron Auctions

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

cs.CL · 2025-12-29 · accept · novelty 7.0

Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

Early-token log-probabilities from LLM decoding are stronger predictors of reasoning quality than full-sequence statistics in multi-agent debate on essay scoring tasks.

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

In two-agent debate, log-probability confidence aligns with LLM-judged reasoning quality roughly twice as strongly for the Constructor (AUROC 0.804 for critical failure detection) as for the Auditor (0.634).

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

cs.CL · 2026-04-22 · unverdicted · novelty 5.0

Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.

A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM

cs.CL · 2026-04-08 · unverdicted · novelty 5.0

G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

cs.CL · 2026-04-03 · unverdicted · novelty 5.0

Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

cs.CL · 2025-06-17 · unverdicted · novelty 5.0

VIDEE introduces a human-in-the-loop system using Monte-Carlo Tree Search for task decomposition, executable pipeline generation, and LLM-based evaluation with visualizations to support non-expert text analytics.

citing papers explorer

Showing 16 of 16 citing papers.

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity cs.CV · 2026-06-12 · unverdicted · none · ref 22
VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges cs.AI · 2026-06-03 · unverdicted · none · ref 42
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations cs.CL · 2026-06-01 · unverdicted · none · ref 3
The DECK taxonomy partitions LLM hallucinations into four detectability regimes using consistency and confidence axes, mapping each to scorer families and identifying a universal blind spot for output-level uncertainty quantification on knowledge-gap inputs.
When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation cs.CL · 2026-05-19 · conditional · none · ref 23
A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.
NARRA-Gym for Evaluating Interactive Narrative Agents cs.CL · 2026-05-08 · unverdicted · none · ref 42
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
LLM Advertisement based on Neuron Auctions cs.LG · 2026-05-08 · unverdicted · none · ref 3
Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring cs.CL · 2026-04-20 · unverdicted · none · ref 44
LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models cs.CL · 2025-12-29 · accept · none · ref 5
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate cs.CL · 2026-06-09 · unverdicted · none · ref 21
Early-token log-probabilities from LLM decoding are stronger predictors of reasoning quality than full-sequence statistics in multi-agent debate on essay scoring tasks.
The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge cs.CL · 2026-06-09 · unverdicted · none · ref 21
In two-agent debate, log-probability confidence aligns with LLM-judged reasoning quality roughly twice as strongly for the Constructor (AUROC 0.804 for critical failure detection) as for the Auditor (0.634).
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems cs.CL · 2026-05-11 · unverdicted · none · ref 31
RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models cs.AI · 2026-05-06 · unverdicted · none · ref 12
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization cs.CL · 2026-04-22 · unverdicted · none · ref 3
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM cs.CL · 2026-04-08 · unverdicted · none · ref 15
G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts cs.CL · 2026-04-03 · unverdicted · none · ref 8
Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.
VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents cs.CL · 2025-06-17 · unverdicted · none · ref 11
VIDEE introduces a human-in-the-loop system using Monte-Carlo Tree Search for task decomposition, executable pipeline generation, and LLM-based evaluation with visualizations to support non-expert text analytics.

Can Large Language Models Be an Alternative to Human Evaluations?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer