From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge

Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan + 2 more · 2025 · Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/2025.emnlp-main.138

7 Pith papers cite this work, alongside 44 external citations. Polarity classification is still indexing.

7 Pith papers citing it

44 external citations · Crossref

open at publisher browse 7 citing papers

representative citing papers

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

cs.CL · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

cs.DL · 2026-05-04 · unverdicted · novelty 6.0

ARA extracts workflow graphs from papers and scores reproducibility, reaching 61% accuracy on 213 ReScience C articles and outperforming priors on ReproBench and GoldStandardDB.

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

"Like Taking the Path of Least Resistance": Exploring the Impact of LLM Interaction on the Creative Process of Programming

cs.HC · 2026-05-13 · conditional · novelty 5.0

LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.

Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation cs.CL · 2026-05-10 · unverdicted · none · ref 18
Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory cs.CL · 2026-04-30 · unverdicted · none · ref 39 · 2 links
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments cs.CL · 2026-05-05 · unverdicted · none · ref 25
LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.
Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 1
A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.

From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge

fields

years

verdicts

representative citing papers

citing papers explorer