Title resolution pending

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, Dongyeop Kang · 2024 · arXiv 2309.17012

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3 dataset 1

citation-polarity summary

background 3 use dataset 1

representative citing papers

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge agreement.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Better & Faster Large Language Models via Multi-token Prediction

cs.CL · 2024-04-30 · conditional · novelty 6.0

Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.

LLM Evaluators Recognize and Favor Their Own Generations

cs.CL · 2024-04-15 · unverdicted · novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.

U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

cs.AI · 2026-04-25 · unverdicted · novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.

A Survey on LLM-as-a-Judge

cs.CL · 2024-11-23 · unverdicted · novelty 4.0

A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

cs.SE · 2024-10-28 · unverdicted · novelty 4.0

A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17

citing papers explorer

Showing 10 of 10 citing papers.

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations cs.AI · 2026-04-16 · unverdicted · none · ref 10
LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge agreement.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 8
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Better & Faster Large Language Models via Multi-token Prediction cs.CL · 2024-04-30 · conditional · none · ref 9
Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.
LLM Evaluators Recognize and Favor Their Own Generations cs.CL · 2024-04-15 · unverdicted · none · ref 10
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning cs.AI · 2026-05-04 · unverdicted · none · ref 55
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines cs.AI · 2026-04-25 · unverdicted · none · ref 10
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 69
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 64
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 116
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps cs.AI · 2026-05-17 · unreviewed · ref 12

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer