super hub Mixed citations

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Siyuan Zhuang, Wei-Lin Chiang, Ying Sheng, Yonghao Zhuang, Zhanghao Wu · 2023 · cs.CL · arXiv 2306.05685

Mixed citation behavior. Most common role is background (47%).

296 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 296 citing papers more from Lianmin Zheng arXiv PDF

abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 method 10 dataset 4 baseline 1

citation-polarity summary

background 14 use method 9 use dataset 4 unclear 2 baseline 1

claims ledger

abstract Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be

authors

Lianmin Zheng Siyuan Zhuang Wei-Lin Chiang Ying Sheng Yonghao Zhuang Zhanghao Wu

co-cited works

representative citing papers

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

Universal and Transferable Adversarial Attacks on Aligned Language Models

cs.CL · 2023-07-27 · accept · novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.

FARS: A Fully Automated Research System Deployed at Scale

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

Prompt Injection in Automated R\'esum\'e Screening with Large Language Models: Single and Multi-Injection Settings

cs.AI · 2026-06-25 · unverdicted · novelty 7.0

Experiments show prompt injection boosts resume rankings in LLM screening when rare and qualities homogeneous, but fails when common or qualities differ.

C3-Bench: A Context-Aware Change Captioning Benchmark

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

C3-Bench supplies a multi-domain dataset and LLM-based evaluation protocol that exposes systematic failures in existing change captioning models outside their training regimes.

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

MEMPROBE is a benchmark for direct recovery of hidden user states from LLM agent memory, showing task success and memory recovery as distinct capabilities with moderate recovery scores around 0.6.

ChartWalker: Benchmarking the Cross-Chart RAG Task with Hierarchical Knowledge Graphs

cs.IR · 2026-06-22 · unverdicted · novelty 7.0

ChartWalker provides a hierarchical knowledge graph construction method and structure-aware sampling to generate cross-chart RAG benchmarks, releasing ChartWalker-Bench that exposes performance gaps across RAG paradigms.

AI Fiction in the Wild

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Analysis of 500k ChatGPT logs shows over one-third of conversations generate fiction, dominated by power users with repetitive and niche patterns.

Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

cs.AI · 2026-06-18 · unverdicted · novelty 7.0 · 2 refs

Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

cs.SE · 2026-06-15 · conditional · novelty 7.0

Wall-clock leaky-integrator monitors on variable-cadence agent streams exhibit a sharp cliff between constant-alarm and silent regimes, while sample-time CUSUM monitors remain invariant; transition detection avoids the trap.

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.

citing papers explorer

Showing 50 of 296 citing papers.

Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 22 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 137 · internal anchor
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding cs.CL · 2023-08-28 · unverdicted · none · ref 132 · internal anchor
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Universal and Transferable Adversarial Attacks on Aligned Language Models cs.CL · 2023-07-27 · accept · none · ref 28 · internal anchor
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
FARS: A Fully Automated Research System Deployed at Scale cs.AI · 2026-06-30 · unverdicted · none · ref 25 · internal anchor
FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.
EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots cs.AI · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents cs.AI · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Prompt Injection in Automated R\'esum\'e Screening with Large Language Models: Single and Multi-Injection Settings cs.AI · 2026-06-25 · unverdicted · none · ref 5 · internal anchor
Experiments show prompt injection boosts resume rankings in LLM screening when rare and qualities homogeneous, but fails when common or qualities differ.
C3-Bench: A Context-Aware Change Captioning Benchmark cs.CV · 2026-06-24 · unverdicted · none · ref 114 · internal anchor
C3-Bench supplies a multi-domain dataset and LLM-based evaluation protocol that exposes systematic failures in existing change captioning models outside their training regimes.
MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery cs.CL · 2026-06-23 · unverdicted · none · ref 36 · internal anchor
MEMPROBE is a benchmark for direct recovery of hidden user states from LLM agent memory, showing task success and memory recovery as distinct capabilities with moderate recovery scores around 0.6.
ChartWalker: Benchmarking the Cross-Chart RAG Task with Hierarchical Knowledge Graphs cs.IR · 2026-06-22 · unverdicted · none · ref 37 · internal anchor
ChartWalker provides a hierarchical knowledge graph construction method and structure-aware sampling to generate cross-chart RAG benchmarks, releasing ChartWalker-Bench that exposes performance gaps across RAG paradigms.
AI Fiction in the Wild cs.CL · 2026-06-22 · unverdicted · none · ref 154 · internal anchor
Analysis of 500k ChatGPT logs shows over one-third of conversations generate fiction, dominated by power users with repetitive and niche patterns.
Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering cs.AI · 2026-06-18 · unverdicted · none · ref 5 · 2 links · internal anchor
Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.
User as Engram: Internalizing Per-User Memory as Local Parametric Edits cs.AI · 2026-06-17 · unverdicted · none · ref 58 · internal anchor
User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.
Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence cs.SE · 2026-06-15 · conditional · none · ref 22 · internal anchor
Wall-clock leaky-integrator monitors on variable-cadence agent streams exhibit a sharp cliff between constant-alarm and silent regimes, while sample-time CUSUM monitors remain invariant; transition detection avoids the trap.
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning cs.AI · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.
Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence cs.SE · 2026-06-05 · unverdicted · none · ref 14 · internal anchor
The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems cs.AI · 2026-06-05 · unverdicted · none · ref 78 · internal anchor
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges cs.AI · 2026-06-03 · unverdicted · none · ref 79 · internal anchor
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks cs.CL · 2026-06-02 · conditional · none · ref 1 · internal anchor
CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
Low-Resource Safety Failures Are Action Failures, Not Representation Failures cs.CL · 2026-05-31 · conditional · none · ref 52 · internal anchor
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
RWGBench: Evaluating Scholarly Positioning in Related Work Generation cs.DL · 2026-05-30 · unverdicted · none · ref 68 · internal anchor
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation cs.LG · 2026-05-28 · unverdicted · none · ref 19 · internal anchor
Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization cs.CL · 2026-05-27 · unverdicted · none · ref 17 · internal anchor
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents cs.AI · 2026-05-27 · unverdicted · none · ref 46 · internal anchor
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm cs.CL · 2026-05-27 · unverdicted · none · ref 133 · internal anchor
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis cs.SE · 2026-05-26 · conditional · none · ref 7 · internal anchor
LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning cs.AI · 2026-05-25 · unverdicted · none · ref 55 · internal anchor
MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.
SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models cs.CL · 2026-05-25 · unverdicted · none · ref 14 · internal anchor
SomaliBench finds large English-to-Somali refusal gaps (0.38 to 0.90) across Llama-3.1-8B, Gemma-2-9B, Qwen-2.5-7B, and Aya-23-8B, with many Somali responses being unclear rather than compliant.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 96 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering cs.CL · 2026-05-22 · unverdicted · none · ref 19 · internal anchor
LLMs show severe staleness after training cutoffs and recency bias on historical German statutes; RAG with version filtering mitigates both better than web search.
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization cs.AI · 2026-05-20 · unverdicted · none · ref 49 · internal anchor
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 35 · internal anchor
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 57 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science cs.AI · 2026-05-18 · unverdicted · none · ref 83 · internal anchor
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
Recall Isn't Enough: Bounding Commitments in Personalized Language Systems cs.AI · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.
Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation cs.AI · 2026-05-14 · unverdicted · none · ref 34 · internal anchor
Belief Engine is a configurable belief-update mechanism for multi-agent LLM systems that uses structured argument extraction and log-odds stance updates to make evidence-grounded deliberation inspectable and controllable.
Test-Time Hinting for Black-Box Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities cs.LG · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics cond-mat.stat-mech · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding cs.LG · 2026-05-11 · unverdicted · none · ref 31 · 2 links · internal anchor
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 156 · internal anchor
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations cs.HC · 2026-05-09 · accept · none · ref 81 · internal anchor
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection cs.LG · 2026-05-08 · unverdicted · none · ref 45 · internal anchor
GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic cs.LG · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts cs.CL · 2026-05-03 · unverdicted · none · ref 14 · internal anchor
EditPropBench evaluates LLM editors on propagating factual edits to dependent claims in synthetic scientific manuscripts, showing that even the strongest systems miss roughly 30% of required updates on hard cases.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice cs.LG · 2026-05-02 · unverdicted · none · ref 38 · internal anchor
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
Simple Self-Conditioning Adaptation for Masked Diffusion Models cs.LG · 2026-04-28 · unverdicted · none · ref 32 · internal anchor
SCMDM is a post-training self-conditioning adaptation for masked diffusion models that reduces generative perplexity by nearly 50% on OWT and improves performance on images, molecules, and genomics.
Subliminal Steering: Stronger Encoding of Hidden Signals cs.CL · 2026-04-28 · unverdicted · none · ref 15 · internal anchor
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing cs.CR · 2026-04-27 · unverdicted · none · ref 40 · internal anchor
Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer