hub

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang · 2023 · cs.CL · arXiv 2306.05685

92 Pith papers cite this work. Polarity classification is still indexing.

92 Pith papers citing it

open full Pith review browse 92 citing papers arXiv PDF

abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

claims ledger

abstract Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be

co-cited works

representative citing papers

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

cs.HC · 2026-05-09 · accept · novelty 7.0

LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

EditPropBench evaluates LLM editors on propagating factual edits to dependent claims in synthetic scientific manuscripts, showing that even the strongest systems miss roughly 30% of required updates on hard cases.

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

Subliminal Steering: Stronger Encoding of Hidden Signals

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing

cs.CR · 2026-04-27 · unverdicted · novelty 7.0

Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.

AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

AsmRAG detects malware at 96% F1 and attributes families at 95% F1 by retrieving functionally similar assembly code via LLM embeddings and density-weighted anchor selection, remaining robust to metamorphic obfuscation.

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.

Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment

cs.AI · 2026-04-07 · unverdicted · novelty 7.0

LatentAudit monitors RAG faithfulness in real time via Mahalanobis distance on residual-stream activations, reaching 0.942 AUROC on PubMedQA with 0.77 ms overhead and supporting Groth16 verification.

citing papers explorer

Showing 50 of 92 citing papers.

Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 22 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding cs.CL · 2023-08-28 · unverdicted · none · ref 132 · internal anchor
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics cond-mat.stat-mech · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding cs.LG · 2026-05-11 · unverdicted · none · ref 31 · 2 links · internal anchor
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 156 · internal anchor
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations cs.HC · 2026-05-09 · accept · none · ref 81 · internal anchor
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection cs.LG · 2026-05-08 · unverdicted · none · ref 45 · internal anchor
GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic cs.LG · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts cs.CL · 2026-05-03 · unverdicted · none · ref 14 · internal anchor
EditPropBench evaluates LLM editors on propagating factual edits to dependent claims in synthetic scientific manuscripts, showing that even the strongest systems miss roughly 30% of required updates on hard cases.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice cs.LG · 2026-05-02 · unverdicted · none · ref 38 · internal anchor
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
Subliminal Steering: Stronger Encoding of Hidden Signals cs.CL · 2026-04-28 · unverdicted · none · ref 15 · internal anchor
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing cs.CR · 2026-04-27 · unverdicted · none · ref 40 · internal anchor
Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.
AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code cs.CR · 2026-04-25 · unverdicted · none · ref 18 · internal anchor
AsmRAG detects malware at 96% F1 and attributes families at 95% F1 by retrieving functionally similar assembly code via LLM embeddings and density-weighted anchor selection, remaining robust to metamorphic obfuscation.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 19 · internal anchor
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.
Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents cs.AI · 2026-04-17 · unverdicted · none · ref 18 · internal anchor
LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 32 · internal anchor
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents cs.AI · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems cs.CL · 2026-04-14 · unverdicted · none · ref 23 · internal anchor
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 59 · internal anchor
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling cs.AI · 2026-04-09 · unverdicted · none · ref 89 · internal anchor
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI · 2026-04-09 · unverdicted · none · ref 33 · internal anchor
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 33 · internal anchor
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment cs.AI · 2026-04-07 · unverdicted · none · ref 9 · internal anchor
LatentAudit monitors RAG faithfulness in real time via Mahalanobis distance on residual-stream activations, reaching 0.942 AUROC on PubMedQA with 0.77 ms overhead and supporting Groth16 verification.
From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering cs.IR · 2026-03-30 · unverdicted · none · ref 23 · internal anchor
Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.
KTO: Model Alignment as Prospect Theoretic Optimization cs.LG · 2024-02-02 · conditional · none · ref 25 · internal anchor
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! cs.CL · 2023-10-05 · conditional · none · ref 3 · internal anchor
Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.
WizardLM: Empowering large pre-trained language models to follow complex instructions cs.CL · 2023-04-24 · conditional · none · ref 54 · internal anchor
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 43 · internal anchor
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 131 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 75 · internal anchor
BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization cs.CL · 2026-05-06 · unverdicted · none · ref 10 · 3 links · internal anchor
RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion cs.SE · 2026-05-04 · unverdicted · none · ref 11 · internal anchor
DocSync fuses AST-aware retrieval with an iterative critic loop to update documentation, outperforming CodeT5-base on semantic alignment and automated judge scores in a proxy code-to-text task.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework cs.AI · 2026-05-02 · unverdicted · none · ref 14 · internal anchor
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for continuous production evaluation with an open-source implementation.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models cs.CR · 2026-05-02 · conditional · none · ref 34 · internal anchor
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Iterative Finetuning is Mostly Idempotent cs.AI · 2026-05-01 · unverdicted · none · ref 17 · internal anchor
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis cs.CR · 2026-05-01 · unverdicted · none · ref 50 · internal anchor
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 73 · 2 links · internal anchor
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Simple Self-Conditioning Adaptation for Masked Diffusion Models cs.LG · 2026-04-28 · unverdicted · none · ref 32 · internal anchor
SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule, and genomic synthesis.
Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation cs.AI · 2026-04-22 · unverdicted · none · ref 98 · internal anchor
SiPeR improves recommendation accuracy and response quality in situated conversations by estimating scene transitions and performing Bayesian inverse inference with multimodal LLMs.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 27 · internal anchor
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
CreativeGame:Toward Mechanic-Aware Creative Game Generation cs.AI · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unverdicted · none · ref 60 · internal anchor
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lower cost.
LLMs Corrupt Your Documents When You Delegate cs.CL · 2026-04-17 · unverdicted · none · ref 97 · internal anchor
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels? cs.LG · 2026-04-16 · unverdicted · none · ref 9 · internal anchor
A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval cs.AI · 2026-04-13 · unverdicted · none · ref 44 · internal anchor
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection cs.AI · 2026-04-12 · unverdicted · none · ref 20 · internal anchor
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts while matching supervised ML on lung cancer and outperforming single-agent baselines.
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach cs.AI · 2026-04-10 · unverdicted · none · ref 113 · internal anchor
A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unverdicted · none · ref 40 · internal anchor
AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 percentage point drop in safety-critical action hit rates.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer