super hub Mixed citations

A Survey on LLM-as-a-Judge

Chengjin Xu, Hexiang Tan, Jiawei Gu, Xuehao Zhai, Xuhui Jiang, Zhichao Shi · 2024 · cs.CL · arXiv 2411.15594

Mixed citation behavior. Most common role is background (70%).

177 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 177 citing papers more from Chengjin Xu arXiv PDF

abstract

Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 5

citation-polarity summary

background 16 use method 5 unclear 2

claims ledger

abstract Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of L

authors

Chengjin Xu Hexiang Tan Jiawei Gu Xuehao Zhai Xuhui Jiang Zhichao Shi

co-cited works

representative citing papers

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

cs.AI · 2026-06-16 · accept · novelty 8.0

LLM user simulators exhibit a disengagement deficit: they match real buyers but systematically overstate purchase intent among real non-buyers by reducing expressed resistance and increasing deliberation.

FollowTable: A Benchmark for Instruction-Following Table Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

cs.AI · 2026-04-20 · accept · novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmentation yielding up to 12% gains.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

cs.CL · 2025-07-28 · accept · novelty 8.0

MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.

FARS: A Fully Automated Research System Deployed at Scale

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.

COHORT: Collaborative Orchestration for Hardening via Offensive Replay on Emulated Topologies

cs.NI · 2026-06-29 · unverdicted · novelty 7.0

COHORT automates mitigation generation for network attacks via collaborative LLMs on emulated topologies with offensive replay evaluation, reporting 46.7% success rate that is 4.4 times higher than a single-agent baseline.

LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.

C3-Bench: A Context-Aware Change Captioning Benchmark

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

C3-Bench supplies a multi-domain dataset and LLM-based evaluation protocol that exposes systematic failures in existing change captioning models outside their training regimes.

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

ReasoningFlow represents LLM reasoning traces as DAGs, finding structural similarity across models and that most erroneous steps are unused in final answers.

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.

Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

cs.CE · 2026-06-01 · unverdicted · novelty 7.0

Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

VIABLE benchmark reveals existing VLM judges are unreliable for VIA tasks (GPT-5.4 at 52.6% diagnostic accuracy with 94.2% self-preference) and proposes VIA-Judge-Agent for improvements.

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Constructs KVoiceBench, KOpenAudioBench, and KMMAU using agent-driven transfer frameworks from English benchmarks and Korean ASR data, then evaluates eight SpeechLMs to show model-specific gaps and complementary weaknesses between SpokenQA and audio understanding.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

EduVideoBench is a new KSA-grounded benchmark that evaluates five frontier video generation models and finds substantial gaps in educational validity across knowledge, skills, and attitudes.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

GS-QA: A Benchmark for Geospatial Question Answering

cs.DB · 2026-05-21 · unverdicted · novelty 7.0

GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

cs.CR · 2026-05-19 · accept · novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

ConsumerSimBench evaluates 13 LLMs on reconstructing crowd reactions from 1,553 Chinese social-media topics using 23,122 auditable yes-no criteria, finding maximum coverage of 47.8% by Gemini-3.1-Pro.

citing papers explorer

Showing 50 of 177 citing papers.

Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering cs.CR · 2026-05-05 · unverdicted · none · ref 19
Black-box membership inference attacks on retrieval-based in-context learning for document QA succeed via query prefixes, with a novel weighted-averaging method outperforming priors even under paraphrasing.
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation cs.CL · 2026-05-03 · unverdicted · none · ref 105
VIDA provides 2,500 visually-dependent ambiguous translation examples and span-level disambiguation metrics; CoT-SFT on LVLMs improves out-of-distribution performance over standard SFT.
Theoria: Rewrite-Acceptability Verification over Informal Reasoning States cs.AI · 2026-07-01 · unverdicted · none · ref 9 · internal anchor
Theoria rewrites solutions into auditable typed state transitions with justifications, certifying 105 of 185 HLE problems at 91.4% precision and outperforming holistic judges on adversarial poisoned proofs by catching hidden premises.
Rigel: Self-Distilled Score Adaptation for Image and Video Captioning Evaluation cs.CV · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
Rigel is a self-distilled LLM-based metric for image and video caption evaluation that reports over 10-point gains on ActivityNet-Fact in reference-free settings.
Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting cs.SE · 2026-06-28 · unverdicted · none · ref 12 · internal anchor
Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.
Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries cs.CR · 2026-06-25 · unverdicted · none · ref 2 · internal anchor
Bandit algorithms learn optimal jailbreaks from noisy exploration and, paired with complexity-enhanced queries in FrankensteinBench, achieve up to 97% attack success on 15 open-weight LLMs.
Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning cs.CL · 2026-06-23 · unverdicted · none · ref 7 · internal anchor
EDV decouples execution, distillation by a third-party agent, and consensus verification to filter erroneous trajectories in LLM agent experience learning, outperforming baselines on tau2-bench, Mind2Web, and MMTB.
HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice cs.CL · 2026-06-16 · unverdicted · none · ref 13 · internal anchor
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents cs.CL · 2026-06-09 · accept · none · ref 3 · internal anchor
Empirical study of a production multi-turn ordering agent finds LLM-as-judge recall below 25% for human-confirmed defects, missing cross-turn state issues due to limited rubric and routing.
LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines cs.AI · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
An LLM-orchestrated framework enables conformance checking in stroke care from unstructured texts, achieving over 86% conformance in hospital data.
Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions cs.CV · 2026-06-08 · unverdicted · none · ref 15 · internal anchor
Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.
Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs cs.CR · 2026-06-02 · unverdicted · none · ref 43 · internal anchor
TSP reframes secure code generation as a tree-structured self-play process that supplies dense on-policy signals at vulnerability-prone nodes, yielding higher security pass rates and cross-language generalization than SFT or unstructured self-play.
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling cs.CL · 2026-06-02 · unverdicted · none · ref 33 · internal anchor
RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification cs.AI · 2026-06-02 · unverdicted · none · ref 31 · internal anchor
The authors introduce a three-part ontology-based verification system for AI agents that generates regulatory and adversarial test scenarios and issues machine-verifiable trust certificates, with pilot results indicating improved coverage over baselines in four industries.
Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks cs.MA · 2026-06-02 · unverdicted · none · ref 10 · internal anchor
Proposes a Trust Layer on top of existing agent protocols that adds probabilistic capability descriptors, screening, and reputation to enable separating equilibria and bound delegation reliability.
Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection cs.AI · 2026-06-01 · unverdicted · none · ref 68 · internal anchor
Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, including in never-smokers.
AMix-2: Establishing Protein as a Native Modality in Large Language Models q-bio.BM · 2026-05-29 · unverdicted · none · ref 53 · internal anchor
AMix-2 unifies protein sequences and text in one LLM via shared tokens and block-wise diffusion modeling, introduces the ProteinArena benchmark, and reports competitive performance against task-specific protein models and frontier LLMs.
Knowledge Dependency Estimation for Reliable Question Answering cs.CL · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
Knot estimates QA model sensitivity to candidate knowledge via subset counterfactual training and latent factor coverage, yielding unit rankings that outperform baselines without extra model calls.
Auditing Stance Asymmetry in Generative Explanations cs.CL · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
Introduces Symmetry Decomposition Evaluation (SDE) to audit stable stance asymmetries in generative explanations using paired situations, role rewrites, and evidence controls on a 32-family prototype suite.
A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback cs.CL · 2026-05-25 · unverdicted · none · ref 19 · internal anchor
A multi-agent LLM system discovers criteria such as Encouraging, Urgent, and Clear for surgical feedback and uses them to score 4.2k instances, outperforming prior content-based approaches in predicting trainee behavior changes and trainer approval.
STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media cs.CL · 2026-05-24 · unverdicted · none · ref 10 · internal anchor
Stream mines streaming media to create and release StreamDial, a dataset of 87,498 structured task-oriented dialogue sessions across automotive, restaurant, and hotel domains using persona construction, Conversational Blueprints, and RAG.
Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring cs.CL · 2026-05-23 · unverdicted · none · ref 7 · internal anchor
govllm turns compliance into a runtime metric by routing via accumulated scores from a jury of regulatory LLM judges, with disagreement treated as a human-arbitration signal, validated on 49 annotated pairs showing 51.5-69.1% agreement across four SLMs.
AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models cs.CL · 2026-05-23 · unverdicted · none · ref 35 · internal anchor
AstroMind is a new physics-grounded benchmark for LLM reasoning on spacecraft behavior across intent inference, maneuver estimation, and threat assessment, evaluated on several open-weight models.
Distinguishing Right from Wrong in Debates: Attribution Analysis of Chinese Harmful Memes cs.CL · 2026-05-23 · unverdicted · none · ref 36 · internal anchor
Introduces Ex-ToxiCN-MM dataset and RIKE framework (with AKE and RIR modules) that outperforms baselines on attributing harm in ambiguous Chinese memes using C-HarmKB.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 80 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 62 · internal anchor
Evaluation of 6233 MedGPTs finds 25-30% with low factual accuracy, 33.6-54.3% violating operational thresholds, and 57% of action-enabled models lacking privacy disclosures.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 269 · internal anchor
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
In-IDE Toolkit for Developers of AI-Based Features cs.SE · 2026-05-14 · unverdicted · none · ref 13 · internal anchor
Presents an AI Toolkit plugin for JetBrains IDEs that integrates trace capture and evaluation into the Run/Debug loop, guided by practitioner needs and showing early adoption signals in PyCharm.
Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines cs.SE · 2026-05-14 · unverdicted · none · ref 45 · internal anchor
A pipeline using SBERT/UMAP/HDBSCAN clustering on 339 repositories identifies 692k recurring Gherkin slices, labels 200 of them, and trains an XGBoost model that achieves F1 0.891 for extraction-worthiness, outperforming rule and LLM baselines, with prevalence statistics released.
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems cs.AI · 2026-05-14 · unverdicted · none · ref 27 · internal anchor
HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs cs.CV · 2026-05-12 · accept · none · ref 4 · internal anchor
A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement cs.AI · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability cs.LG · 2026-05-09 · unverdicted · none · ref 73 · internal anchor
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel cs.SE · 2026-05-08 · conditional · none · ref 21 · internal anchor
False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models cs.CV · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
A Survey on LLM-based Conversational User Simulation cs.CL · 2026-04-27 · unverdicted · none · ref 3 · internal anchor
A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.
Exploring Audio Hallucination in Egocentric Video Understanding cs.CV · 2026-04-26 · unverdicted · none · ref 21 · internal anchor
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models cs.MM · 2026-04-25 · unverdicted · none · ref 48 · internal anchor
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
Evian: Towards Explainable Visual Instruction-tuning Data Auditing cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models cs.AI · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.
Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs cs.CL · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
GLOW integrates a pre-trained GNN for candidate prediction with an LLM for joint symbolic-semantic reasoning over incomplete KGs, reporting up to 53.3% gains on standard benchmarks and a new GLOW-BENCH dataset.
Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation cs.CL · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval cs.AI · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection cs.AI · 2026-04-12 · unverdicted · none · ref 50 · internal anchor
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts while matching supervised ML on lung cancer and outperforming single-agent baselines.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 43 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios cs.CL · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
A hierarchical task taxonomy and synthetic data generation framework enables TRouter to perform effective LLM routing in cold-start scenarios by modeling query-conditioned cost and performance via latent task types.
QoS-QoE Translation with Large Language Model cs.MM · 2026-04-09 · unverdicted · none · ref 11 · internal anchor
A new QoS-QoE Translation dataset is constructed from multimedia literature and fine-tuned LLMs demonstrate strong performance on bidirectional continuous and discrete QoS-QoE predictions.
Rag Performance Prediction for Question Answering cs.CL · 2026-04-09 · unverdicted · none · ref 23 · internal anchor
A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge cs.AI · 2026-04-07 · unverdicted · none · ref 14 · internal anchor
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

A Survey on LLM-as-a-Judge

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer