An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
super hub Canonical reference
Knowledge-Centric Hallucination Detection
Canonical reference. 77% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.
ToolPrivacyBench is a new benchmark that evaluates purpose-bound privacy over-disclosure in multi-tool LLM agent trajectories by auditing tool arguments against policy knowledge bases across 2,150 cases.
Introduces NeuroDoc and NeuroAudit to create a community-reviewed corpus of 53 EEG benchmark entries with 245 task definitions using a rulebook-guided task document and executable kernel.
A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.
Activation patching reveals that citation decisions in Llama-3.1-8B RAG are implemented by a distributed attributional ensemble of heads and layers; targeted interventions fix most missed and spurious citations on PopQA.
A new worsening-trick construction compiles arbitrary-context rewrite rules A → B / L _ R into FSTs with short uniform formulas that match prior transducers where semantics coincide.
Inference system components of LLMs can be fingerprinted from observable prompt-response behavior due to characteristic numerical deviations.
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
An SMT-based active learning algorithm learns minimal nondeterministic weighted automata over arbitrary semirings, with partial correctness proofs, a sufficient termination condition, and experiments showing smaller models and fewer queries than baselines.
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
citing papers explorer
-
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots
The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.
-
ToolPrivacyBench: Benchmarking Purpose-Bound Privacy in Tool-Using LLM Agents
ToolPrivacyBench is a new benchmark that evaluates purpose-bound privacy over-disclosure in multi-tool LLM agent trajectories by auditing tool arguments against policy knowledge bases across 2,150 cases.
-
EEG Benchmarking Needs a Task Specification Layer: NeuroDoc for Rulebook-Guided, Executable Benchmark Construction
Introduces NeuroDoc and NeuroAudit to create a community-reviewed corpus of 53 EEG benchmark entries with 245 task definitions using a rulebook-guided task document and executable kernel.
-
Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.
-
How Do LLMs Cite? A Mechanistic Interpretation of Attribution in Retrieval-Augmented Generation
Activation patching reveals that citation decisions in Llama-3.1-8B RAG are implemented by a distributed attributional ensemble of heads and layers; targeted interventions fix most missed and spurious citations on PopQA.
-
Compiling Rewrite Rules to Finite-State Transducers with the Worsening Trick
A new worsening-trick construction compiles arbitrary-context rewrite rules A → B / L _ R into FSTs with short uniform formulas that match prior transducers where semantics coincide.
-
Fingerprinting Inference Systems of Large Language Models
Inference system components of LLMs can be fingerprinted from observable prompt-response behavior due to characteristic numerical deviations.
-
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
-
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
-
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
-
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
-
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
-
SMT-Based Active Learning of Weighted Automata
An SMT-based active learning algorithm learns minimal nondeterministic weighted automata over arbitrary semirings, with partial correctness proofs, a sufficient termination condition, and experiments showing smaller models and fewer queries than baselines.
-
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
-
Deep Graph-Language Fusion for Structure-Aware Code Generation
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
-
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
-
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
-
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
-
Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising
DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.
-
Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
-
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
-
DP-OPD: Differentially Private On-Policy Distillation for Language Models
DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.
-
Spectral Tempering for Embedding Compression in Dense Passage Retrieval
Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.
-
Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya
Fine-tuning LLMs on Navya-Nyaya's six-phase reasoning structure yields 100% semantic correctness on held-out logical problems despite only 40% strict format adherence.
-
DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.
-
Bayesian Social Deduction with Graph-Informed Language Models
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
-
Open Problems in Constitutional Preference Reconstruction
Empirical analysis across three datasets identifies three open problems in constitutional preference reconstruction and shows that principle refinement raises inter-executor agreement from 73% to 78%.
-
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
-
Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering
OPI introduces a relation-centric ontology graph enabling bidirectional retrieval and iterative refinement, yielding Hit@1/F1 gains of 4.6/5.0 on WebQSP and 8.9/3.3 on CWQ plus near-saturated Hit@1 on MetaQA.
-
INCARBench: A Benchmark for Scientific Configuration in VASP INCAR by Large Language Models
INCARBench evaluates 19 LLMs on VASP INCAR configuration generation and repair, showing high semantic accuracy but lower scientific correctness especially for DFT+U, magnetism, and correlated materials.
-
ThermoLLM: Thermodynamics-Aware HVAC Control with Spatial-Semantic Knowledge Graph
ThermoLLM uses a physics-informed spatial-semantic knowledge graph with LLMs for HVAC control in a five-zone EnergyPlus simulation and reports the best energy-comfort trade-off plus lowest PMV violations among tested methods.
-
APT: Atomic Physical Transitions for Causal Video-Language Understanding
Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.
-
M\"OVE: A Holistic LLM Benchmark for the German Public Sector
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
-
Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding
LilyBench evaluates open-weight LLMs on zero-shot LilyPond generation (achievable) and structural understanding tasks (challenging), with metric disagreements noted and code released.
-
PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams
PaperFlow proposes a Profiling-Recommending-Adapting framework for longitudinal scientific paper recommendation and evaluates it on a new user-day benchmark with 24 simulated users, outperforming five baselines in ranking, behavioral alignment, and blind human evaluation.
-
Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Unlearning in multilingual LLMs suppresses rather than erases knowledge in later layers, with transfer varying by language similarity and reversible via inference-time steering.
-
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
-
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.
-
TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems
TCP-MCP co-evolves prompts and topologies for multi-agent systems, reporting 82.66-96.61% accuracy on MMLU-Pro/MMLU/GSM8K while using up to 5.69x fewer tokens than debate baselines.
-
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
-
CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents
CRPO modifies GRPO with three mechanisms—decoupling task and style rewards, adapting constraints to character complexity, and using generic responses as negative baselines—to improve character fidelity in role-playing agents.
-
Graph Alignment Topology as an Inductive Bias for Grounding Detection
A GNN trained on bipartite alignment graphs between references and LLM generations reports state-of-the-art hallucination detection across four datasets, beating prior methods and GPT-4o.
-
Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
Weasel is a trajectory selection method that improves out-of-domain generalization for web agents while achieving 9.7-12.5x training speedups via importance-diversity optimization, AXTree pruning, and rationale style matching.
-
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
-
Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training
Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.
-
ALSO: Adversarial Online Strategy Optimization for Social Agents
ALSO frames social agent interactions as an adversarial bandit problem with a neural reward predictor to enable online strategy optimization in non-stationary multi-agent simulations.