An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
super hub Canonical reference
Knowledge-Centric Hallucination Detection
Canonical reference. 77% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
ToolPrivacyBench is a new benchmark that evaluates purpose-bound privacy over-disclosure in multi-tool LLM agent trajectories by auditing tool arguments against policy knowledge bases across 2,150 cases.
Introduces NeuroDoc and NeuroAudit to create a community-reviewed corpus of 53 EEG benchmark entries with 245 task definitions using a rulebook-guided task document and executable kernel.
A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.
A new worsening-trick construction compiles arbitrary-context rewrite rules A → B / L _ R into FSTs with short uniform formulas that match prior transducers where semantics coincide.
Inference system components of LLMs can be fingerprinted from observable prompt-response behavior due to characteristic numerical deviations.
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
An SMT-based active learning algorithm learns minimal nondeterministic weighted automata over arbitrary semirings, with partial correctness proofs, a sufficient termination condition, and experiments showing smaller models and fewer queries than baselines.
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.
citing papers explorer
-
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
-
SMT-Based Active Learning of Weighted Automata
An SMT-based active learning algorithm learns minimal nondeterministic weighted automata over arbitrary semirings, with partial correctness proofs, a sufficient termination condition, and experiments showing smaller models and fewer queries than baselines.
-
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
-
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
-
Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
-
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
-
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
-
An Annotation Scheme and Classifier for Personal Facts in Dialogue
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
-
Permit: Permission-Aware Representation Intervention for Controlled Generation in Large Language Models
Permit identifies a permission-sensitive subspace in LLM hidden states and applies lightweight offset or gated interventions to enforce fine-grained generation control, outperforming prior methods with over 18% F1 gain and near-zero leakage using over 98% fewer parameters.
-
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
When AI reviews science: Can we trust the referee?
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
-
AVISE: Framework for Evaluating the Security of AI Systems
AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.
-
Busemann energy-based attention for emotion analysis in Poincar\'e discs
A fully hyperbolic attention model using Busemann energy in Poincaré discs produces emotion predictions from text that generalize well even at low embedding dimensions.
-
Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge
Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.
-
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM
G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.
-
Out-of-Distribution Generalization in Time Series: A Survey
This is the first comprehensive survey of OOD generalization methodologies for time series, organized across data distribution, representation learning, and OOD evaluation.
-
A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models
Small language models extract structured information from paediatric renal biopsy reports at up to 84.3% accuracy on CPU hardware with minimal clinician review.
-
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
- ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models
- "Is This Really a Human Peer Supporter?": Misalignments Between Peer Supporters and Experts in LLM-Supported Interactions