super hub Canonical reference

Knowledge-Centric Hallucination Detection

Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, Jing Gao · 2024 · DOI 10.18653/v1/2024

Canonical reference. 77% of citing Pith papers cite this work as background.

122 Pith papers citing it

Background 77% of classified citations

open at publisher browse 122 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 27 method 2 dataset 1

citation-polarity summary

background 23 support 2 unclear 2 use method 2 use dataset 1

co-cited works

representative citing papers

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.

ToolPrivacyBench: Benchmarking Purpose-Bound Privacy in Tool-Using LLM Agents

cs.CR · 2026-06-26 · unverdicted · novelty 7.0

ToolPrivacyBench is a new benchmark that evaluates purpose-bound privacy over-disclosure in multi-tool LLM agent trajectories by auditing tool arguments against policy knowledge bases across 2,150 cases.

EEG Benchmarking Needs a Task Specification Layer: NeuroDoc for Rulebook-Guided, Executable Benchmark Construction

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Introduces NeuroDoc and NeuroAudit to create a community-reviewed corpus of 53 EEG benchmark entries with 245 task definitions using a rulebook-guided task document and executable kernel.

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.

How Do LLMs Cite? A Mechanistic Interpretation of Attribution in Retrieval-Augmented Generation

cs.IR · 2026-06-09 · unverdicted · novelty 7.0

Activation patching reveals that citation decisions in Llama-3.1-8B RAG are implemented by a distributed attributional ensemble of heads and layers; targeted interventions fix most missed and spurious citations on PopQA.

Compiling Rewrite Rules to Finite-State Transducers with the Worsening Trick

cs.FL · 2026-06-08 · unverdicted · novelty 7.0

A new worsening-trick construction compiles arbitrary-context rewrite rules A → B / L _ R into FSTs with short uniform formulas that match prior transducers where semantics coincide.

Fingerprinting Inference Systems of Large Language Models

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

Inference system components of LLMs can be fingerprinted from observable prompt-response behavior due to characteristic numerical deviations.

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.

Layer-wise Token Compression for Efficient Document Reranking

cs.IR · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

cs.DC · 2026-05-18 · unverdicted · novelty 7.0

PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.

SMT-Based Active Learning of Weighted Automata

cs.FL · 2026-05-08 · unverdicted · novelty 7.0

An SMT-based active learning algorithm learns minimal nondeterministic weighted automata over arbitrary semirings, with partial correctness proofs, a sufficient termination condition, and experiments showing smaller models and fewer queries than baselines.

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

cs.IR · 2026-04-29 · unverdicted · novelty 7.0

ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.

Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.

Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows

cs.DB · 2026-04-17 · unverdicted · novelty 7.0

A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.

citing papers explorer

Showing 22 of 122 citing papers.

LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB cs.SE · 2026-04-15 · unverdicted · none · ref 5
LLMs generate compilable but semantically weak tests for unseen proprietary systems like SAP HANA while performing better on open-source LevelDB, indicating reliance on shortcuts rather than robust reasoning.
Query-Conditioned Graph Retrieval for Contextualized LLM Reasoning in Personalized Wearable Data cs.IR · 2026-04-10 · unverdicted · none · ref 3
WAG builds a query-adaptive knowledge graph from wearable data using hierarchical Bayesian modeling to retrieve relevant context for LLM reasoning and reports ~70% win rate over baselines.
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM cs.CL · 2026-04-08 · unverdicted · none · ref 67
G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 158
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference cs.CL · 2025-10-22 · unverdicted · none · ref 11
DiffAdapt detects problem difficulty via entropy in reasoning traces and applies one of three fixed inference strategies per question, cutting token usage up to 22.4% with comparable or better accuracy across five models and eight benchmarks.
Out-of-Distribution Generalization in Time Series: A Survey cs.LG · 2025-03-18 · unverdicted · none · ref 110
This is the first comprehensive survey of OOD generalization methodologies for time series, organized across data distribution, representation learning, and OOD evaluation.
Exploring Cross-lingual Latent Transplantation: Mutual Opportunities and Open Challenges cs.CL · 2024-12-17 · unverdicted · none · ref 23
XTransplant empirically shows that cross-lingual latent transplantation yields mutual benefits for multilingual capability and cultural adaptability in LLMs, especially low-resource ones, while revealing underutilized model potential.
HiLSVA: Design and Evaluation of a Human-in-the-Loop Agentic System for Scientific Visualization cs.HC · 2026-06-25 · unverdicted · none · ref 83
HiLSVA introduces a plan-first multi-agent LLM system for scientific visualization that incorporates explicit human oversight, stepwise provenance, and learn-at-test-time adaptation, evaluated via case studies and a 12-participant user study.
Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification cs.CL · 2026-06-18 · unverdicted · none · ref 11
Internal LLM artifacts can be used to build classifiers that identify incorrect predictions on legal classification tasks.
Bridging Short Videos and Live Streams: Reasoning-Guided Multimodal LLMs for Cross-Domain Representation Learning cs.IR · 2026-06-03 · unverdicted · none · ref 1
RGCD-Rep distills cross-domain reasoning from a frozen MLLM teacher and learns decomposed transferable item representations via two-stage training, yielding gains in offline experiments and production A/B tests on a live streaming platform.
ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 13 · 2 links
ANCHOR uses hierarchical factor construction and causal Bayesian networks to reduce unknown predictions and improve reliability of LLM-based probability inference over prior Naive Bayes approaches.
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants cs.SE · 2026-04-09 · unverdicted · none · ref 23
Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.
A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models cs.CL · 2026-04-05 · conditional · none · ref 26
Small language models extract structured information from paediatric renal biopsy reports at up to 84.3% accuracy on CPU hardware with minimal clinician review.
Qwen2.5-Coder Technical Report cs.CL · 2024-09-18 · unverdicted · none · ref 40
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
WisPaper: Your AI Scholar Search Engine cs.IR · 2025-12-07 · unverdicted · none · ref 1
WisPaper integrates semantic search with agent-based validation, library organization, and personalized AI feeds into a closed-loop system that improves academic paper discovery and long-term awareness.
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding cs.CV · 2025-08-28 · unverdicted · none · ref 196
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation cs.CL · 2025-04-02 · unverdicted · none · ref 122
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project cs.DC · 2025-04-14 · unverdicted · none · ref 22
Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.
Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs cs.CR · 2026-04-20 · unreviewed · ref 3
Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation cs.SI · 2025-10-13 · unreviewed · ref 10
"Is This Really a Human Peer Supporter?": Misalignments Between Peer Supporters and Experts in LLM-Supported Interactions cs.HC · 2025-06-11 · unreviewed · ref 10
LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning cs.CL · 2025-02-20 · unreviewed · ref 1

Knowledge-Centric Hallucination Detection

hub tools

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer