hub

NAACL-LONG.102

Bhagwatkar, R · 2024 · DOI 10.18653/v1/2024

50 Pith papers cite this work. Polarity classification is still indexing.

50 Pith papers citing it

open at publisher browse 50 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

co-cited works

representative citing papers

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

cs.CV · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.

SMT-Based Active Learning of Weighted Automata

cs.FL · 2026-05-08 · unverdicted · novelty 7.0

An SMT-based active learning algorithm learns minimal nondeterministic weighted automata over arbitrary semirings, with partial correctness proofs, a sufficient termination condition, and experiments showing smaller models and fewer queries than baselines.

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

cs.IR · 2026-04-29 · unverdicted · novelty 7.0

ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.

Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.

Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows

cs.DB · 2026-04-17 · unverdicted · novelty 7.0

A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.

DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.

DP-OPD: Differentially Private On-Policy Distillation for Language Models

cs.LG · 2026-04-06 · unverdicted · novelty 7.0

DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.

MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval

cs.IR · 2026-05-11 · unverdicted · novelty 6.0

MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.

An Annotation Scheme and Classifier for Personal Facts in Dialogue

cs.CL · 2026-05-11 · accept · novelty 6.0

An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.

ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ANCHOR constructs dense hierarchical factor spaces via LLM generation and clustering, then augments Naive Bayes with a causal Bayesian network to reduce unknown predictions and improve reliability of LLM-based probability estimates.

Permit: Permission-Aware Representation Intervention for Controlled Generation in Large Language Models

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

Permit identifies a permission-sensitive subspace in LLM hidden states and applies lightweight offset or gated interventions to enforce fine-grained generation control, outperforming prior methods with over 18% F1 gain and near-zero leakage using over 98% fewer parameters.

Bias and Uncertainty in LLM-as-a-Judge Estimation

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

MASPO jointly optimizes prompts in multi-agent LLM systems via downstream-success evaluation and evolutionary beam search, delivering 2.9 average accuracy gains over prior methods across six tasks.

SkillOS: Learning Skill Curation for Self-Evolving Agents

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

SkillOS is an RL recipe that learns to curate reusable skills for self-evolving LLM agents, outperforming memory-free and memory-based baselines while generalizing across executors and domains.

CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

citing papers explorer

Showing 50 of 50 citing papers.

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts cs.CV · 2026-05-12 · unverdicted · none · ref 6 · 2 links
An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 40
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems cs.AI · 2026-05-12 · unverdicted · none · ref 31
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization cs.AI · 2026-05-12 · unverdicted · none · ref 33
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
SMT-Based Active Learning of Weighted Automata cs.FL · 2026-05-08 · unverdicted · none · ref 32
An SMT-based active learning algorithm learns minimal nondeterministic weighted automata over arbitrary semirings, with partial correctness proofs, a sufficient termination condition, and experiments showing smaller models and fewer queries than baselines.
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences cs.CL · 2026-05-06 · unverdicted · none · ref 15
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
Deep Graph-Language Fusion for Structure-Aware Code Generation cs.SE · 2026-05-05 · unverdicted · none · ref 2
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference cs.LG · 2026-05-05 · unverdicted · none · ref 27 · 2 links
Two calls per example identify the first two moments of latent correctness probability, enabling exact bounds on the vote-accuracy curve for any majority-vote budget under conditional i.i.d. assumptions.
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models cs.CR · 2026-04-30 · unverdicted · none · ref 24
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models cs.IR · 2026-04-29 · unverdicted · none · ref 12
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising cs.CL · 2026-04-27 · unverdicted · none · ref 19
DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.
Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs cs.CR · 2026-04-20 · unverdicted · none · ref 3
A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.
Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows cs.DB · 2026-04-17 · unverdicted · none · ref 68
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV · 2026-04-07 · unverdicted · none · ref 17
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
DP-OPD: Differentially Private On-Policy Distillation for Language Models cs.LG · 2026-04-06 · unverdicted · none · ref 6
DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval cs.IR · 2026-05-11 · unverdicted · none · ref 66
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
An Annotation Scheme and Classifier for Personal Facts in Dialogue cs.CL · 2026-05-11 · accept · none · ref 16
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 13 · 2 links
ANCHOR constructs dense hierarchical factor spaces via LLM generation and clustering, then augments Naive Bayes with a causal Bayesian network to reduce unknown predictions and improve reliability of LLM-based probability estimates.
Permit: Permission-Aware Representation Intervention for Controlled Generation in Large Language Models cs.CR · 2026-05-10 · unverdicted · none · ref 35
Permit identifies a permission-sensitive subspace in LLM hidden states and applies lightweight offset or gated interventions to enforce fine-grained generation control, outperforming prior methods with over 18% F1 gain and near-zero leakage using over 98% fewer parameters.
Bias and Uncertainty in LLM-as-a-Judge Estimation cs.LG · 2026-05-07 · unverdicted · none · ref 15
Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.
MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems cs.AI · 2026-05-07 · unverdicted · none · ref 7
MASPO jointly optimizes prompts in multi-agent LLM systems via downstream-success evaluation and evolutionary beam search, delivering 2.9 average accuracy gains over prior methods across six tasks.
SkillOS: Learning Skill Curation for Self-Evolving Agents cs.AI · 2026-05-07 · unverdicted · none · ref 1
SkillOS is an RL recipe that learns to curate reusable skills for self-evolving LLM agents, outperforming memory-free and memory-based baselines while generalizing across executors and domains.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification cs.CL · 2026-05-05 · unverdicted · none · ref 28
CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tuned Qwen3-14B model and 50% relative improvement over baselines.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction cs.AI · 2026-04-30 · unverdicted · none · ref 26
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains cs.IR · 2026-04-30 · unverdicted · none · ref 56
NeocorRAG uses Evidence Chains to achieve SOTA retrieval quality in RAG on HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ for 3B and 70B models while using under 20% of the tokens of comparable methods.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 30
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model cs.CL · 2026-04-23 · unverdicted · none · ref 30
IRM derives implicit reward signals from off-the-shelf LLMs to detect generated text zero-shot and reports better results than prior zero-shot and supervised detectors on the DetectRL benchmark.
AVISE: Framework for Evaluating the Security of AI Systems cs.CR · 2026-04-22 · unverdicted · none · ref 39
AVISE provides a new framework and automated SET that identifies jailbreak vulnerabilities in language models with 92% accuracy, finding all nine tested models vulnerable to an augmented Red Queen attack.
Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control cs.CV · 2026-04-20 · unverdicted · none · ref 7
Fine-tuned MLLMs achieve competitive skeletal landmark localization on synthetic and real X-ray datasets compared to deep learning baselines and demonstrate reasoning for sequential C-arm navigation.
Privacy-Preserving LLMs Routing cs.CR · 2026-04-17 · unverdicted · none · ref 12
PPRoute achieves plaintext-level LLM routing quality with MPC-based privacy and a 20x speedup over naive encrypted implementations via MPC-friendly encoders, multi-step training, and O(1) communication Top-k search.
The Enforcement and Feasibility of Hate Speech Moderation on Twitter cs.CY · 2026-04-14 · conditional · none · ref 59
80% of hateful tweets remain online after five months with no higher removal rate than non-hateful content, while human-AI moderation pipelines can feasibly cut user exposure below regulatory penalty costs.
Revisiting Anisotropy in Language Transformers: The Geometry of Learning Dynamics cs.CL · 2026-04-09 · unverdicted · none · ref 5
Anisotropy in language transformers arises because training amplifies tangent directions, with activation-based low-rank proxies capturing unusually large gradient energy and anisotropy share compared to controls.
Busemann energy-based attention for emotion analysis in Poincar\'e discs cs.LG · 2026-04-08 · unverdicted · none · ref 34
A fully hyperbolic attention model using Busemann energy in Poincaré discs produces emotion predictions from text that generalize well even at low embedding dimensions.
Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge cs.AI · 2026-04-04 · unverdicted · none · ref 11
Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.
Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control cs.CL · 2026-04-03 · unverdicted · none · ref 2
Emotion vectors in LLMs lie in a circular valence-arousal subspace that supports monotonic control over text affect and bidirectional control over refusal and sycophancy.
Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation cs.CL · 2026-04-03 · unverdicted · none · ref 10
An importance-aware recall metric for LLM factuality evaluation reveals models are better at avoiding false claims than covering all relevant facts.
Composer Vector: Style-steering Symbolic Music Generation in a Latent Space cs.SD · 2026-04-03 · unverdicted · none · ref 25
Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.
Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models cs.CR · 2026-04-01 · conditional · none · ref 34
A new benchmark exposes food-safety gaps in current LLMs and guardrails, and a fine-tuned 4B model is offered as a domain-specific fix.
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training cs.AI · 2025-01-28 · unverdicted · none · ref 7
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production cs.CE · 2026-05-12 · unverdicted · none · ref 44
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding cs.AR · 2026-05-10 · unverdicted · none · ref 14
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity cs.LG · 2026-05-09 · unverdicted · none · ref 5
A user-diversity condition is necessary and sufficient for personalized alignment to achieve O(1) online regret and log(1/epsilon) offline sample complexity.
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling cs.CL · 2026-05-08 · unverdicted · none · ref 18
Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.
Exploring Interaction Paradigms for LLM Agents in Scientific Visualization cs.AI · 2026-04-30 · unverdicted · none · ref 35 · 2 links
General-purpose coding agents achieve highest success on SciVis tasks but cost more compute, while domain-specific agents are efficient yet less flexible and computer-use agents falter on long workflows.
Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling cs.CL · 2026-04-28 · unverdicted · none · ref 6
Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.
LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB cs.SE · 2026-04-15 · unverdicted · none · ref 5
LLMs generate compilable but semantically weak tests for unseen proprietary systems like SAP HANA while performing better on open-source LevelDB, indicating reliance on shortcuts rather than robust reasoning.
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM cs.CL · 2026-04-08 · unverdicted · none · ref 67
G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming SOTA results.
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants cs.SE · 2026-04-09 · unverdicted · none · ref 23
Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.
A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models cs.CL · 2026-04-05 · conditional · none · ref 26
Small language models extract structured information from paediatric renal biopsy reports at up to 84.3% accuracy on CPU hardware with minimal clinician review.
Qwen2.5-Coder Technical Report cs.CL · 2024-09-18 · unverdicted · none · ref 40
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

NAACL-LONG.102

hub tools

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer