hub

Wang, and Sadid Hasan

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, Sadid Hasan · 2024 · arXiv 2411.10541

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 2 support 1

representative citing papers

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Rosetta Memory: Adaptive Memory for Cross-LLM Agents

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

Rosetta Memory trains two profile-conditioned operators with a minimum-gain sampling curriculum and performance-gap reward to enable memory transfer between LLMs, showing gains on multi-hop QA benchmarks and robustness to unseen models.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

IE as Cache: Information Extraction Enhanced Agentic Reasoning

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.

Breaking Validity-Induced Boundaries to Expand Algorithm Search Space: A Two-Stage AST-Based Operator for LLM-Driven Automated Heuristic Evolution

cs.NE · 2026-04-03 · conditional · novelty 7.0

A two-stage AST-based crossover and mutation operator with LLM repair expands the search space in LLM-driven heuristic evolution and improves performance on TSP and online bin packing.

Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.

AI Conversational Interviewing: Scaling Up Semi-Structured and In-depth Interviews

cs.HC · 2026-06-18 · unverdicted · novelty 6.0

AI Conversational Interviewing enables scalable open-ended interviews that capture diverse mental models on topics like migration policy beyond closed-ended surveys, as shown in a 571-respondent study comparing voice, chat, and free-choice modes.

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

cs.CR · 2026-06-09 · unverdicted · novelty 6.0

AuditBench is a new benchmark of audit logs from 50+ malicious and benign scenarios that evaluates five LLMs on four security investigation tasks and analyzes their performance and error profiles.

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Domain specialization does not consistently improve clinical LLM robustness to meaning-preserving prompt variations, as shown by new sensitivity metrics on DiagnosisQA and MedQA.

Empirical Bayes Conformal Prediction for Vision and Language Models

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical Bayes conformal prediction converts score variability into r-value nonconformity scores that preserve target coverage while reducing inclusion of high-variance false candidates in image classification, CLIP VLMs, and LLMs.

Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation

cs.CL · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

DPR-BAG generates biomedical abstracts from full texts via BOMRC decomposition, parallel LLM summarization, and refinement, showing higher abstractive novelty than baselines while preserving factual consistency on a 46k-article PMC dataset.

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

cs.LG · 2026-05-14 · conditional · novelty 6.0

LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.

Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

cs.CR · 2026-05-07 · unverdicted · novelty 6.0

PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.

EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming baselines on four datasets with linear indexing cost and zero token overhead.

Visual Compositional Tuning

cs.CV · 2025-04-30 · unverdicted · novelty 6.0

COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.

Benchmarking Local Language Models for Social Robots using Edge Devices

cs.RO · 2026-05-04 · unverdicted · novelty 5.0

Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.

MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks

cs.CV · 2026-04-26 · unverdicted · novelty 5.0

MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.

Can LLMs Make (Personalized) Access Control Decisions?

cs.CR · 2025-11-25 · unverdicted · novelty 5.0

LLMs reflect users' privacy preferences in access control decisions with up to 86% agreement and can promote safer behavior, but personalization trades off higher individual match for potentially less secure results when users over-permission.

Generative AI Technologies, Techniques & Tensions: A Primer

cs.CY · 2026-04-19 · unverdicted · novelty 2.0

Generative AI systems arise from statistical data processing that produces human-like outputs, creating a mismatch with traditional computer expectations and positioning educational researchers to lead in studying and applying them.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Benchmarking Local Language Models for Social Robots using Edge Devices cs.RO · 2026-05-04 · unverdicted · none · ref 20
Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.

Wang, and Sadid Hasan

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer