Title resolution pending

31 Pith papers cite this work. Polarity classification is still indexing.

31 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

cs.AI · 2026-04-01 · unverdicted · novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.

Adaptive Stopping for Multi-Turn LLM Reasoning

cs.CL · 2026-04-01 · unverdicted · novelty 8.0

MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Unlocking Prompt Infilling Capability for Diffusion Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.

Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.

M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

cs.CL · 2026-04-01 · unverdicted · novelty 7.0

M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

cs.CL · 2026-04-01 · unverdicted · novelty 7.0

Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.

Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs

cs.CL · 2026-03-29 · unverdicted · novelty 7.0

BACR adaptively schedules token budgets for LLM reasoning via curriculum learning and a unified policy, improving accuracy by up to 8.3% under tight budgets while cutting token use by 34% on math benchmarks.

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

cs.LG · 2026-03-25 · unverdicted · novelty 7.0

QuanBench+ is a new multi-framework benchmark showing LLMs reach 43-60% Pass@1 on quantum code tasks across three libraries, rising to 67-83% with error-feedback repair, yet performance remains strongly framework-dependent.

LoRA: Low-Rank Adaptation of Large Language Models

cs.CL · 2021-06-17 · accept · novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.

When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation

cs.MA · 2026-04-04 · unverdicted · novelty 6.0

Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.

Testing the Limits of Truth Directions in LLMs

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.

Align then Train: Efficient Retrieval Adapter Learning

cs.IR · 2026-04-03 · unverdicted · novelty 6.0

A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.

Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

cs.RO · 2026-04-03 · unverdicted · novelty 6.0

Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

cs.LG · 2026-04-03 · conditional · novelty 6.0

Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

cs.CL · 2026-04-03 · accept · novelty 6.0

LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.

No Single Best Model for Diversity: Learning a Router for Sample Diversity

cs.CL · 2026-04-02 · unverdicted · novelty 6.0

No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

cs.AI · 2026-04-02 · unverdicted · novelty 6.0 · 2 refs

ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.

Multimodal Language Models Cannot Spot Spatial Inconsistencies

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.

citing papers explorer

Showing 31 of 31 citing papers.

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks cs.AI · 2026-04-01 · unverdicted · none · ref 35
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
Adaptive Stopping for Multi-Turn LLM Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 39
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer cs.LG · 2017-01-23 · accept · none · ref 47
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 29
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge cs.CL · 2026-04-03 · unverdicted · none · ref 3
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 3
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 28
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection cs.CL · 2026-04-03 · unverdicted · none · ref 3
Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.
M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency cs.CL · 2026-04-01 · unverdicted · none · ref 3
M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.
Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention cs.CL · 2026-04-01 · unverdicted · none · ref 3
Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.
Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs cs.CL · 2026-03-29 · unverdicted · none · ref 3
BACR adaptively schedules token budgets for LLM reasoning via curriculum learning and a unified policy, improving accuracy by up to 8.3% under tight budgets while cutting token use by 34% on math benchmarks.
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation cs.LG · 2026-03-25 · unverdicted · none · ref 3
QuanBench+ is a new multi-framework benchmark showing LLMs reach 43-60% Pass@1 on quantum code tasks across three libraries, rising to 67-83% with error-feedback repair, yet performance remains strongly framework-dependent.
LoRA: Low-Rank Adaptation of Large Language Models cs.CL · 2021-06-17 · accept · none · ref 65
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation cs.MA · 2026-04-04 · unverdicted · none · ref 3
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
Testing the Limits of Truth Directions in LLMs cs.CL · 2026-04-04 · unverdicted · none · ref 3
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
Align then Train: Efficient Retrieval Adapter Learning cs.IR · 2026-04-03 · unverdicted · none · ref 29
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM cs.RO · 2026-04-03 · unverdicted · none · ref 3
Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 40
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models cs.LG · 2026-04-03 · conditional · none · ref 15
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments cs.CL · 2026-04-03 · accept · none · ref 3
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge cs.CL · 2026-04-03 · unverdicted · none · ref 3
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
No Single Best Model for Diversity: Learning a Router for Sample Diversity cs.CL · 2026-04-02 · unverdicted · none · ref 3
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis cs.AI · 2026-04-02 · unverdicted · none · ref 42 · 2 links
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
Multimodal Language Models Cannot Spot Spatial Inconsistencies cs.CV · 2026-04-01 · unverdicted · none · ref 4
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models cs.CV · 2026-03-30 · conditional · none · ref 3
Scene Dynamic Field integrates physics simulators into MLLM fine-tuning to boost intuitive physics understanding, delivering up to 20.7% gains on fluid tasks with generalization to unseen domains.
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents cs.CL · 2026-04-03 · accept · none · ref 3
Citation URLs from LLMs and research agents are hallucinated 3-13% of the time and non-resolving 5-18% of the time, with a released tool that reduces failures by 6-79x.
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts cs.CL · 2026-04-03 · unverdicted · none · ref 41
Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs cs.LG · 2026-04-03 · unverdicted · none · ref 45
Random sampling matches active preference learning on win-rate gains in online DPO yet both degrade benchmark performance, making active selection's overhead hard to justify.
Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs physics.comp-ph · 2026-04-01 · unverdicted · none · ref 3
LLMs reach near-ceiling performance on explicit QFT and string theory derivations but degrade when required to reconstruct omitted reasoning steps or resolve implicit tensions under global consistency constraints.
Embedding-Only Uplink for Onboard Retrieval Under Shift in Remote Sensing cs.CV · 2026-03-30 · conditional · none · ref 21
Embedding-only uplink enables flexible onboard retrieval for remote sensing under distribution shifts, with kNN superior for cloud classification and centroids for temporal change detection.
Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching cs.LG · 2026-03-27 · unverdicted · none · ref 3
Embedding-based distillation shrinks a large genomic model 200-fold into a compact mRNA specialist that reaches state-of-the-art results among similarly sized models.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer