Title resolution pending

316 Pith papers cite this work. Polarity classification is still indexing.

316 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3 method 1

citation-polarity summary

background 2 unclear 1 use method 1

representative citing papers

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

cs.CL · 2026-04-29 · unverdicted · novelty 8.0

TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

cs.LG · 2026-04-17 · unverdicted · novelty 8.0

JumpLoRA uses JumpReLU gating to induce adaptive sparsity in LoRA blocks, achieving dynamic parameter isolation that prevents task interference and improves continual learning performance over IncLoRA and ELLA.

Context Over Content: Exposing Evaluation Faking in Automated Judges

cs.AI · 2026-04-16 · conditional · novelty 8.0

LLM judges exhibit up to 9.8 percentage point leniency bias from stakes signaling in prompts, acting implicitly without mentioning it in chain-of-thought.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning

cs.LG · 2026-04-13 · unverdicted · novelty 8.0

EnsembleCert and ScaLabelCert enable tighter and exact certificates for neural network robustness against label-flipping attacks by leveraging white-box information and neural tangent kernel equivalence.

Steered LLM Activations are Non-Surjective

cs.AI · 2026-04-10 · unverdicted · novelty 8.0 · 2 refs

Steered LLM activations are non-surjective: under practical assumptions, they lie outside the set of states reachable from any discrete prompt.

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

cs.AI · 2026-04-01 · unverdicted · novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.

Adaptive Stopping for Multi-Turn LLM Reasoning

cs.CL · 2026-04-01 · unverdicted · novelty 8.0

MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.

Parameterized Hardness of Zonotope Containment and Neural Network Verification

cs.CC · 2025-09-26 · unverdicted · novelty 8.0

The paper proves W[1]-hardness parameterized by dimension d for positivity, zonotope containment, max approximation, and L_p-Lipschitz constants in 2- and 3-layer ReLU networks, showing enumeration methods are optimal under ETH.

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

cs.CR · 2025-09-25 · conditional · novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Score-Based Generative Modeling through Stochastic Differential Equations

cs.LG · 2020-11-26 · unverdicted · novelty 8.0

Introduces an SDE-based framework for score-based generative modeling that unifies prior methods, enables predictor-corrector sampling and neural ODE likelihoods, and achieves SOTA unconditional image generation on CIFAR-10.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Adam: A Method for Stochastic Optimization

cs.LG · 2014-12-22 · accept · novelty 7.5

A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

cs.LG · 2026-04-29 · unverdicted · novelty 7.0

AutoSP automates sequence parallelism and long-context activation checkpointing via compilation, enabling up to 2.7x longer training contexts on NVIDIA hardware with negligible throughput loss.

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.

Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

cs.AI · 2026-04-28 · conditional · novelty 7.0

C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.

XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

cs.AI · 2026-04-27 · unverdicted · novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

cs.CL · 2026-04-26 · unverdicted · novelty 7.0

GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Humans show broad weak directional confusions while DNNs show sparse strong collapses; these structures shift rate-distortion geometry differently and reveal divergent inductive biases.

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

q-bio.NC · 2026-04-23 · unverdicted · novelty 7.0

Stimuli with low intra-modal dispersion among vision models elicit up to twice the cross-modal alignment with language models compared to high-dispersion stimuli.

citing papers explorer

Showing 50 of 74 citing papers after filters.

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 3
TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis cs.CL · 2026-04-14 · unverdicted · none · ref 37
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Adaptive Stopping for Multi-Turn LLM Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 39
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 91
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs cs.CL · 2026-04-26 · unverdicted · none · ref 3
GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought cs.CL · 2026-04-24 · unverdicted · none · ref 49
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration cs.CL · 2026-04-18 · unverdicted · none · ref 3
Token-level interleaving in multi-agent LLMs allows honest agents to overpower adversarial majorities through dynamic logic chaining, unlike brittle response-level majority voting.
CocoaBench: Evaluating Unified Digital Agents in the Wild cs.CL · 2026-04-13 · unverdicted · none · ref 3
CocoaBench shows the best tested unified digital agents succeed on only 45.1% of human-designed tasks that demand integrated vision, search, and coding.
Many-Tier Instruction Hierarchy in LLM Agents cs.CL · 2026-04-10 · unverdicted · none · ref 40
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Multilingual Embedding Probes Fail to Generalize Across Learner Corpora cs.CL · 2026-04-08 · conditional · none · ref 3
Multilingual embedding probes achieve strong in-distribution CEFR prediction (QWK ≈ 0.7) but fail to generalize across corpora, converging to uniform predictions and capturing corpus-specific features instead of language-general proficiency.
Learning to Interrupt in Language-based Multi-agent Communication cs.CL · 2026-04-07 · unverdicted · none · ref 50
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.
Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries cs.CL · 2026-04-07 · unverdicted · none · ref 54
LLM novel summaries emphasize endings more than human ones, measured by aligning summary sentences to referenced chapters.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 38
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 29
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge cs.CL · 2026-04-03 · unverdicted · none · ref 3
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 3
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection cs.CL · 2026-04-03 · unverdicted · none · ref 3
Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.
M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency cs.CL · 2026-04-01 · unverdicted · none · ref 3
M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.
Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention cs.CL · 2026-04-01 · unverdicted · none · ref 3
Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.
Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs cs.CL · 2026-03-29 · unverdicted · none · ref 3
BACR adaptively schedules token budgets for LLM reasoning via curriculum learning and a unified policy, improving accuracy by up to 8.3% under tight budgets while cutting token use by 34% on math benchmarks.
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models cs.CL · 2025-11-13 · conditional · none · ref 3
MTR-DuplexBench is a multi-round benchmark for full-duplex speech language models that evaluates turn consistency, dialogue quality, instruction following, and safety.
Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models cs.CL · 2025-11-11 · unverdicted · none · ref 3
Think-at-Hard selectively triggers latent iterations only on hard tokens via a neural decider and depth-aware LoRA, yielding 3.8-6.8% gains over baselines on nine reasoning benchmarks while iterating on just 7% of tokens.
LayerNorm Induces Recency Bias in Transformer Decoders cs.CL · 2025-09-25 · unverdicted · none · ref 26
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories cs.CL · 2025-09-25 · unverdicted · none · ref 38
LogitTrace detects benchmark contamination by showing that contaminated inputs produce earlier stabilization in layerwise logit trajectories while clean inputs show more gradual accumulation.
Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation cs.CL · 2024-12-19 · unverdicted · none · ref 44
S^2-Bench is a new one-to-many benchmark for natural language-driven molecule generation with three tasks, and OpenMolIns is an instruction dataset enabling Llama3.1-8B to outperform GPT-4o and Claude-3.5 on it.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering cs.CL · 2024-10-09 · unverdicted · none · ref 35
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
LoRA: Low-Rank Adaptation of Large Language Models cs.CL · 2021-06-17 · accept · none · ref 65
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
A paradox of AI fluency cs.CL · 2026-04-28 · unverdicted · none · ref 43
Fluent AI users adopt an active, iterative collaboration mode that produces more visible failures but better recovery and success on hard tasks, whereas novices experience more invisible failures from passive use.
LLMs Corrupt Your Documents When You Delegate cs.CL · 2026-04-17 · unverdicted · none · ref 3
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models cs.CL · 2026-04-16 · unverdicted · none · ref 36
VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.
Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 3
An adaptive conformal prediction approach for LLMs enables prompt-dependent calibration that improves conditional coverage for factuality while preserving marginal guarantees and supporting selective prediction.
Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration cs.CL · 2026-04-15 · unverdicted · none · ref 10
Fairness emerges from multi-agent negotiation in a hospital triage task, where joint allocations satisfy ethical criteria that neither aligned nor biased agent achieves in isolation.
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation cs.CL · 2026-04-10 · unverdicted · none · ref 2
BERT-as-a-Judge fine-tunes a BERT encoder on synthetic question-candidate-reference triplets to judge answer correctness, outperforming lexical baselines and matching larger LLM judges across 36 models and 15 tasks.
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts cs.CL · 2026-04-09 · conditional · none · ref 105
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
Linear Representations of Hierarchical Concepts in Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 33
Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
Document Optimization for Black-Box Retrieval via Reinforcement Learning cs.CL · 2026-04-06 · unverdicted · none · ref 3
Document optimization via GRPO fine-tuning transforms documents to improve black-box retrieval, enabling smaller models to outperform larger ones on code and VDR benchmarks.
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation cs.CL · 2026-04-06 · unverdicted · none · ref 31
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
Testing the Limits of Truth Directions in LLMs cs.CL · 2026-04-04 · unverdicted · none · ref 3
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments cs.CL · 2026-04-03 · accept · none · ref 3
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge cs.CL · 2026-04-03 · unverdicted · none · ref 3
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
No Single Best Model for Diversity: Learning a Router for Sample Diversity cs.CL · 2026-04-02 · unverdicted · none · ref 3
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks cs.CL · 2026-03-20 · unverdicted · none · ref 3
LiveClawBench is a pilot benchmark and Triple-Axis Complexity Framework for evaluating LLM agents on compositional real-world assistant tasks derived from real usage data.
Multi-User Large Language Model Agents cs.CL · 2026-03-19 · unverdicted · none · ref 2
Frontier LLMs show systematic failures in stable prioritization under conflicting objectives, increasing privacy violations over turns, and efficiency bottlenecks in multi-user coordination.
Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation cs.CL · 2026-03-11 · unverdicted · none · ref 3
Constrained MLE fuses human calibration data, LLM judge labels, and judge performance bounds to yield accurate low-variance estimates of LLM failure rates.
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates cs.CL · 2025-12-04 · conditional · none · ref 102
SSU mitigates catastrophic forgetting in low-resource LLM target-language adaptation by scoring and column-wise freezing source-critical parameters, reducing source degradation to ~3% versus ~20% for full fine-tuning while matching target performance.
Structured Uncertainty guided Clarification for LLM Agents cs.CL · 2025-11-11 · unverdicted · none · ref 3
Structured uncertainty with EVPI enables more efficient clarification and better training for tool-calling LLM agents on ambiguous tasks.
The Realignment Problem: When Right becomes Wrong in LLMs cs.CL · 2025-11-04 · unverdicted · none · ref 26
TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.
Graph-Based Alternatives to LLMs for Human Simulation cs.CL · 2025-11-03 · conditional · none · ref 104
GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning cs.CL · 2025-09-30 · unverdicted · none · ref 48
KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.
Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis cs.CL · 2025-09-29 · unverdicted · none · ref 51
A new framework using Task Subspace Logit Attribution localizes attention heads specialized for task recognition and task learning in in-context learning, showing they align and rotate hidden states within a task subspace.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer