super hub Mixed citations

write newline

" write newline "" before

Mixed citation behavior. Most common role is unclear (62%).

301 Pith papers citing it

unclear 62% of classified citations

browse 301 citing papers more from " write newline "" before

hub tools

JSON dossier citing papers JSON

citation-role summary

background 8 other 4 method 1

citation-polarity summary

unclear 8 background 4 use method 1

claims ledger

background Table A1: Comparison of BAS for frontier models across tasks when varying the risk-prior w(t). Higher scores indicate better alignment with expressed uncertainty. The standardBAS (Uniform: w(t) = 1) serves as the baseline, while Linear and Quadratic weights simulate increasingly safety-critical environments. Identical ECE, different BAS.Consider two models evaluated on four examples with correctness labelsZ= [1, 1, 0, 0]. The models produce the following confidence values: Example 1 2 3 4 Z1 1 0

authors

" write newline "" before

co-cited works

representative citing papers

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

cs.CL · 2026-04-29 · unverdicted · novelty 8.0

TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

cs.LG · 2026-04-17 · unverdicted · novelty 8.0

JumpLoRA uses JumpReLU gating to induce adaptive sparsity in LoRA blocks, achieving dynamic parameter isolation that prevents task interference and improves continual learning performance over IncLoRA and ELLA.

Context Over Content: Exposing Evaluation Faking in Automated Judges

cs.AI · 2026-04-16 · conditional · novelty 8.0

LLM judges exhibit up to 9.8 percentage point leniency bias from stakes signaling in prompts, acting implicitly without mentioning it in chain-of-thought.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning

cs.LG · 2026-04-13 · unverdicted · novelty 8.0

EnsembleCert and ScaLabelCert enable tighter and exact certificates for neural network robustness against label-flipping attacks by leveraging white-box information and neural tangent kernel equivalence.

Steered LLM Activations are Non-Surjective

cs.AI · 2026-04-10 · unverdicted · novelty 8.0 · 2 refs

Steered LLM activations are non-surjective: under practical assumptions, they lie outside the set of states reachable from any discrete prompt.

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

cs.AI · 2026-04-01 · unverdicted · novelty 8.0

AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.

Adaptive Stopping for Multi-Turn LLM Reasoning

cs.CL · 2026-04-01 · unverdicted · novelty 8.0

MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.

Parameterized Hardness of Zonotope Containment and Neural Network Verification

cs.CC · 2025-09-26 · unverdicted · novelty 8.0

The paper proves W[1]-hardness parameterized by dimension d for positivity, zonotope containment, max approximation, and L_p-Lipschitz constants in 2- and 3-layer ReLU networks, showing enumeration methods are optimal under ETH.

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

cs.CR · 2025-09-25 · conditional · novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.

The Coding Limits of Robust Watermarking for Generative Models

cs.CR · 2025-09-11 · accept · novelty 8.0

Establishes an unconditional robustness threshold of 1-1/q for zero-bit tamper-detection codes in watermarking, with matching constructions and experimental confirmation on image models.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

BEAVER: An Enterprise Benchmark for Text-to-SQL

cs.CL · 2024-09-03 · unverdicted · novelty 8.0

BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.

Score-Based Generative Modeling through Stochastic Differential Equations

cs.LG · 2020-11-26 · unverdicted · novelty 8.0

Introduces an SDE-based framework for score-based generative modeling that unifies prior methods, enables predictor-corrector sampling and neural ODE likelihoods, and achieves SOTA unconditional image generation on CIFAR-10.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Adam: A Method for Stochastic Optimization

cs.LG · 2014-12-22 · accept · novelty 7.5

A first-order stochastic optimizer that maintains bias-corrected exponential moving averages of the gradient and its square, dividing the former by the square root of the latter to set per-parameter step sizes.

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

cs.LG · 2026-04-29 · unverdicted · novelty 7.0

AutoSP automates sequence parallelism and long-context activation checkpointing via compilation, enabling up to 2.7x longer training contexts on NVIDIA hardware with negligible throughput loss.

Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

cs.AI · 2026-04-28 · conditional · novelty 7.0

C2C is a new testbed where LM agents negotiate differently from humans and targeted prompting raises their win rate from 22.2% to 32.7% across 1,100+ games.

XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

cs.AI · 2026-04-27 · unverdicted · novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

cs.CL · 2026-04-26 · unverdicted · novelty 7.0

GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

Pliable rejection sampling

stat.ML · 2026-04-24 · unverdicted · novelty 7.0

Pliable rejection sampling learns a kernel-based proposal to enable efficient i.i.d. sampling from target distributions f with high-probability correctness and a guarantee on accepted samples.

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

q-bio.NC · 2026-04-23 · unverdicted · novelty 7.0

Stimuli with low intra-modal dispersion among vision models elicit up to twice the cross-modal alignment with language models compared to high-dispersion stimuli.

citing papers explorer

Showing 50 of 301 citing papers.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection cs.LG · 2026-04-24 · unverdicted · none · ref 1
An uncertainty-aware sequential selection algorithm fits scaling laws to near-full accuracy using only about 10% of the total experimental training budget across diverse benchmarks.
Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors cs.CV · 2026-04-24 · unverdicted · none · ref 1
Explicit prompt baselines cut NLI contradictions by up to 42.6% with zero training, while learned gated context projectors deliver a 34% reduction in planning-stage contradictions and 50% higher cross-stage entailment on DriveLM-nuScenes.
Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models q-bio.NC · 2026-04-23 · unverdicted · none · ref 65
Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.
SafeDream: Safety World Model for Proactive Early Jailbreak Detection cs.CR · 2026-04-18 · unverdicted · none · ref 1
SafeDream uses a safety world model, CUSUM accumulation, and contrastive latent-space imagination to detect multi-turn jailbreaks 1.06-1.20 turns early on average across benchmarks while keeping competitive false-positive rates.
SAVE: A Generalizable Framework for Multi-Condition Single-Cell Generation with Gene Block Attention cs.AI · 2026-04-18 · unverdicted · none · ref 1
SAVE is a conditional Transformer framework with gene block attention and flow matching that generates multi-condition single-cell data and generalizes better than prior methods to unseen condition combinations.
Faster LLM Inference via Sequential Monte Carlo cs.LG · 2026-04-17 · unverdicted · none · ref 1
SMC-SD replaces rejection sampling with particle resampling in speculative decoding to deliver 2.36x speedup over standard SD and 5.2x over autoregressive decoding while staying within 3% of target accuracy.
LLMs Corrupt Your Documents When You Delegate cs.CL · 2026-04-17 · unverdicted · none · ref 1
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
SCATR: Simple Calibrated Test-Time Ranking cs.LG · 2026-04-16 · unverdicted · none · ref 1
SCATR calibrates a simple scorer from base-model hidden representations on limited data to improve Best-of-N response selection, delivering up to 9% gains over heuristics with orders-of-magnitude less compute than fine-tuning or PRMs.
ProtoTTA: Prototype-Guided Test-Time Adaptation cs.LG · 2026-04-16 · unverdicted · none · ref 28
ProtoTTA is a test-time adaptation framework for prototype models that uses intermediate prototype signals and entropy minimization to improve robustness and semantic focus under distribution shifts.
Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography cs.CV · 2026-04-16 · unverdicted · none · ref 1
LAMAE adds latent-space attention to masked autoencoders so multi-view echocardiography videos can exchange information across frames and views, yielding representations that transfer from adult to pediatric hearts and enable ICD-10 code prediction on MIMIC-IV-ECHO.
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models cs.CL · 2026-04-16 · unverdicted · none · ref 34
VLMs show answer inertia in CoT reasoning and remain influenced by misleading textual cues even with sufficient visual evidence, making CoT an incomplete window into modality reliance.
Quantifying Cross-Query Contradictions in Multi-Query LLM Reasoning cs.AI · 2026-04-16 · unverdicted · none · ref 1
A benchmark and solver-augmented method reduces cross-query contradictions in LLMs (SetCons from 0.56 to 0.94) while preserving per-query accuracy across four domains.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space cs.LG · 2026-04-15 · unverdicted · none · ref 76
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
Reward Design for Physical Reasoning in Vision-Language Models cs.AI · 2026-04-15 · unverdicted · none · ref 1
Accuracy-based rewards outperform SFT and other reward variants in GRPO training of VLMs on the PhyX physics benchmark, with attention-weight rewards raising spatial reasoning accuracy from 0.27 to 0.50.
Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 1
An adaptive conformal prediction approach for LLMs enables prompt-dependent calibration that improves conditional coverage for factuality while preserving marginal guarantees and supporting selective prediction.
Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades cs.LG · 2026-04-15 · unverdicted · none · ref 1
CTD trains a lightweight DV probe to predict escalation benefits and calibrates its threshold via multiple hypothesis testing on held-out data to deliver finite-sample guarantees on delegation rate while outperforming uncertainty-based cascades on safety tasks.
Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration cs.CL · 2026-04-15 · unverdicted · none · ref 8
Fairness emerges from multi-agent negotiation in a hospital triage task, where joint allocations satisfy ethical criteria that neither aligned nor biased agent achieves in isolation.
Introspective Diffusion Language Models cs.AI · 2026-04-13 · unverdicted · none · ref 1
I-DLM matches same-scale autoregressive model quality in diffusion language models by enforcing introspective consistency via strided decoding, outperforming prior DLMs on 15 benchmarks including 69.6 on AIME-24.
Sanity Checks for Agentic Data Science cs.AI · 2026-04-13 · unverdicted · none · ref 1
Sanity checks using input perturbations can reveal when agentic data science conclusions lack support from stable signal, as shown on synthetic data and 11 real datasets where 6 affirmative claims were unsupported.
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents cs.LG · 2026-04-12 · unverdicted · none · ref 1
Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts cs.LG · 2026-04-12 · unverdicted · none · ref 1
CodeQuant unifies learnable rotation smoothing with cluster-centroid absorption of outliers to reduce quantization error in low-precision MoE models, reporting up to 4.15x speedup and higher accuracy than prior PTQ methods.
Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models cs.CV · 2026-04-10 · unverdicted · none · ref 1
Medically fine-tuned VLMs exhibit fragile performance that degrades with task difficulty and shows no reliable advantage over general models, with high sensitivity to prompt changes.
The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise cs.AI · 2026-04-10 · unverdicted · none · ref 1
Expert specialization in MoEs is an emergent effect of hidden state geometry due to linear routers, not domain expertise, as confirmed empirically across models and explained by a proof on load-balancing effects.
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation cs.CL · 2026-04-10 · unverdicted · none · ref 4
BERT-as-a-Judge fine-tunes a BERT encoder on synthetic question-candidate-reference triplets to judge answer correctness, outperforming lexical baselines and matching larger LLM judges across 36 models and 15 tasks.
OASIS: Online Activation Subspace Learning for Memory-Efficient Training cs.LG · 2026-04-10 · unverdicted · none · ref 1
OASIS tracks an evolving low-dimensional activation subspace to project activations, gradients, and optimizer states, cutting peak memory up to 2x versus full fine-tuning while matching performance on finetuning and pretraining tasks.
MixFlow: Mixed Source Distributions Improve Rectified Flows cs.CV · 2026-04-10 · unverdicted · none · ref 1
Mixing unconditional Gaussian noise with a κ-conditioned source during training of rectified flows reduces path curvature, yielding 12% better FID scores and faster sampling than standard rectified flows.
Efficient RL Training for LLMs with Experience Replay cs.LG · 2026-04-09 · unverdicted · none · ref 2
Well-designed experience replay buffers reduce inference compute in LLM RL post-training while maintaining or improving performance and preserving policy entropy.
EvoLen: Evolution-Guided Tokenization for DNA Language Model cs.LG · 2026-04-09 · unverdicted · none · ref 1
EvoLen is an evolution-guided tokenizer that stratifies DNA sequences by conservation signals, applies group-specific BPE, and uses dynamic programming decoding to improve preservation of functional motifs over standard BPE.
Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest cs.AI · 2026-04-09 · unverdicted · none · ref 112
Many LLMs prioritize company ad incentives over user welfare by recommending pricier sponsored products, disrupting purchases, or concealing prices in comparisons.
Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM cs.AI · 2026-04-09 · unverdicted · none · ref 1
DiADEM learns demographic importance weights to model annotator disagreement distributions and outperforms LLM and neural baselines on disagreement tracking in DICES and VOICED benchmarks.
Bit-by-Bit: Progressive QAT Strategy with Outlier Channel Splitting for Stable Low-Bit LLMs cs.LG · 2026-04-09 · unverdicted · none · ref 1
Bit-by-Bit achieves stable 2-bit quantization of Llama models via block-wise progressive training and outlier channel splitting, reporting only 2.25 WikiText2 PPL degradation versus full precision while outperforming prior QAT baselines.
Linear Representations of Hierarchical Concepts in Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 31
Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V cs.AI · 2026-04-09 · unverdicted · none · ref 1
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning cs.LG · 2026-04-07 · unverdicted · none · ref 32
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits cs.AI · 2026-04-07 · unverdicted · none · ref 1
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment cs.IR · 2026-04-07 · unverdicted · none · ref 1
Multilingual retrievers show English bias in mixed-language pools; a small-data training strategy improves cross-lingual alignment and reduces the bias.
Planning to Explore: Curiosity-Driven Planning for LLM Test Generation cs.SE · 2026-04-06 · unverdicted · none · ref 23
CovQValue achieves 51-77% higher branch coverage than greedy baselines on TestGenEval Lite by using coverage feedback and LLM-estimated Q-values to select informative test plans.
Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner cs.LG · 2026-04-06 · unverdicted · none · ref 1
Scaling Decision Pre-Trained Transformer with Flow Matching on hundreds of tasks yields an agent with improved generalization in in-context reinforcement learning.
Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest cs.CY · 2026-04-06 · unverdicted · none · ref 1
LLMs deviate from announced actions in 56.6% of scenarios across six games and nine models, frequently without awareness of breaking promises.
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation cs.CL · 2026-04-06 · unverdicted · none · ref 29
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation cs.MA · 2026-04-04 · unverdicted · none · ref 1
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
Testing the Limits of Truth Directions in LLMs cs.CL · 2026-04-04 · unverdicted · none · ref 1
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
Align then Train: Efficient Retrieval Adapter Learning cs.IR · 2026-04-03 · unverdicted · none · ref 1
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM cs.RO · 2026-04-03 · unverdicted · none · ref 1
Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models cs.LG · 2026-04-03 · conditional · none · ref 1
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments cs.CL · 2026-04-03 · accept · none · ref 1
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge cs.CL · 2026-04-03 · unverdicted · none · ref 1
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
No Single Best Model for Diversity: Learning a Router for Sample Diversity cs.CL · 2026-04-02 · unverdicted · none · ref 1
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents cs.SE · 2026-04-02 · unverdicted · none · ref 45
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.
Multimodal Language Models Cannot Spot Spatial Inconsistencies cs.CV · 2026-04-01 · unverdicted · none · ref 1
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.

write newline

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer