AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
hub
write newline
31 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.
M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.
Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.
HumorRank ranks nine LLMs on textual humor using GTVH-grounded pairwise tournaments and Adaptive Swiss aggregation on the SemEval-2026 MWAHAHA dataset, finding that comedic mechanism mastery matters more than scale.
QuanBench+ is a new multi-framework benchmark showing LLMs reach 43-60% Pass@1 on quantum code tasks across three libraries, rising to 67-83% with error-feedback repair, yet performance remains strongly framework-dependent.
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
citing papers explorer
-
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
-
Adaptive Stopping for Multi-Turn LLM Reasoning
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
-
CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
-
Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection
Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.
-
M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.
-
Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention
Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.
-
HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
HumorRank ranks nine LLMs on textual humor using GTVH-grounded pairwise tournaments and Adaptive Swiss aggregation on the SemEval-2026 MWAHAHA dataset, finding that comedic mechanism mastery matters more than scale.
-
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
QuanBench+ is a new multi-framework benchmark showing LLMs reach 43-60% Pass@1 on quantum code tasks across three libraries, rising to 67-83% with error-feedback repair, yet performance remains strongly framework-dependent.
-
LoRA: Low-Rank Adaptation of Large Language Models
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
-
Architecture Determines Observability of Transformers
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
-
When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
-
Testing the Limits of Truth Directions in LLMs
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
-
Align then Train: Efficient Retrieval Adapter Learning
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
-
Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM
Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.
-
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
-
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
-
Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
-
No Single Best Model for Diversity: Learning a Router for Sample Diversity
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
-
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.
-
Multimodal Language Models Cannot Spot Spatial Inconsistencies
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
-
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
Scene Dynamic Field integrates physics simulators into MLLM fine-tuning to boost intuitive physics understanding, delivering up to 20.7% gains on fluid tasks with generalization to unseen domains.
-
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
Citation URLs from LLMs and research agents are hallucinated 3-13% of the time and non-resolving 5-18% of the time, with a released tool that reduces failures by 6-79x.
-
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.
-
Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs
LLMs reach near-ceiling performance on explicit QFT and string theory derivations but degrade when required to reconstruct omitted reasoning steps or resolve implicit tensions under global consistency constraints.
-
Embedding-Only Uplink for Onboard Retrieval Under Shift in Remote Sensing
Embedding-only uplink enables flexible onboard retrieval for remote sensing under distribution shifts, with kNN superior for cloud classification and centroids for temporal change detection.
-
Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
Embedding-based distillation shrinks a large genomic model 200-fold into a compact mRNA specialist that reaches state-of-the-art results among similarly sized models.
-
Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
VLMs show systematic fragility in visual invariance under geometric transformations, with sharp performance drops as semantic content thins across sketches, photos, and art.