AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
Title resolution pending
31 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.
M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.
Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.
BACR adaptively schedules token budgets for LLM reasoning via curriculum learning and a unified policy, improving accuracy by up to 8.3% under tight budgets while cutting token use by 34% on math benchmarks.
QuanBench+ is a new multi-framework benchmark showing LLMs reach 43-60% Pass@1 on quantum code tasks across three libraries, rising to 67-83% with error-feedback repair, yet performance remains strongly framework-dependent.
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
citing papers explorer
-
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
-
Adaptive Stopping for Multi-Turn LLM Reasoning
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
-
Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection
Gen-SSD improves chain-of-thought distillation by letting the student model guide the teacher's generation process through real-time selection of learnable reasoning branches, yielding 5.9-point gains over standard KD on math benchmarks.
-
M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
M2-Verify is a new multidomain benchmark dataset for multimodal scientific claim consistency that reveals state-of-the-art models drop from 85.8% to 61.6% Micro-F1 on complex perturbations and produce hallucinated explanations.
-
Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention
Stochastic Attention applies random permutations to token sequences in sliding-window attention to achieve exponentially growing receptive fields and full coverage in logarithmic layers, outperforming standard SWA in language model pre-training and inference.
-
Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs
BACR adaptively schedules token budgets for LLM reasoning via curriculum learning and a unified policy, improving accuracy by up to 8.3% under tight budgets while cutting token use by 34% on math benchmarks.
-
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
QuanBench+ is a new multi-framework benchmark showing LLMs reach 43-60% Pass@1 on quantum code tasks across three libraries, rising to 67-83% with error-feedback repair, yet performance remains strongly framework-dependent.
-
LoRA: Low-Rank Adaptation of Large Language Models
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
-
When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
-
Testing the Limits of Truth Directions in LLMs
Truth directions in LLMs are not universal but depend heavily on model layer, task type and difficulty, and prompt instructions.
-
Align then Train: Efficient Retrieval Adapter Learning
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
-
Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM
Flash-Mono uses a recurrent feed-forward frontend with cross-attention to predict poses and 2D Gaussian surfel attributes for monocular SLAM, achieving 10x speedup and state-of-the-art tracking and mapping.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
-
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
-
Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge
RL with an LLM judge provides rewards on unlabeled data for knowledge distillation, yielding gains on math benchmarks when mixed with verifiable rewards.
-
No Single Best Model for Diversity: Learning a Router for Sample Diversity
No single LLM is best for response diversity; a router selecting the per-prompt best model raises diversity coverage from 23.8% to 26.3% on NB-Wildchat and generalizes to new data.
-
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
ATBench is a new trajectory-level benchmark with 1,000 diverse and realistic scenarios for assessing safety in LLM agents.
-
Multimodal Language Models Cannot Spot Spatial Inconsistencies
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
-
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
Scene Dynamic Field integrates physics simulators into MLLM fine-tuning to boost intuitive physics understanding, delivering up to 20.7% gains on fluid tasks with generalization to unseen domains.
-
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
Citation URLs from LLMs and research agents are hallucinated 3-13% of the time and non-resolving 5-18% of the time, with a released tool that reduces failures by 6-79x.
-
Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts
Domain-adapted utterance-level retrieval raises Cohen's kappa for tutoring dialogue act annotation to 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, beating no-retrieval baselines by large margins across three LLMs.
-
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
Random sampling matches active preference learning on win-rate gains in online DPO yet both degrade benchmark performance, making active selection's overhead hard to justify.
-
Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs
LLMs reach near-ceiling performance on explicit QFT and string theory derivations but degrade when required to reconstruct omitted reasoning steps or resolve implicit tensions under global consistency constraints.
-
Embedding-Only Uplink for Onboard Retrieval Under Shift in Remote Sensing
Embedding-only uplink enables flexible onboard retrieval for remote sensing under distribution shifts, with kNN superior for cloud classification and centroids for temporal change detection.
-
Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching
Embedding-based distillation shrinks a large genomic model 200-fold into a compact mRNA specialist that reaches state-of-the-art results among similarly sized models.