Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Title resolution pending
128 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
DRACULA is the first dataset of user feedback on intermediate actions for deep research agents, showing that LLMs predict preferred actions better with full user history and that history-based action generation leads to higher user selection rates.
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
Enforcing feature- and label-permutation equivariance in transformers for in-context classification yields an identifiable emergent update rule driven by mixed feature-label Gram matrices that amplifies class separation.
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and multiple tasks at low parameter cost.
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
OBLIQ-Bench reveals that modern retrievers fail to surface documents for latent and implicit queries even though LLMs reliably recognize relevance when those documents are provided.
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
S3 decomposes multimodal data into selectable semantic experts, routes them adaptively, and sparsifies to achieve higher accuracy on MultiBench benchmarks with peak performance at intermediate sparsity levels.
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.
A new corpus of 301,871 systematic reviews across all sciences is released with extracted method artifacts to support retrieval benchmarking and meta-research.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute
DRACULA is the first dataset of user feedback on intermediate actions for deep research agents, showing that LLMs predict preferred actions better with full user history and that history-based action generation leads to higher user selection rates.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
Layerwise Dynamics for In-Context Classification in Transformers
Enforcing feature- and label-permutation equivariance in transformers for in-context classification yields an identifiable emergent update rule driven by mixed feature-label Gram matrices that amplifies class separation.
-
The Linear Representation Hypothesis and the Geometry of Large Language Models
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
-
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
-
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
-
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and multiple tasks at low parameter cost.
-
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
-
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
-
OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries
OBLIQ-Bench reveals that modern retrievers fail to surface documents for latent and implicit queries even though LLMs reliably recognize relevance when those documents are provided.
-
Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
SIOP enables turn-level credit assignment in LLM agents via semantic clustering of final answers as latent outcomes, improving performance on reasoning benchmarks without verifiers.
-
Deep Graph-Language Fusion for Structure-Aware Code Generation
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
-
Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts
S3 decomposes multimodal data into selectable semantic experts, routes them adaptively, and sparsifies to achieve higher accuracy on MultiBench benchmarks with peak performance at intermediate sparsity levels.
-
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
-
On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility
VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.
-
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
-
Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising
DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.
-
A Large-Scale, Cross-Disciplinary Corpus of Systematic Reviews
A new corpus of 301,871 systematic reviews across all sciences is released with extracted method artifacts to support retrieval benchmarking and meta-research.
-
Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation
An LLM simulation framework generates multilingual tip-of-the-tongue queries, validated by rank correlation with real queries, producing the first large-scale ToT benchmarks for four languages.
-
CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction
CT Open is a new live platform with an automated LLM-powered decontamination pipeline that supplies uncontaminated benchmarks for predicting clinical trial outcomes.
-
Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion
Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
-
Sign-to-Speech Prosody Transfer via Sign Reconstruction-based GAN
SignRecGAN trains on separate sign and speech datasets via adversarial and reconstruction objectives to inject sign-derived prosody into TTS output using the S2PFormer model.
-
The Character Error Vector: Decomposable errors for page-level OCR evaluation
The Character Error Vector is a decomposable bag-of-characters evaluator for page-level OCR that remains defined under parsing errors and bridges parsing metrics with local CER.
-
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
-
Can You Trust the Vectors in Your Vector Database? Black-Hole Attack from Embedding Space Defects
Injecting a few malicious vectors near the centroid exploits centrality-driven hubness in high-dimensional embeddings, causing them to dominate top-k retrievals in up to 99.85% of cases.
-
Attribution Bias in Large Language Models
LLMs exhibit large systematic disparities in quote attribution accuracy across race, gender, and intersectional groups, along with widespread and uneven suppression of attributions.
-
Faster Superword Tokenization
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
-
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
ClawsBench is a benchmark using high-fidelity mock services to evaluate LLM agents on 44 productivity tasks, finding 39-64% success rates and 7-33% unsafe action rates depending on scaffolding.
-
StoryScope: Investigating idiosyncrasies in AI fiction
StoryScope extracts narrative features showing AI stories favor tidy plots and over-explain themes while human stories show more moral ambiguity and temporal complexity, enabling strong detection and attribution.
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
-
Extending Context Window of Large Language Models via Positional Interpolation
Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-context quality.
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
-
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
-
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
-
Evaluating the False Trust engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
An Annotation Scheme and Classifier for Personal Facts in Dialogue
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.