ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Title resolution pending
15 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4 stat.ME 3 cs.CL 2 astro-ph.GA 1 cs.CY 1 physics.gen-ph 1 quant-ph 1 stat.CO 1 stat.ML 1years
2026 15representative citing papers
Zero-noise extrapolation has a finite-shot help-harm boundary below which it increases local mean-squared error due to variance penalties outweighing bias reduction.
JudgeSense benchmark shows LLM judge consistency does not reliably improve with model scale, with coherence most sensitive to prompt changes and factuality more stable.
Jensen-Shannon regularized analogues of KL-based direct-correlation measures are introduced, taking values in [0,1] and accompanied by alphabet-size-dependent upper bounds under the observed marginal p(x,z).
A Bayesian model for multi-feature contact matrices that uses tensor structures and contingency table theory to satisfy structural constraints and impute missing contact features, validated on simulations and US/German survey data.
Bayesian procedures are derived to compute the posterior probability that a recoverable process is currently in control or that a drifting latent parameter lies in an acceptable region.
A semi-supervised kernel two-sample test integrates unlabeled covariate data to achieve asymptotic normality under the null, higher power than standard kernel tests, and consistency against fixed and local alternatives.
Task-aligned supervised geometric stability predicts linear steerability with high accuracy while unsupervised stability detects representational drift earlier and with lower false alarms than CKA or Procrustes.
42% of significant turn-level associations in LLM conversation analysis are spurious due to unaccounted autocorrelation, with a validated two-stage correction framework improving replication.
Bio-PINNs with a near-to-far curriculum and deformation-uncertainty proxy recover cell-induced densified phases and tether morphologies more reliably than standard adaptive PINN baselines in single-cell and multicellular settings.
Bayesian-ARGOS is a hybrid frequentist-Bayesian method that discovers equations from limited noisy observations more efficiently than SINDy or bootstrap-ARGOS while adding uncertainty quantification.
No single goodness-of-fit or two-sample test reliably detects deviations across all multivariate scenarios, so the authors recommend a small combination of methods that together cover the simulated cases.
Spatially resolved kinematics show SLACS lens galaxies have nearly isothermal total mass profiles (mean γ=2.04) with average mass-sheet parameter λ_int=1.01, consistent with no measurable bias from power-law assumptions in cosmography.
Four latent profiles of AI risk perception were identified in U.S. adults, with higher AI concern generally linked to greater perceived driving-hazard severity except for AI-versus-human driving comparisons.
Observational constraints on a dark energy EoS parametrization in curved spacetime yield α ≈ 0.35 (0.56) and Ω_k0 that changes sign with ANN data reconstruction.
citing papers explorer
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
The finite-shot help-harm boundary of zero-noise extrapolation
Zero-noise extrapolation has a finite-shot help-harm boundary below which it increases local mean-squared error due to variance penalties outweighing bias reduction.
-
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
JudgeSense benchmark shows LLM judge consistency does not reliably improve with model scale, with coherence most sensitive to prompt changes and factuality more stable.
-
How to quantify direct correlations between variables
Jensen-Shannon regularized analogues of KL-based direct-correlation measures are introduced, taking values in [0,1] and accompanied by alphabet-size-dependent upper bounds under the observed marginal p(x,z).
-
Bayesian Modeling and Prediction of Generalized Contact Matrices
A Bayesian model for multi-feature contact matrices that uses tensor structures and contingency table theory to satisfy structural constraints and impute missing contact features, validated on simulations and US/German survey data.
-
Sequential Bayesian Monitoring for Recoverable and Drifting Processes
Bayesian procedures are derived to compute the posterior probability that a recoverable process is currently in control or that a drifting latent parameter lies in an acceptable region.
-
A Semi-Supervised Kernel Two-Sample Test
A semi-supervised kernel two-sample test integrates unlabeled covariate data to achieve asymptotic normality under the null, higher power than standard kernel tests, and consistency against fixed and local alternatives.
-
The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability
Task-aligned supervised geometric stability predicts linear steerability with high accuracy while unsupervised stability detects representational drift earlier and with lower false alarms than CKA or Procrustes.
-
The Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious
42% of significant turn-level associations in LLM conversation analysis are spurious due to unaccounted autocorrelation, with a validated two-stage correction framework improving replication.
-
Cell-induced densification and tether formation in fibrous extracellular matrices with biomimetic physics-informed neural networks
Bio-PINNs with a near-to-far curriculum and deformation-uncertainty proxy recover cell-induced densified phases and tether morphologies more reliably than standard adaptive PINN baselines in single-cell and multicellular settings.
-
Fast and principled equation discovery from chaos to climate
Bayesian-ARGOS is a hybrid frequentist-Bayesian method that discovers equations from limited noisy observations more efficiently than SINDy or bootstrap-ARGOS while adding uncertainty quantification.
-
Power Studies For Two-Sample and Goodness-of-Fit Methods For Multivariate Data
No single goodness-of-fit or two-sample test reliably detects deviations across all multivariate scenarios, so the authors recommend a small combination of methods that together cover the simulated cases.
-
Spatially Resolved Kinematics of SLACS Lens Galaxies. II: Breaking Degeneracies with Lensing and Dynamical Models
Spatially resolved kinematics show SLACS lens galaxies have nearly isothermal total mass profiles (mean γ=2.04) with average mass-sheet parameter λ_int=1.01, consistent with no measurable bias from power-law assumptions in cosmography.
-
Latent Profiles of AI Risk Perception and Their Differential Association with Community Driving Safety Concerns: A Person-Centered Analysis
Four latent profiles of AI risk perception were identified in U.S. adults, with higher AI concern generally linked to greater perceived driving-hazard severity except for AI-versus-human driving comparisons.
-
Constraining Dark Energy Dynamics in Curved Spacetime with Current Observations
Observational constraints on a dark energy EoS parametrization in curved spacetime yield α ≈ 0.35 (0.56) and Ω_k0 that changes sign with ANN data reconstruction.