Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
Title resolution pending
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Proposes adaptive multiple importance sampling for robust Bayesian model evidence estimation under parameter non-identifiability, shown to outperform deterministic methods on ecological case studies while being cheaper than MCMC.
PDE-STRIDE applies stability-based model selection to sparse regression for robust, parameter-free recovery of PDEs from noisy data.
A formula approximating degrees of freedom for tree-structured varying coefficient models is proposed to improve BIC model selection over naive parameter counting.
Systematic multi-variable experiments show panoptic segmentation yields poorer uncertainty quality than semantic, with high variance across datasets and backbones, limited value from time-series samples, calibration gains from sample diversity, and conditional benefits from ensembles over single det
Proposes fMSV framework using factor decomposition, two-stage estimation, and derived asymptotics for high-dimensional multivariate stochastic volatility, tested via simulations and portfolio applications.
citing papers explorer
-
Open-World Evaluations for Measuring Frontier AI Capabilities
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
-
Reliable model selection in the presence of parameter non-identifiability
Proposes adaptive multiple importance sampling for robust Bayesian model evidence estimation under parameter non-identifiability, shown to outperform deterministic methods on ecological case studies while being cheaper than MCMC.
-
Stability selection enables robust learning of partial differential equations from limited noisy data
PDE-STRIDE applies stability-based model selection to sparse regression for robust, parameter-free recovery of PDEs from noisy data.
-
A tool to determine the degrees of freedom in tree-structured varying coefficient models
A formula approximating degrees of freedom for tree-structured varying coefficient models is proposed to improve BIC model selection over naive parameter counting.
-
U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration
Systematic multi-variable experiments show panoptic segmentation yields poorer uncertainty quality than semantic, with high variance across datasets and backbones, limited value from time-series samples, calibration gains from sample diversity, and conditional benefits from ensembles over single det
-
Factor multivariate stochastic volatility models of high dimension
Proposes fMSV framework using factor decomposition, two-stage estimation, and derived asymptotics for high-dimensional multivariate stochastic volatility, tested via simulations and portfolio applications.