HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds

Anirudh Goyal; Beibei Lin; Dianbo Liu; Hongyu He; Qiran Zou; Tingting Chen; Yew-Soon Ong; Zifeng Yuan

arxiv: 2510.15614 · v3 · pith:OUD2UJ4Dnew · submitted 2025-10-17 · 💻 cs.CL

HypoSpace: A Diagnostic Benchmark for Set-Valued Hypothesis Generation under Underdetermination and Sublinear Coverage Bounds

Tingting Chen , Beibei Lin , Zifeng Yuan , Qiran Zou , Hongyu He , Anirudh Goyal , Yew-Soon Ong , Dianbo Liu This is my paper

classification 💻 cs.CL

keywords hypospacehypothesisbenchmarkinferencespacesadmissiblecoveragediagnostic

0 comments

read the original abstract

Many scientific problems are underdetermined: multiple distinct hypotheses are equally consistent with the same observations. In such settings, effective inference requires not only producing valid explanations, but also systematically exploring and covering the admissible hypothesis set. We introduce HypoSpace, a benchmark that treats large language models (LLMs) as samplers over finite hypothesis spaces and evaluates them on three metrics: Validity, Uniqueness, and Recovery. HypoSpace spans three structured domains (causal graph inference, gravity-constrained 3D voxel reconstruction, and Boolean genetic interaction modeling) with deterministic validators and exactly enumerable solution spaces, plus real-world anchored case studies. Empirically, HypoSpace reveals a capability- and scale-dependent coverage failure: models can maintain high Validity while exhibiting reduced Uniqueness and Recovery as admissible hypothesis spaces become larger or more combinatorial. We further show that the analysis on stratified decoding partially mitigates this collapse, demonstrating HypoSpace's utility as a diagnostic benchmark for set-valued inference. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unlocking LLM Creativity in Science through Analogical Reasoning
cs.AI 2026-05 conditional novelty 6.0

Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
cs.CV 2026-05 unverdicted novelty 5.0

GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.