Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
Landsness, Daniel L
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
other 1polarities
unclear 1representative citing papers
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
AI coding agents evolve simple ground-state protocols into improved versions for VQE, DMRG, and AFQMC on spin models and molecules by using executable energy scores under fixed compute budgets.
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-Literature.
PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
citing papers explorer
-
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
Optimizing ground state preparation protocols with autoresearch
AI coding agents evolve simple ground-state protocols into improved versions for VQE, DMRG, and AFQMC on spin models and molecules by using executable energy scores under fixed compute budgets.
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
Unlocking LLM Creativity in Science through Analogical Reasoning
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
-
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
-
Hypothesis generation and updating in large language models
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
-
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-Literature.
-
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.