A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
Delgado-Chaves, Matthew J
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
A new 507-leaf taxonomy and 4x6 Target x Technique matrix audits six LLM attack benchmarks and finds they cover at most 25% of the threat surface with entire STRIDE categories untested.
LLMs given only research questions from 1000 arXiv CS papers recommend a narrower set of methods than the original papers, with effective model-entity diversity dropping from 1232 to 59-96 and stronger agreement among LLMs than with papers.
Only 12.2% of 3,967 eligible prediction model studies share code, with shared repositories frequently lacking dependency specifications and modular structure needed for reproducibility.
citing papers explorer
-
Code Sharing In Prediction Model Research: A Scoping Review
Only 12.2% of 3,967 eligible prediction model studies share code, with shared repositories frequently lacking dependency specifications and modular structure needed for reproducibility.