LLM agents reach 90.9% retrieval recall at K=200 but recover at most 52.7% of ground-truth included studies because they cannot reliably apply PI/ECO eligibility criteria to topically similar distractors.
Journal of Clinical Epidemiology181(2025)
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples, and preserves the result as an inspectable artefact.
Analysis of LLM vs human disagreements in six software engineering systematic reviews reveals recurring causes like term ambiguity and proposes recommendations for LLM deployment.
citing papers explorer
-
Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio
LLM agents reach 90.9% retrieval recall at K=200 but recover at most 52.7% of ground-truth included studies because they cannot reliably apply PI/ECO eligibility criteria to topically similar distractors.
-
A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples, and preserves the result as an inspectable artefact.
-
Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations
Analysis of LLM vs human disagreements in six software engineering systematic reviews reveals recurring causes like term ambiguity and proposes recommendations for LLM deployment.