GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
Ai can learn scientific taste
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
baseline 1polarities
baseline 1representative citing papers
LLM-generated research ideas cluster more around bridge-like opportunities and synthesis methods than the broader distribution seen in human papers.
SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.
GraphReview models paper evaluation as LLM-driven message passing on a semantic paper graph that links intrinsic quality, contemporaneous papers, and prior work, then applies Personalized PageRank for ranking and review generation.
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
ForeSci is a temporally controlled benchmark with 500 tasks for assessing LLM agents on forward-looking AI research judgments in four domains using cutoff-aligned knowledge bases.
ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
citing papers explorer
-
ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
ForeSci is a temporally controlled benchmark with 500 tasks for assessing LLM agents on forward-looking AI research judgments in four domains using cutoff-aligned knowledge bases.