ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
arXiv preprint arXiv:2501.10326 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
PeerCheck finds that chain-of-thought prompting improves LLM academic reviews while retrieval-augmented generation sometimes lowers quality, and that LLMs and humans emphasize different aspects of papers.
A survey synthesizing LLM methods for peer review critique generation and score prediction, including taxonomies, benchmark limitations, domain biases, and robustness risks such as prompt injection.
citing papers explorer
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
-
Sakana Fugu Technical Report
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
-
PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality
PeerCheck finds that chain-of-thought prompting improves LLM academic reviews while retrieval-augmented generation sometimes lowers quality, and that LLMs and humans emphasize different aspects of papers.
-
LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges
A survey synthesizing LLM methods for peer review critique generation and score prediction, including taxonomies, benchmark limitations, domain biases, and robustness risks such as prompt injection.