hub Mixed citations

Llm4sr: A survey on large language models for scientific research

· 2025 · arXiv 2501.04306

Mixed citation behavior. Most common role is background (60%).

19 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 1

citation-polarity summary

background 3 unclear 1 use dataset 1

representative citing papers

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

cs.AI · 2026-02-11 · accept · novelty 8.0

ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.

Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape

cs.SE · 2026-04-13 · accept · novelty 7.0

A survey of 457 SE researchers finds widespread GenAI use concentrated in writing and ideation, with productivity gains but persistent concerns over accuracy, bias, and the need for clearer governance rules.

AlphaEvolve: A coding agent for scientific and algorithmic discovery

cs.AI · 2025-06-16 · unverdicted · novelty 7.0

AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

cs.AI · 2026-06-08 · unverdicted · novelty 6.0

Graph2Idea builds dynamic knowledge graphs from retrieved literature to supply compact, relational contexts that guide LLMs in generating novel, feasible, and high-quality scientific ideas, outperforming flat-text baselines on automatic metrics.

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

cs.IR · 2026-06-06 · unverdicted · novelty 6.0

GIScholarBench shows LLMs exhibit consistent overconfidence across three scholarly tasks in GIS, with different manifestations in factual retrieval, citation expansion, and idea generation.

When AI reviews science: Can we trust the referee?

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.

CacheClip: Accelerating RAG with Effective KV Cache Reuse

cs.LG · 2025-10-11 · unverdicted · novelty 6.0

CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.

RExBench: Can coding agents autonomously implement AI research extensions?

cs.CL · 2025-06-27 · unverdicted · novelty 6.0

RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

Scientific discovery as meta-optimization: a combinatorial optimization case study

cs.AI · 2026-06-25 · unverdicted · novelty 5.0

Introduces consensus objective aggregation for meta-optimization of scientific discovery and reports improved scaling and speedup for 3-SAT algorithm discovery using digital MemComputing machines.

PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality

cs.CL · 2026-06-18 · unverdicted · novelty 5.0

PeerCheck finds that chain-of-thought prompting improves LLM academic reviews while retrieval-augmented generation sometimes lowers quality, and that LLMs and humans emphasize different aspects of papers.

SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding

cs.CL · 2026-06-18 · unverdicted · novelty 5.0

SciLens introduces an evidence-conditioned atomic entailment framework that grounds claims to modality-specific witnesses in tables and figures, achieving 79.2% macro-F1 on SciClaimEval.

EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation

cs.CL · 2026-05-29 · unverdicted · novelty 5.0

EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 5.0

SciResearcher is a new agentic data-construction framework that trains an 8B model via supervised fine-tuning and reinforcement learning to reach 19.46% on HLE-Bio/Chem-Gold and 13-15% gains on related biology and literature benchmarks.

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.

Hephaestus: Toward a Cybersecurity AI Scientist

cs.CR · 2026-06-29 · unverdicted · novelty 4.0

The paper proposes the Cybersecurity AI Scientist as a modular multi-agent architecture for automating cybersecurity research, distinguished by its focus on non-stationary threats and anchored in a four-zeros risk-trust-incident-energy frame.

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

cs.CL · 2026-06-23 · unverdicted · novelty 4.0

A survey synthesizing LLM methods for peer review critique generation and score prediction, including taxonomies, benchmark limitations, domain biases, and robustness risks such as prompt injection.

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

cs.AI · 2026-05-22 · unverdicted · novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator

cs.DL · 2025-07-16 · unverdicted · novelty 4.0

The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.

citing papers explorer

Showing 19 of 19 citing papers.

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences cs.AI · 2026-02-11 · accept · none · ref 9
ReplicatorBench evaluates LLM agents on replicating social and behavioral science claims across retrieval, computation, and interpretation stages, finding strength in experiment execution but weakness in resource retrieval.
Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape cs.SE · 2026-04-13 · accept · none · ref 27
A survey of 457 SE researchers finds widespread GenAI use concentrated in writing and ideation, with productivity gains but persistent concerns over accuracy, bias, and the need for clearer governance rules.
AlphaEvolve: A coding agent for scientific and algorithmic discovery cs.AI · 2025-06-16 · unverdicted · none · ref 66
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts cs.AI · 2026-06-08 · unverdicted · none · ref 18
Graph2Idea builds dynamic knowledge graphs from retrieved literature to supply compact, relational contexts that guide LLMs in generating novel, feasible, and high-quality scientific ideas, outperforming flat-text baselines on automatic metrics.
GIScholarBench: Benchmarking LLM Overconfidence in GIS Research cs.IR · 2026-06-06 · unverdicted · none · ref 17
GIScholarBench shows LLMs exhibit consistent overconfidence across three scholarly tasks in GIS, with different manifestations in factual retrieval, citation expansion, and idea generation.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 6
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
CacheClip: Accelerating RAG with Effective KV Cache Reuse cs.LG · 2025-10-11 · unverdicted · none · ref 4
CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 33
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
Scientific discovery as meta-optimization: a combinatorial optimization case study cs.AI · 2026-06-25 · unverdicted · none · ref 20
Introduces consensus objective aggregation for meta-optimization of scientific discovery and reports improved scaling and speedup for 3-SAT algorithm discovery using digital MemComputing machines.
PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality cs.CL · 2026-06-18 · unverdicted · none · ref 40
PeerCheck finds that chain-of-thought prompting improves LLM academic reviews while retrieval-augmented generation sometimes lowers quality, and that LLMs and humans emphasize different aspects of papers.
SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding cs.CL · 2026-06-18 · unverdicted · none · ref 7
SciLens introduces an evidence-conditioned atomic entailment framework that grounds claims to modality-specific witnesses in tables and figures, achieving 79.2% macro-F1 on SciClaimEval.
EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation cs.CL · 2026-05-29 · unverdicted · none · ref 22
EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 23
SciResearcher is a new agentic data-construction framework that trains an 8B model via supervised fine-tuning and reinforcement learning to reach 19.46% on HLE-Bio/Chem-Gold and 13-15% gains on related biology and literature benchmarks.
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts cs.CL · 2026-04-07 · unverdicted · none · ref 15
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.
Hephaestus: Toward a Cybersecurity AI Scientist cs.CR · 2026-06-29 · unverdicted · none · ref 27
The paper proposes the Cybersecurity AI Scientist as a modular multi-agent architecture for automating cybersecurity research, distinguished by its focus on non-stationary threats and anchored in a four-zeros risk-trust-incident-energy frame.
LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges cs.CL · 2026-06-23 · unverdicted · none · ref 38
A survey synthesizing LLM methods for peer review critique generation and score prediction, including taxonomies, benchmark limitations, domain biases, and robustness risks such as prompt injection.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery cs.AI · 2026-05-22 · unverdicted · none · ref 1
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 126
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator cs.DL · 2025-07-16 · unverdicted · none · ref 118
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.

Llm4sr: A survey on large language models for scientific research

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer