{"total":28,"items":[{"citing_arxiv_id":"2606.31651","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FARS: A Fully Automated Research System Deployed at Scale","primary_cat":"cs.AI","submitted_at":"2026-06-30T13:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31478","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution","primary_cat":"cs.AI","submitted_at":"2026-06-30T10:54:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAGE with MHFA improves failure recovery in autonomous research agents, raising metrics-bearing outputs from 42% to 92% on a 12-topic benchmark versus single-reflection baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31229","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic-Ideation: Sample Efficient Agentic Trajectories Synthesis for Scientific Ideation Agents","primary_cat":"cs.AI","submitted_at":"2026-06-30T07:07:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic-Ideation uses oracle-guided multi-agent synthesis to generate efficient training trajectories for scientific ideation agents, reporting 11.91% quality gains and over 10x sample efficiency versus workflow baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27416","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents","primary_cat":"cs.MA","submitted_at":"2026-06-25T17:52:26+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02258","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research","primary_cat":"cs.CE","submitted_at":"2026-06-01T13:45:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30961","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation","primary_cat":"cs.CL","submitted_at":"2026-05-29T07:56:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07591","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research","primary_cat":"cs.LG","submitted_at":"2026-05-28T16:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ResearchClawBench is a new benchmark that evaluates autonomous AI research agents on 40 tasks grounded in published papers using expert rubrics, finding that top systems score only 20-26 out of 100.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26340","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence","primary_cat":"cs.AI","submitted_at":"2026-05-25T21:30:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScientistOne introduces Chain-of-Evidence and an audit system that achieves zero hallucinated references, perfect score verification, and top method-code alignment while matching or beating human experts on five frontier tasks and generalizing to six more.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21825","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks","primary_cat":"cs.AI","submitted_at":"2026-05-20T23:49:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-agent harness autonomously generates functional single-page VIS apps with linked views for scientific data tasks using coordinated skills for analysis, planning, implementation, and evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21481","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists","primary_cat":"cs.AI","submitted_at":"2026-05-20T17:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AiraXiv is a proposed AI-driven platform for open preprints that supports human and AI authors with interactive UI and MCP-based interactions, validated by serving as the submission system for ICAIS 2025.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22878","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research","primary_cat":"cs.AI","submitted_at":"2026-05-20T16:03:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SciAtlas builds a large-scale multi-disciplinary academic knowledge graph and a neuro-symbolic retrieval system to support automated scientific research tasks such as literature review and idea positioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18661","ref_index":200,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI for Auto-Research: Roadmap & User Guide","primary_cat":"cs.AI","submitted_at":"2026-05-18T17:08:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"7 ChatPaper [124] GitHub'23 Generation 19K+ GitHub stars; arXiv summarization tool 8 PaperQA [128] arXiv'23 Generation 8K+ GitHub stars; RAG for scientific Q&A 9 AutoSurvey [214] arXiv'24 Generation First end-to-end LLM survey drafting system 10 GPT Researcher[38] GitHub'24 Generation 26K+ GitHub stars; comprehensive report generation 11 LLMs for Lit. Review [200] arXiv'24 Generation - Hallucination analysis; models still generate errors† 12 STORM [178] arXiv'24 Generation Multi-perspective question-asking for outlines 13 Agentic AutoSurvey [119] arXiv'25 Generation - Multi-agent role decomposition† 14 Citegeist [11] arXiv'25 Generation - Dynamic RAG pipeline on arXiv corpus 15 IterSurvey [250] arXiv'25 Generation"},{"citing_arxiv_id":"2605.17373","ref_index":49,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics","primary_cat":"cs.LG","submitted_at":"2026-05-17T10:30:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16616","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility","primary_cat":"cs.LG","submitted_at":"2026-05-15T20:35:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14790","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation","primary_cat":"cs.CL","submitted_at":"2026-05-14T12:57:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10813","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation","primary_cat":"cs.AI","submitted_at":"2026-05-11T16:33:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"progressively refines itself to produce better research at lower cost over successive cycles. Code Dataset 1 Introduction LLM-powered multi-agent systems [1] have recently transformed end-to-end research automation from a long-standing aspiration [15, 31] into working reality. Systems such as The AI Scientist [21], AI Scientist-v2 [35], EvoScientist [22], and AI-Researcher [27] can now autonomously traverse the full research lifecycle [32, 26, 18], surveying literature, generating hypotheses, implementing experiments, and writing papers within a single pipeline. These advances mark genuine progress: tasks that once required weeks of researcher effort can now be completed in hours at modest cost [37]. Yet the ability to complete the pipeline does not guarantee that its outputs are usable by any particular researcher."},{"citing_arxiv_id":"2605.10530","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery","primary_cat":"cs.IR","submitted_at":"2026-05-11T13:14:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"The input consists of the paper title, selected keywords from the original abstract, and the author's prior publications. The expected output is a rewritten abstract maintaining scientific ac- curacy while reflecting the author's personal writing style. We curated this subset by filtering the LongLaMP [14] corpus, derived from the Citation Network Dataset (V14) [36]. We retained twenty users whose abstracts exceed 2,000 characters to ensure the task remains knowledge-intensive and provides sufficient content for meaningful personalization assessment. 3.2 Task 2: Personalized Topic Writing This task focuses on generating complete Reddit posts reflecting the author's creative style, including sarcasm, irony, and subreddit-"},{"citing_arxiv_id":"2605.10425","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI","primary_cat":"cs.CY","submitted_at":"2026-05-11T12:00:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enable cheaper downstream verification.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"When ai co-scientists fail: Spot-a benchmark for automated verification of scientific research.ArXiv, abs/2505.11855, 2025. URL https://api.semanticscholar.org/ CorpusID:278740501. [33] Diomidis Spinellis. False authorship: an explorative case study around an ai-generated article published under my name.Research Integrity and Peer Review, 10(1):8, 2025. [34] Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation. arXiv preprint arXiv:2505.18705, 2025. [35] Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl V ondrick, and James Zou. A large-scale randomized study of large language model feedback in peer review."},{"citing_arxiv_id":"2605.08678","ref_index":92,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI","primary_cat":"cs.LG","submitted_at":"2026-05-09T04:29:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Generalize: evaluation of whether the same method or artifact works across multiple settings.Scalability: scale-sensitive tasks or scalable evaluation design.Control: editable or submitted artifact restricted to the problem-relevant part under frozen evaluation. Benchmark Setting Count # Scope New Method Generalize Scalability Control Reference ML-Bench [92] ML code 169 18 repos✗ ✗△ △Pass@K MLAgentBench [34] ML experimentation 13 4 categories✗ ✗ ✗ ✗baselines MLE-bench [10] ML engineering 75 15 categories✗ ✗ ✓ ✓medals MLE-Dojo [76] ML engineering 200+ 4 domains✗ ✗ ✓△H-Rank PostTrainBench [78] LLM post-training 28 7 evals✗△✓ ✓instruct AutoResearch [44] LLM training 1 1 setup△✗△ △val_bpb MLGym [66] ML experiments 13 4 domains△✗△ △baselines"},{"citing_arxiv_id":"2605.07208","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution","primary_cat":"cs.LG","submitted_at":"2026-05-08T03:57:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"large-scale human study with 100+ nlp researchers.arXiv preprint arXiv:2409.04109, 2024. [26] Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. [27] Alex J Smola and Bernhard Schölkopf. A tutorial on support vector regression.Statistics and computing, 14(3):199-222, 2004. [28] Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025. [29] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211."},{"citing_arxiv_id":"2605.13874","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GEAR: Genetic AutoResearch for Agentic Code Evolution","primary_cat":"cs.NE","submitted_at":"2026-05-08T00:25:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06607","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents","primary_cat":"physics.flu-dyn","submitted_at":"2026-05-07T17:27:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05851","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hypothesis generation and updating in large language models","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:24:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25256","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery","primary_cat":"cs.AI","submitted_at":"2026-04-28T06:05:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977-6043, 2025. [3] Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025. [4] Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. Ai-researcher: Autonomous scientific innovation.arXiv preprint arXiv:2505.18705, 2025. [5] Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025,"},{"citing_arxiv_id":"2604.20622","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"pAI/MSc: ML Theory Research with Humans on the Loop","primary_cat":"cs.AI","submitted_at":"2026-04-22T14:38:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AI Scientist [60] and its successor [61] pushed toward fully autonomous ML research with tree search and au- tomated review. Agent Laboratory [62] showed that human feedback at stage boundaries materially improves quality. CycleResearcher [63] coupled generation with review loops, AI-Researcher [64] introduced Scientist- Bench for open-ended evaluation, and AgentRxiv [65] enabled cross-run collaboration through a preprint- server abstraction. freephdlabor [66] argued for fully dynamic, user-customizable workflows. Relative to these systems,pAI/MScplaces more weight on a tighter artifact contract, theory-experiment coordination, and bounded workflow control. Autonomous discovery and open platforms.Broader systems target scientific discovery rather than"},{"citing_arxiv_id":"2604.19606","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories","primary_cat":"cs.AI","submitted_at":"2026-04-21T15:55:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14116","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration","primary_cat":"cs.AI","submitted_at":"2026-04-15T17:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"DiscoveryWorld [16], effectively evaluate AI-agents in relevant scenarios but do not provide a systematic evaluation dataset tailored for LLM training-a gap to be addressed. 2.3. Automated Model Training AutoML.Traditional autonomous machine learning [9,20] primarily focuses on automating model selection and hyperparameter configuration. More recently, a line of work [41] has leveraged LLMs to generate architecture variants [5] and synthesize post-training objectives [28]. These methods remain constrained by predefined search spaces or focus on optimizing isolated components. In contrast, our work explores a more open-ended setting, directly automating the entire LLM training lifecycle. AI for Data Construction.Recent studies extensively utilize LLMs for data synthesis [31], evolutionary"},{"citing_arxiv_id":"2507.11810","ref_index":171,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator","primary_cat":"cs.DL","submitted_at":"2025-07-16T00:11:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Thought (CoT) [192] and Reinforcement Learning from Human Feedback (RLHF) [ 6] allow LLMs to break down complex problems, simulate human reasoning, and align with expert feedback, allowing quick adaptation to new scientific challenges. Multimodal learning [198, 57] lets AI systems integrate diverse scientific data, while Reasoning Language Models (RLMs) [8] and multi-agent systems [171] further enhance scientific inference and collaborative research. As a result, LLM-driven innovation is shifting from simple tool use to advanced reasoning and collaborative intelligence. Figure 1: Trends in annual publication counts for traditional AI-driven versus LLM-driven scientific innovation. We retrieved relevant literature from OpenAlex using two distinct search strategies."}],"limit":50,"offset":0}