arxiv: 2605.11258 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.CL· q-bio.QM

Recognition: 2 theorem links

· Lean Theorem

Unlocking LLM Creativity in Science through Analogical Reasoning

Andrew Shen, James Zou, Shaul Druckmann

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:49 UTC · model grok-4.3

classification 💻 cs.AI cs.CLq-bio.QM

keywords analogical reasoninglarge language modelssolution generationbiomedicinemode collapsediversitynoveltyscientific discovery

0 comments

The pith

Analogical reasoning enables LLMs to generate more diverse and novel solutions for open-ended scientific problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces analogical reasoning as a way for large language models to generate solutions to open-ended problems in science. By first creating analogies to similar problems in other domains based on shared relational structure, then using those analogies to guide the search for new ideas, the approach counters the tendency of LLMs to repeat similar outputs. A sympathetic reader would care because current models often collapse into low-diversity generations, which limits their value for creative tasks like biomedicine. The authors show that this method produces large gains in diversity and novelty while delivering measurable improvements when applied to real biomedical prediction tasks.

Core claim

Analogical reasoning (AR) generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR improves solution diversity metrics by 90-173 percent, produces novel solutions over 50 percent of the time versus as little as 1.6 percent for baselines, and yields high-quality analogies. When the resulting approaches are implemented on four biomedical problems, they deliver consistent quantitative gains including a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, better AUPRC for cell-cell communication, high correlation with published brain region interaction,

What carries the argument

Analogical reasoning (AR), which creates cross-domain analogies based on shared relational structure and then applies them to search for novel solutions in the target scientific problem.

Load-bearing premise

Analogies produced by the LLM reliably reflect true shared relational structures across domains and translate into genuinely novel, high-quality scientific solutions rather than superficial or invalid ones.

What would settle it

A controlled experiment in which domain experts rate the generated analogies as lacking valid relational similarity or in which AR-generated solutions show no performance advantage over baselines when evaluated on independent, held-out biomedical datasets.

Figures

Figures reproduced from arXiv: 2605.11258 by Andrew Shen, James Zou, Shaul Druckmann.

**Figure 2.** Figure 2: Example solutions for one research problem. (top) Distribution of pairwise cosine similari [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Analogical reasoning pipeline for case study #1 and case study #4. (A) Perturbation effect [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Cell-cell communication inference results on 2 metrics (AUPRC, odds ratio) from the [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Brain region interaction results across N=36 hemisphere-sessions. (left) Spearman cor [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Average domain Vendi Score with error bars of 95% confidence intervals. Evaluated across [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: Average solution Vendi Score with error bars of 95% confidence intervals. Evaluated across [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗

**Figure 8.** Figure 8: Average of mean per-problem novelty scores with error bars of 95% confidence intervals. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: Average of mean per-problem novelty scores with error bars of 95% confidence intervals. [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗

**Figure 10.** Figure 10: Average of mean per-problem analogy scores with error bars of 95% confidence intervals. [PITH_FULL_IMAGE:figures/full_fig_p041_10.png] view at source ↗

read the original abstract

Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($\rho$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompting LLMs to generate cross-domain analogies before solving open-ended science problems lifts diversity and novelty metrics with some real biomedical gains, but the paper does not show that relational structure is what actually drives the results.

read the letter

The main takeaway is that this approach of first having the LLM build analogies from other domains and then transfer them produces more varied outputs on open-ended scientific tasks than standard prompting, and the authors back it with applications to four biomedical problems that show metric improvements. They quantify mode collapse in LLMs for these tasks and report clear lifts: diversity metrics up 90-173%, novelty rates above 50% versus baselines at 1.6%, plus concrete wins like a 13-fold gain on perturbation effect prediction, better AUPRC on cell-cell communication, solid Spearman correlation on brain region interactions, and state-of-the-art on two oligonucleotide datasets. That move from prompting experiments to domain-specific validations is the part that gives the work some weight. The biomedical results in particular stand out as a step beyond pure generation benchmarks. The soft spots are in the mechanism and the experimental controls. The central claim attributes the gains to analogies that capture shared relational structure, yet the paper provides no ablation or independent check to separate that from the simple effect of adding a multi-step prompt. If a different open-ended or longer prompt produced similar variety, the analogical framing would not be causal. Baseline implementations, prompt sensitivity tests, and statistical significance details are also thin in the description, which leaves the size of the reported improvements harder to evaluate. This is the sort of paper that would interest people working on prompt methods for AI-assisted discovery or practical LLM use in biomedicine. Readers focused on increasing solution diversity in complex domains would get usable ideas from the examples and numbers. It has enough concrete experiments and domain applications to deserve a serious referee, though revisions would need to tighten the evidence on why the analogies are doing the work rather than the prompting format alone. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit mode collapse in open-ended scientific solution generation, producing low-diversity outputs. It introduces Analogical Reasoning (AR), which generates cross-domain analogies based on shared relational structure and transfers them to the target problem to yield novel solutions. Evaluations on four tasks report 90-173% gains in diversity metrics, >50% novelty rate (vs. 1.6% baselines), high-quality analogies, and real-world biomedical validation with a 13-fold improvement on perturbation effect prediction metrics, superior AUPRC on cell-cell communication, Spearman ρ=0.729 on brain interactions, and SOTA on two oligonucleotide datasets.

Significance. If the central attribution to relational structure mapping holds, the work offers a promising direction for mitigating creativity limitations in LLMs for autonomous science, with concrete empirical support from multiple quantitative metrics and direct implementation on biomedical problems. Strengths include the reproducible performance claims across tasks and the attempt to link LLM outputs to downstream scientific utility.

major comments (3)

[§4] §4 (Evaluation setup): Baseline implementations lack sufficient detail on prompt length, number of reasoning steps, temperature, and few-shot examples. Without these controls or an ablation comparing AR to other multi-step open-ended prompts of matched complexity, the diversity (90-173%) and novelty (>50%) gains cannot be confidently attributed to the analogical mechanism rather than prompting format differences.
[§3.1] §3.1 (Analogy generation): The description states that analogies are generated 'based on shared relational structure,' but no independent verification is provided (e.g., human-rated mapping quality, predicate alignment score, or contrast against surface-similarity baselines). This leaves open the possibility that gains arise from increased textual variation rather than structure-mapping as claimed.
[§5.1] §5.1 (Biomedical results): No statistical significance tests, confidence intervals, or multiple-run variance are reported for the 13-fold distributional metric improvement or other gains. This is load-bearing for the claim of consistent outperformance, as single-run results on LLM outputs are known to be sensitive to sampling.

minor comments (2)

[Abstract] Abstract: The four biomedical problems are listed via their metrics but not named explicitly; adding the problem names (e.g., perturbation prediction, cell-cell communication) would improve clarity.
[§4.2] Notation: The diversity and novelty metrics are introduced without a dedicated equation or table defining their exact formulas (e.g., how 'novel' is operationalized against a reference set), which could be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will incorporate revisions to improve clarity, reproducibility, and evidential support for our claims.

read point-by-point responses

Referee: [§4] §4 (Evaluation setup): Baseline implementations lack sufficient detail on prompt length, number of reasoning steps, temperature, and few-shot examples. Without these controls or an ablation comparing AR to other multi-step open-ended prompts of matched complexity, the diversity (90-173%) and novelty (>50%) gains cannot be confidently attributed to the analogical mechanism rather than prompting format differences.

Authors: We agree that the current description of baselines is insufficient for full reproducibility and attribution. In the revised manuscript, we will expand §4 to include complete prompt templates, exact token lengths, number of reasoning steps, temperature settings (0.7 across methods), and few-shot counts (3 examples for all conditions). We will also add a new ablation subsection comparing AR against other multi-step open-ended prompting strategies (e.g., extended chain-of-thought and iterative self-refinement) that are matched in total prompt length and number of generation steps. This will allow readers to isolate the contribution of relational structure mapping from general prompting effects. revision: yes
Referee: [§3.1] §3.1 (Analogy generation): The description states that analogies are generated 'based on shared relational structure,' but no independent verification is provided (e.g., human-rated mapping quality, predicate alignment score, or contrast against surface-similarity baselines). This leaves open the possibility that gains arise from increased textual variation rather than structure-mapping as claimed.

Authors: We acknowledge that the manuscript currently supports the relational-structure claim primarily through downstream performance metrics rather than direct verification of the analogies themselves. To strengthen this, we will add a human evaluation protocol in the revised §3.1 in which independent annotators rate a sample of generated analogies on relational mapping fidelity and relevance (using a standardized rubric). We will also introduce a surface-similarity baseline that constructs analogies via lexical overlap instead of structure mapping, allowing direct comparison of diversity and novelty outcomes. These additions will provide independent evidence that the observed gains derive from the intended mechanism. revision: yes
Referee: [§5.1] §5.1 (Biomedical results): No statistical significance tests, confidence intervals, or multiple-run variance are reported for the 13-fold distributional metric improvement or other gains. This is load-bearing for the claim of consistent outperformance, as single-run results on LLM outputs are known to be sensitive to sampling.

Authors: We agree that the absence of statistical reporting and variance estimates limits confidence in the biomedical results. In the revised manuscript, we will re-execute all four biomedical experiments across at least five independent runs with different random seeds. We will report means and standard deviations for all metrics, 95% confidence intervals, and results of appropriate statistical tests (paired t-tests or Wilcoxon signed-rank tests with Bonferroni correction) comparing AR against baselines. These changes will be added to §5.1 and the corresponding figures/tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated against external baselines

full rationale

The paper introduces analogical reasoning (AR) as a prompting-based approach to increase diversity and novelty in LLM solution generation, then reports direct empirical gains (diversity +90-173%, novelty >50%, biomedical task improvements) measured against independent baselines and external published methods. No derivations, equations, fitted parameters, or self-citations are invoked to establish the central claims; results rest on falsifiable comparisons outside the paper's own definitions or inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the premise that LLMs can reliably extract and apply relational analogies; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption LLMs can generate meaningful cross-domain analogies based on shared relational structure when prompted appropriately
This is the core mechanism of AR and is assumed rather than proven in the abstract.

pith-pipeline@v0.9.0 · 5546 in / 1185 out tokens · 42217 ms · 2026-05-13T01:49:42.527539+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We utilize the definition of analogy established by the structure-mapping framework... Object Mappings and Shared Relations

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · 4 internal anchors

[1]

Autodiscovery: Open-ended scientific discovery via bayesian surprise.arXiv preprint arXiv:2507.00310,

Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, and Peter Clark. Autodiscovery: Open-ended scientific discovery via bayesian surprise, 2026. URLhttps://arxiv.org/abs/2507.00310

work page arXiv 2026
[2]

Deep learning-based predic- tions of gene perturbation effects do not yet outperform simple linear baselines.bioRxiv,

Constantin Ahlmann-Eltze, Wolfgang Huber, and Simon Anders. Deep learning-based predic- tions of gene perturbation effects do not yet outperform simple linear baselines.bioRxiv,

work page
[3]

URL https://www.biorxiv.org/content/ early/2025/02/07/2024.09.16.613342

doi: 10.1101/2024.09.16.613342. URL https://www.biorxiv.org/content/ early/2025/02/07/2024.09.16.613342

work page doi:10.1101/2024.09.16.613342 2024
[4]

Wilk, and James Zou

Samuel Alber, Bowen Chen, Eric Sun, Alina Isakova, Aaron J. Wilk, and James Zou. Cel- lvoyager: Ai compbio agent generates new insights by autonomously analyzing biological data.bioRxiv, 2025. doi: 10.1101/2025.06.03.657517. URL https://www.biorxiv.org/ content/early/2025/06/04/2025.06.03.657517

work page doi:10.1101/2025.06.03.657517 2025
[5]

Homogenization effects of large language models on human creative ideation

Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. Homogenization effects of large language models on human creative ideation. InCreativity and Cognition, C&C ’24, pp. 413–425. ACM, June 2024. doi: 10.1145/3635636.3656204. URL http://dx.doi.org/10. 1145/3635636.3656204

work page doi:10.1145/3635636.3656204 2024
[6]

arXiv preprint arXiv:2404.07738

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models, 2025. URLhttps://arxiv.org/abs/2404.07738

work page arXiv 2025
[7]

Quality-diversity through ai feedback, 2023

Herbie Bradley, Andrew Dai, Hannah Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bella- gente, Jeff Clune, Kenneth Stanley, Grégory Schott, and Joel Lehman. Quality-diversity through ai feedback, 2023. URLhttps://arxiv.org/abs/2310.13032

work page arXiv 2023
[8]

Modularity and robustness of frontal cortical networks.Cell, 184(14):3717–3730.e24, 2021

Guang Chen, Byungwoo Kang, Jack Lindsey, Shaul Druckmann, and Nuo Li. Modularity and robustness of frontal cortical networks.Cell, 184(14):3717–3730.e24, 2021. ISSN 0092-8674. doi: https://doi.org/10.1016/j.cell.2021.05.026. URL https://www.sciencedirect.com/ science/article/pii/S0092867421006565

work page doi:10.1016/j.cell.2021.05.026 2021
[9]

Hypospace: Evaluating llm creativity as set-valued hypothesis generators under underdetermination, 2026

Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Anirudh Goyal, Yew-Soon Ong, and Dianbo Liu. Hypospace: Evaluating llm creativity as set-valued hypothesis generators under underdetermination, 2026. URLhttps://arxiv.org/abs/2510.15614

work page arXiv 2026
[10]

Automatic icd- 10 coding: Deep semantic matching based on analogical reasoning.Heliyon, 9(4):e15570,

Yani Chen, Han Chen, Xudong Lu, Huilong Duan, Shilin He, and Jiye An. Automatic icd- 10 coding: Deep semantic matching based on analogical reasoning.Heliyon, 9(4):e15570,

work page
[11]

doi: https://doi.org/10.1016/j.heliyon.2023.e15570

ISSN 2405-8440. doi: https://doi.org/10.1016/j.heliyon.2023.e15570. URL https: //www.sciencedirect.com/science/article/pii/S2405844023027779

work page doi:10.1016/j.heliyon.2023.e15570 2023
[12]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. Specter: Document-level representation learning using citation-informed transformers, 2020. URL https://arxiv.org/abs/2004.07180

work page arXiv 2020
[13]

Strong model collapse,

Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse,

work page
[14]

URLhttps://arxiv.org/abs/2410.04840

work page arXiv
[15]

Is our solar system just a giant atom? Facebook post, 2025

Ethical Explorations. Is our solar system just a giant atom? Facebook post, 2025. https://www.facebook.com/ethicalexploration/posts/122196896288285338 [Ac- cessed: 2026-03-23]

work page arXiv 2025
[16]

The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning, 2023. URLhttps://arxiv.org/abs/2210.02410

work page arXiv 2023
[17]

Structure-mapping: A theoretical framework for analogy.Cognitive science, 7 (2):155–170, 1983

Dedre Gentner. Structure-mapping: A theoretical framework for analogy.Cognitive science, 7 (2):155–170, 1983

work page 1983
[18]

Analogical reasoning

Dedre Gentner and Francisco Maravilla. Analogical reasoning. InInternational handbook of thinking and reasoning, pp. 186–203. Routledge, 2017. 11

work page 2017
[19]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Arlen Harbaugh. Documentation of the MT3DMS: A modular three-dimensional multispecies transport model for simulation of advection, dispersion, and chemical reactions of contaminants in groundwater systems, 2005. URLhttps://pubs.usgs.gov/tm/2005/tm6A16/

work page 2005
[21]

Artificial hivemind: The open-ended homogeneity of language models (and beyond), 2025

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi. Artificial hivemind: The open-ended homogeneity of language models (and beyond), 2025. URL https://arxiv.org/abs/2510. 22954

work page 2025
[22]

Tech overview - stanford agentic reviewer

Yixing Jiang and Andrew Ng. Tech overview - stanford agentic reviewer. https:// paperreview.ai/tech-overview, 2025. Accessed: 2026-03-24

work page 2025
[23]

Graham, F.Q

Rodney Michael Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, Miles Crawford, Doug Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Kohlmeier, Ba...

work page arXiv 2023
[24]

Foster, Cyril Zhang, and Aleksandrs Slivkins

Akshay Krishnamurthy, Keegan Harris, Dylan J. Foster, Cyril Zhang, and Aleksandrs Slivkins. Can large language models explore in-context?, 2024. URL https://arxiv.org/abs/2403. 15371

work page 2024
[25]

Digital red queen: Adversarial program evolution in core war with llms, 2026

Akarsh Kumar, Ryan Bahlous-Boldi, Prafull Sharma, Phillip Isola, Sebastian Risi, Yujin Tang, and David Ha. Digital red queen: Adversarial program evolution in core war with llms, 2026. URLhttps://arxiv.org/abs/2601.03335

work page arXiv 2026
[26]

Diverse preference optimization, 2025

Jack Lanchantin, Angelica Chen, Shehzaad Dhuliawala, Ping Yu, Jason Weston, Sainbayar Sukhbaatar, and Ilia Kulikov. Diverse preference optimization, 2025. URL https://arxiv. org/abs/2501.18101

work page arXiv 2025
[27]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URL https: //arxiv.org/abs/2408.06292

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, 43(7):1035– 1040, 2025

Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Nikolay S Markov, Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov, et al. Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, 43(7):1035– 1040, 2025

work page 2025
[29]

Mitchener, A

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...

work page arXiv 2025
[30]

Llms as models for analogical reasoning.Journal of Memory and Language, 145:104676, December 2025

Sam Musker, Alex Duchnowski, Raphaël Millière, and Ellie Pavlick. Llms as models for analogical reasoning.Journal of Memory and Language, 145:104676, December 2025. ISSN 0749-596X. doi: 10.1016/j.jml.2025.104676. URL http://dx.doi.org/10.1016/j.jml. 2025.104676

work page doi:10.1016/j.jml.2025.104676 2025
[31]

Preprint, arXiv:2407.01082

Minh Nhat Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs, 2025. URLhttps://arxiv.org/abs/2407.01082

work page arXiv 2025
[32]

Can llms help improve analogical reasoning for strategic decisions? experimental evidence from humans and gpt-4, 2025

Phanish Puranam, Prothit Sen, and Maciej Workiewicz. Can llms help improve analogical reasoning for strategic decisions? experimental evidence from humans and gpt-4, 2025. URL https://arxiv.org/abs/2505.00603

work page arXiv 2025
[33]

Relevant or random: Can llms truly perform analogical reasoning?, 2025

Chengwei Qin, Wenhan Xia, Tan Wang, Fangkai Jiao, Yuchen Hu, Bosheng Ding, Ruirui Chen, and Shafiq Joty. Relevant or random: Can llms truly perform analogical reasoning?, 2025. URL https://arxiv.org/abs/2404.12728

work page arXiv 2025
[34]

Dive: Diversified iterative self-improvement, 2025

Yiwei Qin, Yixiu Liu, and Pengfei Liu. Dive: Diversified iterative self-improvement, 2025. URLhttps://arxiv.org/abs/2501.00747

work page arXiv 2025
[35]

Oligogym: Curated datasets and bench- marks for oligonucleotide drug discovery

Rachapun Rotrattanadumrong and Carlo De Donno. Oligogym: Curated datasets and bench- marks for oligonucleotide drug discovery. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[36]

Weld, and Tom Hope

Simra Shahid, Marissa Radensky, Raymond Fok, Pao Siangliulue, Daniel S. Weld, and Tom Hope. Literature-grounded novelty assessment of scientific ideas, 2025. URL https://arxiv. org/abs/2506.22026

work page arXiv 2025
[37]

Si, C., Hashimoto, T., and Yang, D

Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers, 2024. URL https://arxiv.org/abs/ 2409.04109

work page arXiv 2024
[38]

Towards execution-grounded automated ai research, 2026

Chenglei Si, Zitong Yang, Yejin Choi, Emmanuel Candès, Diyi Yang, and Tatsunori Hashimoto. Towards execution-grounded automated ai research, 2026. URL https://arxiv.org/abs/ 2601.14525

work page arXiv 2026
[39]

Many heads are better than one: Improved scientific idea generation by a llm-based multi-agent system, 2025

Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nanqing Dong. Many heads are better than one: Improved scientific idea generation by a llm-based multi-agent system, 2025. URLhttps://arxiv.org/abs/2410.09403

work page arXiv 2025
[40]

A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry

Oren Sultan, Eitan Stern, and Dafna Shahaf. A neuro-symbolic approach for reliable proof generation with llms: A case study in euclidean geometry, 2026. URL https://arxiv.org/ abs/2505.14479

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10. 1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/ 12/2024.11.11.623004

work page 2024
[42]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models, 2018. URLhttps://arxiv.org/abs/1610.02424

work page Pith review arXiv 2018
[43]

Villaescusa-Navarro, B

Francisco Villaescusa-Navarro, Boris Bolliet, Pablo Villanueva-Domingo, Adrian E. Bayer, Aidan Acquah, Chetana Amancharla, Almog Barzilay-Siegal, Pablo Bermejo, Camille Bilodeau, Pablo Cárdenas Ramírez, Miles Cranmer, Urbano L. França, ChangHoon Hahn, Yan-Fei Jiang, Raul Jimenez, Jun-Young Lee, Antonio Lerario, Osman Mamun, Thomas Meier, Anupam A. Ojha, P...

work page arXiv 2025
[44]

hypodd: A program to compute double-difference hypocenter locations,

Felix Waldhauser. hypodd: A program to compute double-difference hypocenter locations,

work page
[45]

URLhttps://www.ldeo.columbia.edu/~felixw/hypoDD.html

work page
[46]

Ellsworth

Felix Waldhauser and William L. Ellsworth. A double-difference earthquake location algorithm: Method and application to the northern Hayward fault, California.Bulletin of the Seismological Society of America, 90(6):1353–1368, 2000. doi: 10.1785/0120000006. URL https://doi. org/10.1785/0120000006

work page doi:10.1785/0120000006 2000
[47]

Holyoak, and Hongjing Lu

Taylor Webb, Keith J. Holyoak, and Hongjing Lu. Emergent analogical reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2212.09196

work page arXiv 2023
[48]

Base models beat aligned models at randomness and creativity,

Peter West and Christopher Potts. Base models beat aligned models at randomness and creativity,

work page
[49]

URLhttps://arxiv.org/abs/2505.00047

work page arXiv
[50]

Perturbench: Benchmarking machine learning models for cellular perturbation analysis.arXiv preprint arXiv:2408.10609,

Yan Wu, Esther Wershof, Sebastian M Schmon, Marcel Nassar, Bła˙zej Osi´nski, Ridvan Eksi, Zichao Yan, Rory Stark, Kun Zhang, and Thore Graepel. Perturbench: Benchmarking machine learning models for cellular perturbation analysis, 2025. URL https://arxiv.org/abs/ 2408.10609

work page arXiv 2025
[51]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URLhttps://arxiv.org/abs/2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Large language models for automated open-domain scientific hypotheses discovery, 2024

Zonglin Yang, Xinya Du, Junxian Li, Jie Zheng, Soujanya Poria, and Erik Cambria. Large language models for automated open-domain scientific hypotheses discovery, 2024. URL https://arxiv.org/abs/2309.02726

work page arXiv 2024
[53]

doi: 10.48550/arXiv.2310.01714

Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. Large language models as analogical reasoners, 2024. URL https://arxiv.org/abs/2310.01714

work page arXiv 2024
[54]

U2f: Encouraging swe-agent to seize novelty without losing feasibility, 2025

Wencheng Ye and Yan Liu. U2f: Encouraging swe-agent to seize novelty without losing feasibility, 2025. URLhttps://arxiv.org/abs/2511.03517

work page arXiv 2025
[55]

The price of format: Diversity collapse in llms, 2025

Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, and Jingbo Shang. The price of format: Diversity collapse in llms, 2025. URLhttps://arxiv.org/abs/2505.18949

work page arXiv 2025
[56]

Zhang, Peter Eckmann, Jiacheng Miao, Andrew B

Harrison G. Zhang, Peter Eckmann, Jiacheng Miao, Andrew B. Mahon, and James Zou. The virtual biotech: A multi-agent ai framework for therapeutic discovery and development.bioRxiv,

work page
[57]

URL https://www.biorxiv.org/content/ early/2026/02/23/2026.02.23.707551

doi: 10.64898/2026.02.23.707551. URL https://www.biorxiv.org/content/ early/2026/02/23/2026.02.23.707551

work page doi:10.64898/2026.02.23.707551 2026
[58]

Tomz, Christopher D

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R Tomz, Christopher D Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock llm diversity.arXiv preprint arXiv:2510.01171, 2025

work page arXiv 2025
[59]

Forcing diffuse distributions out of language models, 2024

Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, Zico Kolter, and Daphne Ippolito. Forcing diffuse distributions out of language models, 2024. URL https://arxiv.org/abs/2404. 10859

work page 2024
[60]

text-embedding-3-small

Yilun Zhou, Caiming Xiong, Silvio Savarese, and Chien-Sheng Wu. Shared imagination: Llms hallucinate alike, 2024. URLhttps://arxiv.org/abs/2407.16604. 14 A Reproducibility Statement We describe the evaluation setup for our analyses and methods in the main text and Appendix. All LLM prompts used are provided in Section G. In addition, we include the codeba...

work page arXiv 2024
[61]

problem_summary

"problem_summary": A concise 1-2 sentence summary

work page
[62]

problem_objects

"problem_objects": Array of key objects/entities with their functional roles

work page
[63]

problem_relations

"problem_relations": Array of core relational structures between objects

work page
[64]

analogies

"analogies": Array of {num_domains} analogies, each with: - "target_domain": The domain name (e.g., "computer_science", "logistics") - "analogy_title": A descriptive title for this analogy - "object_mappings": Array of source-to-target mappings with rationale - "shared_relations": The relational structure preserved across domains

work page
[65]

key_terms

"key_terms": Array of {min_key_terms}-{max_key_terms} important terms/concepts

work page
[66]

target_domains

"target_domains": Array of the {num_domains} domain names (from analogies) **Map objects by FUNCTION, not surface similarity.** "Delivers payload" is a good mapping basis; " is liquid" is not. **Example format:** ‘‘‘json {{ "problem_summary": "Brief description of the biomedical problem", "problem_objects": [ {{"name": "object_A", "role": "functional role...

work page
[71]

relevance

"relevance": How this solution addresses the shared relational structure and could transfer back to the biomedical domain

work page
[74]

github_repos

"github_repos": Array of GitHub repositories found (can be empty if none found) **CRITICAL:** After completing your research, return ONLY the JSON array with NO additional text, explanation, or commentary. Do not write "Based on my research" or any other introduction. Start your response directly with the JSON array. Format: ‘‘‘json [ {{ "title": "Solutio...

work page
[76]

source_domain

"source_domain": "{domain}" (MUST be this exact domain name)

work page
[82]

github_repos

"github_repos": Array of GitHub repositories found (can be empty if none found) **CRITICAL:** After completing your research, return ONLY the JSON array with NO additional text, explanation, or commentary. Do not write "Based on my research" or any other introduction. Start your response directly with the JSON array. Format: ‘‘‘json [ {{ "title": "Solutio...

work page
[83]

"title": Descriptive title of the solution/algorithm

work page
[84]

source_domain

"source_domain": A single domain name where this solution comes from (e.g., "computer_science", "logistics"). Do not combine multiple domains with slashes

work page
[85]

description

"description": 2-3 sentence explanation with specifics

work page
[86]

key_concepts

"key_concepts": 3-5 core techniques/concepts

work page
[87]

relevance

"relevance": Explain how this solution could address the biomedical problem

work page
[88]

"sources": URLs or citations you found

work page
[89]

source_titles

"source_titles": EXACT titles of papers/articles at each source URL (must match order of sources array)

work page
[90]

github_repos

"github_repos": Array of GitHub repositories found (can be empty if none found) **Example format:** ‘‘‘json [ {{ "title": "Solution name", "source_domain": "Domain name", "description": "Detailed explanation...", "key_concepts": ["concept1", "concept2", "concept3"], "relevance": "How this addresses the biomedical problem...", "sources": ["url1", "url2"], ...

work page
[91]

graph neural networks

Rewrite the TITLE to combine the methodology with the application (under 15 words) - Use the ACTUAL TECHNICAL METHODOLOGY from the key concepts, NOT any brand/algorithm names - Focus on the underlying technical approach (e.g., "graph neural networks", "matrix factorization") rather than named methods - MUST show how it’s applied to the target domain - Exa...

work page
[92]

title":

Rewrite the ABSTRACT to highlight practical application (~150-200 words) - Start with what problem in the target domain this method solves - Describe the technical methodology using the key concepts (avoid brand/algorithm names) - Explain how this technical approach addresses that specific problem - Focus on domain-specific application, not general theory...

work page

Showing first 80 references.