pith. machine review for the scientific record. sign in

arxiv: 2605.10032 · v2 · submitted 2026-05-11 · 💻 cs.CL

Recognition: no theorem link

PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Liqing Zhang, Sajib Acharjee Dip, Song Li

Pith reviewed 2026-05-13 04:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords plant marker genesevidence classificationscientific literaturebenchmark datasetlanguage modelscell-type markersmulti-speciesinformation extraction
0
0 comments X

The pith

A new benchmark reveals language models handle direct plant marker evidence well but confuse functional and indirect types from literature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PlantMarkerBench to evaluate how language models interpret evidence for cell-type marker genes drawn from full-text scientific papers across four plant species. It uses a pipeline of literature retrieval, hybrid search, species grounding, extraction, and human review to create 5,550 annotated sentence instances labeled for evidence validity, type, and support strength. Two tasks test whether a sentence provides valid marker evidence for a gene-cell-type pair and classify the evidence as expression, localization, function, indirect, or negative. Frontier models show solid results on direct expression but drop sharply on functional, indirect, and weak-support cases, with evidence-type confusion as the main error; open-weight models add high false-positive rates in ambiguous contexts. This setup matters because trustworthy extraction of marker evidence from papers can improve curation of plant biology resources and support AI-assisted discovery.

Core claim

PlantMarkerBench is a multi-species benchmark built through modular literature retrieval, hybrid search, species-aware biological grounding, structured extraction, and targeted human review. It contains 5,550 sentence-level instances from Arabidopsis, maize, rice, and tomato, each annotated for marker-evidence validity, evidence type, and support strength. The benchmark defines two tasks: deciding if a candidate sentence supplies valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. Benchmarking of open-weight and closed-source models shows relatively strong performance on direct expression but

What carries the argument

PlantMarkerBench, a dataset of 5,550 sentence instances annotated for marker-evidence validity, evidence type, and support strength that tests models on validity determination and evidence classification from biological papers.

If this is right

  • Models need stronger methods for distinguishing functional and indirect evidence from ambiguous contexts in biology papers.
  • Open-weight models require better calibration to lower false-positive rates when support is weak.
  • The benchmark supplies a reproducible test bed for developing AI systems that attribute scientific claims accurately in plant biology.
  • Performance gaps on non-expression evidence point to the value of targeted prompting or fine-tuning strategies for evidence classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks could be built for other scientific domains where evidence strength varies, such as medical literature.
  • Extending the tasks from sentences to full papers might expose additional reasoning challenges in evidence integration.
  • The observed confusion patterns could guide development of models that explicitly track support strength when extracting biological facts.

Load-bearing premise

The modular curation pipeline that combines literature retrieval, hybrid search, species-aware grounding, structured extraction, and targeted human review produces accurate annotations of marker-evidence validity, evidence type, and support strength without substantial bias or error.

What would settle it

Independent biologists re-annotating a random subset of the 5,550 sentences and finding large disagreements with the original labels on evidence type or validity.

Figures

Figures reproduced from arXiv: 2605.10032 by Liqing Zhang, Sajib Acharjee Dip, Song Li.

Figure 1
Figure 1. Figure 1: Example evidence-grounded reasoning instances in PlantMarkerBench. Positive examples include expression and localization evidence supporting gene–cell-type associations. Hard negative examples illustrate biologically challenging failure modes including spurious alias matching, wrong-gene attribution, and cell-type granularity mismatch. PlantMarkerBench evaluates whether models can ground the correct gene a… view at source ↗
Figure 2
Figure 2. Figure 2: PlantMarkerBench dataset overview. (A) Dataset scale across four plant species. (B) Evidence-type composition showing diverse biological reasoning regimes including expression, localization, functional, indirect, and negative evidence. (C) Long-tail support-strength distributions reveal that most literature evidence is weakly supported, reflecting realistic scientific ambiguity. 3 PlantMarkerBench Construc… view at source ↗
Figure 3
Figure 3. Figure 3: PlantMarkerBench dataset overview and benchmark composition. PlantMarkerBench is a multi￾species, evidence-grounded benchmark for plant cell-type marker reasoning constructed from full-text literature across four plant species: Arabidopsis thaliana, maize, rice, and tomato. The benchmark contains 5,550 sentence-level evidence instances spanning 1,036 unique genes and 127 observed cell types. Evidence insta… view at source ↗
Figure 4
Figure 4. Figure 4: Error taxonomy across representative Plant￾MarkerBench runs. Evidence-type mismatch is the domi￾nant failure mode across most settings, while open-weight models exhibit substantially higher false-positive rates. To better understand model behavior beyond aggregate accuracy, we analyze prediction failures across representative PlantMarker￾Bench runs [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PlantMarkerBench, a multi-species benchmark for literature-grounded plant marker reasoning constructed via a modular pipeline of literature retrieval, hybrid search, species-aware grounding, structured extraction, and targeted human review. It contains 5,550 sentence-level annotations across Arabidopsis, maize, rice, and tomato for two tasks: validating marker evidence for gene-cell-type pairs and classifying evidence into expression, localization, function, indirect, or negative categories. Benchmarking of open-weight and closed-source LLMs shows relatively strong performance on direct expression evidence but substantial drops on functional, indirect, and weak-support evidence, with evidence-type confusion as the dominant failure mode and elevated false positives in open-weight models under ambiguity.

Significance. If the annotation quality holds, the benchmark would provide a valuable, reproducible framework for evaluating AI systems on nuanced scientific evidence attribution in plant biology, a domain where literature-grounded interpretation is critical but under-served by existing resources. The identification of specific failure modes (evidence-type confusion and weakness on indirect/functional evidence) offers concrete directions for improving model trustworthiness in information extraction. The multi-species scope and emphasis on full-text papers strengthen its utility for future work on AI-assisted biology.

major comments (2)
  1. [Abstract / benchmark construction] The benchmark construction description (abstract and associated methods section) outlines the curation pipeline with targeted human review but reports no inter-annotator agreement scores, no held-out validation metrics, and no error analysis for the 5,550 labels on evidence type and support strength. This is load-bearing for the central claim that observed performance drops and evidence-type confusion reflect model limitations rather than annotation artifacts, given that boundaries between functional, indirect, and weak-support evidence are biologically subtle.
  2. [Results / evaluation] The results section states that frontier models achieve relatively strong performance on direct expression evidence with substantial drops elsewhere, yet the abstract supplies no exact performance numbers, per-species breakdowns, statistical significance tests, or comparison to simple baselines. Without these, it is difficult to evaluate whether the headline distinctions (e.g., expression vs. functional) are robust or sensitive to label noise.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., F1 scores on direct vs. indirect evidence) to ground the qualitative claims.
  2. [Task definition] Notation for evidence categories (expression, localization, function, indirect, negative) should be defined explicitly with examples in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment point by point below and describe the revisions we will make to strengthen the reporting of annotation quality and evaluation details.

read point-by-point responses
  1. Referee: [Abstract / benchmark construction] The benchmark construction description (abstract and associated methods section) outlines the curation pipeline with targeted human review but reports no inter-annotator agreement scores, no held-out validation metrics, and no error analysis for the 5,550 labels on evidence type and support strength. This is load-bearing for the central claim that observed performance drops and evidence-type confusion reflect model limitations rather than annotation artifacts, given that boundaries between functional, indirect, and weak-support evidence are biologically subtle.

    Authors: We agree that quantitative assessment of annotation reliability is necessary to support claims about model limitations versus label quality, especially for biologically nuanced categories. The submitted manuscript described the targeted human review but did not include agreement metrics or error analysis. In the revision we will add inter-annotator agreement scores (Fleiss' kappa) computed on a double-annotated subset, a held-out validation analysis, and a dedicated error analysis section that examines sources of disagreement and resolution procedures. These additions will directly address concerns about potential annotation artifacts. revision: yes

  2. Referee: [Results / evaluation] The results section states that frontier models achieve relatively strong performance on direct expression evidence with substantial drops elsewhere, yet the abstract supplies no exact performance numbers, per-species breakdowns, statistical significance tests, or comparison to simple baselines. Without these, it is difficult to evaluate whether the headline distinctions (e.g., expression vs. functional) are robust or sensitive to label noise.

    Authors: The full results section contains detailed performance tables across models and species, but we acknowledge that the abstract and high-level statements remain qualitative. We will revise the abstract to report key quantitative metrics (F1 scores per evidence type), include per-species breakdowns, add comparisons to simple baselines such as lexical or rule-based classifiers, and report statistical significance tests (e.g., bootstrap confidence intervals and McNemar's test) for the observed differences. These changes will make the robustness of the expression-versus-other distinctions clearer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent annotations and direct evaluation

full rationale

The paper constructs PlantMarkerBench via a literature-retrieval and human-review pipeline, then measures model performance on the resulting 5,550 labeled instances. No equations, fitted parameters, or predictions are claimed; the two tasks (validity detection and evidence-type classification) are evaluated directly against the curated labels. The central claims about performance drops on functional/indirect/weak evidence therefore rest on the external validity of the annotations rather than any self-referential reduction or self-citation chain. This is a standard empirical benchmark paper with no load-bearing derivation that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the described curation process yields reliable ground-truth labels; no free parameters, mathematical derivations, or new physical entities are introduced.

axioms (1)
  • domain assumption Human review combined with automated retrieval produces accurate annotations for marker-evidence validity, type, and support strength.
    The benchmark's utility as an evaluation framework depends on this assumption about annotation quality.

pith-pipeline@v0.9.0 · 5546 in / 1262 out tokens · 97277 ms · 2026-05-13T04:07:54.342219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 7 internal anchors

  1. [2]

    gold standard

    Tanya Z Berardini, Leonore Reiser, Donghui Li, Yarik Mezheritsky, Robert Muller, Emily Strait, and Eva Huala. The arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome. genesis, 53 0 (8): 0 474--485, 2015

  2. [3]

    A gene expression map of the arabidopsis root

    Kenneth Birnbaum, Dennis E Shasha, Jean Y Wang, Jee W Jung, Georgina M Lambert, David W Galbraith, and Philip N Benfey. A gene expression map of the arabidopsis root. Science, 302 0 (5652): 0 1956--1960, 2003

  3. [4]

    A high-resolution root spatiotemporal map reveals dominant expression patterns

    Siobhan M Brady, David A Orlando, Ji-Young Lee, Jean Y Wang, Jeremy Koch, Jos \'e R Dinneny, Daniel Mace, Uwe Ohler, and Philip N Benfey. A high-resolution root spatiotemporal map reveals dominant expression patterns. Science, 318 0 (5851): 0 801--806, 2007

  4. [5]

    Biomedical natural language processing

    Kevin Bretonnel Cohen and Dina Demner-Fushman. Biomedical natural language processing. 2014

  5. [6]

    Reconstructing spatiotemporal gene expression data from partial observations

    Dustin A Cartwright, Siobhan M Brady, David A Orlando, Bernd Sturmfels, and Philip N Benfey. Reconstructing spatiotemporal gene expression data from partial observations. Bioinformatics, 25 0 (19): 0 2581--2587, 2009

  6. [7]

    Plantscrnadb: a database for plant single-cell rna analysis

    Hongyu Chen, Xinxin Yin, Longbiao Guo, Jie Yao, Yiwen Ding, Xiaoxu Xu, Lu Liu, Qian-Hao Zhu, Qinjie Chu, and Longjiang Fan. Plantscrnadb: a database for plant single-cell rna analysis. Molecular Plant, 14 0 (6): 0 855--857, 2021

  7. [8]

    Information commons for rice (ic4r)

    IC4R Project Consortium. Information commons for rice (ic4r). Nucleic acids research, 44 0 (D1): 0 D1172--D1180, 2016

  8. [9]

    Spatiotemporal developmental trajectories in the arabidopsis root revealed using high-throughput single-cell rna sequencing

    Tom Denyer, Xiaoli Ma, Simon Klesen, Emanuele Scacchi, Kay Nieselt, and Marja CP Timmermans. Spatiotemporal developmental trajectories in the arabidopsis root revealed using high-throughput single-cell rna sequencing. Developmental cell, 48 0 (6): 0 840--852, 2019

  9. [10]

    The sol genomics network (sgn)—from genotype to phenotype to breeding

    Noe Fernandez-Pozo, Naama Menda, Jeremy D Edwards, Surya Saha, Isaak Y Tecle, Susan R Strickler, Aureliano Bombarely, Thomas Fisher-York, Anuradha Pujar, Hartmut Foerster, et al. The sol genomics network (sgn)—from genotype to phenotype to breeding. Nucleic acids research, 43 0 (D1): 0 D1036--D1041, 2015

  10. [12]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929--3938. PMLR, 2020

  11. [13]

    Integrated analysis of multimodal single-cell data

    Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M Mauck, Shiwei Zheng, Andrew Butler, Maddie J Lee, Aaron J Wilk, Charlotte Darby, Michael Zager, et al. Integrated analysis of multimodal single-cell data. Cell, 184 0 (13): 0 3573--3587, 2021

  12. [14]

    scplantdb: a comprehensive database for exploring cell types and markers of plant cell atlases

    Zhaohui He, Yuting Luo, Xinkai Zhou, Tao Zhu, Yangming Lan, and Dijun Chen. scplantdb: a comprehensive database for exploring cell types and markers of plant cell atlases. Nucleic acids research, 52 0 (D1): 0 D1629--D1638, 2024

  13. [15]

    Towards reasoning in large language models: A survey

    Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Findings of the association for computational linguistics: ACL 2023, pages 1049--1065, 2023

  14. [18]

    Dynamics of gene expression in single root cells of arabidopsis thaliana

    Ken Jean-Baptiste, Jos \'e L McFaline-Figueroa, Cristina M Alexandre, Michael W Dorrity, Lauren Saunders, Kerry L Bubb, Cole Trapnell, Stanley Fields, Christine Queitsch, and Josh T Cuperus. Dynamics of gene expression in single root cells of arabidopsis thaliana. The plant cell, 31 0 (5): 0 993--1011, 2019

  15. [19]

    Pcmdb: a curated and comprehensive resource of plant cell markers

    Jingjing Jin, Peng Lu, Yalong Xu, Jiemeng Tao, Zefeng Li, Shuaibin Wang, Shizhou Yu, Chen Wang, Xiaodong Xie, Junping Gao, et al. Pcmdb: a curated and comprehensive resource of plant cell markers. Nucleic Acids Research, 50 0 (D1): 0 D1448--D1455, 2022

  16. [20]

    Improvement of the oryza sativa nipponbare reference genome using next generation sequence and optical map data

    Yoshihiro Kawahara, Melissa de la Bastide, John P Hamilton, Hiroyuki Kanamori, W Richard McCombie, Shu Ouyang, David C Schwartz, Tsuyoshi Tanaka, Jianzhong Wu, Shiguo Zhou, et al. Improvement of the oryza sativa nipponbare reference genome using next generation sequence and optical map data. Rice, 6 0 (1): 0 4, 2013

  17. [21]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020

  18. [22]

    Maizegdb 2018: the maize multi-genome genetics and genomics database

    John L Portwood, Margaret R Woodhouse, Ethalinda K Cannon, Jack M Gardiner, Lisa C Harper, Mary L Schaeffer, Jesse R Walsh, Taner Z Sen, Kyoung Tak Cho, David A Schott, et al. Maizegdb 2018: the maize multi-genome genetics and genomics database. Nucleic acids research, 47 0 (D1): 0 D1146--D1154, 2019

  19. [23]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982--3992, 2019

  20. [24]

    Towards building a plant cell atlas

    Seung Y Rhee, Kenneth D Birnbaum, and David W Ehrhardt. Towards building a plant cell atlas. Trends in plant science, 24 0 (4): 0 303--310, 2019

  21. [25]

    Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation process

    Ang \'e lique Richard, Lo \" s Boullu, Ulysse Herbach, Arnaud Bonnafoux, Val \'e rie Morin, Elodie Vallin, Anissa Guillemin, Nan Papili Gao, Rudiyanto Gunawan, J \'e r \'e mie Cosette, et al. Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation process. PLoS biology, 14 ...

  22. [26]

    The probabilistic relevance framework: BM25 and beyond, volume 4

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

  23. [27]

    Single-cell rna sequencing resolves molecular relationships among individual plant cells

    Kook Hui Ryu, Ling Huang, Hyun Min Kang, and John Schiefelbein. Single-cell rna sequencing resolves molecular relationships among individual plant cells. Plant physiology, 179 0 (4): 0 1444--1456, 2019

  24. [28]

    High-throughput single-cell transcriptome profiling of plant cell types

    Christine N Shulse, Benjamin J Cole, Doina Ciobanu, Junyan Lin, Yuko Yoshinaga, Mona Gouran, Gina M Turco, Yiwen Zhu, Ronan C O’Malley, Siobhan M Brady, et al. High-throughput single-cell transcriptome profiling of plant cell types. Cell reports, 27 0 (7): 0 2241--2247, 2019

  25. [29]

    Comprehensive integration of single-cell data

    Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. cell, 177 0 (7): 0 1888--1902, 2019

  26. [32]

    A single-cell rna sequencing profiles the developmental landscape of arabidopsis root

    Tian-Qi Zhang, Zhou-Geng Xu, Guan-Dong Shang, and Jia-Wei Wang. A single-cell rna sequencing profiles the developmental landscape of arabidopsis root. Molecular plant, 12 0 (5): 0 648--660, 2019

  27. [33]

    Developmental cell , volume=

    Spatiotemporal developmental trajectories in the Arabidopsis root revealed using high-throughput single-cell RNA sequencing , author=. Developmental cell , volume=. 2019 , publisher=

  28. [34]

    The plant cell , volume=

    Dynamics of gene expression in single root cells of Arabidopsis thaliana , author=. The plant cell , volume=. 2019 , publisher=

  29. [35]

    Cell reports , volume=

    High-throughput single-cell transcriptome profiling of plant cell types , author=. Cell reports , volume=. 2019 , publisher=

  30. [36]

    PLoS biology , volume=

    Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation process , author=. PLoS biology , volume=. 2016 , publisher=

  31. [37]

    Plant physiology , volume=

    Single-cell RNA sequencing resolves molecular relationships among individual plant cells , author=. Plant physiology , volume=. 2019 , publisher=

  32. [38]

    Nucleic Acids Research , volume=

    PCMDB: a curated and comprehensive resource of plant cell markers , author=. Nucleic Acids Research , volume=. 2022 , publisher=

  33. [39]

    Molecular Plant , volume=

    PlantscRNAdb: a database for plant single-cell RNA analysis , author=. Molecular Plant , volume=. 2021 , publisher=

  34. [40]

    Nucleic acids research , volume=

    scPlantDB: a comprehensive database for exploring cell types and markers of plant cell atlases , author=. Nucleic acids research , volume=. 2024 , publisher=

  35. [41]

    Trends in plant science , volume=

    Towards building a plant cell atlas , author=. Trends in plant science , volume=. 2019 , publisher=

  36. [42]

    Nature communications , volume=

    Massively parallel digital transcriptional profiling of single cells , author=. Nature communications , volume=. 2017 , publisher=

  37. [43]

    cell , volume=

    Comprehensive integration of single-cell data , author=. cell , volume=. 2019 , publisher=

  38. [44]

    Cell , volume=

    Integrated analysis of multimodal single-cell data , author=. Cell , volume=. 2021 , publisher=

  39. [45]

    Science , volume=

    A high-resolution root spatiotemporal map reveals dominant expression patterns , author=. Science , volume=. 2007 , publisher=

  40. [46]

    Science , volume=

    A gene expression map of the Arabidopsis root , author=. Science , volume=. 2003 , publisher=

  41. [47]

    Bioinformatics , volume=

    Reconstructing spatiotemporal gene expression data from partial observations , author=. Bioinformatics , volume=. 2009 , publisher=

  42. [48]

    2014 , publisher=

    Biomedical natural language processing , author=. 2014 , publisher=

  43. [49]

    Findings of the association for computational linguistics: ACL 2023 , pages=

    Towards reasoning in large language models: A survey , author=. Findings of the association for computational linguistics: ACL 2023 , pages=

  44. [50]

    International conference on machine learning , pages=

    Retrieval augmented language model pre-training , author=. International conference on machine learning , pages=. 2020 , organization=

  45. [51]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  46. [52]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  47. [53]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  48. [54]

    Qwen2.5-Coder Technical Report

    Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

  49. [55]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

  50. [56]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  51. [57]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  52. [58]

    Atlas: Few-shot learning with retrieval augmented language models, 2022

    Few-shot learning with retrieval augmented language models , author=. arXiv preprint arXiv:2208.03299 , volume=

  53. [59]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  54. [60]

    gold standard

    The Arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome , author=. genesis , volume=. 2015 , publisher=

  55. [61]

    Rice , volume=

    Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data , author=. Rice , volume=. 2013 , publisher=

  56. [62]

    Nucleic acids research , volume=

    Information commons for rice (IC4R) , author=. Nucleic acids research , volume=. 2016 , publisher=

  57. [63]

    Nucleic acids research , volume=

    MaizeGDB 2018: the maize multi-genome genetics and genomics database , author=. Nucleic acids research , volume=. 2019 , publisher=

  58. [64]

    Nucleic acids research , volume=

    The Sol Genomics Network (SGN)—from genotype to phenotype to breeding , author=. Nucleic acids research , volume=. 2015 , publisher=

  59. [65]

    Molecular plant , volume=

    A single-cell RNA sequencing profiles the developmental landscape of Arabidopsis root , author=. Molecular plant , volume=. 2019 , publisher=

  60. [66]

    2009 , publisher=

    The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=

  61. [67]

    Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=