arxiv: 2605.10032 · v2 · submitted 2026-05-11 · 💻 cs.CL

Recognition: no theorem link

PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Liqing Zhang, Sajib Acharjee Dip, Song Li

Pith reviewed 2026-05-13 04:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords plant marker genesevidence classificationscientific literaturebenchmark datasetlanguage modelscell-type markersmulti-speciesinformation extraction

0 comments

The pith

A new benchmark reveals language models handle direct plant marker evidence well but confuse functional and indirect types from literature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PlantMarkerBench to evaluate how language models interpret evidence for cell-type marker genes drawn from full-text scientific papers across four plant species. It uses a pipeline of literature retrieval, hybrid search, species grounding, extraction, and human review to create 5,550 annotated sentence instances labeled for evidence validity, type, and support strength. Two tasks test whether a sentence provides valid marker evidence for a gene-cell-type pair and classify the evidence as expression, localization, function, indirect, or negative. Frontier models show solid results on direct expression but drop sharply on functional, indirect, and weak-support cases, with evidence-type confusion as the main error; open-weight models add high false-positive rates in ambiguous contexts. This setup matters because trustworthy extraction of marker evidence from papers can improve curation of plant biology resources and support AI-assisted discovery.

Core claim

PlantMarkerBench is a multi-species benchmark built through modular literature retrieval, hybrid search, species-aware biological grounding, structured extraction, and targeted human review. It contains 5,550 sentence-level instances from Arabidopsis, maize, rice, and tomato, each annotated for marker-evidence validity, evidence type, and support strength. The benchmark defines two tasks: deciding if a candidate sentence supplies valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. Benchmarking of open-weight and closed-source models shows relatively strong performance on direct expression but

What carries the argument

PlantMarkerBench, a dataset of 5,550 sentence instances annotated for marker-evidence validity, evidence type, and support strength that tests models on validity determination and evidence classification from biological papers.

If this is right

Models need stronger methods for distinguishing functional and indirect evidence from ambiguous contexts in biology papers.
Open-weight models require better calibration to lower false-positive rates when support is weak.
The benchmark supplies a reproducible test bed for developing AI systems that attribute scientific claims accurately in plant biology.
Performance gaps on non-expression evidence point to the value of targeted prompting or fine-tuning strategies for evidence classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be built for other scientific domains where evidence strength varies, such as medical literature.
Extending the tasks from sentences to full papers might expose additional reasoning challenges in evidence integration.
The observed confusion patterns could guide development of models that explicitly track support strength when extracting biological facts.

Load-bearing premise

The modular curation pipeline that combines literature retrieval, hybrid search, species-aware grounding, structured extraction, and targeted human review produces accurate annotations of marker-evidence validity, evidence type, and support strength without substantial bias or error.

What would settle it

Independent biologists re-annotating a random subset of the 5,550 sentences and finding large disagreements with the original labels on evidence type or validity.

Figures

Figures reproduced from arXiv: 2605.10032 by Liqing Zhang, Sajib Acharjee Dip, Song Li.

**Figure 1.** Figure 1: Example evidence-grounded reasoning instances in PlantMarkerBench. Positive examples include expression and localization evidence supporting gene–cell-type associations. Hard negative examples illustrate biologically challenging failure modes including spurious alias matching, wrong-gene attribution, and cell-type granularity mismatch. PlantMarkerBench evaluates whether models can ground the correct gene a… view at source ↗

**Figure 2.** Figure 2: PlantMarkerBench dataset overview. (A) Dataset scale across four plant species. (B) Evidence-type composition showing diverse biological reasoning regimes including expression, localization, functional, indirect, and negative evidence. (C) Long-tail support-strength distributions reveal that most literature evidence is weakly supported, reflecting realistic scientific ambiguity. 3 PlantMarkerBench Construc… view at source ↗

**Figure 3.** Figure 3: PlantMarkerBench dataset overview and benchmark composition. PlantMarkerBench is a multispecies, evidence-grounded benchmark for plant cell-type marker reasoning constructed from full-text literature across four plant species: Arabidopsis thaliana, maize, rice, and tomato. The benchmark contains 5,550 sentence-level evidence instances spanning 1,036 unique genes and 127 observed cell types. Evidence insta… view at source ↗

**Figure 4.** Figure 4: Error taxonomy across representative PlantMarkerBench runs. Evidence-type mismatch is the dominant failure mode across most settings, while open-weight models exhibit substantially higher false-positive rates. To better understand model behavior beyond aggregate accuracy, we analyze prediction failures across representative PlantMarkerBench runs [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlantMarkerBench creates a new multi-species dataset for testing model handling of plant marker evidence in papers, with clear performance drops on indirect cases, but the unquantified annotation quality leaves the failure modes open to question.

read the letter

The main thing to know is that this paper introduces PlantMarkerBench, a collection of 5,550 sentence-level annotations drawn from full-text papers on four plant species, with tasks for judging whether a sentence gives valid marker evidence for a gene-cell-type pair and for sorting that evidence into categories like direct expression, function, or indirect support. The results show frontier models hold up better on straightforward expression evidence but drop on functional, indirect, and weak-support cases, with evidence-type confusion as the main error pattern and open-weight models showing more false positives in ambiguous spots.

Referee Report

2 major / 2 minor

Summary. The paper introduces PlantMarkerBench, a multi-species benchmark for literature-grounded plant marker reasoning constructed via a modular pipeline of literature retrieval, hybrid search, species-aware grounding, structured extraction, and targeted human review. It contains 5,550 sentence-level annotations across Arabidopsis, maize, rice, and tomato for two tasks: validating marker evidence for gene-cell-type pairs and classifying evidence into expression, localization, function, indirect, or negative categories. Benchmarking of open-weight and closed-source LLMs shows relatively strong performance on direct expression evidence but substantial drops on functional, indirect, and weak-support evidence, with evidence-type confusion as the dominant failure mode and elevated false positives in open-weight models under ambiguity.

Significance. If the annotation quality holds, the benchmark would provide a valuable, reproducible framework for evaluating AI systems on nuanced scientific evidence attribution in plant biology, a domain where literature-grounded interpretation is critical but under-served by existing resources. The identification of specific failure modes (evidence-type confusion and weakness on indirect/functional evidence) offers concrete directions for improving model trustworthiness in information extraction. The multi-species scope and emphasis on full-text papers strengthen its utility for future work on AI-assisted biology.

major comments (2)

[Abstract / benchmark construction] The benchmark construction description (abstract and associated methods section) outlines the curation pipeline with targeted human review but reports no inter-annotator agreement scores, no held-out validation metrics, and no error analysis for the 5,550 labels on evidence type and support strength. This is load-bearing for the central claim that observed performance drops and evidence-type confusion reflect model limitations rather than annotation artifacts, given that boundaries between functional, indirect, and weak-support evidence are biologically subtle.
[Results / evaluation] The results section states that frontier models achieve relatively strong performance on direct expression evidence with substantial drops elsewhere, yet the abstract supplies no exact performance numbers, per-species breakdowns, statistical significance tests, or comparison to simple baselines. Without these, it is difficult to evaluate whether the headline distinctions (e.g., expression vs. functional) are robust or sensitive to label noise.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., F1 scores on direct vs. indirect evidence) to ground the qualitative claims.
[Task definition] Notation for evidence categories (expression, localization, function, indirect, negative) should be defined explicitly with examples in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment point by point below and describe the revisions we will make to strengthen the reporting of annotation quality and evaluation details.

read point-by-point responses

Referee: [Abstract / benchmark construction] The benchmark construction description (abstract and associated methods section) outlines the curation pipeline with targeted human review but reports no inter-annotator agreement scores, no held-out validation metrics, and no error analysis for the 5,550 labels on evidence type and support strength. This is load-bearing for the central claim that observed performance drops and evidence-type confusion reflect model limitations rather than annotation artifacts, given that boundaries between functional, indirect, and weak-support evidence are biologically subtle.

Authors: We agree that quantitative assessment of annotation reliability is necessary to support claims about model limitations versus label quality, especially for biologically nuanced categories. The submitted manuscript described the targeted human review but did not include agreement metrics or error analysis. In the revision we will add inter-annotator agreement scores (Fleiss' kappa) computed on a double-annotated subset, a held-out validation analysis, and a dedicated error analysis section that examines sources of disagreement and resolution procedures. These additions will directly address concerns about potential annotation artifacts. revision: yes
Referee: [Results / evaluation] The results section states that frontier models achieve relatively strong performance on direct expression evidence with substantial drops elsewhere, yet the abstract supplies no exact performance numbers, per-species breakdowns, statistical significance tests, or comparison to simple baselines. Without these, it is difficult to evaluate whether the headline distinctions (e.g., expression vs. functional) are robust or sensitive to label noise.

Authors: The full results section contains detailed performance tables across models and species, but we acknowledge that the abstract and high-level statements remain qualitative. We will revise the abstract to report key quantitative metrics (F1 scores per evidence type), include per-species breakdowns, add comparisons to simple baselines such as lexical or rule-based classifiers, and report statistical significance tests (e.g., bootstrap confidence intervals and McNemar's test) for the observed differences. These changes will make the robustness of the expression-versus-other distinctions clearer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent annotations and direct evaluation

full rationale

The paper constructs PlantMarkerBench via a literature-retrieval and human-review pipeline, then measures model performance on the resulting 5,550 labeled instances. No equations, fitted parameters, or predictions are claimed; the two tasks (validity detection and evidence-type classification) are evaluated directly against the curated labels. The central claims about performance drops on functional/indirect/weak evidence therefore rest on the external validity of the annotations rather than any self-referential reduction or self-citation chain. This is a standard empirical benchmark paper with no load-bearing derivation that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the described curation process yields reliable ground-truth labels; no free parameters, mathematical derivations, or new physical entities are introduced.

axioms (1)

domain assumption Human review combined with automated retrieval produces accurate annotations for marker-evidence validity, type, and support strength.
The benchmark's utility as an evaluation framework depends on this assumption about annotation quality.

pith-pipeline@v0.9.0 · 5546 in / 1262 out tokens · 97277 ms · 2026-05-13T04:07:54.342219+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 7 internal anchors

[2]

gold standard

Tanya Z Berardini, Leonore Reiser, Donghui Li, Yarik Mezheritsky, Robert Muller, Emily Strait, and Eva Huala. The arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome. genesis, 53 0 (8): 0 474--485, 2015

work page 2015
[3]

A gene expression map of the arabidopsis root

Kenneth Birnbaum, Dennis E Shasha, Jean Y Wang, Jee W Jung, Georgina M Lambert, David W Galbraith, and Philip N Benfey. A gene expression map of the arabidopsis root. Science, 302 0 (5652): 0 1956--1960, 2003

work page 1956
[4]

A high-resolution root spatiotemporal map reveals dominant expression patterns

Siobhan M Brady, David A Orlando, Ji-Young Lee, Jean Y Wang, Jeremy Koch, Jos \'e R Dinneny, Daniel Mace, Uwe Ohler, and Philip N Benfey. A high-resolution root spatiotemporal map reveals dominant expression patterns. Science, 318 0 (5851): 0 801--806, 2007

work page 2007
[5]

Biomedical natural language processing

Kevin Bretonnel Cohen and Dina Demner-Fushman. Biomedical natural language processing. 2014

work page 2014
[6]

Reconstructing spatiotemporal gene expression data from partial observations

Dustin A Cartwright, Siobhan M Brady, David A Orlando, Bernd Sturmfels, and Philip N Benfey. Reconstructing spatiotemporal gene expression data from partial observations. Bioinformatics, 25 0 (19): 0 2581--2587, 2009

work page 2009
[7]

Plantscrnadb: a database for plant single-cell rna analysis

Hongyu Chen, Xinxin Yin, Longbiao Guo, Jie Yao, Yiwen Ding, Xiaoxu Xu, Lu Liu, Qian-Hao Zhu, Qinjie Chu, and Longjiang Fan. Plantscrnadb: a database for plant single-cell rna analysis. Molecular Plant, 14 0 (6): 0 855--857, 2021

work page 2021
[8]

Information commons for rice (ic4r)

IC4R Project Consortium. Information commons for rice (ic4r). Nucleic acids research, 44 0 (D1): 0 D1172--D1180, 2016

work page 2016
[9]

Spatiotemporal developmental trajectories in the arabidopsis root revealed using high-throughput single-cell rna sequencing

Tom Denyer, Xiaoli Ma, Simon Klesen, Emanuele Scacchi, Kay Nieselt, and Marja CP Timmermans. Spatiotemporal developmental trajectories in the arabidopsis root revealed using high-throughput single-cell rna sequencing. Developmental cell, 48 0 (6): 0 840--852, 2019

work page 2019
[10]

The sol genomics network (sgn)—from genotype to phenotype to breeding

Noe Fernandez-Pozo, Naama Menda, Jeremy D Edwards, Surya Saha, Isaak Y Tecle, Susan R Strickler, Aureliano Bombarely, Thomas Fisher-York, Anuradha Pujar, Hartmut Foerster, et al. The sol genomics network (sgn)—from genotype to phenotype to breeding. Nucleic acids research, 43 0 (D1): 0 D1036--D1041, 2015

work page 2015
[12]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929--3938. PMLR, 2020

work page 2020
[13]

Integrated analysis of multimodal single-cell data

Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M Mauck, Shiwei Zheng, Andrew Butler, Maddie J Lee, Aaron J Wilk, Charlotte Darby, Michael Zager, et al. Integrated analysis of multimodal single-cell data. Cell, 184 0 (13): 0 3573--3587, 2021

work page 2021
[14]

scplantdb: a comprehensive database for exploring cell types and markers of plant cell atlases

Zhaohui He, Yuting Luo, Xinkai Zhou, Tao Zhu, Yangming Lan, and Dijun Chen. scplantdb: a comprehensive database for exploring cell types and markers of plant cell atlases. Nucleic acids research, 52 0 (D1): 0 D1629--D1638, 2024

work page 2024
[15]

Towards reasoning in large language models: A survey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Findings of the association for computational linguistics: ACL 2023, pages 1049--1065, 2023

work page 2023
[18]

Dynamics of gene expression in single root cells of arabidopsis thaliana

Ken Jean-Baptiste, Jos \'e L McFaline-Figueroa, Cristina M Alexandre, Michael W Dorrity, Lauren Saunders, Kerry L Bubb, Cole Trapnell, Stanley Fields, Christine Queitsch, and Josh T Cuperus. Dynamics of gene expression in single root cells of arabidopsis thaliana. The plant cell, 31 0 (5): 0 993--1011, 2019

work page 2019
[19]

Pcmdb: a curated and comprehensive resource of plant cell markers

Jingjing Jin, Peng Lu, Yalong Xu, Jiemeng Tao, Zefeng Li, Shuaibin Wang, Shizhou Yu, Chen Wang, Xiaodong Xie, Junping Gao, et al. Pcmdb: a curated and comprehensive resource of plant cell markers. Nucleic Acids Research, 50 0 (D1): 0 D1448--D1455, 2022

work page 2022
[20]

Improvement of the oryza sativa nipponbare reference genome using next generation sequence and optical map data

Yoshihiro Kawahara, Melissa de la Bastide, John P Hamilton, Hiroyuki Kanamori, W Richard McCombie, Shu Ouyang, David C Schwartz, Tsuyoshi Tanaka, Jianzhong Wu, Shiguo Zhou, et al. Improvement of the oryza sativa nipponbare reference genome using next generation sequence and optical map data. Rice, 6 0 (1): 0 4, 2013

work page 2013
[21]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020

work page 2020
[22]

Maizegdb 2018: the maize multi-genome genetics and genomics database

John L Portwood, Margaret R Woodhouse, Ethalinda K Cannon, Jack M Gardiner, Lisa C Harper, Mary L Schaeffer, Jesse R Walsh, Taner Z Sen, Kyoung Tak Cho, David A Schott, et al. Maizegdb 2018: the maize multi-genome genetics and genomics database. Nucleic acids research, 47 0 (D1): 0 D1146--D1154, 2019

work page 2018
[23]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982--3992, 2019

work page 2019
[24]

Towards building a plant cell atlas

Seung Y Rhee, Kenneth D Birnbaum, and David W Ehrhardt. Towards building a plant cell atlas. Trends in plant science, 24 0 (4): 0 303--310, 2019

work page 2019
[25]

Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation process

Ang \'e lique Richard, Lo \" s Boullu, Ulysse Herbach, Arnaud Bonnafoux, Val \'e rie Morin, Elodie Vallin, Anissa Guillemin, Nan Papili Gao, Rudiyanto Gunawan, J \'e r \'e mie Cosette, et al. Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation process. PLoS biology, 14 ...

work page 2016
[26]

The probabilistic relevance framework: BM25 and beyond, volume 4

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009

work page 2009
[27]

Single-cell rna sequencing resolves molecular relationships among individual plant cells

Kook Hui Ryu, Ling Huang, Hyun Min Kang, and John Schiefelbein. Single-cell rna sequencing resolves molecular relationships among individual plant cells. Plant physiology, 179 0 (4): 0 1444--1456, 2019

work page 2019
[28]

High-throughput single-cell transcriptome profiling of plant cell types

Christine N Shulse, Benjamin J Cole, Doina Ciobanu, Junyan Lin, Yuko Yoshinaga, Mona Gouran, Gina M Turco, Yiwen Zhu, Ronan C O’Malley, Siobhan M Brady, et al. High-throughput single-cell transcriptome profiling of plant cell types. Cell reports, 27 0 (7): 0 2241--2247, 2019

work page 2019
[29]

Comprehensive integration of single-cell data

Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. Comprehensive integration of single-cell data. cell, 177 0 (7): 0 1888--1902, 2019

work page 1902
[32]

A single-cell rna sequencing profiles the developmental landscape of arabidopsis root

Tian-Qi Zhang, Zhou-Geng Xu, Guan-Dong Shang, and Jia-Wei Wang. A single-cell rna sequencing profiles the developmental landscape of arabidopsis root. Molecular plant, 12 0 (5): 0 648--660, 2019

work page 2019
[33]

Developmental cell , volume=

Spatiotemporal developmental trajectories in the Arabidopsis root revealed using high-throughput single-cell RNA sequencing , author=. Developmental cell , volume=. 2019 , publisher=

work page 2019
[34]

The plant cell , volume=

Dynamics of gene expression in single root cells of Arabidopsis thaliana , author=. The plant cell , volume=. 2019 , publisher=

work page 2019
[35]

Cell reports , volume=

High-throughput single-cell transcriptome profiling of plant cell types , author=. Cell reports , volume=. 2019 , publisher=

work page 2019
[36]

PLoS biology , volume=

Single-cell-based analysis highlights a surge in cell-to-cell molecular variability preceding irreversible commitment in a differentiation process , author=. PLoS biology , volume=. 2016 , publisher=

work page 2016
[37]

Plant physiology , volume=

Single-cell RNA sequencing resolves molecular relationships among individual plant cells , author=. Plant physiology , volume=. 2019 , publisher=

work page 2019
[38]

Nucleic Acids Research , volume=

PCMDB: a curated and comprehensive resource of plant cell markers , author=. Nucleic Acids Research , volume=. 2022 , publisher=

work page 2022
[39]

Molecular Plant , volume=

PlantscRNAdb: a database for plant single-cell RNA analysis , author=. Molecular Plant , volume=. 2021 , publisher=

work page 2021
[40]

Nucleic acids research , volume=

scPlantDB: a comprehensive database for exploring cell types and markers of plant cell atlases , author=. Nucleic acids research , volume=. 2024 , publisher=

work page 2024
[41]

Trends in plant science , volume=

Towards building a plant cell atlas , author=. Trends in plant science , volume=. 2019 , publisher=

work page 2019
[42]

Nature communications , volume=

Massively parallel digital transcriptional profiling of single cells , author=. Nature communications , volume=. 2017 , publisher=

work page 2017
[43]

cell , volume=

Comprehensive integration of single-cell data , author=. cell , volume=. 2019 , publisher=

work page 2019
[44]

Cell , volume=

Integrated analysis of multimodal single-cell data , author=. Cell , volume=. 2021 , publisher=

work page 2021
[45]

Science , volume=

A high-resolution root spatiotemporal map reveals dominant expression patterns , author=. Science , volume=. 2007 , publisher=

work page 2007
[46]

Science , volume=

A gene expression map of the Arabidopsis root , author=. Science , volume=. 2003 , publisher=

work page 2003
[47]

Bioinformatics , volume=

Reconstructing spatiotemporal gene expression data from partial observations , author=. Bioinformatics , volume=. 2009 , publisher=

work page 2009
[48]

2014 , publisher=

Biomedical natural language processing , author=. 2014 , publisher=

work page 2014
[49]

Findings of the association for computational linguistics: ACL 2023 , pages=

Towards reasoning in large language models: A survey , author=. Findings of the association for computational linguistics: ACL 2023 , pages=

work page 2023
[50]

International conference on machine learning , pages=

Retrieval augmented language model pre-training , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[51]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page
[58]

Atlas: Few-shot learning with retrieval augmented language models, 2022

Few-shot learning with retrieval augmented language models , author=. arXiv preprint arXiv:2208.03299 , volume=

work page arXiv
[59]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

gold standard

The Arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome , author=. genesis , volume=. 2015 , publisher=

work page 2015
[61]

Rice , volume=

Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data , author=. Rice , volume=. 2013 , publisher=

work page 2013
[62]

Nucleic acids research , volume=

Information commons for rice (IC4R) , author=. Nucleic acids research , volume=. 2016 , publisher=

work page 2016
[63]

Nucleic acids research , volume=

MaizeGDB 2018: the maize multi-genome genetics and genomics database , author=. Nucleic acids research , volume=. 2019 , publisher=

work page 2018
[64]

Nucleic acids research , volume=

The Sol Genomics Network (SGN)—from genotype to phenotype to breeding , author=. Nucleic acids research , volume=. 2015 , publisher=

work page 2015
[65]

Molecular plant , volume=

A single-cell RNA sequencing profiles the developmental landscape of Arabidopsis root , author=. Molecular plant , volume=. 2019 , publisher=

work page 2019
[66]

2009 , publisher=

The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=

work page 2009
[67]

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019