SynRXN: An Open Benchmark and Curated Dataset for Computational Reaction Modeling

Nhu-Ngoc Nguyen Song; Peter F. Stadler; Tieu-Long Phan

arxiv: 2601.01943 · v1 · submitted 2026-01-05 · 💻 cs.LG

SynRXN: An Open Benchmark and Curated Dataset for Computational Reaction Modeling

Tieu-Long Phan , Nhu-Ngoc Nguyen Song , Peter F. Stadler This is my paper

Pith reviewed 2026-05-16 17:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords CASPreaction modelingbenchmarkingsynthesis planningdataset curationmachine learningcomputational chemistryatom mapping

0 comments

The pith

SynRXN assembles heterogeneous public reaction data into versioned datasets for five standardized CASP task families with leakage-aware splits and reproducible evaluation protocols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SynRXN as a unified benchmarking resource that decomposes computer-aided synthesis planning into five distinct task families: reaction rebalancing, atom-to-atom mapping, reaction classification, reaction property prediction, and synthesis route design. It curates data from multiple public sources into harmonized representations that carry explicit provenance, license tags, checksums, and manifests. Transparent splitting functions create train, validation, and test partitions designed to minimize leakage, while sensitive tasks are supplied only as evaluation sets. Scripted build processes ensure bitwise-reproducible regeneration of the corpora. The overall goal is to remove dataset heterogeneity so that different methods can be compared fairly over time and across the full reaction-informatics pipeline.

Core claim

SynRXN decomposes end-to-end synthesis planning into five task families, assembles curated provenance-tracked reaction corpora from heterogeneous public sources into a harmonized representation, and packages them as versioned datasets together with leakage-aware splitting functions, standardized evaluation workflows, and metric suites tailored to each setting.

What carries the argument

The harmonized representation of reaction corpora packaged as versioned datasets for each of the five task families, together with provenance metadata, machine-readable manifests, and leakage-aware train-validation-test splitting functions.

If this is right

Different CASP methods can be compared longitudinally on identical, versioned data partitions.
Researchers can run controlled ablations and stress tests across the entire reaction-informatics pipeline.
Practitioners obtain more robust and directly comparable performance numbers for real-world synthesis workloads.
Contamination-sensitive tasks remain isolated as evaluation-only sets, reducing the risk of inflated results.
Reproducible build scripts allow the community to regenerate or extend the corpora without format drift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be extended to include multi-task models that solve several of the five families simultaneously.
It may serve as a reference point for comparing reaction informatics progress against benchmarks in other molecular prediction domains.
Adoption could accelerate the creation of ensemble systems that chain the five tasks into end-to-end planners.
Future work might test whether the harmonized splits reveal systematic weaknesses in current atom-mapping or route-design algorithms.

Load-bearing premise

Assembling heterogeneous public sources into a harmonized representation preserves all necessary information for the five task families without introducing curation artifacts or biases that would affect downstream evaluations.

What would settle it

Observation that models trained or evaluated on SynRXN partitions achieve markedly different performance when tested on independently collected, non-public industrial reaction records that were never part of the original public sources.

read the original abstract

We present SynRXN, a unified benchmarking framework and open-data resource for computer-aided synthesis planning (CASP). SynRXN decomposes end-to-end synthesis planning into five task families, covering reaction rebalancing, atom-to-atom mapping, reaction classification, reaction property prediction, and synthesis route design. Curated, provenance-tracked reaction corpora are assembled from heterogeneous public sources into a harmonized representation and packaged as versioned datasets for each task family, with explicit source metadata, licence tags, and machine-readable manifests that record checksums, and row counts. For every task, SynRXN provides transparent splitting functions that generate leakage-aware train, validation, and test partitions, together with standardized evaluation workflows and metric suites tailored to classification, regression, and structured prediction settings. For sensitive benchmarking, we combine public training and validation data with held-out gold-standard test sets, and contamination-prone tasks such as reaction rebalancing and atom-to-atom mapping are distributed only as evaluation sets and are explicitly not intended for model training. Scripted build recipes enable bitwise-reproducible regeneration of all corpora across machines and over time, and the entire resource is released under permissive open licences to support reuse and extension. By removing dataset heterogeneity and packaging transparent, reusable evaluation scaffolding, SynRXN enables fair longitudinal comparison of CASP methods, supports rigorous ablations and stress tests along the full reaction-informatics pipeline, and lowers the barrier for practitioners who seek robust and comparable performance estimates for real-world synthesis planning workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynRXN gives a practical unified benchmark for five CASP tasks with reproducible builds and leakage-aware splits, but the harmonization step has no visible fidelity checks.

read the letter

Hi, the main point on this paper is that it packages a single open resource for reaction modeling benchmarks in computer-aided synthesis planning. It breaks the problem into five clear task families and supplies harmonized datasets plus evaluation scaffolding that should make comparisons across papers more consistent than the scattered sources we have now. That is the real contribution here. They assembled data from multiple public origins, kept provenance and license info, added checksum manifests, and wrote scripts for bitwise-reproducible rebuilds. The leakage-aware splits and task-specific metrics are sensible engineering choices, and holding back some evaluation-only sets for mapping and rebalancing avoids obvious contamination. This setup lowers the friction for running ablations or tracking progress over time. What it does well is the packaging: versioned releases, machine-readable manifests, and permissive licenses make the resource easy to adopt and extend. The decomposition into rebalancing, atom mapping, classification, property prediction, and route design gives a coherent pipeline view that prior separate datasets lacked. The soft spot is the harmonization claim. The paper asserts that the single representation preserves everything needed for all five tasks, yet it shows no quantitative checks on retention of stereodescriptors, solvent fields, atom-mapping completeness, or condition data. Without before-and-after comparisons or fidelity metrics, it is possible that curation quietly drops or alters details that matter downstream. That leaves the soundness thinner than the abstract suggests. This work is aimed at the CASP and reaction-informatics crowd who need a common reference for method evaluation. Anyone running synthesis-planning models would get immediate use from the splits and workflows. I would send it to peer review. The resource is useful on its own, and referees can ask for the missing validation numbers without killing the contribution.

Referee Report

1 major / 1 minor

Summary. The manuscript presents SynRXN, a unified open benchmarking framework and curated dataset resource for computer-aided synthesis planning (CASP). It decomposes end-to-end synthesis planning into five task families (reaction rebalancing, atom-to-atom mapping, reaction classification, reaction property prediction, and synthesis route design), assembles harmonized provenance-tracked corpora from heterogeneous public sources with explicit metadata and machine-readable manifests, supplies leakage-aware splitting functions and standardized evaluation workflows, and releases everything under permissive licenses with reproducible build scripts.

Significance. If the harmonization and curation steps preserve task-critical information without introducing artifacts, SynRXN would provide a valuable standardized resource that enables fair longitudinal comparisons of CASP methods, supports rigorous ablations across the full reaction-informatics pipeline, and lowers the barrier to obtaining robust, comparable performance estimates for synthesis planning workloads. The transparent splitting, versioned datasets, and emphasis on contamination prevention for sensitive tasks are particular strengths.

major comments (1)

[Dataset assembly and harmonization (described in abstract and methods)] The central claim that the harmonized corpora preserve all information needed for the five task families without curation artifacts rests on an unverified assumption. The manuscript provides no quantitative fidelity metrics (e.g., retention rates for stereodescriptors, solvent/condition fields, or atom-mapping completeness) comparing source datasets before and after harmonization, nor any task-specific performance comparison pre- versus post-harmonization. This gap directly affects the reliability of the promised fair comparisons and ablations.

minor comments (1)

[Methods] The abstract and methods would benefit from an explicit table listing the source datasets, their original licenses, and the exact harmonization transformations applied to each field.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of SynRXN's potential value for the CASP community. We address the single major comment on dataset harmonization below.

read point-by-point responses

Referee: [Dataset assembly and harmonization (described in abstract and methods)] The central claim that the harmonized corpora preserve all information needed for the five task families without curation artifacts rests on an unverified assumption. The manuscript provides no quantitative fidelity metrics (e.g., retention rates for stereodescriptors, solvent/condition fields, or atom-mapping completeness) comparing source datasets before and after harmonization, nor any task-specific performance comparison pre- versus post-harmonization. This gap directly affects the reliability of the promised fair comparisons and ablations.

Authors: We agree that explicit quantitative fidelity metrics would strengthen the manuscript and directly support the claim of artifact-free harmonization. In the revised version we will add a new subsection (Methods, Section 3.3) reporting retention rates for stereodescriptors, solvent/condition fields, atom-mapping completeness, and other task-critical attributes across all source-to-harmonized transitions. We will also include a supplementary table with task-specific baseline performance (e.g., accuracy for classification tasks, MAE for property prediction) evaluated on both original source data and the harmonized corpora to demonstrate preservation of information. These additions will be generated from the existing reproducible build scripts and will be accompanied by the corresponding code and data manifests. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark curation is independent of fitted quantities

full rationale

The paper describes assembly of public reaction corpora into harmonized, provenance-tracked datasets for five task families, together with leakage-aware splits, evaluation workflows, and reproducible build scripts. No equations, parameter fitting, predictions, or derivations appear; the contribution is the resource packaging itself. All load-bearing steps (harmonization, splitting, metric definition) are explicit curation choices with external source metadata rather than reductions to self-citations or fitted inputs, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the feasibility of harmonizing heterogeneous public reaction data without critical loss and on the chosen task decomposition adequately covering the CASP pipeline.

axioms (1)

domain assumption Heterogeneous public reaction databases can be successfully harmonized into a single representation suitable for all five task families.
The paper describes assembling from heterogeneous public sources into harmonized representation.

pith-pipeline@v0.9.0 · 5579 in / 1333 out tokens · 76125 ms · 2026-05-16T17:23:33.917059+00:00 · methodology

SynRXN: An Open Benchmark and Curated Dataset for Computational Reaction Modeling

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)