arxiv: 2604.03476 · v2 · submitted 2026-04-03 · 💻 cs.CV · cs.AI· q-bio.BM

Recognition: no theorem link

Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

Haocheng Tang , Xingyu Dang , Junmei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AIq-bio.BM

keywords OCSRmolecular structure recognitionSMILES generationfine-tuningvision-language modelsLoRAimage-to-sequencechemical diagrams

0 comments

The pith

MolSeek-OCR matches exact matching accuracy of top image-to-sequence models on molecular structure recognition

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts DeepSeek-OCR-2 to convert 2D molecular diagrams into SMILES strings by treating the task as image-conditioned sequence generation. Direct full-parameter fine-tuning proves unstable, so the authors introduce a two-stage process that starts with LoRA and shifts to selective full-parameter updates using split learning rates. Training combines large-scale synthetic images from PubChem with realistic patent diagrams from USPTO-MOL to build coverage. The resulting model reaches performance levels comparable to leading image-to-sequence approaches while remaining behind state-of-the-art image-to-graph methods, and reinforcement or curation post-training steps do not raise strict sequence fidelity.

Core claim

By formulating OCSR as image-conditioned SMILES generation and applying a two-stage progressive supervised fine-tuning strategy that begins with parameter-efficient LoRA and transitions to selective full-parameter fine-tuning with split learning rates, on a corpus of synthetic PubChem renderings plus realistic USPTO-MOL patent images, the adapted MolSeek-OCR model achieves exact matching accuracies comparable to the best-performing image-to-sequence models, though it remains inferior to state-of-the-art image-to-graph models. Reinforcement-style post-training and data-curation refinements fail to improve the strict sequence-level fidelity required for exact SMILES matching.

What carries the argument

Two-stage progressive supervised fine-tuning strategy that starts with LoRA adaptation and moves to selective full-parameter fine-tuning with split learning rates to stabilize image-to-SMILES generation.

If this is right

General vision-language models can be specialized for OCSR through progressive fine-tuning without task-specific architectures from scratch.
Combining synthetic renderings and real patent images yields robustness gains for molecular diagram recognition.
Image-to-sequence models can reach competitive exact-match rates on SMILES output but do not surpass image-to-graph methods.
Reinforcement-style post-training and curation steps do not raise strict sequence-level fidelity for exact SMILES matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage tuning pattern may transfer to other scientific diagram-to-sequence tasks such as equation or reaction recognition.
Integrating graph-decoding heads into the sequence model could narrow the remaining gap to image-to-graph leaders.
Scaling synthetic data generation from chemical databases may support diagram understanding in adjacent domains like materials science.
Observed instabilities in full fine-tuning could be architecture-dependent and less severe in newer vision-language models.

Load-bearing premise

Direct full-parameter supervised fine-tuning fails due to instabilities for this task, and the two-stage progressive strategy reliably overcomes those instabilities.

What would settle it

A direct full-parameter fine-tuning run of DeepSeek-OCR-2 on the same PubChem-plus-USPTO-MOL corpus that reaches equal or higher exact SMILES matching accuracy without instabilities would falsify the necessity of the two-stage strategy.

Figures

Figures reproduced from arXiv: 2604.03476 by Haocheng Tang, Junmei Wang, Xingyu Dang.

**Figure 1.** Figure 1: The overall architecture of DeepSeek-OCR-2. the ChemDraw-like style, and realistic USPTO-MOL images, resulting in 192k training instances in total. In the subsequent progressive full fine-tuning stage, we enlarge the corpus to 800k examples, consisting of 300k MolScribelike rendered molecules from PubChem, 300k ChemDrawlike rendered molecules from PubChem, and 200k realistic molecular images from USPTO… view at source ↗

**Figure 2.** Figure 2: The 2 stages fine-tuning. A. Train loss in stage 1. B. Train accuracy in stage 2. C. Train loss in stage 2. C. Train accuracy in stage 2. PubChem molecules using on-the-fly rendering. One subset follows a MolScribe-like style with stronger appearance augmentation, while the other follows a cleaner ChemDrawlike style with less augmentation. A third subset is built from realistic molecule images from USPTO-… view at source ↗

read the original abstract

Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper offers a practical two-stage fine-tuning recipe for adapting DeepSeek-OCR-2 to molecular structure recognition, but the lack of quantitative results leaves the performance claims hard to evaluate.

read the letter

The paper's main contribution is showing that a two-stage fine-tuning approach—starting with LoRA and then doing selective full-parameter updates with different learning rates—lets them adapt DeepSeek-OCR-2 to generate SMILES from molecular images without the training blowing up. They train on a mix of synthetic PubChem images and real USPTO patent scans, and the resulting MolSeek-OCR model reaches accuracy levels that match other sequence-based methods.

Referee Report

2 major / 2 minor

Summary. The paper adapts DeepSeek-OCR-2 for Optical Chemical Structure Recognition (OCSR) by casting the task as image-conditioned SMILES generation. It introduces a two-stage progressive supervised fine-tuning strategy (LoRA followed by selective full-parameter tuning with split learning rates) to mitigate instabilities, trains on combined synthetic PubChem renderings and USPTO-MOL patent images, and reports that the resulting MolSeek-OCR model reaches exact-match accuracies comparable to the strongest prior image-to-sequence models while remaining inferior to image-to-graph SOTA; reinforcement-style post-training and data-curation refinements are shown to be ineffective for strict sequence fidelity.

Significance. If the quantitative claims hold, the work would demonstrate a practical, stable adaptation recipe for vision-language models on exact SMILES extraction, addressing a known pain point in OCSR. The use of mixed synthetic and realistic patent data is a positive step toward robustness, but the absence of any numerical results, baseline tables, ablations, or error analysis prevents assessment of whether the two-stage strategy actually delivers the claimed gains or merely reproduces existing performance.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): the central claim of 'competitive exact matching accuracies comparable to the best-performing image-to-sequence model' is stated without any reported numbers, tables, or comparisons to specific baselines (e.g., no mention of what the prior best image-to-sequence accuracy is or how MolSeek-OCR compares on the same test set). This is load-bearing for the performance contribution and cannot be evaluated from the supplied text.
[§3] §3 (Training Strategy): the assertion that direct full-parameter fine-tuning 'often fails due to instabilities' and that the proposed LoRA-then-selective-full-parameter schedule with split learning rates 'reliably overcomes' them is presented without loss curves, instability diagnostics, or an ablation comparing the two-stage schedule against direct full fine-tuning or standard LoRA-only training.

minor comments (2)

[Abstract] Abstract: 'modelS' contains a stray capital S; correct to 'models'.
[§4] The manuscript should include at minimum a results table with exact-match accuracy, SMILES validity rate, and comparison to at least two published image-to-sequence and image-to-graph baselines on the same USPTO-MOL test split.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to supply the requested quantitative evidence and supporting analyses.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): the central claim of 'competitive exact matching accuracies comparable to the best-performing image-to-sequence model' is stated without any reported numbers, tables, or comparisons to specific baselines (e.g., no mention of what the prior best image-to-sequence accuracy is or how MolSeek-OCR compares on the same test set). This is load-bearing for the performance contribution and cannot be evaluated from the supplied text.

Authors: We agree that the performance claim cannot be assessed without numerical results. The current manuscript states comparability in the abstract and §4 but does not report specific exact-match accuracies, name the leading image-to-sequence baseline, or provide side-by-side numbers on identical test sets. In the revised version we will insert a results table with these exact figures, baseline references, and test-set details to substantiate the claim. revision: yes
Referee: [§3] §3 (Training Strategy): the assertion that direct full-parameter fine-tuning 'often fails due to instabilities' and that the proposed LoRA-then-selective-full-parameter schedule with split learning rates 'reliably overcomes' them is presented without loss curves, instability diagnostics, or an ablation comparing the two-stage schedule against direct full fine-tuning or standard LoRA-only training.

Authors: We acknowledge that §3 presents the two-stage schedule as a solution to instabilities without accompanying empirical support. The manuscript contains no loss curves, instability metrics, or ablation experiments comparing direct full-parameter tuning, LoRA-only, and the proposed progressive approach. We will add these diagnostics and ablations to §3 in the revision to demonstrate the claimed stability gains. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical fine-tuning study

full rationale

The paper reports an empirical adaptation of DeepSeek-OCR-2 via a two-stage LoRA-then-selective-full-parameter fine-tuning procedure on PubChem and USPTO-MOL data, followed by accuracy measurements against external baselines. No derivation chain, equations, or first-principles predictions exist that could reduce to fitted inputs or self-citations. The central claim is a performance outcome on held-out test sets; all steps are externally falsifiable via reproduction on the cited public datasets. No self-definitional, fitted-input, or uniqueness-imported steps are present.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard transfer-learning assumptions about data coverage and training stability; no new physical or mathematical entities are introduced.

free parameters (2)

LoRA rank and scaling
Hyperparameters chosen to stabilize the first training stage; values not reported in abstract.
split learning rates
Separate rates for different parameter groups in the second stage; selection method not detailed.

axioms (2)

domain assumption Vision-language models can be successfully adapted to domain-specific image-to-sequence tasks via staged supervised fine-tuning
Invoked to justify the two-stage protocol as a solution to direct-training failures.
domain assumption Mixture of synthetic PubChem renderings and USPTO-MOL patent images supplies adequate coverage and robustness for the target distribution
Used to motivate the training corpus construction.

pith-pipeline@v0.9.0 · 5505 in / 1499 out tokens · 67474 ms · 2026-05-13T19:33:09.330341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

arXiv preprint arXiv:2308.13418 , year=

URL https://arxiv.org/abs/ 2308.13418. Cao, H., Liu, Z., Lu, X., Yao, Y ., and Li, Y . Instructmol: Multi-modal integration for building a versatile and reli- able molecular assistant in drug discovery,

work page arXiv
[2]

org/abs/2508.10696

URL https://arxiv. org/abs/2508.10696. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., and Park, S. Ocr- free document understanding transformer,

work page arXiv
[3]

Kim, S., Thiessen, P

URL https://arxiv.org/abs/2111.15664. Kim, S., Thiessen, P. A., Bolton, E. E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He, S., Shoemaker, B. A., Wang, J., Yu, B., Zhang, J., and Bryant, S. H. Pub- chem substance and compound databases.Nucleic Acids Research, 44(D1):D1202–D1213, 09

work page arXiv
[4]

Langley, P

doi: 10.1093/nar/gkv951. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

work page doi:10.1093/nar/gkv951 2000
[5]

Liu, H., Yin, H., Luo, Z., and Wang, X

URLhttps://arxiv.org/abs/2408.07246. Liu, H., Yin, H., Luo, Z., and Wang, X. Integrating chem- istry knowledge in large language models via prompt engineering.Synthetic and Systems Biotechnology, 10(1): 23–38, 3

work page arXiv
[6]

W., and Barzilay, R

Qian, Y ., Guo, J., Tu, Z., Li, Z., Coley, C. W., and Barzilay, R. Molscribe: Robust molecular structure recognition with image-to-graph generation.Journal of Chemical Information and Modeling, 63(7):1925–1934,

work page 1925
[7]

Rajan, K., Brinkhaus, H

doi: 10.1021/acs.jcim.2c01480. Rajan, K., Brinkhaus, H. O., Zielesny, A., and Steinbeck, C. A review of optical chemical structure recognition tools. Journal of cheminformatics, 12(1):60, 10

work page doi:10.1021/acs.jcim.2c01480
[8]

doi: 10.1186/s13321-020-00465-0

ISSN 1758-2946. doi: 10.1186/s13321-020-00465-0. Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A., and Steinbeck, C. Decimer.ai: an open platform for auto- mated optical chemical structure identification, segmen- tation and recognition in scientific publications.Nature communications, 14(1):5045, 8

work page doi:10.1186/s13321-020-00465-0
[9]

doi: 10.1038/s41467-023-40782-0

ISSN 2041-1723. doi: 10.1038/s41467-023-40782-0. Staker, J., Marshall, K., Abel, R., and McQuaw, C. M. Molecular structure extraction from documents using deep learning.Journal of Chemical Information and Modeling, 59(3):1017–1029,

work page doi:10.1038/s41467-023-40782-0 2041
[10]

Performance and Analysis of the Alchemical Transfer Method for Binding-Free-Energy Predictions of Diverse Ligands

doi: 10.1021/acs. jcim.8b00669. Tang, H., Long, J., Ji, B., and Wang, J. Auxiliary discrmi- nator sequence generative adversarial networks for few sample molecule generation.Journal of Chemical Infor- mation and Modeling, 65(19):10311–10322,

work page doi:10.1021/acs
[11]

Wang, J., He, Y ., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y ., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., and He, C

doi: 10.1021/acs.jcim.5c01737. Wang, J., He, Y ., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y ., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., and He, C. Gtr-cot: Graph traversal as visual chain of thought for molecular structure recognition,

work page doi:10.1021/acs.jcim.5c01737
[12]

Wei, H., Sun, Y ., and Li, Y

URL https: //arxiv.org/abs/2506.07553. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234,

work page arXiv
[13]

Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

work page arXiv