Recognition: no theorem link
Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition
Pith reviewed 2026-05-13 19:33 UTC · model grok-4.3
The pith
MolSeek-OCR matches exact matching accuracy of top image-to-sequence models on molecular structure recognition
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By formulating OCSR as image-conditioned SMILES generation and applying a two-stage progressive supervised fine-tuning strategy that begins with parameter-efficient LoRA and transitions to selective full-parameter fine-tuning with split learning rates, on a corpus of synthetic PubChem renderings plus realistic USPTO-MOL patent images, the adapted MolSeek-OCR model achieves exact matching accuracies comparable to the best-performing image-to-sequence models, though it remains inferior to state-of-the-art image-to-graph models. Reinforcement-style post-training and data-curation refinements fail to improve the strict sequence-level fidelity required for exact SMILES matching.
What carries the argument
Two-stage progressive supervised fine-tuning strategy that starts with LoRA adaptation and moves to selective full-parameter fine-tuning with split learning rates to stabilize image-to-SMILES generation.
If this is right
- General vision-language models can be specialized for OCSR through progressive fine-tuning without task-specific architectures from scratch.
- Combining synthetic renderings and real patent images yields robustness gains for molecular diagram recognition.
- Image-to-sequence models can reach competitive exact-match rates on SMILES output but do not surpass image-to-graph methods.
- Reinforcement-style post-training and curation steps do not raise strict sequence-level fidelity for exact SMILES matching.
Where Pith is reading between the lines
- The same two-stage tuning pattern may transfer to other scientific diagram-to-sequence tasks such as equation or reaction recognition.
- Integrating graph-decoding heads into the sequence model could narrow the remaining gap to image-to-graph leaders.
- Scaling synthetic data generation from chemical databases may support diagram understanding in adjacent domains like materials science.
- Observed instabilities in full fine-tuning could be architecture-dependent and less severe in newer vision-language models.
Load-bearing premise
Direct full-parameter supervised fine-tuning fails due to instabilities for this task, and the two-stage progressive strategy reliably overcomes those instabilities.
What would settle it
A direct full-parameter fine-tuning run of DeepSeek-OCR-2 on the same PubChem-plus-USPTO-MOL corpus that reaches equal or higher exact SMILES matching accuracy without instabilities would falsify the necessity of the two-stage strategy.
Figures
read the original abstract
Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts DeepSeek-OCR-2 for Optical Chemical Structure Recognition (OCSR) by casting the task as image-conditioned SMILES generation. It introduces a two-stage progressive supervised fine-tuning strategy (LoRA followed by selective full-parameter tuning with split learning rates) to mitigate instabilities, trains on combined synthetic PubChem renderings and USPTO-MOL patent images, and reports that the resulting MolSeek-OCR model reaches exact-match accuracies comparable to the strongest prior image-to-sequence models while remaining inferior to image-to-graph SOTA; reinforcement-style post-training and data-curation refinements are shown to be ineffective for strict sequence fidelity.
Significance. If the quantitative claims hold, the work would demonstrate a practical, stable adaptation recipe for vision-language models on exact SMILES extraction, addressing a known pain point in OCSR. The use of mixed synthetic and realistic patent data is a positive step toward robustness, but the absence of any numerical results, baseline tables, ablations, or error analysis prevents assessment of whether the two-stage strategy actually delivers the claimed gains or merely reproduces existing performance.
major comments (2)
- [Abstract and §4] Abstract and §4 (Results): the central claim of 'competitive exact matching accuracies comparable to the best-performing image-to-sequence model' is stated without any reported numbers, tables, or comparisons to specific baselines (e.g., no mention of what the prior best image-to-sequence accuracy is or how MolSeek-OCR compares on the same test set). This is load-bearing for the performance contribution and cannot be evaluated from the supplied text.
- [§3] §3 (Training Strategy): the assertion that direct full-parameter fine-tuning 'often fails due to instabilities' and that the proposed LoRA-then-selective-full-parameter schedule with split learning rates 'reliably overcomes' them is presented without loss curves, instability diagnostics, or an ablation comparing the two-stage schedule against direct full fine-tuning or standard LoRA-only training.
minor comments (2)
- [Abstract] Abstract: 'modelS' contains a stray capital S; correct to 'models'.
- [§4] The manuscript should include at minimum a results table with exact-match accuracy, SMILES validity rate, and comparison to at least two published image-to-sequence and image-to-graph baselines on the same USPTO-MOL test split.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to supply the requested quantitative evidence and supporting analyses.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Results): the central claim of 'competitive exact matching accuracies comparable to the best-performing image-to-sequence model' is stated without any reported numbers, tables, or comparisons to specific baselines (e.g., no mention of what the prior best image-to-sequence accuracy is or how MolSeek-OCR compares on the same test set). This is load-bearing for the performance contribution and cannot be evaluated from the supplied text.
Authors: We agree that the performance claim cannot be assessed without numerical results. The current manuscript states comparability in the abstract and §4 but does not report specific exact-match accuracies, name the leading image-to-sequence baseline, or provide side-by-side numbers on identical test sets. In the revised version we will insert a results table with these exact figures, baseline references, and test-set details to substantiate the claim. revision: yes
-
Referee: [§3] §3 (Training Strategy): the assertion that direct full-parameter fine-tuning 'often fails due to instabilities' and that the proposed LoRA-then-selective-full-parameter schedule with split learning rates 'reliably overcomes' them is presented without loss curves, instability diagnostics, or an ablation comparing the two-stage schedule against direct full fine-tuning or standard LoRA-only training.
Authors: We acknowledge that §3 presents the two-stage schedule as a solution to instabilities without accompanying empirical support. The manuscript contains no loss curves, instability metrics, or ablation experiments comparing direct full-parameter tuning, LoRA-only, and the proposed progressive approach. We will add these diagnostics and ablations to §3 in the revision to demonstrate the claimed stability gains. revision: yes
Circularity Check
No circularity: purely empirical fine-tuning study
full rationale
The paper reports an empirical adaptation of DeepSeek-OCR-2 via a two-stage LoRA-then-selective-full-parameter fine-tuning procedure on PubChem and USPTO-MOL data, followed by accuracy measurements against external baselines. No derivation chain, equations, or first-principles predictions exist that could reduce to fitted inputs or self-citations. The central claim is a performance outcome on held-out test sets; all steps are externally falsifiable via reproduction on the cited public datasets. No self-definitional, fitted-input, or uniqueness-imported steps are present.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank and scaling
- split learning rates
axioms (2)
- domain assumption Vision-language models can be successfully adapted to domain-specific image-to-sequence tasks via staged supervised fine-tuning
- domain assumption Mixture of synthetic PubChem renderings and USPTO-MOL patent images supplies adequate coverage and robustness for the target distribution
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2308.13418 , year=
URL https://arxiv.org/abs/ 2308.13418. Cao, H., Liu, Z., Lu, X., Yao, Y ., and Li, Y . Instructmol: Multi-modal integration for building a versatile and reli- able molecular assistant in drug discovery,
-
[2]
URL https://arxiv. org/abs/2508.10696. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., and Park, S. Ocr- free document understanding transformer,
-
[3]
URL https://arxiv.org/abs/2111.15664. Kim, S., Thiessen, P. A., Bolton, E. E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He, S., Shoemaker, B. A., Wang, J., Yu, B., Zhang, J., and Bryant, S. H. Pub- chem substance and compound databases.Nucleic Acids Research, 44(D1):D1202–D1213, 09
-
[4]
doi: 10.1093/nar/gkv951. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,
-
[5]
Liu, H., Yin, H., Luo, Z., and Wang, X
URLhttps://arxiv.org/abs/2408.07246. Liu, H., Yin, H., Luo, Z., and Wang, X. Integrating chem- istry knowledge in large language models via prompt engineering.Synthetic and Systems Biotechnology, 10(1): 23–38, 3
-
[6]
Qian, Y ., Guo, J., Tu, Z., Li, Z., Coley, C. W., and Barzilay, R. Molscribe: Robust molecular structure recognition with image-to-graph generation.Journal of Chemical Information and Modeling, 63(7):1925–1934,
work page 1925
-
[7]
doi: 10.1021/acs.jcim.2c01480. Rajan, K., Brinkhaus, H. O., Zielesny, A., and Steinbeck, C. A review of optical chemical structure recognition tools. Journal of cheminformatics, 12(1):60, 10
-
[8]
doi: 10.1186/s13321-020-00465-0
ISSN 1758-2946. doi: 10.1186/s13321-020-00465-0. Rajan, K., Brinkhaus, H. O., Agea, M. I., Zielesny, A., and Steinbeck, C. Decimer.ai: an open platform for auto- mated optical chemical structure identification, segmen- tation and recognition in scientific publications.Nature communications, 14(1):5045, 8
-
[9]
doi: 10.1038/s41467-023-40782-0
ISSN 2041-1723. doi: 10.1038/s41467-023-40782-0. Staker, J., Marshall, K., Abel, R., and McQuaw, C. M. Molecular structure extraction from documents using deep learning.Journal of Chemical Information and Modeling, 59(3):1017–1029,
-
[10]
doi: 10.1021/acs. jcim.8b00669. Tang, H., Long, J., Ji, B., and Wang, J. Auxiliary discrmi- nator sequence generative adversarial networks for few sample molecule generation.Journal of Chemical Infor- mation and Modeling, 65(19):10311–10322,
-
[11]
doi: 10.1021/acs.jcim.5c01737. Wang, J., He, Y ., Yang, H., Wu, J., Ge, L., Wei, X., Wang, Y ., Li, L., Ao, H., Liu, C., Wang, B., Wu, L., and He, C. Gtr-cot: Graph traversal as visual chain of thought for molecular structure recognition,
-
[12]
URL https: //arxiv.org/abs/2506.07553. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234,
-
[13]
Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,
Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.