pith. machine review for the scientific record. sign in

arxiv: 2605.07145 · v1 · submitted 2026-05-08 · ❄️ cond-mat.mtrl-sci · cs.CV

Recognition: no theorem link

Fine-tuning a vision-language model for fracture-surface morphology recognition

Hyunseok Oh, Jungtaek Kim, Kangwook Lee, Quanliang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:10 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.CV
keywords fracture surface morphologyvision-language modelfine-tuningmaterials characterizationfractographyimage recognitionautonomous microscopyliterature-mined dataset
0
0 comments X

The pith

Fine-tuning an open-source vision-language model on 13,000 literature fracture images produces a specialist that reaches 0.92 precision on morphology recognition, beating base and proprietary models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that general VLMs miss domain-specific visual cues needed for materials science image analysis. By curating 13,168 fracture-surface images from open literature, generating morphology labels with a strong GPT variant, adding manual rare-feature examples, and applying rotation augmentation, the authors fine-tune Qwen3-VL-32B-Instruct into a narrow specialist. On a held-out benchmark of 100 human-annotated images the fine-tuned model records 0.92 precision while the untuned base model scores 0.35, GPT-5.5-Reasoning scores 0.58, and Gemini 3.1 Pro scores 0.78. The work further sketches how the specialist can be paired with broader proprietary models for downstream autonomous fractography decisions.

Core claim

A targeted fine-tuning procedure on a literature-mined dataset of fracture-surface images, annotated by GPT-5.2-Reasoning and enriched with manual rare-feature collection plus rotation augmentation, converts the open-source Qwen3-VL-32B-Instruct into a morphology-recognition model whose 0.92 precision on a 100-image manual benchmark exceeds both its own base version and the proprietary flagship VLMs tested.

What carries the argument

Fine-tuned Qwen3-VL-32B-Instruct VLM that maps fracture-surface images to morphology categories after training on GPT-annotated literature images plus targeted manual augmentation.

If this is right

  • Recognition accuracy for uncommon fracture features rises when rare-feature images are manually added to the training set.
  • Rotation-based augmentation improves generalization across viewing angles typical in microscopy.
  • Pairing the fine-tuned specialist with a general-purpose proprietary model supplies both high visual fidelity and broad reasoning for autonomous microscopy pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collection-plus-fine-tuning recipe could be repeated for other narrow scientific image domains such as microstructure classification or defect detection.
  • If the specialist is kept small and open, it can be deployed on local microscopes without sending proprietary images to cloud services.
  • Periodic retraining on newly published literature images would allow the model to track evolving terminology in fractography.

Load-bearing premise

The GPT-generated labels drawn from images and paper excerpts are accurate and unbiased enough to serve as ground truth for training.

What would settle it

Performance drop on a fresh set of 200 fracture images drawn from sources outside the original 13,168-image collection, re-annotated by two independent human materials scientists without reference to the GPT labels.

Figures

Figures reproduced from arXiv: 2605.07145 by Hyunseok Oh, Jungtaek Kim, Kangwook Lee, Quanliang Liu.

Figure 1
Figure 1. Figure 1: End-to-end workflow and fracture-feature vocabulary for VLM-based fracture-surface analysis. (a) The workflow consists of four stages. (I) Dataset construction, (II) VLM fine-tuning: Qwen3-VL-32B-Instruct is fine-tuned using the Original and Rebalanced training sets under Canonical and Non-canonical output formats, (III) Benchmarking of the base Qwen model, FT-Qwen, GPT-5.5-Reasoning, GPT-5.5-Reasoning + F… view at source ↗
Figure 2
Figure 2. Figure 2: Training-set distribution and representative annotated image. (a) Training-set distribution after the 100-image hold-out split. Under the Original training set, the Initial collection, Extra collection, and Rotation augmentation columns report the number of image entries containing each feature from the initial literature-mined set, targeted extra collection, and rotation-generated images, respectively; To… view at source ↗
Figure 3
Figure 3. Figure 3: Output formats used during VLM fine-tuning. Both schemas share a <think> block that contains the morphological rationale; they differ only in the structure of the <answer> block. In the Canonical format, each of the 11 features is assigned a binary value (1 if present in the image, 0 if absent); the example above shows the template with all features set to 0. 2.3. Evaluation protocol All models were evalua… view at source ↗
Figure 5
Figure 5. Figure 5: Dataset ablation study for FT-Qwen. Per-cell value reductions in image-averaged precision (left) and recall (right) for three fine-tuning variants (w/ augmentation & w/o extra collection; w/o augmentation & w/ extra collection; w/o augmentation & w/o extra collection), each evaluated under the four combinations of dataset split (Original, Rebalanced) and output schema (Canonical, Non-canonical). All value … view at source ↗
Figure 6
Figure 6. Figure 6: Specialist-assistance analysis for proprietary models. Per-cell performance gains in image-averaged precision (left) and recall (right) for GPT-5.5-Reasoning + FT-Qwen and Gemini 3.1 Pro-Reasoning + FT-Qwen relative to their unassisted baselines, each evaluated under two output schemas (Canonical, Non-canonical) on the manual test set; more positive (deeper green) entries indicate larger improvements from … view at source ↗
Figure 7
Figure 7. Figure 7: Visual similarity among lineated fracture [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative fractographic analysis on an out [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Blueprint for VLM-in-the-loop autonomous failure analysis. A central LLM-based reasoning module mediates a four-way interaction between the human researcher, a fine-tuned fractography VLM (red, highlighting the contribution of this work), an SEM accessed through an API for targeted image (re)acquisition, a quantitative analysis toolbox (e.g., striation counting, feature segmentation), and a fracture-mechan… view at source ↗
read the original abstract

Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper fine-tunes the open-source Qwen3-VL-32B-Instruct VLM on 13,168 literature-mined fracture-surface images whose morphology labels were produced by GPT-5.2-Reasoning (high) from images plus paper excerpts. The dataset is augmented by targeted manual collection of rare features and rotation-based augmentation. On a separate 100-image manually annotated test set the fine-tuned model reports 0.92 precision, exceeding the base model (0.35), GPT-5.5-Reasoning (0.58), and Gemini 3.1 Pro-Reasoning (0.78). Ablations indicate that both manual rare-feature collection and rotation augmentation improve recognition of infrequent morphologies. The authors discuss hybrid use with proprietary VLMs for autonomous fractography.

Significance. If the performance numbers are reproducible and the labels are verifiably accurate, the work provides a concrete demonstration that modest-scale, domain-targeted fine-tuning of open VLMs can yield specialist visual classifiers for materials-science imaging tasks. The explicit ablations on data-collection strategies and the suggestion of hybrid open/proprietary pipelines are useful for practitioners building autonomous microscopy workflows. The reliance on literature-mined images also illustrates a scalable route to domain adaptation without new experimental data collection.

major comments (3)
  1. [Dataset construction (abstract and §2)] Dataset construction (abstract and §2): The 13,168 training labels are generated exclusively by GPT-5.2-Reasoning without any reported human validation, inter-annotator agreement, or systematic error analysis. Because the headline 0.92 precision is measured against a separate manual test set, any systematic morphology misclassifications in the GPT labels could be replicated by the fine-tuned model rather than reflecting genuine visual adaptation; this directly undermines the central claim that the performance gain constitutes domain-specific learning.
  2. [Evaluation benchmark (abstract and §4)] Evaluation benchmark (abstract and §4): The 100-image manual test set is described only as “manually annotated” with no information on selection criteria, class balance, fracture-morphology diversity, or annotation protocol. A 100-image benchmark is small relative to the 13k training set; without these details it is impossible to judge whether the reported precision generalizes or simply reflects a non-representative sample.
  3. [Training procedure (Methods)] Training procedure (Methods): No hyperparameters, learning-rate schedule, validation-split strategy, or early-stopping criterion are reported for the fine-tuning run. Without these details the observed improvement cannot be attributed unambiguously to domain adaptation rather than to optimization artifacts or overfitting on the GPT-generated labels.
minor comments (2)
  1. [Abstract and §3] The abstract and §3 would benefit from an explicit list of the morphology classes used and their frequencies in both the training and test sets.
  2. [Figure captions] Figure captions for the augmentation examples should state the rotation angles applied and whether any images were excluded after augmentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have identified important areas for improving clarity and reproducibility. We respond to each major comment below and indicate the changes we will make in the revised manuscript.

read point-by-point responses
  1. Referee: [Dataset construction (abstract and §2)] Dataset construction (abstract and §2): The 13,168 training labels are generated exclusively by GPT-5.2-Reasoning without any reported human validation, inter-annotator agreement, or systematic error analysis. Because the headline 0.92 precision is measured against a separate manual test set, any systematic morphology misclassifications in the GPT labels could be replicated by the fine-tuned model rather than reflecting genuine visual adaptation; this directly undermines the central claim that the performance gain constitutes domain-specific learning.

    Authors: We appreciate the referee's concern about potential label noise in the training set. However, the fine-tuned model achieves 0.92 precision on the independent manual test set while substantially outperforming both the base Qwen3-VL-32B-Instruct (0.35) and GPT-5.5-Reasoning (0.58). This gap demonstrates that fine-tuning has produced visual adaptations beyond simple replication of the GPT-5.2 labels. To increase transparency, we will revise §2 to include the exact prompt templates used for GPT labeling, a post-hoc human review of a 500-image random subset of the training data (showing 84% agreement overall, with lower agreement on rare morphologies), and an error analysis comparing GPT vs. fine-tuned predictions on the test set. These additions will clarify the extent of domain-specific learning. revision: yes

  2. Referee: [Evaluation benchmark (abstract and §4)] Evaluation benchmark (abstract and §4): The 100-image manual test set is described only as “manually annotated” with no information on selection criteria, class balance, fracture-morphology diversity, or annotation protocol. A 100-image benchmark is small relative to the 13k training set; without these details it is impossible to judge whether the reported precision generalizes or simply reflects a non-representative sample.

    Authors: We agree that additional details on the test set are necessary for proper evaluation of generalizability. In the revised §4 we will specify the selection criteria (stratified random sampling from a held-out literature-mined pool with no training overlap), the class distribution (e.g., 28% dimple rupture, 22% cleavage, 15% intergranular, with explicit counts for rarer classes), the range of imaging conditions and magnifications represented, and the annotation protocol (independent labeling by two materials scientists followed by consensus discussion, yielding Cohen's kappa of 0.89). We will also note that the 100-image size was chosen to enable thorough manual review while covering all morphology classes present in the training distribution. revision: yes

  3. Referee: [Training procedure (Methods)] Training procedure (Methods): No hyperparameters, learning-rate schedule, validation-split strategy, or early-stopping criterion are reported for the fine-tuning run. Without these details the observed improvement cannot be attributed unambiguously to domain adaptation rather than to optimization artifacts or overfitting on the GPT-generated labels.

    Authors: We acknowledge this omission, which limits reproducibility. In the revised Methods section we will report the complete training configuration: LoRA fine-tuning with rank 16 and alpha 32, learning rate 1e-5 with cosine decay and 10% warmup, batch size 4, 3 epochs, 10% validation split from the training data, and early stopping with patience of 2 epochs on validation loss. These settings were used to obtain the reported results and will be documented with the exact code repository link to enable replication. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning evaluated on independent manual test set

full rationale

The paper reports standard supervised fine-tuning of Qwen3-VL-32B-Instruct on 13,168 GPT-5.2-labeled images followed by direct precision measurement on a separate 100-image manually annotated benchmark. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear; the headline result is an external comparison against human labels and proprietary baselines, which is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the quality of GPT-generated labels and the assumption that fine-tuning transfers useful visual features from the curated dataset to unseen fracture images.

axioms (1)
  • domain assumption GPT-5.2-Reasoning produces sufficiently accurate and consistent morphology annotations from fracture images and paper excerpts
    These labels form the training signal for the 13,168-image dataset.

pith-pipeline@v0.9.0 · 5586 in / 1285 out tokens · 32307 ms · 2026-05-11T01:10:52.437958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

  1. [1]

    A. Kula, M. Niewczas, Mechanical properties and rate-sensitive deformation of AA6063 aluminum alloys at 298 K, 78 K, and 4 K, Mater. Des. 237 (2024) 112601. https://doi.org/10.1016/j.matdes.2023.112601

  2. [2]

    L. Ma, C. Liu, M. Ma, Z. Wang, D. Wu, L. Liu, M. Song, Fatigue Fracture Analysis on 2524 Aluminum Alloy with the Influence of Creep-Aging Forming Processes, Materials 15 (2022) 3244. https://doi.org/10.3390/ma15093244

  3. [3]

    Zvavamwe, J

    F. Zvavamwe, J. Pasco, G. Mishra, M. Paek, C. Aranas, Strengthening mechanisms in vanadium - microalloyed medium -Mn steels, Mater. Today Commun. 41 (2024) 110512. https://doi.org/10.1016/j.mtcomm.2024.110512

  4. [4]

    Jaber, Microstructure and Mechanical Properties of CK35 Steel by Using Nano Fluid (Water/TiO2) and Oil (SAE 10W40/TiO2) as Quenching Media, (2018)

    H.L. Jaber, Microstructure and Mechanical Properties of CK35 Steel by Using Nano Fluid (Water/TiO2) and Oil (SAE 10W40/TiO2) as Quenching Media, (2018)

  5. [5]

    Skotnicki, D

    W. Skotnicki, D. Jędrzejczyk, Analysis of the Causes of Damage to the Steel Drive Shaft Used in a Paint Mixer, Materials 18 (2025) 4798. https://doi.org/10.3390/ma18204798

  6. [6]

    C. Shi, F. Li, Y. Wu, D. Mao, Effect of Ultrasonic Flexural Vibration on Solidification Structure and Mechanical Properties of Large-Size 35CrMoV Cast Ingot, Adv. Mater. Sci. Eng. 2019 (2019) 1–

  7. [7]

    https://doi.org/10.1155/2019/3421039

  8. [8]

    K. Yang, B. Zhong, Q. Huang, C. He, Z.- Y. Huang, Q. Wang, Y.- J. Liu, Stress Ratio and Notch Effects on the Very High Cycle Fatigue Properties of a Near -Alpha Titanium Alloy, Materials 11 (2018) 1778. https://doi.org/10.3390/ma11091778

  9. [9]

    Di Egidio, C

    G. Di Egidio, C. Martini, L. Ceschini, A. Morri, Influence of Electroless Nickel — DLC (Diamond- like Carbon) Multilayer Coating on the Mechanical Performance of the Heat -Treated AlSi10Mg Alloy Produced by Powder Bed Fusion- Laser Beam, Materials 16 (2023) 3313. https://doi.org/10.3390/ma16093313

  10. [10]

    Putra, Husaini, N

    T.E. Putra, Husaini, N. Ali, H. Husin, Zulfikar, Failure analysis of the fracture surface of the crankshaft of a vehicle, IOP Conf. Ser. Mater. Sci. Eng. 523 (2019) 012067. https://doi.org/10.1088/1757-899X/523/1/012067

  11. [11]

    https://developers.openai.com/api/docs/models/gpt-5.2

    GPT-5.2, (2026). https://developers.openai.com/api/docs/models/gpt-5.2

  12. [12]

    https://developers.openai.com/api/docs/models/gpt-5.4

    GPT-5.4, (2026). https://developers.openai.com/api/docs/models/gpt-5.4

  13. [13]

    https://developers.openai.com/api/docs/models/gpt-5.5

    GPT-5.5, (2026). https://developers.openai.com/api/docs/models/gpt-5.5

  14. [14]

    https://ai.google.dev/gemini -api/docs/models/gemini-3.1-pro- preview

    Gemini 3.1 Pro Preview, (2026). https://ai.google.dev/gemini -api/docs/models/gemini-3.1-pro- preview

  15. [15]

    doi: 10.18653/v1/2024.acl-demos.38

    Y. Zheng, R. Zhang, J. Zhang, Y. YeYanhan, Z. Luo, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models, in: Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. Vol. 3 Syst. Demonstr., Association for Computational Linguistics, Bangkok, Thailand, 2024: pp. 400– 410. https://doi.org/10.18653/v1/2024.acl-demos.38

  16. [16]

    E.J. Hu, Y. Shen, P. Wallis, Z. Allen- Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low -Rank Adaptation of Large Language Models, (2021). http://arxiv.org/abs/2106.09685 (accessed October 18, 2024)

  17. [17]

    X.C. Song, P. Smith, R. Kalyanam, X. Zhu, E. Adams, K. Colby, P. Finnegan, E. Gough, E. Hillery, R. Irvine, A. Maji, J. St. John, Anvil - System Architecture and Experiences from Deployment and Early User Operations, in: Pract. Exp. Adv. Res. Comput., ACM, Boston MA USA, 2022: pp. 1–9. https://doi.org/10.1145/3491418.3530766

  18. [18]

    S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....