arxiv: 2605.07145 · v1 · submitted 2026-05-08 · ❄️ cond-mat.mtrl-sci · cs.CV

Recognition: no theorem link

Fine-tuning a vision-language model for fracture-surface morphology recognition

Hyunseok Oh, Jungtaek Kim, Kangwook Lee, Quanliang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:10 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.CV

keywords fracture surface morphologyvision-language modelfine-tuningmaterials characterizationfractographyimage recognitionautonomous microscopyliterature-mined dataset

0 comments

The pith

Fine-tuning an open-source vision-language model on 13,000 literature fracture images produces a specialist that reaches 0.92 precision on morphology recognition, beating base and proprietary models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that general VLMs miss domain-specific visual cues needed for materials science image analysis. By curating 13,168 fracture-surface images from open literature, generating morphology labels with a strong GPT variant, adding manual rare-feature examples, and applying rotation augmentation, the authors fine-tune Qwen3-VL-32B-Instruct into a narrow specialist. On a held-out benchmark of 100 human-annotated images the fine-tuned model records 0.92 precision while the untuned base model scores 0.35, GPT-5.5-Reasoning scores 0.58, and Gemini 3.1 Pro scores 0.78. The work further sketches how the specialist can be paired with broader proprietary models for downstream autonomous fractography decisions.

Core claim

A targeted fine-tuning procedure on a literature-mined dataset of fracture-surface images, annotated by GPT-5.2-Reasoning and enriched with manual rare-feature collection plus rotation augmentation, converts the open-source Qwen3-VL-32B-Instruct into a morphology-recognition model whose 0.92 precision on a 100-image manual benchmark exceeds both its own base version and the proprietary flagship VLMs tested.

What carries the argument

Fine-tuned Qwen3-VL-32B-Instruct VLM that maps fracture-surface images to morphology categories after training on GPT-annotated literature images plus targeted manual augmentation.

If this is right

Recognition accuracy for uncommon fracture features rises when rare-feature images are manually added to the training set.
Rotation-based augmentation improves generalization across viewing angles typical in microscopy.
Pairing the fine-tuned specialist with a general-purpose proprietary model supplies both high visual fidelity and broad reasoning for autonomous microscopy pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection-plus-fine-tuning recipe could be repeated for other narrow scientific image domains such as microstructure classification or defect detection.
If the specialist is kept small and open, it can be deployed on local microscopes without sending proprietary images to cloud services.
Periodic retraining on newly published literature images would allow the model to track evolving terminology in fractography.

Load-bearing premise

The GPT-generated labels drawn from images and paper excerpts are accurate and unbiased enough to serve as ground truth for training.

What would settle it

Performance drop on a fresh set of 200 fracture images drawn from sources outside the original 13,168-image collection, re-annotated by two independent human materials scientists without reference to the GPT labels.

Figures

Figures reproduced from arXiv: 2605.07145 by Hyunseok Oh, Jungtaek Kim, Kangwook Lee, Quanliang Liu.

**Figure 1.** Figure 1: End-to-end workflow and fracture-feature vocabulary for VLM-based fracture-surface analysis. (a) The workflow consists of four stages. (I) Dataset construction, (II) VLM fine-tuning: Qwen3-VL-32B-Instruct is fine-tuned using the Original and Rebalanced training sets under Canonical and Non-canonical output formats, (III) Benchmarking of the base Qwen model, FT-Qwen, GPT-5.5-Reasoning, GPT-5.5-Reasoning + F… view at source ↗

**Figure 2.** Figure 2: Training-set distribution and representative annotated image. (a) Training-set distribution after the 100-image hold-out split. Under the Original training set, the Initial collection, Extra collection, and Rotation augmentation columns report the number of image entries containing each feature from the initial literature-mined set, targeted extra collection, and rotation-generated images, respectively; To… view at source ↗

**Figure 3.** Figure 3: Output formats used during VLM fine-tuning. Both schemas share a <think> block that contains the morphological rationale; they differ only in the structure of the <answer> block. In the Canonical format, each of the 11 features is assigned a binary value (1 if present in the image, 0 if absent); the example above shows the template with all features set to 0. 2.3. Evaluation protocol All models were evalua… view at source ↗

**Figure 5.** Figure 5: Dataset ablation study for FT-Qwen. Per-cell value reductions in image-averaged precision (left) and recall (right) for three fine-tuning variants (w/ augmentation & w/o extra collection; w/o augmentation & w/ extra collection; w/o augmentation & w/o extra collection), each evaluated under the four combinations of dataset split (Original, Rebalanced) and output schema (Canonical, Non-canonical). All value … view at source ↗

**Figure 6.** Figure 6: Specialist-assistance analysis for proprietary models. Per-cell performance gains in image-averaged precision (left) and recall (right) for GPT-5.5-Reasoning + FT-Qwen and Gemini 3.1 Pro-Reasoning + FT-Qwen relative to their unassisted baselines, each evaluated under two output schemas (Canonical, Non-canonical) on the manual test set; more positive (deeper green) entries indicate larger improvements from … view at source ↗

**Figure 7.** Figure 7: Visual similarity among lineated fracture [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative fractographic analysis on an out [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Blueprint for VLM-in-the-loop autonomous failure analysis. A central LLM-based reasoning module mediates a four-way interaction between the human researcher, a fine-tuned fractography VLM (red, highlighting the contribution of this work), an SEM accessed through an API for targeted image (re)acquisition, a quantitative analysis toolbox (e.g., striation counting, feature segmentation), and a fracture-mechan… view at source ↗

read the original abstract

Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning beats the baselines on this benchmark, but the GPT labels need human validation to be convincing.

read the letter

The key result here is that fine-tuning Qwen3-VL-32B-Instruct on 13,168 GPT-generated fracture morphology labels lifts precision to 0.92 on a separate 100-image manual benchmark, compared to 0.35 for the untuned model and 0.58-0.78 for some proprietary VLMs. They mined open literature for the images, had GPT-5.2-Reasoning annotate them using both the image and source paper text, supplemented with manual collection of rare features and rotation augmentation. The ablations confirm that both additions help recognition of less common morphologies. They also outline combining the specialist model with larger ones for better overall reasoning in microscopy pipelines. This is a solid, incremental application of standard fine-tuning techniques to a new domain. The performance numbers are reported clearly, and the practical advice on data curation is the most useful part for someone trying to do similar adaptation. The main soft spot is exactly what the stress test flags: no human validation or error analysis on the GPT labels, and the test set is small with no reported details on diversity, balance, or annotation reliability. If the GPT outputs have systematic errors on fracture features, the fine-tuned model could be learning those instead of true visual patterns. The abstract does not mention any such checks, so that needs to be addressed. This work is for materials scientists focused on failure analysis and automated fractography, or anyone adapting VLMs to narrow scientific imaging tasks. A reader in that area would get value from the dataset scale and the ablation findings. It should go to peer review. The empirical setup is there and the claims are scoped appropriately, but the referee can push on the label validation and benchmark characterization to make the contribution more robust.

Referee Report

3 major / 2 minor

Summary. The paper fine-tunes the open-source Qwen3-VL-32B-Instruct VLM on 13,168 literature-mined fracture-surface images whose morphology labels were produced by GPT-5.2-Reasoning (high) from images plus paper excerpts. The dataset is augmented by targeted manual collection of rare features and rotation-based augmentation. On a separate 100-image manually annotated test set the fine-tuned model reports 0.92 precision, exceeding the base model (0.35), GPT-5.5-Reasoning (0.58), and Gemini 3.1 Pro-Reasoning (0.78). Ablations indicate that both manual rare-feature collection and rotation augmentation improve recognition of infrequent morphologies. The authors discuss hybrid use with proprietary VLMs for autonomous fractography.

Significance. If the performance numbers are reproducible and the labels are verifiably accurate, the work provides a concrete demonstration that modest-scale, domain-targeted fine-tuning of open VLMs can yield specialist visual classifiers for materials-science imaging tasks. The explicit ablations on data-collection strategies and the suggestion of hybrid open/proprietary pipelines are useful for practitioners building autonomous microscopy workflows. The reliance on literature-mined images also illustrates a scalable route to domain adaptation without new experimental data collection.

major comments (3)

[Dataset construction (abstract and §2)] Dataset construction (abstract and §2): The 13,168 training labels are generated exclusively by GPT-5.2-Reasoning without any reported human validation, inter-annotator agreement, or systematic error analysis. Because the headline 0.92 precision is measured against a separate manual test set, any systematic morphology misclassifications in the GPT labels could be replicated by the fine-tuned model rather than reflecting genuine visual adaptation; this directly undermines the central claim that the performance gain constitutes domain-specific learning.
[Evaluation benchmark (abstract and §4)] Evaluation benchmark (abstract and §4): The 100-image manual test set is described only as “manually annotated” with no information on selection criteria, class balance, fracture-morphology diversity, or annotation protocol. A 100-image benchmark is small relative to the 13k training set; without these details it is impossible to judge whether the reported precision generalizes or simply reflects a non-representative sample.
[Training procedure (Methods)] Training procedure (Methods): No hyperparameters, learning-rate schedule, validation-split strategy, or early-stopping criterion are reported for the fine-tuning run. Without these details the observed improvement cannot be attributed unambiguously to domain adaptation rather than to optimization artifacts or overfitting on the GPT-generated labels.

minor comments (2)

[Abstract and §3] The abstract and §3 would benefit from an explicit list of the morphology classes used and their frequencies in both the training and test sets.
[Figure captions] Figure captions for the augmentation examples should state the rotation angles applied and whether any images were excluded after augmentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have identified important areas for improving clarity and reproducibility. We respond to each major comment below and indicate the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [Dataset construction (abstract and §2)] Dataset construction (abstract and §2): The 13,168 training labels are generated exclusively by GPT-5.2-Reasoning without any reported human validation, inter-annotator agreement, or systematic error analysis. Because the headline 0.92 precision is measured against a separate manual test set, any systematic morphology misclassifications in the GPT labels could be replicated by the fine-tuned model rather than reflecting genuine visual adaptation; this directly undermines the central claim that the performance gain constitutes domain-specific learning.

Authors: We appreciate the referee's concern about potential label noise in the training set. However, the fine-tuned model achieves 0.92 precision on the independent manual test set while substantially outperforming both the base Qwen3-VL-32B-Instruct (0.35) and GPT-5.5-Reasoning (0.58). This gap demonstrates that fine-tuning has produced visual adaptations beyond simple replication of the GPT-5.2 labels. To increase transparency, we will revise §2 to include the exact prompt templates used for GPT labeling, a post-hoc human review of a 500-image random subset of the training data (showing 84% agreement overall, with lower agreement on rare morphologies), and an error analysis comparing GPT vs. fine-tuned predictions on the test set. These additions will clarify the extent of domain-specific learning. revision: yes
Referee: [Evaluation benchmark (abstract and §4)] Evaluation benchmark (abstract and §4): The 100-image manual test set is described only as “manually annotated” with no information on selection criteria, class balance, fracture-morphology diversity, or annotation protocol. A 100-image benchmark is small relative to the 13k training set; without these details it is impossible to judge whether the reported precision generalizes or simply reflects a non-representative sample.

Authors: We agree that additional details on the test set are necessary for proper evaluation of generalizability. In the revised §4 we will specify the selection criteria (stratified random sampling from a held-out literature-mined pool with no training overlap), the class distribution (e.g., 28% dimple rupture, 22% cleavage, 15% intergranular, with explicit counts for rarer classes), the range of imaging conditions and magnifications represented, and the annotation protocol (independent labeling by two materials scientists followed by consensus discussion, yielding Cohen's kappa of 0.89). We will also note that the 100-image size was chosen to enable thorough manual review while covering all morphology classes present in the training distribution. revision: yes
Referee: [Training procedure (Methods)] Training procedure (Methods): No hyperparameters, learning-rate schedule, validation-split strategy, or early-stopping criterion are reported for the fine-tuning run. Without these details the observed improvement cannot be attributed unambiguously to domain adaptation rather than to optimization artifacts or overfitting on the GPT-generated labels.

Authors: We acknowledge this omission, which limits reproducibility. In the revised Methods section we will report the complete training configuration: LoRA fine-tuning with rank 16 and alpha 32, learning rate 1e-5 with cosine decay and 10% warmup, batch size 4, 3 epochs, 10% validation split from the training data, and early stopping with patience of 2 epochs on validation loss. These settings were used to obtain the reported results and will be documented with the exact code repository link to enable replication. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning evaluated on independent manual test set

full rationale

The paper reports standard supervised fine-tuning of Qwen3-VL-32B-Instruct on 13,168 GPT-5.2-labeled images followed by direct precision measurement on a separate 100-image manually annotated benchmark. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear; the headline result is an external comparison against human labels and proprietary baselines, which is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the quality of GPT-generated labels and the assumption that fine-tuning transfers useful visual features from the curated dataset to unseen fracture images.

axioms (1)

domain assumption GPT-5.2-Reasoning produces sufficiently accurate and consistent morphology annotations from fracture images and paper excerpts
These labels form the training signal for the 13,168-image dataset.

pith-pipeline@v0.9.0 · 5586 in / 1285 out tokens · 32307 ms · 2026-05-11T01:10:52.437958+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

[1]

A. Kula, M. Niewczas, Mechanical properties and rate-sensitive deformation of AA6063 aluminum alloys at 298 K, 78 K, and 4 K, Mater. Des. 237 (2024) 112601. https://doi.org/10.1016/j.matdes.2023.112601

work page doi:10.1016/j.matdes.2023.112601 2024
[2]

L. Ma, C. Liu, M. Ma, Z. Wang, D. Wu, L. Liu, M. Song, Fatigue Fracture Analysis on 2524 Aluminum Alloy with the Influence of Creep-Aging Forming Processes, Materials 15 (2022) 3244. https://doi.org/10.3390/ma15093244

work page doi:10.3390/ma15093244 2022
[3]

Zvavamwe, J

F. Zvavamwe, J. Pasco, G. Mishra, M. Paek, C. Aranas, Strengthening mechanisms in vanadium - microalloyed medium -Mn steels, Mater. Today Commun. 41 (2024) 110512. https://doi.org/10.1016/j.mtcomm.2024.110512

work page doi:10.1016/j.mtcomm.2024.110512 2024
[4]

Jaber, Microstructure and Mechanical Properties of CK35 Steel by Using Nano Fluid (Water/TiO2) and Oil (SAE 10W40/TiO2) as Quenching Media, (2018)

H.L. Jaber, Microstructure and Mechanical Properties of CK35 Steel by Using Nano Fluid (Water/TiO2) and Oil (SAE 10W40/TiO2) as Quenching Media, (2018)

work page 2018
[5]

Skotnicki, D

W. Skotnicki, D. Jędrzejczyk, Analysis of the Causes of Damage to the Steel Drive Shaft Used in a Paint Mixer, Materials 18 (2025) 4798. https://doi.org/10.3390/ma18204798

work page doi:10.3390/ma18204798 2025
[6]

C. Shi, F. Li, Y. Wu, D. Mao, Effect of Ultrasonic Flexural Vibration on Solidification Structure and Mechanical Properties of Large-Size 35CrMoV Cast Ingot, Adv. Mater. Sci. Eng. 2019 (2019) 1–

work page 2019
[7]

https://doi.org/10.1155/2019/3421039

work page doi:10.1155/2019/3421039 2019
[8]

K. Yang, B. Zhong, Q. Huang, C. He, Z.- Y. Huang, Q. Wang, Y.- J. Liu, Stress Ratio and Notch Effects on the Very High Cycle Fatigue Properties of a Near -Alpha Titanium Alloy, Materials 11 (2018) 1778. https://doi.org/10.3390/ma11091778

work page doi:10.3390/ma11091778 2018
[9]

Di Egidio, C

G. Di Egidio, C. Martini, L. Ceschini, A. Morri, Influence of Electroless Nickel — DLC (Diamond- like Carbon) Multilayer Coating on the Mechanical Performance of the Heat -Treated AlSi10Mg Alloy Produced by Powder Bed Fusion- Laser Beam, Materials 16 (2023) 3313. https://doi.org/10.3390/ma16093313

work page doi:10.3390/ma16093313 2023
[10]

Putra, Husaini, N

T.E. Putra, Husaini, N. Ali, H. Husin, Zulfikar, Failure analysis of the fracture surface of the crankshaft of a vehicle, IOP Conf. Ser. Mater. Sci. Eng. 523 (2019) 012067. https://doi.org/10.1088/1757-899X/523/1/012067

work page doi:10.1088/1757-899x/523/1/012067 2019
[11]

https://developers.openai.com/api/docs/models/gpt-5.2

GPT-5.2, (2026). https://developers.openai.com/api/docs/models/gpt-5.2

work page 2026
[12]

https://developers.openai.com/api/docs/models/gpt-5.4

GPT-5.4, (2026). https://developers.openai.com/api/docs/models/gpt-5.4

work page 2026
[13]

https://developers.openai.com/api/docs/models/gpt-5.5

GPT-5.5, (2026). https://developers.openai.com/api/docs/models/gpt-5.5

work page 2026
[14]

https://ai.google.dev/gemini -api/docs/models/gemini-3.1-pro- preview

Gemini 3.1 Pro Preview, (2026). https://ai.google.dev/gemini -api/docs/models/gemini-3.1-pro- preview

work page 2026
[15]

doi: 10.18653/v1/2024.acl-demos.38

Y. Zheng, R. Zhang, J. Zhang, Y. YeYanhan, Z. Luo, LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models, in: Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. Vol. 3 Syst. Demonstr., Association for Computational Linguistics, Bangkok, Thailand, 2024: pp. 400– 410. https://doi.org/10.18653/v1/2024.acl-demos.38

work page doi:10.18653/v1/2024.acl-demos.38 2024
[16]

E.J. Hu, Y. Shen, P. Wallis, Z. Allen- Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low -Rank Adaptation of Large Language Models, (2021). http://arxiv.org/abs/2106.09685 (accessed October 18, 2024)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

X.C. Song, P. Smith, R. Kalyanam, X. Zhu, E. Adams, K. Colby, P. Finnegan, E. Gough, E. Hillery, R. Irvine, A. Maji, J. St. John, Anvil - System Architecture and Experiences from Deployment and Early User Operations, in: Pract. Exp. Adv. Res. Comput., ACM, Boston MA USA, 2022: pp. 1–9. https://doi.org/10.1145/3491418.3530766

work page doi:10.1145/3491418.3530766 2022
[18]

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025