pith. sign in

arxiv: 2606.03401 · v1 · pith:HN2OWY7Wnew · submitted 2026-06-02 · 💻 cs.CV

Towards Characterizing Scientific Image Utility and Upgradability

Pith reviewed 2026-06-28 11:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords scientific image evaluationmultimodal modelserror detectionimage corruption taxonomycorrection feasibilityAI-generated contentscientific validitybenchmark dataset
0
0 comments X

The pith

Current multimodal systems cannot reliably detect scientific errors in images or generate faithful corrections, revealing a gap between visual perception and scientific validity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the SIU²A framework to evaluate scientific images on two axes: utility, which measures the ability to detect errors and judge whether they can be repaired, and upgradability, which measures whether a correction actually restores scientific accuracy without damaging correct parts. It defines four corruption types—Detail Distortion, Incompleteness, False Content, and Entity Confusion—and releases an expert-annotated benchmark that tests both detection and repair. Experiments on this benchmark show that existing multimodal models perform poorly on both error identification and faithful correction. This matters because scientific images function as primary evidence in research, and undetected or poorly fixed errors can propagate false findings. The work therefore supplies a concrete way to quantify how far current AI systems remain from handling scientific visual data in a trustworthy manner.

Core claim

The central claim is that the SIU²A framework, built on a four-category taxonomy of scientific image corruptions and an expert-annotated benchmark, exposes clear limitations in current multimodal systems: they fail to detect scientific inaccuracies and fail to produce corrections that preserve scientific validity, demonstrating a separation between general visual perception and domain-specific scientific usability.

What carries the argument

The SIU²A framework, which splits evaluation into a Utility stage (error detection plus repair-instruction generation) and an Upgradability stage (whether the resulting correction restores validity without altering accurate information), applied to the four corruption categories on the expert-annotated SIU²A-Benchmark.

If this is right

  • Perceptual quality metrics do not track scientific validity, so new evaluation methods are required.
  • Multimodal models require domain-specific verification capabilities to handle scientific images.
  • Faithful correction must preserve accurate information while repairing errors.
  • The benchmark provides a standardized test for measuring progress on scientific image tasks.
  • Current systems exhibit a measurable gap between visual perception and scientific usability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be applied to evaluate AI assistance in figure preparation for scientific papers.
  • Training data that includes explicit scientific validity labels might close the observed gap.
  • Similar taxonomies and benchmarks could be developed for other scientific modalities such as diagrams or plots.
  • Automated tools built on this approach might eventually flag questionable figures during peer review.

Load-bearing premise

The four corruption categories form a complete taxonomy of scientific image issues, and expert annotations on the benchmark reliably capture scientific validity.

What would settle it

A multimodal model that scores near ceiling on the SIU²A-Benchmark error-detection and repair tasks yet still produces scientifically invalid corrected images when applied to real research figures would falsify the claim that the benchmark measures scientific usability.

Figures

Figures reproduced from arXiv: 2606.03401 by Chunyi Li, Farong Wen, Guangtao Zhai, Junying Wang, Liang Chen, Qihang Yan, Wenzhe Li, Yijin Guo, Zicheng Zhang.

Figure 1
Figure 1. Figure 1: Visual plausibility is not scientific validity. Traditional assessors (LLM/S-IQA) are fooled [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The SIU2A framework for scientific image assessment: (a) utility via error detection and correction feasibility, (b) upgradability through a diagnosis-to-correction pipeline, and (c) comparative model performance. 3.1 SIU2A Definition Scientific Images Failure Summary We summarize the common failure modes in scientific images into four structurally distinct categories: (i) Detail Distortion, where low-leve… view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for constructing the SIU2A-Benchmark dataset, including high-quality scientific image filtering, controlled degradation generation, and expert annotation. 3.2 SIU2A-Benchmark Construction. Base Image Collection To support the above formulation, we construct SIU2A-Benchmark, a dataset that jointly evaluates diagnosis, instruction generation, and editing under controlled scientific corruptions as sh… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the SIU2A data structure. Each instance contains a ground-truth image, a corrupted image, error detection and correction feasibility labels, structured error descriptions, correction instructions, and a corresponding scientific QA pair. disentangles functional correctness (task completion) from semantic faithfulness (scientific validity preservation), enabling a comprehensive evaluation of both… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the impact of error description quality on correction performance. We [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on upgradability: com￾paring ground-truth versus predicted correction instructions to assess their impact on editing per￾formance. Upgradability Dependence on Correction In￾struction Quality for Advanced Model The ablation results in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Our custom annotation interface. The tool presents the scientific figure to the expert [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Utility 1 — Error Detection. The figure displays the input images alongside the ground￾truth expert annotations and the outputs from eleven diagnostic VLMs. Each model’s prediction is encoded with a color bar (green for Detect: YES, red for Detect: NO), allowing for immediate assessment of detection accuracy against the known ground truth. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Utility 2 — Correction Feasiblity (page 1 of 2). Results for the first five diagnostic models. The full-width gold EXPERT row reproduces the human GT-Instruction as a reference. A predicted no-error chip replaces a missing instruction when a model declined to flag an error. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Utility 2 — Correction Feasiblity (page 2 of 2). Results for the remaining six diagnostic models, with identical layout and semantics as [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Upgradability — Image Restoration (page 1 of 2). Results for the first five image editing models. The PROMPT band surfaces both prompts side-by-side. The OmniGen-2 Pred-cond cell carries an instruction too long tag, indicating its sensitivity to instruction length. H Additional Ablations This section presents two key ablation studies that dissect the performance of our SIU2A framework. The first study iso… view at source ↗
Figure 12
Figure 12. Figure 12: Upgradability — Image Restoration (page 2 of 2). Results for the remaining four image editing models, with identical layout and semantics as [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

Scientific images function as critical evidence in research communication, yet their integrity faces unprecedented threats from AI-generated content that introduces subtle but consequential errors. Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities. To address this gap, we propose the \textbf{S}cientific \textbf{I}mage \textbf{U}tility and \textbf{U}pgradability \textbf{A}ssessment (\textbf{SIU$^2$A}) framework, which introduces two complementary dimensions for scientific image evaluation. \textbf{Utility} encompasses \textit{error detection} (identifying scientific inaccuracies) and \textit{correction feasibility} (assessing whether errors can be reliably repaired). \textbf{Upgradability} measures the quality of correction. We categorize scientific image corruption into four fundamental types: Detail Distortion, Incompleteness, False Content, and Entity Confusion. Based on this taxonomy, we construct SIU$^2$A-Benchmark, a dataset with expert annotations for error identification and repair. The framework implements a two-stage evaluation protocol: the \textit{Utility} stage evaluates error detection capability and repair instruction generation, while the \textit{Upgradability} stage assesses whether corrections faithfully restore scientific validity without compromising existing accurate information. Experiments reveal that current multimodal systems exhibit significant limitations in both scientific error assessment and faithful correction, exposing a fundamental gap between visual perception and scientific usability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes the SIU²A framework to evaluate scientific images along utility (error detection and correction feasibility) and upgradability (correction quality) dimensions. It introduces a four-category taxonomy of corruptions (Detail Distortion, Incompleteness, False Content, Entity Confusion), constructs the SIU²A-Benchmark with expert annotations, and describes a two-stage evaluation protocol. Experiments are claimed to show that current multimodal systems have significant limitations in scientific error assessment and faithful correction, revealing a fundamental gap between visual perception and scientific usability.

Significance. If the taxonomy, benchmark, and experimental results hold after validation, the work could establish a domain-specific evaluation paradigm for AI handling of scientific imagery that goes beyond perceptual metrics, potentially guiding development of more reliable multimodal systems for research communication.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'experiments reveal that current multimodal systems exhibit significant limitations... exposing a fundamental gap' is unsupported because the abstract (and by extension the manuscript) supplies no quantitative results, error metrics, dataset statistics, validation procedures, or inter-rater reliability scores for the expert annotations. This directly undermines the load-bearing experimental evidence for the claimed gap.
  2. [Abstract] Abstract (taxonomy and benchmark construction): The four corruption categories are asserted to be 'fundamental' and used to build SIU²A-Benchmark with expert annotations for error identification and repair, yet no coverage analysis, overlap assessment, or validation that the taxonomy is complete (e.g., versus calibration artifacts or modality-specific noise) is provided. Without this, performance gaps on the benchmark may reflect construction artifacts rather than inherent system limitations.
minor comments (1)
  1. [Abstract] The abstract introduces the acronym SIU²A but does not expand it on first use in a manner consistent with standard academic style.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the presentation of quantitative evidence and taxonomy validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'experiments reveal that current multimodal systems exhibit significant limitations... exposing a fundamental gap' is unsupported because the abstract (and by extension the manuscript) supplies no quantitative results, error metrics, dataset statistics, validation procedures, or inter-rater reliability scores for the expert annotations. This directly undermines the load-bearing experimental evidence for the claimed gap.

    Authors: We agree the abstract is high-level and omits specific metrics. The full manuscript's Experiments section reports quantitative results (system error rates on detection and correction tasks, SIU²A-Benchmark statistics, and inter-rater reliability scores such as Cohen's kappa for annotations). To address the concern directly, we will revise the abstract to incorporate key quantitative highlights supporting the claimed gap. revision: yes

  2. Referee: [Abstract] Abstract (taxonomy and benchmark construction): The four corruption categories are asserted to be 'fundamental' and used to build SIU²A-Benchmark with expert annotations for error identification and repair, yet no coverage analysis, overlap assessment, or validation that the taxonomy is complete (e.g., versus calibration artifacts or modality-specific noise) is provided. Without this, performance gaps on the benchmark may reflect construction artifacts rather than inherent system limitations.

    Authors: The taxonomy was developed via expert consultation on prevalent scientific image issues. We acknowledge the absence of explicit coverage analysis and completeness validation in the current draft. We will add a dedicated subsection describing taxonomy construction, domain coverage, category overlap assessment, and checks against additional corruption types (e.g., calibration artifacts) to confirm the benchmark reflects genuine system limitations rather than construction artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and benchmark proposal is self-contained

full rationale

The paper introduces the SIU²A framework and SIU²A-Benchmark as a new taxonomy-based evaluation approach for scientific images. No equations, fitted parameters, or predictions appear that reduce to inputs by construction. The four corruption categories are presented as a proposed taxonomy rather than derived from prior results or self-citations. Experiments evaluate multimodal systems on the new benchmark without any self-referential fitting or renaming of known results. This matches the default expectation of a non-circular proposal paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The paper introduces a new evaluation framework and associated concepts; the ledger captures the domain assumption motivating the work and the newly postulated framework elements.

axioms (1)
  • domain assumption Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities.
    Directly stated in the abstract as the basis for proposing the new framework.
invented entities (3)
  • SIU²A framework no independent evidence
    purpose: Evaluate scientific image utility (error detection and correction feasibility) and upgradability (correction quality)
    Newly proposed in the paper.
  • Four corruption types (Detail Distortion, Incompleteness, False Content, Entity Confusion) no independent evidence
    purpose: Categorize scientific image corruptions for systematic evaluation
    Introduced as fundamental types in the paper.
  • SIU²A-Benchmark no independent evidence
    purpose: Dataset with expert annotations for testing error identification and repair
    Constructed based on the new taxonomy in this work.

pith-pipeline@v0.9.1-grok · 5812 in / 1456 out tokens · 48406 ms · 2026-06-28T11:08:26.927679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 20 canonical work pages · 9 internal anchors

  1. [1]

    Introducing claude opus 4.6, Feb

    Anthropic. Introducing claude opus 4.6, Feb. 2026. URL https://www.anthropic.com/ news/claude-opus-4-6. Accessed: 2026-04-23

  2. [2]

    Bosse, D

    S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek. Deep neural networks for no- reference and full-reference image quality assessment.IEEE Transactions on Image Processing, 27(1):206–219, 2018. doi: 10.1109/TIP.2017.2760518

  3. [3]

    Brooks, A

    T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions.arXiv preprint arXiv:2211.09800, 2022

  4. [4]

    ByteDance.Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity, Feb. 2026. URL https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf. Official model card. Accessed: 2026-04-23

  5. [5]

    H. Cao, Z. Liu, X. Lu, Y . Yao, and Y . Li. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. InProceedings of the 31st International Conference on Computational Linguistics, pages 354–379, 2025

  6. [6]

    Gemini 3 pro image generation model

    Google. Gemini 3 pro image generation model. https://aistudio.google.com/models/ gemini-3-pro-image, 2026. Accessed: 2026-04-30

  7. [7]

    Gemini 2.5 flash image, 2025

    Google DeepMind. Gemini 2.5 flash image, 2025. URL https://ai.google.dev/ gemini-api/docs/models/gemini. Accessed: 2026-04-23

  8. [8]

    Gemini 3.1 pro preview, 2026

    Google DeepMind. Gemini 3.1 pro preview, 2026. URL https://ai.google.dev/ gemini-api/docs/models/gemini-3.1-pro-preview. Accessed: 2026-04-23

  9. [9]

    B. F. Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

  10. [10]

    J. Li, D. Zhang, X. Wang, Z. Hao, J. Lei, Q. Tan, C. Zhou, W. Liu, Y . Yang, X. Xiong, et al. Chemvlm: Exploring the power of multimodal large language models in chemistry area. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 415–423, 2025

  11. [11]

    W. Li, L. Chen, J. Wang, Y . Guo, Y . Shen, F. Wen, C. Li, Z. Zhang, and G. Zhai. Siqa: Toward reliable scientific image quality assessment.arXiv preprint arXiv:2603.06700, 2026

  12. [12]

    M. Liu, Z. Fan, Z. Wang, L. Gu, Z. Zhu, Y . He, Y . Yang, C. Tian, X. Zhao, N. Liao, et al. Grade: Benchmarking discipline-informed reasoning in image editing.arXiv preprint arXiv:2603.12264, 2026

  13. [13]

    S. Liu, Y . Han, P. Xing, F. Yin, R. Wang, W. Cheng, J. Liao, Y . Wang, H. Fu, C. Han, G. Li, Y . Peng, Q. Sun, J. Wu, Y . Cai, Z. Ge, R. Ming, L. Xia, X. Zeng, Y . Zhu, B. Jiao, X. Zhang, G. Yu, and D. Jiang. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

  14. [14]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  15. [15]

    T. Lv, Y . Huang, J. Chen, Y . Zhao, Y . Jia, L. Cui, S. Ma, Y . Chang, S. Huang, W. Wang, et al. Kosmos-2.5: A multimodal literate model.arXiv preprint arXiv:2309.11419, 2023

  16. [16]

    Llama 3.2 model cards and prompt formats

    Meta. Llama 3.2 model cards and prompt formats. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama3$_$2/, 2025

  17. [17]

    Introducing gpt-image-1.5, Dec

    OpenAI. Introducing gpt-image-1.5, Dec. 2025. URL https://openai.com/zh-Hans-CN/ index/new-chatgpt-images-is-here/. Accessed: 2026-04-23

  18. [18]

    Gpt-image-2: A multimodal image generation model

    OpenAI. Gpt-image-2: A multimodal image generation model. https://openai.com, 2025. Proprietary model, accessed 2026. 10

  19. [19]

    Introducing gpt-5.4, 2026

    OpenAI. Introducing gpt-5.4, 2026. URL https://openai.com/zh-Hans-CN/index/ introducing-gpt-5-4/. Accessed: 2026-04-23

  20. [20]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  21. [21]

    Qwen3.6-Plus: Towards real world agents, April 2026

    Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026. URL https://qwen.ai/ blog?id=qwen3.6

  22. [22]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models, 2021

  23. [23]

    S. Su, Q. Yan, Y . Zhu, C. Zhang, X. Ge, J. Sun, and Y . Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  24. [24]

    H. Tao, C. Huang, N. Wang, H. Lyu, L. Zhang, G. Ke, and X. Fang. Omniscience: A large-scale multi-modal dataset for scientific image understanding.arXiv preprint arXiv:2602.13758, 2026

  25. [25]

    Galactica: A Large Language Model for Science

    R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V . Kerkez, and R. Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022

  26. [26]

    K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  27. [27]

    V . Team, W. Hong, W. Yu, X. Gu, G. Wang, G. Gan, H. Tang, J. Cheng, J. Qi, J. Ji, L. Pan, S. Duan, W. Wang, Y . Wang, Y . Cheng, Z. He, Z. Su, Z. Yang, Z. Pan, A. Zeng, B. Wang, B. Chen, B. Shi, C. Pang, C. Zhang, D. Yin, F. Yang, G. Chen, J. Xu, J. Zhu, J. Chen, J. Chen, J. Chen, J. Lin, J. Wang, J. Chen, L. Lei, L. Gong, L. Pan, M. Liu, M. Xu, M. Zhang...

  28. [28]

    J. Wang, J. Wang, H. Duan, J. Kang, G. Zhai, and X. Min. I2i-bench: A comprehensive benchmark suite for image-to-image editing models.arXiv preprint arXiv:2512.04660, 2025

  29. [29]

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  30. [30]

    X. Wang, Z. Hu, P. Lu, Y . Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y . Sun, and W. Wang. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. InProceedings of the Forty-First International Conference on Machine Learning, 2024

  31. [31]

    Z. Wang, P. Yin, X. Zhao, C. Tian, Y . Qiao, W. Wang, J. Dai, and G. Luo. Genexam: A multidisciplinary text-to-image exam.arXiv preprint arXiv:2509.14232, 2025

  32. [32]

    H. Wei, H. Liu, Z. Wang, Y . Peng, B. Xu, S. Wu, X. Zhang, X. He, Z. Liu, P. Wang, X. Song, Y . Li, Y . Liu, and Y . Zhou. Skywork unipic 3.0: Unified multi-image composition via sequence modeling, 2026. URLhttps://arxiv.org/abs/2601.15664

  33. [33]

    Y . Weng, M. Zhu, Q. Xie, Q. Sun, Z. Lin, S. Liu, and Y . Zhang. Deepscientist: Advancing frontier-pushing scientific findings progressively. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=cZFgsLq8Gs

  34. [34]

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu. Qwen-image technical report,

  35. [35]

    URLhttps://arxiv.org/abs/2508.02324. 11

  36. [36]

    C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhou, Z. Liu, Z. Xia, C. Li, H. Deng, J. Wang, K. Luo, B. Zhang, D. Lian, X. Wang, Z. Wang, T. Huang, and Z. Liu. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  37. [37]

    H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

  38. [38]

    Grok-4.20-0309-reasoning

    xAI. Grok-4.20-0309-reasoning. https://docs.x.ai/developers/models/grok-4. 20-0309-reasoning, 2026. Accessed: 2026-04-24

  39. [39]

    Z. Xi, G. Li, Y . Fan, H. Guo, Y . Liu, X. Fan, J. Liu, J. Ding, W. Zuo, Z. Yin, L. Bai, T. Ji, T. Gui, Q. Zhang, and X. Huang. Bmmr: A large-scale bilingual multimodal multi-discipline reasoning dataset, 2025. URLhttps://arxiv.org/abs/2507.03483

  40. [40]

    Z. Xu, H. Duan, B. Liu, G. Ma, J. Wang, L. Yang, S. Gao, X. Wang, J. Wang, X. Min, et al. Lmm4edit: Benchmarking and evaluating multimodal image editing with lmms. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6908–6917, 2025

  41. [41]

    Zhang, H

    Z. Zhang, H. Wu, C. Li, Y . Zhou, W. Sun, X. Min, Z. Chen, X. Liu, W. Lin, and G. Zhai. A- bench: Are lmms masters at evaluating ai-generated images?arXiv preprint arXiv:2406.03070, 2024

  42. [42]

    Zhang, T

    Z. Zhang, T. Kou, S. Wang, C. Li, W. Sun, W. Wang, X. Li, Z. Wang, X. Cao, X. Min, et al. Q-eval-100k: Evaluating visual quality and alignment level for text-to-vision content. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10621–10631, 2025

  43. [43]

    Zhang, J

    Z. Zhang, J. Wang, F. Wen, Y . Guo, et al. Large multimodal models evaluation: A survey. SCIENCE CHINA Information Sciences, 68(12):221301–221369, 2025. doi: https://doi.org/10. 1007/s11432-025-4676-4

  44. [44]

    found": true,

    Z. Zhao, D. Ma, L. Chen, L. Sun, Z. Li, Y . Xia, B. Chen, H. Xu, Z. Zhu, S. Zhu, et al. Chemdfm: a large language foundation model for chemistry.arXiv preprint arXiv:2401.14818, 2024. A Limitations This work introduces a novel evaluation framework (SIU2A), formulates a new task, and constructs a corresponding benchmark dataset. However, it does not propos...

  45. [45]

    Proteasomal degradation

    Add an arrow originating from the polyubiquitinated Ino80 (the Ino80 molecule with the chain of 4 Ub moieties attached at the top left of the figure) pointing to a new label that reads 'Proteasomal degradation of ubiquitinated Ino80' to include the missing degradation step. 2. In the replication fork region, add the text label 'H2Aub' adjacent to each Ub ...