Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

Jianwen Sun; Jiaxin Ai; Kaipeng Zhang; Liangliang Zhao; Minghao Liu; Siqi Luo; Yifan Chang; Yihao Liu; Yuandong Pu; Yuchen Ren

arxiv: 2606.05949 · v2 · pith:TWKK6UN6new · submitted 2026-06-04 · 💻 cs.CV

Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

Yifan Chang , Jiaxin Ai , Jianwen Sun , Yuandong Pu , Siqi Luo , Liangliang Zhao , Yuchen Ren , Minghao Liu

show 4 more authors

Yunfei Yu Yu Qiao Kaipeng Zhang Yihao Liu

This is my paper

Pith reviewed 2026-06-28 02:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords scientific illustrationtext-to-image modelsbenchmarkinstruction faithfulnessreasoning enrichmentsemantic precisionatom set annotation

0 comments

The pith

A new benchmark reveals that even leading closed-source text-to-image models still fail at accurate text rendering and balanced reasoning in scientific diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FEPBench, a benchmark of high-quality natural-science illustrations annotated at the level of individual visual, textual, relational, and layout atoms. It measures text-to-image models on three axes: how faithfully they follow input instructions, how much they enrich outputs through scientific reasoning, and how precisely they maintain semantic accuracy without excess or omission. Evaluation of current models shows persistent shortfalls in text rendering, limited enrichment, and difficulty trading off richness against precision, even among the strongest closed-source systems.

Core claim

Even state-of-the-art closed-source models such as GPT Image 2 and Nano Banana Pro still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision when producing scientific illustrations.

What carries the argument

FEPBench benchmark with atom-set annotations that decompose outputs into visual, textual, relation, and layout elements for scoring instruction faithfulness, reasoning enrichment, and semantic precision.

If this is right

Text rendering must be treated as a first-class capability rather than an afterthought for scientific use.
Models need explicit mechanisms to add domain reasoning without drifting from the prompt.
Generation systems will require tunable controls that let users trade richness for precision on demand.
Evaluation of future models should report separate scores for visual, textual, relational, and layout atoms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark holds, current models are better suited to rough concept sketches than to publication-ready figures.
The three-dimensional scoring could be applied to non-scientific domains where precise diagrams matter, such as technical manuals.
Releasing the atom annotations may let other researchers test whether particular architectural choices drive the observed bottlenecks.

Load-bearing premise

The chosen scientific illustrations and their atom annotations accurately capture what counts as faithful, enriched, and precise generation across disciplines.

What would settle it

A new text-to-image model that scores near the top of FEPBench on all three dimensions yet produces diagrams judged unusable by practicing scientists in a blind review.

Figures

Figures reproduced from arXiv: 2606.05949 by Jianwen Sun, Jiaxin Ai, Kaipeng Zhang, Liangliang Zhao, Minghao Liu, Siqi Luo, Yifan Chang, Yihao Liu, Yuandong Pu, Yuchen Ren, Yunfei Yu, Yu Qiao.

**Figure 2.** Figure 2: Pipeline of our benchmark Definition of Atom Set. We evaluate generated scientific illustrations via a state-based accounting framework over atoms. Let A denote the gold atom set of a target illustration, partitioned into instruction atoms Ains and reasoning atoms Area, and let Aˆ denote the realized atom set of a generated illustration. a ∈ Ains means an atom that can find evidence in instructions (prompt… view at source ↗

**Figure 3.** Figure 3: Overall model comparison from fine-grained metric gaps to the capability frontier. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Effects of illustration layout and prompt format on fine-grained model performance. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Model robustness to increasing semantic complexity. For each model, we plot IF, RE, and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of generated results from different models under different prompt formats for [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Prompts for Atom Set Verifier and Precision Verifier of MLLM. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used for rewriting free-form prompts into structured prompts [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Sample-level score distributions across models. For each model, we show the distributions [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Correlation among IF, RE , and SP C Generation Results of Model and Analysis GPT Image 1.5 Nano Banana Pro Reference Qwen-Image-2.0 Pro Seedream 5.0 GPT Image 2 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Closed-source model generations on Physics and Materials tasks. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Closed-source model generations on Geography and Ecology tasks. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

read the original abstract

Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FEPBench adds a decomposed benchmark for T2I scientific illustrations but the construction details are too thin to support the model-failure claims yet.

read the letter

The paper's main contribution is FEPBench, a new benchmark drawn from scientific illustrations across disciplines. It supplies atom-set annotations and breaks evaluation into three axes—faithfulness to the prompt, reasoning enrichment, and semantic precision—then further splits performance by visual, textual, relation, and layout elements. That decomposition is a clear step past the usual holistic scores in T2I papers.

What works is the focus itself. Scientific illustration is a real use case where generic image benchmarks miss the mark, and releasing the annotations plus evaluation code is the right move. The abstract states that even closed-source models like GPT Image 2 still struggle with text rendering and balancing richness against precision; if the numbers hold, that observation is useful for anyone trying to adapt these models to research communication.

The soft spot is exactly the one the stress-test flags. The abstract says the illustrations were “carefully selected” and the atom sets were made “with the assistance of MLLMs and human experts,” but gives no selection criteria, no inter-annotator numbers, and no description of how MLLM suggestions were filtered or corrected. Without those steps, it is hard to know whether the reported bottlenecks are properties of the models or artifacts of how the test set was built. The soundness score in the reader’s note looks right on that point.

This is the kind of paper that belongs in a CV venue that cares about domain-specific evaluation. A reader working on generative models for science or education would get value from the benchmark once the methods section is expanded. It is worth sending to peer review; the core idea is sound and the release plan is concrete, but referees will need to see the annotation protocol and any error analysis before the performance claims can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FEPBench, a benchmark for Text-to-Image (T2I) models focused on natural-science illustration generation. It is constructed from carefully selected high-quality illustrations across disciplines and layout types, with fine-grained atom set annotations produced via MLLM assistance and human experts. The benchmark evaluates models on three dimensions—instruction faithfulness, reasoning enrichment, and semantic precision—decomposed across visual, textual, relation, and layout elements. Results indicate that even SOTA closed-source models (e.g., GPT Image 2, Nano Banana Pro) exhibit text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing richness with precision. The authors plan to release the benchmark data, annotations, and evaluation code.

Significance. If the benchmark construction proves robust and representative, this work offers a fine-grained, domain-specific evaluation framework that addresses limitations of holistic T2I benchmarks. It identifies concrete, actionable weaknesses in current models relevant to scientific communication and provides a template for decomposed assessment. The commitment to releasing data and code supports reproducibility and extension by the community.

major comments (2)

[Benchmark construction] Benchmark construction section: The manuscript refers to 'carefully selected high-quality scientific illustrations' and atom set annotations created 'with the assistance of MLLMs and human experts' but supplies no explicit, reproducible selection criteria, sampling strategy across disciplines, or detailed annotation protocol (including how MLLM outputs were validated by experts and any inter-annotator agreement metrics). This is load-bearing for the central claims, because the representativeness of the atom sets directly determines whether the decomposed results reliably demonstrate text-rendering bottlenecks, limited reasoning enrichment, and richness-precision imbalance.
[Evaluation and results] Evaluation and results section: The quantitative definitions and aggregation rules for the three core metrics (instruction faithfulness, reasoning enrichment, semantic precision) and their element-wise decomposition are not provided. Without these, it is impossible to determine whether the reported model shortcomings are robust to annotation choices or sensitive to the MLLM-assisted process, undermining the strength of the performance conclusions.

minor comments (2)

[Introduction] The abstract and introduction would benefit from a brief table or paragraph explicitly contrasting FEPBench with prior T2I benchmarks (e.g., on granularity and scientific focus) to strengthen the novelty claim.
Figure captions for generated examples should consistently include the input prompt, the specific atom-set elements being evaluated, and the observed failure mode to aid reader interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving reproducibility and clarity. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: The manuscript refers to 'carefully selected high-quality scientific illustrations' and atom set annotations created 'with the assistance of MLLMs and human experts' but supplies no explicit, reproducible selection criteria, sampling strategy across disciplines, or detailed annotation protocol (including how MLLM outputs were validated by experts and any inter-annotator agreement metrics). This is load-bearing for the central claims, because the representativeness of the atom sets directly determines whether the decomposed results reliably demonstrate text-rendering bottlenecks, limited reasoning enrichment, and richness-precision imbalance.

Authors: We agree that the current manuscript provides only high-level descriptions and lacks the requested explicit details. In the revised version, we will add a dedicated subsection with: (1) explicit selection criteria for illustrations (e.g., requirements for scientific accuracy, visual clarity, and disciplinary diversity), (2) the sampling strategy used to cover disciplines and layout types, and (3) the full annotation protocol, including MLLM prompting details, expert validation steps, and inter-annotator agreement metrics. These additions will directly support claims about representativeness. revision: yes
Referee: [Evaluation and results] Evaluation and results section: The quantitative definitions and aggregation rules for the three core metrics (instruction faithfulness, reasoning enrichment, semantic precision) and their element-wise decomposition are not provided. Without these, it is impossible to determine whether the reported model shortcomings are robust to annotation choices or sensitive to the MLLM-assisted process, undermining the strength of the performance conclusions.

Authors: We acknowledge that the manuscript does not include the formal quantitative definitions or aggregation rules. In the revision, we will add precise mathematical formulations for each metric, the element-wise decomposition (visual, textual, relation, layout), and the aggregation procedures. We will also specify how MLLM-assisted annotations are handled in scoring to allow assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark is externally grounded evaluation framework

full rationale

The paper constructs FEPBench from external high-quality scientific illustrations selected across disciplines, with atom-set annotations produced via MLLM assistance plus human experts. Evaluation metrics for faithfulness, enrichment, and precision are defined independently of any model outputs or fitted parameters. No equations, self-referential definitions, fitted-input predictions, or load-bearing self-citations appear in the derivation of the central claims. The reported model limitations follow directly from applying these external annotations to T2I outputs, making the evaluation self-contained against the benchmark inputs rather than reducing to them by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that selected illustrations and MLLM/human annotations provide a valid ground truth for scientific illustration quality; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption High-quality scientific illustrations can be curated across disciplines and annotated at the atom level to serve as reliable ground truth for faithfulness, enrichment, and precision.
This underpins the benchmark construction and evaluation as described in the abstract.

pith-pipeline@v0.9.1-grok · 5793 in / 1267 out tokens · 51010 ms · 2026-06-28T02:56:03.574467+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Supported Models and Capabilities Overview: Qwen Image Models

Alibaba Cloud. Supported Models and Capabilities Overview: Qwen Image Models. https://www. alibabacloud.com/help/en/model-studio/models, 2026. Accessed: 2026-05-06

2026
[2]

FLUX.2: Next Generation Image Generation

Black Forest Labs. FLUX.2: Next Generation Image Generation. https://bfl.ai/models/flux-2,
[4]

Seedream 5.0 Lite

ByteDance Seed Team. Seedream 5.0 Lite. https://seed.bytedance.com/en/seedream5_0_lite,
[6]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, et al. HunyuanImage 3.0 Technical Report. arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Kevin Zhou, and Kaipeng Zhang

Yifan Chang, Yukang Feng, Jianwen Sun, Jiaxin Ai, Chuanhao Li, S. Kevin Zhou, and Kaipeng Zhang. Sridbench: Benchmark of scientific research illustration drawing of image generation model.arXiv preprint arXiv:2505.22126, 2025

work page arXiv 2025
[8]

Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation

Jaemin Cho, Yushi Hu, Jason Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation. InInternational Conference on Learning Representations, 2024

2024
[9]

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding.arXiv preprint arXiv:2603.22458, 2026

Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, and Conghui He. MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding.arXiv preprint arXiv:2603.22458, 2026

work page arXiv 2026
[10]

Introducing nano banana pro

Google DeepMind. Introducing nano banana pro. Google Blog, 2025. URL https://blog.google/ technology/google-deepmind/nano-banana-pro/. Accessed 2026-04-27

2025
[11]

Gemini 3 Pro Image – Nano Banana Pro

Google DeepMind. Gemini 3 Pro Image – Nano Banana Pro. https://deepmind.google/models/ gemini-image/pro/, 2026. Accessed: 2026-05-06

2026
[12]

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023
[13]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[14]

Evaluating numerical reasoning in text-to-image models

Ivana Kaji´c, Olivia Wiles, Isabela Albuquerque, Matthias Bauer, Su Wang, Jordi Pont-Tuset, and Aida Nematzadeh. Evaluating numerical reasoning in text-to-image models. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

2024
[15]

Easier painting than thinking: Can text-to-image models set the stage, but not direct the play? InInternational Conference on Learning Representations, 2026

Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, and Fuli Feng. Easier painting than thinking: Can text-to-image models set the stage, but not direct the play? InInternational Conference on Learning Representations, 2026

2026
[16]

Bizgeneval: A systematic benchmark for commercial visual content generation.arXiv preprint arXiv:2603.25732, 2026

Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, and Chong Luo. Bizgeneval: A systematic benchmark for commercial visual content generation.arXiv preprint arXiv:2603.25732, 2026

work page arXiv 2026
[17]

Scientific image synthesis: Benchmarking, methodologies, and downstream utility

Honglin Lin et al. Scientific image synthesis: Benchmarking, methodologies, and downstream utility. arXiv preprint arXiv:2601.17027, 2026

work page arXiv 2026
[18]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InEuropean Conference on Computer Vision, pages 366–384, 2024. 10

2024
[19]

Mmmg: A massive, multidisciplinary, multi-tier generation benchmark for text-to-image reasoning.arXiv preprint arXiv:2506.10963, 2025

Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, and Zhouhui Lian. Mmmg: A massive, multidisciplinary, multi-tier generation benchmark for text-to-image reasoning.arXiv preprint arXiv:2506.10963, 2025

work page arXiv 2025
[20]

David Marr.Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco, 1982

1982
[21]

GPT Image 1.5 Model

OpenAI. GPT Image 1.5 Model. https://developers.openai.com/api/docs/models/ gpt-image-1.5, 2025. Accessed: 2026-05-06

2025
[22]

GPT-5.4 Model

OpenAI. GPT-5.4 Model. https://developers.openai.com/api/docs/models/gpt-5.4, 2026. Accessed: 2026-05-06

2026
[23]

Gpt image 2 model

OpenAI. Gpt image 2 model. OpenAI API Documentation, 2026. URL https://developers.openai. com/api/docs/models/gpt-image-2. Accessed 2026-04-27

2026
[24]

Qwen3.5: Towards Native Multimodal Agents

Qwen Team. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5,
[25]

Accessed: 2026-05-06

2026
[26]

Qwen-Image-2.0: Professional Infographics, Exquisite Text, and More

Qwen Team. Qwen-Image-2.0: Professional Infographics, Exquisite Text, and More. https://qwen.ai/ blog?id=qwen-image-2.0, 2026. Accessed: 2026-05-06

2026
[27]

T2i-reasonbench: Benchmarking reasoning-informed text-to-image generation.arXiv preprint arXiv:2508.17472, 2025

Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, and Xihui Liu. T2i-reasonbench: Benchmarking reasoning-informed text-to-image generation.arXiv preprint arXiv:2508.17472, 2025

work page arXiv 2025
[28]

Tufte.The Visual Display of Quantitative Information

Edward R. Tufte.The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 1983

1983
[29]

Ovis-Image Technical Report.arXiv preprint arXiv:2511.22982, 2025

Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, and Qing-Guo Chen. Ovis-Image Technical Report.arXiv preprint arXiv:2511.22982, 2025

work page arXiv 2025
[30]

From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing

Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13315–13325, 2025

2025
[31]

Revisiting text-to-image evaluation with gecko

Olivia Wiles, Isabela Albuquerque, Ivana Kajic, Jordi Pont-Tuset, Matthias Bauer, Su Wang, and Aida Nematzadeh. Revisiting text-to-image evaluation with gecko. InInternational Conference on Learning Representations, 2025

2025
[32]

Conceptmix: A com- positional image generation benchmark with controllable difficulty

Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora. Conceptmix: A com- positional image generation benchmark with controllable difficulty. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

2024
[33]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, et al. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

With more and more customers opting out of cookies, the amount of data for wisdom of crowd declines

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. Paperbanana: Automating academic illustration for ai scientists.arXiv preprint arXiv:2601.23265, 2026

work page arXiv 2026
[35]

Autofigure: Generating and refining publication-ready scientific illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autofigure: Generating and refining publication-ready scientific illustrations. InInternational Conference on Learning Representations, 2026. A Prompt used in MLLM Here, we provide the prompts used for atom-set scoring and unexpected-atom detection ...

2026
[37]

present" |

compact gold graph atoms from final_annotation_json: - text entities - visual entities - relations - layout constraints Your task is to verify the whole gold graph against the generated image in ONE pass. For each gold text entity, output only: - entity_id - presence_status: "present" | "absent" - exact_match: 1 | 0 - readable: 1 | 0 - attachment_match: 1...
[38]

a generated scientific figure image
[39]

compact allowed atoms from the gold graph: - required/optional allowed texts - allowed visual entities - allowed relations - gold visual entity count limits Inspect the generated image directly and identify unsupported scientific content in ONE pass. Output:
[40]

supported_texts: realized scientific texts that align to required or optional gold text atoms
[41]

unsupported_texts: realized scientific texts that align to neither required nor optional gold text atoms
[42]

unsupported_visual_entities: salient generated visual entities with scientific meaning that are not allowed by goldatoms
[43]

unsupported_relations: generated scientific relations that are not allowed by gold relations
[44]

supported_texts

generated_visual_entity_counts: generated counts for supported visual entity kinds Rules: - Do not use or infer anything from the original generation prompt. - Report only content with clear scientific meaning. - Ignore harmless decoration, layout fillers, watermark-like noise, and unreadable artifacts. - Do not output importance, confidence, notes, or ex...
[45]

Key Scientific Entities
[46]

Relationships and Process Flow
[47]

Legend and Visual Encoding
[48]

Keep each section compact

Style Only include sections that are supported by the input. Keep each section compact. Use short bullet points inside sections if useful. Emphasize: core topic and research object key visual and textual entities logical relations, mechanisms, and process flow layout and panel structure color palette and visual style Output quality target: The result shou...

2026

[1] [1]

Supported Models and Capabilities Overview: Qwen Image Models

Alibaba Cloud. Supported Models and Capabilities Overview: Qwen Image Models. https://www. alibabacloud.com/help/en/model-studio/models, 2026. Accessed: 2026-05-06

2026

[2] [2]

FLUX.2: Next Generation Image Generation

Black Forest Labs. FLUX.2: Next Generation Image Generation. https://bfl.ai/models/flux-2,

[3] [4]

Seedream 5.0 Lite

ByteDance Seed Team. Seedream 5.0 Lite. https://seed.bytedance.com/en/seedream5_0_lite,

[4] [6]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, et al. HunyuanImage 3.0 Technical Report. arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [7]

Kevin Zhou, and Kaipeng Zhang

Yifan Chang, Yukang Feng, Jianwen Sun, Jiaxin Ai, Chuanhao Li, S. Kevin Zhou, and Kaipeng Zhang. Sridbench: Benchmark of scientific research illustration drawing of image generation model.arXiv preprint arXiv:2505.22126, 2025

work page arXiv 2025

[6] [8]

Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation

Jaemin Cho, Yushi Hu, Jason Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation. InInternational Conference on Learning Representations, 2024

2024

[7] [9]

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding.arXiv preprint arXiv:2603.22458, 2026

Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, and Conghui He. MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding.arXiv preprint arXiv:2603.22458, 2026

work page arXiv 2026

[8] [10]

Introducing nano banana pro

Google DeepMind. Introducing nano banana pro. Google Blog, 2025. URL https://blog.google/ technology/google-deepmind/nano-banana-pro/. Accessed 2026-04-27

2025

[9] [11]

Gemini 3 Pro Image – Nano Banana Pro

Google DeepMind. Gemini 3 Pro Image – Nano Banana Pro. https://deepmind.google/models/ gemini-image/pro/, 2026. Accessed: 2026-05-06

2026

[10] [12]

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023

[11] [13]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[12] [14]

Evaluating numerical reasoning in text-to-image models

Ivana Kaji´c, Olivia Wiles, Isabela Albuquerque, Matthias Bauer, Su Wang, Jordi Pont-Tuset, and Aida Nematzadeh. Evaluating numerical reasoning in text-to-image models. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

2024

[13] [15]

Easier painting than thinking: Can text-to-image models set the stage, but not direct the play? InInternational Conference on Learning Representations, 2026

Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, and Fuli Feng. Easier painting than thinking: Can text-to-image models set the stage, but not direct the play? InInternational Conference on Learning Representations, 2026

2026

[14] [16]

Bizgeneval: A systematic benchmark for commercial visual content generation.arXiv preprint arXiv:2603.25732, 2026

Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, and Chong Luo. Bizgeneval: A systematic benchmark for commercial visual content generation.arXiv preprint arXiv:2603.25732, 2026

work page arXiv 2026

[15] [17]

Scientific image synthesis: Benchmarking, methodologies, and downstream utility

Honglin Lin et al. Scientific image synthesis: Benchmarking, methodologies, and downstream utility. arXiv preprint arXiv:2601.17027, 2026

work page arXiv 2026

[16] [18]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. InEuropean Conference on Computer Vision, pages 366–384, 2024. 10

2024

[17] [19]

Mmmg: A massive, multidisciplinary, multi-tier generation benchmark for text-to-image reasoning.arXiv preprint arXiv:2506.10963, 2025

Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, and Zhouhui Lian. Mmmg: A massive, multidisciplinary, multi-tier generation benchmark for text-to-image reasoning.arXiv preprint arXiv:2506.10963, 2025

work page arXiv 2025

[18] [20]

David Marr.Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco, 1982

1982

[19] [21]

GPT Image 1.5 Model

OpenAI. GPT Image 1.5 Model. https://developers.openai.com/api/docs/models/ gpt-image-1.5, 2025. Accessed: 2026-05-06

2025

[20] [22]

GPT-5.4 Model

OpenAI. GPT-5.4 Model. https://developers.openai.com/api/docs/models/gpt-5.4, 2026. Accessed: 2026-05-06

2026

[21] [23]

Gpt image 2 model

OpenAI. Gpt image 2 model. OpenAI API Documentation, 2026. URL https://developers.openai. com/api/docs/models/gpt-image-2. Accessed 2026-04-27

2026

[22] [24]

Qwen3.5: Towards Native Multimodal Agents

Qwen Team. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5,

[23] [25]

Accessed: 2026-05-06

2026

[24] [26]

Qwen-Image-2.0: Professional Infographics, Exquisite Text, and More

Qwen Team. Qwen-Image-2.0: Professional Infographics, Exquisite Text, and More. https://qwen.ai/ blog?id=qwen-image-2.0, 2026. Accessed: 2026-05-06

2026

[25] [27]

T2i-reasonbench: Benchmarking reasoning-informed text-to-image generation.arXiv preprint arXiv:2508.17472, 2025

Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, and Xihui Liu. T2i-reasonbench: Benchmarking reasoning-informed text-to-image generation.arXiv preprint arXiv:2508.17472, 2025

work page arXiv 2025

[26] [28]

Tufte.The Visual Display of Quantitative Information

Edward R. Tufte.The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 1983

1983

[27] [29]

Ovis-Image Technical Report.arXiv preprint arXiv:2511.22982, 2025

Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, and Qing-Guo Chen. Ovis-Image Technical Report.arXiv preprint arXiv:2511.22982, 2025

work page arXiv 2025

[28] [30]

From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing

Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13315–13325, 2025

2025

[29] [31]

Revisiting text-to-image evaluation with gecko

Olivia Wiles, Isabela Albuquerque, Ivana Kajic, Jordi Pont-Tuset, Matthias Bauer, Su Wang, and Aida Nematzadeh. Revisiting text-to-image evaluation with gecko. InInternational Conference on Learning Representations, 2025

2025

[30] [32]

Conceptmix: A com- positional image generation benchmark with controllable difficulty

Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora. Conceptmix: A com- positional image generation benchmark with controllable difficulty. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

2024

[31] [33]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, et al. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [34]

With more and more customers opting out of cookies, the amount of data for wisdom of crowd declines

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. Paperbanana: Automating academic illustration for ai scientists.arXiv preprint arXiv:2601.23265, 2026

work page arXiv 2026

[33] [35]

Autofigure: Generating and refining publication-ready scientific illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. Autofigure: Generating and refining publication-ready scientific illustrations. InInternational Conference on Learning Representations, 2026. A Prompt used in MLLM Here, we provide the prompts used for atom-set scoring and unexpected-atom detection ...

2026

[34] [37]

present" |

compact gold graph atoms from final_annotation_json: - text entities - visual entities - relations - layout constraints Your task is to verify the whole gold graph against the generated image in ONE pass. For each gold text entity, output only: - entity_id - presence_status: "present" | "absent" - exact_match: 1 | 0 - readable: 1 | 0 - attachment_match: 1...

[35] [38]

a generated scientific figure image

[36] [39]

compact allowed atoms from the gold graph: - required/optional allowed texts - allowed visual entities - allowed relations - gold visual entity count limits Inspect the generated image directly and identify unsupported scientific content in ONE pass. Output:

[37] [40]

supported_texts: realized scientific texts that align to required or optional gold text atoms

[38] [41]

unsupported_texts: realized scientific texts that align to neither required nor optional gold text atoms

[39] [42]

unsupported_visual_entities: salient generated visual entities with scientific meaning that are not allowed by goldatoms

[40] [43]

unsupported_relations: generated scientific relations that are not allowed by gold relations

[41] [44]

supported_texts

generated_visual_entity_counts: generated counts for supported visual entity kinds Rules: - Do not use or infer anything from the original generation prompt. - Report only content with clear scientific meaning. - Ignore harmless decoration, layout fillers, watermark-like noise, and unreadable artifacts. - Do not output importance, confidence, notes, or ex...

[42] [45]

Key Scientific Entities

[43] [46]

Relationships and Process Flow

[44] [47]

Legend and Visual Encoding

[45] [48]

Keep each section compact

Style Only include sections that are supported by the input. Keep each section compact. Use short bullet points inside sections if useful. Emphasize: core topic and research object key visual and textual entities logical relations, mechanisms, and process flow layout and panel structure color palette and visual style Output quality target: The result shou...

2026