pith. sign in

arxiv: 2509.14232 · v5 · pith:6MWN3ELSnew · submitted 2025-09-17 · 💻 cs.CV

GenExam: A Multidisciplinary Text-to-Image Exam

Pith reviewed 2026-05-18 15:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationbenchmarkmultidisciplinary evaluationexam-style assessmentsemantic correctnessvisual plausibilitygenerative modelsreasoning integration
0
0 comments X

The pith

GenExam introduces a benchmark of 1,000 multidisciplinary exam prompts to test text-to-image models on integrated understanding, reasoning and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GenExam as the first benchmark that treats text-to-image generation as a multidisciplinary exam. It supplies 1,000 curated prompts spanning 10 subjects, structured by a four-level taxonomy, each accompanied by ground-truth images and fine-grained scoring points. Experiments across 17 models establish that the tasks are difficult and expose a consistent performance gap favoring closed-source systems over open-source ones. The design measures whether models can combine factual knowledge, logical steps and visual output at expert level. This framing supplies a concrete way to track progress toward generative systems that handle complex, exam-like demands.

Core claim

GenExam is introduced as a benchmark consisting of 1,000 exam-style prompts across 10 subjects under a four-level taxonomy, each paired with ground-truth images and fine-grained scoring points. Testing 17 text-to-image and unified models reveals the benchmark's difficulty and a significant gap where open-source models lag behind leading closed-source ones. The benchmark assesses the integration of understanding, reasoning, and generation in image synthesis.

What carries the argument

The GenExam benchmark, organized by a four-level taxonomy of exam-style prompts together with 1,000 samples, ground-truth images, and fine-grained scoring points that jointly evaluate semantic correctness and visual plausibility.

If this is right

  • Text-to-image models must improve their ability to combine conceptual understanding with step-by-step reasoning before producing images.
  • Open-source models require targeted advances to close the observed gap on expert-level generation tasks.
  • GenExam supplies a standard for measuring whether generative systems are approaching integrated expert intelligence.
  • Existing image-generation benchmarks miss the rigorous evaluation of drawing exams that require precise semantic and visual fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could be reused to generate synthetic training data that forces models to produce explicit reasoning traces before rendering an image.
  • Similar exam-style benchmarks may be constructed for other generative domains such as video synthesis or scientific diagram creation.
  • The performance difference suggests that future scaling efforts should incorporate explicit reasoning objectives rather than pure visual fidelity alone.
  • Human validation studies on the scoring points could be run on a held-out subset to quantify how well the automated metric aligns with expert judgment.

Load-bearing premise

The assumption that the four-level taxonomy, 1,000 curated samples, and fine-grained scoring points together provide a precise and representative measure of semantic correctness and visual plausibility for expert-level image generation tasks.

What would settle it

An open-source model that matches or exceeds the scores of the leading closed-source models on the full set of GenExam prompts would falsify the reported performance gap.

Figures

Figures reproduced from arXiv: 2509.14232 by Changyao Tian, Gen Luo, Jifeng Dai, Penghao Yin, Wenhai Wang, Xiangyu Zhao, Yu Qiao, Zhaokai Wang.

Figure 1
Figure 1. Figure 1: Examples and performance of state-of-the-art text-to-image models on GenExam. Orange dashed circles indicate scoring points. GenExam contains complex and diverse prompts resembling human exams, and pose great challenge to existing models. ABSTRACT Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly … view at source ↗
Figure 2
Figure 2. Figure 2: Examples from GenExam. GenExam contains 1,000 exam-style prompts that span over 10 subjects and corresponding reference images for multidisciplinary text-to-image exam. 3 GENEXAM 3.1 OVERVIEW [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data curation pipeline. We use pre-defined taxonomy to collect web images and existing datasets, and conduct annotating and filtering based on GPT-5 and manual check. the generated image from all aspects based on the input prompt. In [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation protocol of GenExam. Employing MLLM-as-a-judge, we use scoring points of each prompt to calculate semantic correctness (0-1), and also rate the image in terms of spelling, logical consistency, and readability (0/1/2). The four dimensions are used to calculate a strict score and a relaxed score. Manual Check and Filtering. Three PhD annotators perform rigorous manual inspection on all the images,… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison across models on four evaluation dimensions. Closed-source Models: GPT-Image-1 (GPT-4o) (OpenAI, 2025b), Gemini-2.5-Flash-Image (Nano Banana) (Google, 2025), Imagen-4-Ultra (Deepmind, 2025), Seedream 4.0 (Seed, 2025), Seedream 3.0 (Gao et al., 2025), and FLUX.1 Kontext max (Batifol et al., 2025). These proprietary models represent the state-of-the-art methods in text-to-image generation. Open-so… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of images generated by top-performing models. Semantic, Read, Logic, and Spell denote semantic correctness, logical consistency and spelling, respectively. of the models as evaluator and calculate the accuracy of answering scoring points, MAE of visual plausibility, and three correlation metrics (Kendall’s τ , Spearman’s ρ, and Pearson’s r). As shown in Tab. 4, the latest model GPT-5 with low reas… view at source ↗
Figure 8
Figure 8. Figure 8: Level-2 taxonomy of GenExam. Detailed four-level taxonomy is in Sec D [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More statistics: (a) Subject distribution on GenExam-Full; (b) Subject distribution on GenExam-Mini; (c) Proportion of each image source. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Word clouds of prompts in each subject. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of generated images. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of generated images. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of generated images. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
read the original abstract

Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GenExam, the first benchmark for multidisciplinary text-to-image exams. It features 1,000 curated samples across 10 subjects, organized under a four-level taxonomy, with each problem supplied with ground-truth images and fine-grained scoring points to assess semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models show that the benchmark is challenging and reveal a consistent performance gap in which open-source models lag behind leading closed-source models. The work frames image generation as an exam to evaluate integrated understanding, reasoning, and generation capabilities.

Significance. If the scoring methodology can be shown to be reliable and reproducible, GenExam would provide a useful new tool for evaluating expert-level generative capabilities that go beyond existing world-knowledge illustration benchmarks. The public release of the benchmark, ground-truth images, scoring points, and evaluation code is a clear strength that supports reproducibility and future extensions by the community.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The central claim that GenExam enables 'precise evaluation of semantic correctness and visual plausibility' rests on the fine-grained scoring points, yet the manuscript supplies no information on how these points were authored, aligned to the four-level taxonomy, subjected to expert review, or pilot-tested. No inter-rater reliability statistics are reported. This directly undermines the reliability of the 17-model comparison.
  2. [§4] §4 (Experiments): The reported performance gap between open- and closed-source models is presented as evidence of GenExam's rigor, but without validation of the scoring rubric (e.g., agreement metrics or sensitivity analysis), it remains possible that differences in how 'visual plausibility' is interpreted across the 10 subjects drive the observed results rather than genuine capability differences.
minor comments (2)
  1. [Abstract] The abstract contains a minor grammatical issue ('insights for on the path').
  2. [Figure 1] Ensure that any figures illustrating the four-level taxonomy include concrete examples from multiple subjects to improve clarity for readers unfamiliar with the taxonomy.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We have carefully reviewed the concerns regarding the details of scoring point creation and the validation of the evaluation methodology. Below we respond point by point to the major comments. Where appropriate, we indicate revisions that will be incorporated into the next version of the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The central claim that GenExam enables 'precise evaluation of semantic correctness and visual plausibility' rests on the fine-grained scoring points, yet the manuscript supplies no information on how these points were authored, aligned to the four-level taxonomy, subjected to expert review, or pilot-tested. No inter-rater reliability statistics are reported. This directly undermines the reliability of the 17-model comparison.

    Authors: We acknowledge that the original manuscript provides insufficient detail on the development of the fine-grained scoring points. The points were created by the core author team in consultation with domain experts for each of the 10 subjects, with explicit mapping to the four-level taxonomy: level-1 and level-2 criteria focus on semantic correctness (object presence, attribute accuracy, relational structure), while level-3 and level-4 criteria address visual plausibility (style consistency, spatial layout, and absence of artifacts). A small-scale pilot was conducted internally on 50 samples to refine the rubric before full annotation. In the revised manuscript we will expand §3 with a dedicated subsection describing this process, including the expert consultation steps and pilot outcomes. We did not collect formal inter-rater reliability statistics during the initial construction; therefore we cannot report Cohen’s kappa or similar metrics at this time. revision: partial

  2. Referee: [§4] §4 (Experiments): The reported performance gap between open- and closed-source models is presented as evidence of GenExam's rigor, but without validation of the scoring rubric (e.g., agreement metrics or sensitivity analysis), it remains possible that differences in how 'visual plausibility' is interpreted across the 10 subjects drive the observed results rather than genuine capability differences.

    Authors: We agree that stronger validation of the rubric would increase confidence in the reported gap. The scoring rubric was applied uniformly by the same annotation team across all subjects, and the performance ordering remains consistent when results are broken down by individual subject. Nevertheless, we accept that without sensitivity analysis it is difficult to fully rule out rubric-interpretation effects. In the revision we will add a sensitivity study that re-scores a subset of generations under varied weightings of semantic versus visual components and show that the open- versus closed-source gap persists. We maintain that the gap primarily reflects capability differences, but we will explicitly note the limitation of the current validation in the revised text. revision: partial

standing simulated objections not resolved
  • Formal inter-rater reliability statistics (e.g., Cohen’s kappa) were not collected during benchmark construction and therefore cannot be reported without additional annotation effort.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential predictions

full rationale

This is an empirical benchmark paper that introduces GenExam with 1,000 curated samples, a four-level taxonomy, ground-truth images, and fine-grained scoring points, then reports performance of 17 models. The abstract and provided text contain no equations, no claimed first-principles derivations, no predictions that reduce to fitted parameters, and no load-bearing self-citations that justify uniqueness or ansatzes. The central claim (open-source models lag closed-source ones on integrated understanding/reasoning/generation) rests on direct experimental comparison rather than any reduction to the benchmark's own inputs by construction. No steps satisfy the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This paper introduces an evaluation benchmark rather than a theoretical model. It contains no free parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5730 in / 1268 out tokens · 50530 ms · 2026-05-18T15:46:07.435951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.

  2. PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

    cs.CV 2026-02 unverdicted novelty 7.0

    PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.

  3. How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

    cs.CL 2026-03 unverdicted novelty 6.0

    Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 3 Pith papers

  1. [1]

    graphs, diagrams, plots, icons, geometry, etc.), it is 0

    Text Richness Score (range: 0-10): If the image content is pure table/text without any other content (e.g. graphs, diagrams, plots, icons, geometry, etc.), it is 0. If the image does not contain any text, it is 10. For other cases, the score is based on the complexity of the image content

  2. [2]

    Image Domain (range: 0-1): If the image is a natural image (i.e., real image) / pathological image (also including body scans, MRI, CT scans, and X-rays), it is 0, otherwise it is 1

  3. [3]

    Image Complexity (range: 1-10): Whether the image is too complex to draw (consider the number and complexity of compo- nents,objects, text, etc.), very complex is 1, very simple is 10

  4. [4]

    Text Richness Score

    Subject Knowledge (range: 1-10): Whether the model needs **subject knowledge and reasoning** to generate this image (note: the prompt is not a complete description of the image, it is related to subject knowledge), no need is 1, need a lot is 10. ## Instructions For each domain, provide: - **Score**: An integer, based on the range defined above. - **Confi...

  5. [5]

    It should be in a style similar to an **academic question** that **requires some subject knowledge and reasoning** to generate the image

    The prompt should **NOT** be a complete description of the image. It should be in a style similar to an **academic question** that **requires some subject knowledge and reasoning** to generate the image. If a model does not have such subject knowledge, it is expected to fail to generate the correct image

  6. [6]

    The prompt should be precise and concise (less than 200 words) and should be in English

  7. [7]

    ## Scoring Points

    The provided conversation and text in the image can sometimes help you to generate the prompt, but you should consider them carefully since they may contain some irrelevant information. ## Scoring Points

  8. [8]

    Yes” or “No

    The scoring points are in the form of a list of questions and their corresponding scores, answered by “Yes” or “No” only. They are designed so that by answering the questions, we can evaluate whether the model can generate the correct image corresponding to the prompt. It is expected that the answers are all “Yes” if the model generates the correct image

  9. [9]

    If the image is simple and requires little subject knowledge and reasoning, the number of scoring points can be small

    The number of scoring points should be no more than 20. If the image is simple and requires little subject knowledge and reasoning, the number of scoring points can be small

  10. [10]

    The questions should be based on the prompt instead of the image, i.e. since the model is required to generate the image based on the prompt, the questions should not focus on information or components that are in the ground truth image but not in the prompt

  11. [11]

    Is xxx correct?

    The questions should not be too general. It should provide **details** rather than “Is xxx correct?”, e.g. the structure of a molecule (bonds, position of atoms, etc.), characteristics that a curve should satisfy, etc. It should not be too complex as you should break it down into several sub-questions

  12. [12]

    The score of each scoring point should be a number between 0 and 1, indicating the proportion of this question in the total score of the image.**The sum of all scores should be 1**

  13. [13]

    checking the overall structure of the image) should be small, such as 0.1 or 0.2

    Typically, the score of the most basic question (e.g. checking the overall structure of the image) should be small, such as 0.1 or 0.2. For complex questions, questions requiring subject knowledge, or questions about details, the score should be large. ## Examples

  14. [14]

    question

    Prompt: “Generate a planar molecular structure diagram of benzene (C 6H6), showing its conjugated double bond structure, and distinguish carbon atoms and hydrogen atoms with different colors.”, Scoring points: [{{“question”: “Is the number of carbon atoms 6?”, “score”: 0.2}},{{“question”: “Is the number of hydrogen atoms 6?”, “score”: 0.2}},{{“question”: ...

  15. [15]

    Generate the graph of the functiony=e x

    Prompt: “Generate the graph of the functiony=e x.”, Scoring points: [{{“question”: “Does the image shows a graph of a function?”, “score”: 0.1}},{{“question”: “Does the curve pass through the point (0, 1)?”, “score”: 0.2}},{{“question”: “Does the curve asymptotically approach the x-axis when x approaches negative infinity?”, “score”: 0.2}},{{“question”: “...

  16. [16]

    Generate a schematic diagram of an animal cell structure. Label cell membrane, cytoplasm, nucleus, and mitochondria

    Prompt: “Generate a schematic diagram of an animal cell structure. Label cell membrane, cytoplasm, nucleus, and mitochondria.”, Scoring points: [{{“question”: “Does the image contain a cell structure?”, “score”: 0.1}},{{“question”: “Does the cell membrane label correspond to the correct position in the cell?”, “score”: 0.2}},{{“question”: “Does the cytopl...

  17. [17]

    You should **only evaluate the generated image**, while the ground truth image can be used for reference

    Remember that the first image is the generated image from a model, and the second image is the ground truth image. You should **only evaluate the generated image**, while the ground truth image can be used for reference

  18. [18]

    First, give a detailed description of all the components in the generated image

  19. [19]

    You should generate a **detailed** reasoning **step by step**

    For each scoring point, you should use the ground truth image as **reference information** to make more accurate judgments. You should generate a **detailed** reasoning **step by step**. It should first analyze the question, what subject knowledge it requires and what information you need to refer to from the ground truth image, then analyze whether the g...

  20. [20]

    For each perspective, first provide a **detailed step-by-step** reasoning, e.g

    You should also evaluate the generated image from some global perspectives provided below. For each perspective, first provide a **detailed step-by-step** reasoning, e.g. recognize all the text labels for spelling, then give a score in the defined range. ## Global Perspectives

  21. [21]

    You should first recognize the text in the image in the reasoning, then check the spelling of the text

    Spelling (range: 0-2): The spelling of the text in the image, including the notations and equations. You should first recognize the text in the image in the reasoning, then check the spelling of the text. Specifically: - 0: There are critical errors in spelling, notations or equations which significantly hinders the understanding of the image. - 1: There ...

  22. [22]

    Each component in the image should be clearly readable and identifiable

    Readability (range: 0-2): The readability of the image. Each component in the image should be clearly readable and identifiable. All text labels and marks should in the right place and are not overlapped or occluded by other elements. If the image is geometry or diagram, there should not be unlabeled points or lines or duplicated labels. If the image is a...

  23. [23]

    ancient civilizations

    Logical Consistency (range: 0-2): The logical consistency of the image. Check the correctness of all the marks, text, musical notes, etc. If the image is geometry, check the correctness of each marked angles, lengths, coordinates, etc. If the image is a plot or chart, check the correctness of each data point, the axis, the ticks, the legend, etc. You shou...