pith. sign in

arxiv: 2605.31212 · v1 · pith:FOV2E7PYnew · submitted 2026-05-29 · 💻 cs.CV · cs.AI· cs.CL

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

Pith reviewed 2026-06-28 23:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords text-to-image generationeducational AIarithmetic educationbenchmarkvisual representationsequation-to-visualAI content creation
0
0 comments X

The pith

Current text-to-image models often fail to create accurate visual aids from arithmetic equations for teaching young students.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces equation-to-visual generation, a task where models must produce images from arithmetic equations that preserve exact numerical values and relational structures. Informed by teacher interviews and analysis of classroom materials, the authors build E2V-Bench, a benchmark covering four visual types with automatic metrics that score whether generated images correctly represent the equations. Evaluation of recent models shows frequent failures, mainly from producing the wrong number of objects or breaking the intended connections between them. The authors test several benchmark-guided enhancement strategies that raise performance on representative models.

Core claim

Recent text-to-image models frequently fail on equation-to-visual generation, with errors dominated by incorrect object counts and broken relational structure. Benchmark-guided enhancement strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

What carries the argument

E2V-Bench, a benchmark spanning four pedagogically grounded visual types together with automatic metrics for visual correctness.

If this is right

  • T2I models must develop stronger mechanisms for counting objects and maintaining relational structure when generating from equations.
  • Benchmark-guided strategies can measurably raise performance on the four visual types.
  • The task demands precise preservation of both numerical values and relational structure, unlike standard image generation.
  • A persistent performance gap remains even after enhancements, indicating the need for improved numerical and relational capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same counting and relation failures would likely appear in other early-education domains such as simple science diagrams.
  • Automatic metrics developed here could be adapted to check generated visuals for other structured educational content.
  • Future work could test whether the identified error patterns persist when models are trained on larger amounts of equation-image pairs.

Load-bearing premise

The four visual types and automatic metrics in E2V-Bench match what teachers judge as meaningful representations for early arithmetic.

What would settle it

A direct comparison in which teachers rate the same set of generated images for correctness using the same criteria as the automatic metrics, revealing whether the metric scores align with human judgments.

Figures

Figures reproduced from arXiv: 2605.31212 by April Yi Wang, Boqi Chen, Heejin Do, Junling Wang, Mrinmaya Sachan, Mubashara Akhtar.

Figure 1
Figure 1. Figure 1: Failure cases of text-to-image models. All images are generated from the equation “6 + 7 = 13” using four visual types (e.g., spatial-based; see Sec. 3). and backgrounds, rather than used as fixed illustra￾tions (Singh et al., 2023; Lee et al., 2025). How￾ever, manual visual creation is time-consuming and cannot support the immediate, adaptive scaffolding required in such contexts (Xu et al., 2021; Kaitera… view at source ↗
Figure 2
Figure 2. Figure 2: Standardized equation-to-VD layer used in E2V-Bench. For each equation, we use an LLM to generate four VDs, one for each visual type. These VDs are used as prompts for T2I models to produce visuals. for training and 300 for testing (statistics in Tab. 4), evenly balanced across the four visual types. 4.2 Evaluation Metrics We evaluate model outputs using two criteria in￾formed by prior work and interviews … view at source ↗
Figure 3
Figure 3. Figure 3: Overall accuracy across quantity ranges. Model performance generally declines as object counts increase, highlighting quantity control as a major chal￾lenge in E2V generation. Diffusion-based models perform poorly overall: Stable Diffusion-3.5-large achieves low overall ac￾curacy, while Flux.1-dev shows modest improve￾ments but still struggles to reliably control object counts. Layout-to-image models show … view at source ↗
Figure 4
Figure 4. Figure 4: Overall accuracy across visual types. Perfor￾mance varies by visual type, showing that grouping cues introduce different levels of difficulty for T2I models. Performance also varies by visual type ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bagel training pipeline. The training process begins with SFT on filtered GPT-Image-1 generated data, continues with SFT on a mixed dataset of synthetic data and GPT-Image-1 data, and concludes with iterative rejection-sampling supervised fine-tuning. generative diffusion architecture, making it more robust for further enhancement. This is consistent with prior findings on the limitations of diffusion mode… view at source ↗
Figure 6
Figure 6. Figure 6: Bagel performance across training stages. Overall accuracy improves after synthetic augmentation and RSFT, showing gains from structured data curation. Star indicates the best-performing checkpoint. To verify that the gains were not simply due to additional training and did not merely reflect closer alignment to the automatic evaluation met￾ric, we conducted two additional analyses, with component-level ab… view at source ↗
Figure 8
Figure 8. Figure 8: Example of a visual generated by GPT-Image [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of a visual generated by GPT-Image [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall accuracy across quantity ranges for cartoon-style visuals. The corresponding realistic￾style results are shown in [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Human evaluation interface. Annotators compared each generated visual against the corresponding ground-truth educational visual and assigned binary judgments for Quantity Accuracy and Overall Accuracy. Clicking the button indicates a correct judgment, while leaving it unclicked indicates an incorrect one. Visuals are shown in random order for each question. single visual could exhibit multiple problems, e… view at source ↗
Figure 13
Figure 13. Figure 13: Example visuals from four T2I models on the E2V-Bench task. Each column contains one of the four [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of manually coded error types [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example visuals generated by the DSL-based pipeline across different visual types and styles. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
read the original abstract

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the equation-to-visual (E2V) generation task for producing pedagogically meaningful visuals from arithmetic equations. It constructs E2V-Bench, spanning four visual types informed by teacher interviews and educational materials, along with automatic metrics. Evaluation of recent T2I models shows frequent failures dominated by incorrect object counts and broken relational structure; benchmark-guided enhancements improve representative models but a gap remains, calling for stronger numerical and relational grounding in future T2I systems.

Significance. If the automatic metrics prove to align with teacher judgments of pedagogical correctness, the work identifies a concrete limitation of current T2I models in educational settings and provides a benchmark to drive progress on numerical and relational fidelity. The teacher-informed construction of the benchmark is a strength that grounds the evaluation in domain needs.

major comments (2)
  1. [E2V-Bench construction and automatic metrics] The central claim that T2I model errors are 'dominated by incorrect object counts and broken relational structure' (Abstract) depends on E2V-Bench metrics faithfully measuring pedagogical value. No quantitative validation is reported (e.g., correlation of auto scores with teacher ratings of model outputs or inter-rater reliability across the four visual types), leaving open the possibility that the metrics over-weight count/relation failures that teachers tolerate or miss other critical issues.
  2. [Evaluation and enhancement sections] Abstract states evaluation outcomes and improvement strategies but provides no details on metric definitions, dataset construction, statistical significance testing, or exact enhancement methods. This absence makes the performance claims and the 'remaining gap' conclusion difficult to assess from the provided text.
minor comments (1)
  1. [Abstract] Abstract could more explicitly note the absence of human correlation studies for the automatic metrics to set reader expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the manuscript can be strengthened. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [E2V-Bench construction and automatic metrics] The central claim that T2I model errors are 'dominated by incorrect object counts and broken relational structure' (Abstract) depends on E2V-Bench metrics faithfully measuring pedagogical value. No quantitative validation is reported (e.g., correlation of auto scores with teacher ratings of model outputs or inter-rater reliability across the four visual types), leaving open the possibility that the metrics over-weight count/relation failures that teachers tolerate or miss other critical issues.

    Authors: We agree that direct quantitative validation, such as correlation with teacher ratings of generated outputs, would provide stronger evidence for the metrics' alignment with pedagogical value. The metrics were constructed based on teacher interviews and educational material analysis to target numerical accuracy and relational structure as primary failure modes. However, the current version does not include teacher ratings of model outputs or inter-rater reliability statistics. In revision, we will expand the benchmark construction section to include a more explicit discussion of the design process, add a limitations paragraph acknowledging the absence of correlation analysis, and outline a plan for future teacher validation studies. This addresses the concern without overstating current evidence. revision: partial

  2. Referee: [Evaluation and enhancement sections] Abstract states evaluation outcomes and improvement strategies but provides no details on metric definitions, dataset construction, statistical significance testing, or exact enhancement methods. This absence makes the performance claims and the 'remaining gap' conclusion difficult to assess from the provided text.

    Authors: The abstract summarizes results concisely by design, while the full manuscript contains dedicated sections on E2V-Bench (with metric definitions and dataset construction details), the evaluation setup, and the benchmark-guided enhancements. To improve assessability, we will revise the manuscript to include a summary table of metrics and enhancements, report statistical significance where applicable, and provide precise parameter details for the enhancement strategies. These additions will make the claims easier to evaluate without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and evaluations are externally grounded

full rationale

The paper introduces E2V-Bench as a newly constructed benchmark informed by teacher interviews and educational material analysis, then applies it to evaluate external T2I models and explore enhancements. No equations, fitted parameters, self-citations, or ansatzes are described that would reduce any claim to its own inputs by construction. The central findings (model failures on count/relation errors) follow from applying the benchmark metrics to outside models rather than tautological redefinition. This is the most common honest outcome for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that teacher interviews and material analysis yield valid visual categories and metrics; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Teacher interviews and analysis of educational materials produce four visual types that faithfully represent pedagogical requirements for early arithmetic.
    Stated as the basis for constructing E2V-Bench.

pith-pipeline@v0.9.1-grok · 5704 in / 1219 out tokens · 13110 ms · 2026-06-28T23:10:58.465099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 4 canonical work pages · 4 internal anchors

  1. [1]

    Emerging Properties in Unified Multimodal Pretraining

    Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683. Utkarsh Dwivedi, Nitendra Rajput, Prasenjit Dey, and Blessin Varkey. 2017. Visualmath: An automated visualization system for understanding math word- problems. InCompanion Proceedings of the 22nd International Conference on Intelligent User Inter- faces, pages 105–108. ...

  2. [2]

    InForty-first Interna- tional Conference on Machine Learning

    Scaling rectified flow transformers for high- resolution image synthesis. InForty-first Interna- tional Conference on Machine Learning. Maria Evagorou, Sibel Erduran, and Terhi Mäntylä

  3. [3]

    International journal of Stem education, 2:1–13

    The role of visual representations in scien- tific practices: from conceptual understanding and knowledge generation to ‘seeing’how science works. International journal of Stem education, 2:1–13. flaticon. 2025. flaticon. https://www.flaticon.com. [Accessed 31-12-2025]. fun2dolabs. 2025. fun2dolabs. https://fun2dolabs. com. [Accessed 22-12-2025]. Hanan Ga...

  4. [4]

    John Hoven and Barry Garelick

    The effects of stimulus type on performance in a color-form sorting task with preschool, kinder- garten, first-grade, and third-grade children.Child Development, pages 177–191. John Hoven and Barry Garelick. 2007. Singapore math: Simple or complex?Educational Leadership, 65(3):28. Susanna Kaitera and Sari Harmoinen. 2022. Develop- ing mathematical problem...

  5. [5]

    Featured Certification

    LLM-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models.Transactions on Machine Learning Research. Featured Certification. Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jian- wei Yang, Hang Su, Jun Zhu, and Lei Zhang. 2025. Grounding dino: Marrying...

  6. [6]

    Ministry of General Education and Instruction, Re- public of South Sudan

    College student web use, perceptions of infor- mation credibility, and verification behavior.Com- puters & Education, 41(3):271–290. Ministry of General Education and Instruction, Re- public of South Sudan. 2018.Primary Mathemat- ics: Pupil’s Book 2. Mountain Top Publishers Ltd., Nairobi, Kenya. Funded by the Global Partnership for Education. Wenyi Mo, Ti...

  7. [7]

    How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

    Young infants readily use proximity to orga- nize visual pattern information.Acta Psychologica, 127(2):289–298. Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, O ˘guzhan Fatih Kar, and Amir Za- mir. 2025. How well does gpt-4o understand vi- sion? evaluating multimodal foundation models on standard computer vision tasks.arXiv preprint arXiv...

  8. [8]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    Enhancing textbooks with visuals from the web for improved learning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11931–11944, Singa- pore. Association for Computational Linguistics. Marian Small and Amy Lin. 2025.Eyes on math: A visual approach to teaching math concepts. Teachers College Press. Junling W...

  9. [9]

    Show-o2: Improved Native Unified Multimodal Models

    Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564. Yi Xu, Roger Smeets, and Rafael Bidarra. 2021. Pro- cedural generation of problems for elementary math education.International Journal of Serious Games, 8(2):49–66. Shih-Ying Yeh, Yu-Guan Hsieh, Zhidong Gao, Bernard BW Yang, Giyeong Oh, and Yanmin Gong

  10. [10]

    Create a cartoon style image to visualize this equation:3+4=7

    Navigating text-to-image customization: From lycoris fine-tuning to model evaluation. InThe Twelfth International Conference on Learning Repre- sentations. A Details of Thematic Analysis of Visual Types A.1 Procedure We conducted a thematic analysis to identify re- curring visual types from visuals collected across six educational sources, including three...

  11. [11]

    You can draw limited amount of other objects to make the whole image realistic, but the quantity of objects specified in the prompt should be accurate

  12. [12]

    Bounding box should reflect the shape of the object, and the object mentioned in the prompt should be the focus of the image and their bounding box should be BIG for visualization

  13. [13]

    If the prompt involve same type of objects in different color, group objects of the same color together

    If not specified in the prompt, make sure same type of objects are grouping together. If the prompt involve same type of objects in different color, group objects of the same color together

  14. [14]

    If there are too many objects, you can use a top-down view as indicated in the Background prompt

    Please place the bounding boxes in a natural and spatially sensible way: for example, objects should not be floating in the air. If there are too many objects, you can use a top-down view as indicated in the Background prompt. Similarly, if the objects are inside a container, you may also use a top view to make both the container and the objects visible

  15. [15]

    Example:

    Make sure no bounding box exceeds the image boundary. Example:

  16. [16]

    A short distance away, there are eight green balloons also floating

    Input prompt: There are five balloons floating in the air. A short distance away, there are eight green balloons also floating. Output Bounding Box:[('balloon', [8, 62, 95, 100]), ('balloon', [115, 62, 96, 102]), ('balloon', [9, 177, 93, 98]), ('balloon', [118, 176, 96, 101]), ('balloon', [14, 293, 97, 97]), ('balloon', [294, 27, 97, 103]), ('balloon', [4...