arxiv: 2604.15309 · v1 · submitted 2026-04-16 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Chong Luo, Ji Li, Lijuan Wang, Lili Qiu, Mingxi Cheng, Ning Liao, Qi Dai, Weiwei Guo, Xue Yang, Yan Li, Yifan Yang, Yuqing Yang, Zezi Zeng, Zhendong Wang, Zhengyuan Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords multimodal webpage generationhierarchical agentAIGC integrationself-reflectionlayout optimizationweb design automationUI consistency

0 comments

The pith

MM-WebAgent coordinates hierarchical planning and self-reflection to produce coherent multimodal webpages from isolated AI elements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MM-WebAgent as a hierarchical agentic system that plans global webpage layouts first, then generates and integrates local multimodal elements such as images and text through iterative self-reflection. This setup directly tackles the style inconsistencies and lack of overall coherence that arise when AIGC tools create individual components separately. A reader would care because it offers a pathway to automated web design that maintains visual unity without constant manual intervention. Experiments on a newly introduced benchmark show the approach outperforming code-generation and other agent baselines, with particular gains in multimodal integration. The framework thus makes end-to-end generation of professional-looking pages more reliable.

Core claim

MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. It jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. Experiments on a new benchmark with multi-level evaluation demonstrate that it outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration.

What carries the argument

Hierarchical agentic framework that uses planning and iterative self-reflection to coordinate global layout with local multimodal content generation and integration.

If this is right

Webpage generation achieves higher visual consistency by treating layout and content as a single optimization problem rather than separate steps.
Multimodal elements integrate more effectively into the overall design than when produced in isolation.
The method surpasses code-generation and standard agent baselines on metrics for element quality and coherence.
A new benchmark and multi-level evaluation protocol become available for systematic testing of multimodal webpage systems.
Automated design workflows require fewer post-processing fixes to reach usable results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same planning-plus-reflection pattern could transfer to consistency challenges in generating mobile app screens or interactive documents.
Chained generative pipelines in other creative fields might adopt similar self-correction loops to reduce visible artifacts.
Wider use could lower the barrier for non-experts to produce polished web interfaces without design expertise.
Testing on pages with heavy dynamic content or strict accessibility constraints would reveal whether the overhead remains acceptable.

Load-bearing premise

Hierarchical planning combined with iterative self-reflection can reliably resolve style inconsistency and poor global coherence from isolated AIGC element generation without introducing new artifacts or excessive overhead.

What would settle it

A set of MM-WebAgent-generated webpages that still show mismatched visual styles across elements or incoherent overall layouts would show the coordination step fails to deliver the claimed consistency.

Figures

Figures reproduced from arXiv: 2604.15309 by Chong Luo, Ji Li, Lijuan Wang, Lili Qiu, Mingxi Cheng, Ning Liao, Qi Dai, Weiwei Guo, Xue Yang, Yan Li, Yifan Yang, Yuqing Yang, Zezi Zeng, Zhendong Wang, Zhengyuan Yang.

**Figure 1.** Figure 1: Rendered webpage examples generated by MM-WebAgent and baseline methods on MM-WebGEN-Bench. MM-WebAgent generates webpages with more coherent layouts, consistent visual styles, and better-integrated multimodal elements compared to baseline methods. code—they contain heterogeneous multimodal elements such as images, videos, and charts, whose content, style, and geometry must cohere with the global layout … view at source ↗

**Figure 2.** Figure 2: An overview of the proposed framework MM-WebAgent. The framework generates webpages through four key steps: Task planning, hierarchical generation, multi-level evaluation and iterative reflection. 3.1 Hierarchical Planning and Generation The planning stage organizes webpage generation into two levels: a global layout plan and local element plans. MM-WebAgent first constructs a global layout plan that def… view at source ↗

**Figure 3.** Figure 3: Overview of MM-WebGEN-Bench. (a) Dataset construction process, including data generation controlled by layout complexity, visual style, semantic intent, and multimodal elements, followed by a filtering pipeline with automatic format validation and manual quality control. (b) Statistical summary of the final evaluation set, consisting of 120 webpages spanning 11 scene categories and 11 visual styles, and … view at source ↗

**Figure 4.** Figure 4: Effect of reflection iterations on global and local evaluation metrics. Hierarchical reflection steadily improves both global and local metrics [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the hierarchical reflection process. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: More rendered webpage examples generated by [PITH_FULL_IMAGE:figures/full_fig_p054_6.png] view at source ↗

**Figure 7.** Figure 7: Example questions from the survey, focusing on the coherence and attractiveness of multimodal assets, the aesthetic appeal and elegance of the layout, and the accuracy and readability of charts [PITH_FULL_IMAGE:figures/full_fig_p055_7.png] view at source ↗

read the original abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation. It addresses style inconsistency and poor global coherence from isolated AIGC element generation by coordinating via hierarchical planning and iterative self-reflection, jointly optimizing global layout, local multimodal content, and their integration to produce coherent webpages. The authors introduce a new benchmark for multimodal webpage generation and a multi-level evaluation protocol, with experiments showing outperformance over code-generation and agent-based baselines, especially on multimodal element generation and integration.

Significance. If the empirical results and framework details hold under scrutiny, this could meaningfully advance automated UI/UX design by providing a structured coordination mechanism for AIGC tools. The benchmark and multi-level protocol represent a concrete contribution that could enable standardized evaluation in the area. Open-sourcing of code and data strengthens potential impact and reproducibility.

major comments (2)

[§3 (Framework)] The central claim that hierarchical planning plus iterative self-reflection reliably resolves style inconsistency and poor global coherence (without introducing new artifacts or excessive overhead) is load-bearing but rests on an assumption whose validation is not yet visible. The framework description must include concrete implementation details of the reflection loop, decision criteria, and any safeguards against artifact introduction.
[§4 (Experiments and Evaluation)] The outperformance claim, especially on multimodal element generation and integration, cannot be assessed without the concrete metrics, error bars, baseline implementations, and benchmark construction details. The multi-level evaluation protocol needs explicit definitions of coherence scores and how they aggregate across levels.

minor comments (2)

[Abstract and §2] The abstract states the approach and claims but defers all methodological and evaluation specifics to the full text; ensure the introduction and related-work sections explicitly contrast the hierarchical mechanism against prior agentic and code-generation baselines.
[Code & Data link] Confirm that the provided code repository includes the exact benchmark datasets, prompt templates, and evaluation scripts used in the reported experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We appreciate the opportunity to clarify the framework details and evaluation protocol. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [§3 (Framework)] The central claim that hierarchical planning plus iterative self-reflection reliably resolves style inconsistency and poor global coherence (without introducing new artifacts or excessive overhead) is load-bearing but rests on an assumption whose validation is not yet visible. The framework description must include concrete implementation details of the reflection loop, decision criteria, and any safeguards against artifact introduction.

Authors: We agree that the current Section 3 provides a high-level description of the hierarchical planning and iterative self-reflection mechanisms. In the revision, we will expand this section with concrete implementation details, including: (1) the exact structure of the reflection loop (e.g., number of iterations, how it alternates between global layout and local element refinement); (2) decision criteria for accepting/revising outputs (e.g., thresholds on style consistency scores computed via CLIP-based similarity and layout overlap metrics); and (3) safeguards such as style reference injection, artifact detection via perceptual loss checks, and fallback to previous valid states. We will also add pseudocode and a detailed diagram. These additions will make the validation of our central claim explicit without altering the core approach. revision: yes
Referee: [§4 (Experiments and Evaluation)] The outperformance claim, especially on multimodal element generation and integration, cannot be assessed without the concrete metrics, error bars, baseline implementations, and benchmark construction details. The multi-level evaluation protocol needs explicit definitions of coherence scores and how they aggregate across levels.

Authors: We acknowledge that the current presentation of results in Section 4 would benefit from greater explicitness. In the revised version, we will: (1) report all concrete metrics with numerical values and error bars from 5 independent runs; (2) provide detailed descriptions of baseline implementations, including prompt templates and adaptation steps for code-generation (e.g., GPT-4 with HTML/CSS output) and agent-based methods; (3) include full benchmark construction details (dataset size, source of webpages, annotation process for multimodal elements); and (4) explicitly define coherence scores at each level (global layout coherence via IoU and alignment metrics, local content coherence via multimodal similarity, integration coherence via cross-element consistency) along with the aggregation formula (weighted sum with weights 0.4/0.3/0.3). These will be added to the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes a novel hierarchical agentic framework MM-WebAgent that uses planning and iterative self-reflection to coordinate AIGC element generation for coherent webpages. It introduces a new benchmark and multi-level evaluation protocol, then reports experimental outperformance versus code-generation and agent baselines. No equations, fitted parameters renamed as predictions, self-definitional reductions, or load-bearing self-citation chains appear in the provided abstract or description. The central claims rest on external empirical comparisons rather than internal construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the high-level framework description. The central claim implicitly rests on the domain assumption that agentic coordination can solve AIGC integration issues.

axioms (1)

domain assumption Hierarchical planning and iterative self-reflection in agents can effectively coordinate isolated multimodal generations to achieve global coherence
Invoked as the core mechanism of MM-WebAgent without further justification in the abstract

invented entities (1)

MM-WebAgent no independent evidence
purpose: Hierarchical multimodal web agent for coherent webpage generation
The proposed framework itself; no independent evidence or falsifiable predictions provided in the abstract

pith-pipeline@v0.9.0 · 5497 in / 1271 out tokens · 92012 ms · 2026-05-10T11:31:34.798771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

130 extracted references · 15 canonical work pages · 4 internal anchors

[1]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Awal, R., Massoud, M., Feizi, A., Li, Z., Wang, S., Pal, C., Agrawal, A., Vazquez, D., Reddy, S., Rodriguez, J.A., et al.: Webmmu: A benchmark for multimodal mul- tilingual website understanding and code generation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 25129– 25156 (2025)

2025
[2]

arXiv preprint arXiv:2506.13663 (2025)

Chen, Y., Ding, S., Zhang, Y., Chen, W., Du, J., Sun, L., Chen, L.: Designcoder: Hierarchy-aware and self-correcting ui code generation with large language models. arXiv preprint arXiv:2506.13663 (2025)

work page arXiv 2025
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: Proceedings of the ACM on Web Conference 2025

Gui, Y., Li, Z., Wan, Y., Shi, Y., Zhang, H., Chen, B., Su, Y., Chen, D., Wu, S., Zhou, X., et al.: Webcode2m: A real-world dataset for code generation from webpage designs. In: Proceedings of the ACM on Web Conference 2025. pp. 1834– 1845 (2025)

2025
[5]

In: Proceedings of the ACM on Web Conference

Gui, Y., Wan, Y., Li, Z., Zhang, Z., Chen, D., Zhang, H., Su, Y., Chen, B., Zhou, X., Jiang, W., et al.: Uicopilot: Automating ui synthesis via hierarchical code generation from webpage designs. In: Proceedings of the ACM on Web Conference
[6]

1846–1855 (2025)

pp. 1846–1855 (2025)

2025
[7]

In: Findings of the Association for Computational Linguistics: ACL

Guo, H., Zhang, W., Chen, J., Gu, Y., Yang, J., Du, J., Cao, S., Hui, B., Liu, T., Ma, J., et al.: Iw-bench: Evaluating large multimodal models for converting image-to-web. In: Findings of the Association for Computational Linguistics: ACL
[8]

6449–6466 (2025)

pp. 6449–6466 (2025)

2025
[9]

University of Hong Kong, King’s College London, University of Sussex, Shanghai Jiao Tong University (2024)

Huang, D., Zhang, J., Luck, M., Bu, Q., Qing, Y., Cui, H.: Agentcoder: Multi- agent code generation with effective testing and self-optimization. University of Hong Kong, King’s College London, University of Sussex, Shanghai Jiao Tong University (2024)

2024
[10]

arXiv preprint arXiv:2506.16136 (2025) 16 Y

Huang, K., Zhang, J., Xie, X., Chen, C.: Seeing is fixing: Cross-modal rea- soning with multimodal llms for visual software issue fixing. arXiv preprint arXiv:2506.16136 (2025) 16 Y. Li et al

work page arXiv 2025
[11]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., et al.: Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review arXiv 2024
[12]

arXiv preprint arXiv:2507.22827 (2025)

Jiang, Y., Zheng, Y., Wan, Y., Han, J., Wang, Q., Lyu, M.R., Yue, X.: Screen- coder: Advancing visual-to-code generation for front-end automation via modular multimodal agents. arXiv preprint arXiv:2507.22827 (2025)

work page arXiv 2025
[13]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024a

Laurençon,H.,Tronchon,L.,Sanh,V.:Unlockingtheconversionofwebscreenshots into html code with the websight dataset. arXiv preprint arXiv:2403.09029 (2024)

work page arXiv 2024
[14]

In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Li, J., Le, H., Zhou, Y., Xiong, C., Savarese, S., Sahoo, D.: Codetree: Agent- guided tree search for code generation with large language models. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 3711–3726 (2025)

2025
[15]

arXiv preprint arXiv:2509.22644 (2025)

Lu, Z., Ren, H., Yang, Y., Wang, K., Zong, Z., Pan, J., Zhan, M., Li, H.: Webgen- agent:Enhancinginteractivewebsitegenerationwithmulti-levelfeedbackandstep- level reinforcement learning. arXiv preprint arXiv:2509.22644 (2025)

work page arXiv 2025
[16]

Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025

Lu, Z., Yang, Y., Ren, H., Hou, H., Xiao, H., Wang, K., Shi, W., Zhou, A., Zhan, M., Li, H.: Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch. arXiv preprint arXiv:2505.03733 (2025)

work page arXiv 2025
[17]

OpenAI: Gpt-5 and the new era of work.https://openai.com/index/gpt-5-new- era-of-work(2025), accessed: 2025-12-31

2025
[18]

https://platform.openai.com/docs/models/gpt-5-mini(2025), accessed: 2025- 12-31

OpenAI: Gpt-5 mini: A faster, cost-efficient version of gpt-5 for well-defined tasks. https://platform.openai.com/docs/models/gpt-5-mini(2025), accessed: 2025- 12-31

2025
[19]

OpenAI: Gpt-5.1: A smarter, more conversational chatgpt.https://openai.com/ index/gpt-5-1(2025), accessed: 2025-12-31

2025
[20]

OpenAI: Introducing 4o image generation.https://openai.com/index/ introducing-4o-image-generation/(2025), model: GPT-Image-1

2025
[21]

com/index/gpt-4o-and-more-tools-to-chatgpt-free(2025),accessed:2025-12- 31

OpenAI:Introducinggpt-4oandmoretoolstochatgptfreeusers.https://openai. com/index/gpt-4o-and-more-tools-to-chatgpt-free(2025),accessed:2025-12- 31

2025
[22]

OpenAI:Sora2ishere.https://openai.com/index/sora-2/(Sep2025),accessed: 2025-12-31

2025
[23]

In: International Conference on Machine Learning

Shrivastava, D., Larochelle, H., Tarlow, D.: Repository-level prompt generation for large language models of code. In: International Conference on Machine Learning. pp. 31693–31715. PMLR (2023)

2023
[24]

StackBlitz Labs: bolt.diy.https://github.com/stackblitz-labs/bolt.diy (2024), accessed: 2025-12-31

2024
[25]

arXiv preprint arXiv:2505.17399 (2025)

Sun, H., Wang, H.W., Gu, J., Li, L., Cheng, Y.: Fullfront: Benchmarking mllms across the full front-end engineering workflow. arXiv preprint arXiv:2505.17399 (2025)

work page arXiv 2025
[26]

Team, Q.: Qwen2.5: A party of foundation models (September 2024),https:// qwenlm.github.io/blog/qwen2.5/

2024
[27]

Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

arXiv preprint arXiv:2510.15306 , year=

Wang, K.D., Wang, Z., Shimose, Y., Wang, W.Y., Takamatsu, S.: Webgen-v bench: Structured representation for enhancing visual design in llm-based web generation and evaluation. arXiv preprint arXiv:2510.15306 (2025)

work page arXiv 2025
[29]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, X., Li, B., Song, Y., Xu, F.F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., et al.: Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741 (2024) MM-WebAgent17

work page internal anchor Pith review arXiv 2024
[30]

Interaction2code: How far are we from automatic interactive webpage generation?,

Xiao, J., Wan, Y., Huo, Y., Wang, Z., Xu, X., Wang, W., Xu, Z., Wang, Y., Lyu, M.R.: Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping. arXiv preprint arXiv:2411.03292 (2024)

work page arXiv 2024
[31]

Advances in Neural Information Processing Systems37, 50528–50652 (2024)

Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems37, 50528–50652 (2024)

2024
[32]

Recode: Unify plan and action for universal granularity control.arXiv preprint arXiv:2510.23564, 2025

Yu, Z., Zhang, J., Su, H., Zhao, Y., Wu, Y., Deng, M., Xiang, J., Lin, Y., Tang, L., Luo, Y., et al.: Recode: Unify plan and action for universal granularity control. arXiv preprint arXiv:2510.23564 (2025)

work page arXiv 2025
[33]

Advances in neural information processing systems37, 112134–112157 (2024)

Yun, S., Thushara, R., Bhat, M., Wang, Y., Deng, M., Wang, J., Tao, T., Li, J., Li, H., Nakov, P., et al.: Web2code: A large-scale webpage-to-code dataset and evaluation framework for multimodal llms. Advances in neural information processing systems37, 112134–112157 (2024)

2024
[34]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges.arXiv preprint arXiv:2401.07339, 2024

Zhang, K., Li, J., Li, G., Shi, X., Jin, Z.: Codeagent: Enhancing code genera- tion with tool-integrated agent systems for real-world repo-level coding challenges. arXiv preprint arXiv:2401.07339 (2024) 18 Y. Li et al. Supplementary Material A More Qualitative Results We present more examples of generated webpages in Fig. 6. B Prompt Templates B.1 Planner...

work page arXiv 2024
[35]

**code_generation** — use to generate the HTML layout of the webpage
[36]

**image_generation** — used after layout planning, automatically extracted from all referenced image placeholders
[37]

**video_generation** — used to generate videos based on detailed visual descriptions
[38]

--- ### Planning Guidelines

**data_visualization** — used when the user provides a dataset that needs to be visualized; generates an echart html file and integrates it into the webpage. --- ### Planning Guidelines
[39]

Hero image shows a cozy écaf interior at sunrise (path: assets/hero_cafe.png, width: 1200px, height: 600px)

**HTML Layout & Visual Planning** Write a detailed description for the ‘code_generation‘ tool. - Describe all sections of the webpage (e.g., hero banner, logo, navigation bar, background, icons, gallery, footer, etc.). - For any visual element that needs an image (‘.png‘), video (‘. mp4‘), or chart (‘.html‘), insert a clear reference specifying ** both th...

2022
[40]

save_path

**Image Extraction** After defining the full webpage plan, extract all referenced ‘( path: assets/xxx.png)‘ entries and generate corresponding image descriptions for ‘image_generation‘: Each entry must include: - ‘"save_path"‘ consistent with the code reference, all images should be saved ‘.png‘ format - **context** - ‘section‘: webpage section where the ...
[41]

save_path

**Video Extraction** If the webpage description includes any dynamic or animated visual elements (e.g., background videos, hero section animations), extract these and generate corresponding video descriptions for ‘ video_generation‘. Each entry should include: - ‘"save_path"‘ consistent with the code reference, all videos should be saved ‘.mp4‘ format - *...
[42]

save_path

**Data Visualization Extraction** If the webpage requires charts based on provided datasets, extract all ‘(path: assets/xxx.html)‘ references and generate corresponding chart descriptions under ‘data_visualization‘. Ensure each ’charts visual style aligns seamlessly with the overall webpage design. **A greater variety of chart types** is encouraged to enh...
[43]

The output should be directly savable as a ‘.html‘ file and openable in a browser

**Output only valid HTML code** — no explanations, comments, or markdown formatting. The output should be directly savable as a ‘.html‘ file and openable in a browser
[44]

Do not modify, rename, or relocate these paths

**Strictly preserve image references** — if the description includes any image references in the format **(path: xxx)**, you must use **exactly the same file path** for the corresponding ‘<img>‘ elements, ‘<source>‘ tags, or CSS background-image URLs. Do not modify, rename, or relocate these paths
[45]

The HTML should faithfully represent the layout, structure, and style described in the input prompt, including: - Semantic sections (hero, header, footer, gallery, etc.) - Visual hierarchy and composition - Color palette, font choices, and overall theme
[46]

The hero section shows a cozy écaf interior (path: assets/ hero_cafe.png)

Include minimal inline CSS or internal ‘<style>‘ tags to make the page visually coherent. Example: If the description says: > "The hero section shows a cozy écaf interior (path: assets/ hero_cafe.png)." Then the generated HTML **must** include: ‘‘‘html <img src="assets/hero_cafe.png" alt="cozy écaf interior"> ‘‘‘ 24 Y. Li et al. Prompt 9: Image Generation...
[47]

The output must be directly savable as a ‘.html‘ file and openable in a browser

**Output only valid HTML code** — no explanations, comments, or markdown formatting. The output must be directly savable as a ‘.html‘ file and openable in a browser
[48]

Use **ECharts** to render the chart
[49]

The chart background must be transparent, so it blends seamlessly when embedded into another webpage
[50]

UTF-8" /> <meta name=

**Do not include any layout elements** — no header, footer, sections, captions, or descriptive text. Only the chart container is needed. For reference, the HTML <head> can look like this: ‘‘‘html <head> <meta charset="UTF-8" /> <meta name="viewport" content="width=device-width, initial-scale =1.0"/> <title>A Short Title</title> <script src="https://cdn.js...
[51]

Remove any margins or padding, and MM-WebAgent25 ensure the chart container uses the full width and height of the viewport

The chart must **occupy the full viewport**, filling the ‘<iframe >‘ or container entirely. Remove any margins or padding, and MM-WebAgent25 ensure the chart container uses the full width and height of the viewport
[52]

Ensure the chart is **responsive**, scaling automatically to fit the container while maintaining aspect ratio
[53]

**Double-check that the HTML runs without errors and produces the desired visualization, with all data placed correctly.** B.2 Evaluation Prompt Prompt 11: Layout Evaluation System Prompt You are an evaluator that assesses **the layout quality of a generated webpage** based on:
[54]

**User design prompt**
[55]

**Generated HTML code**
[56]

3- column

**Rendered webpage screenshot** (the image input) Your task is to check whether the webpage layout correctly satisfies the required structure, placement, and relationships implied by the design prompt and the HTML. You must output **all detected layout issues** and assign penalty values according to the rules below. # **Layout Penalty Rules** ## **1. Elem...
[57]

Li et al

Rendered webpage screenshot (image input) 28 Y. Li et al. Your task is to determine whether the visual style of the webpage matches the intended style described in the design prompt, and whether the style is applied consistently across all elements. You must output all detected style issues along with their penalty values according to the rules below. # S...
[58]

Layout Balance and Spacing ========================================• Grid structure clarity• Element alignment precision• Spacing consistency• Visual balance and weight distribution• Logical placement of information Scoring guide: 0.2 = chaotic, misaligned, inconsistent spacing 0.4 = noticeably below average 0.6 = slightly below average 0.8 = clean and ba...
[59]

Typography and Readability ========================================• Font pairing quality• Hierarchy clarity (titles, subtitles, body)• Line-height and letter-spacing• Legibility• Ease of scanning and reading flow Scoring guide: 0.2 = poor readability 0.4 = below average 0.6 = slightly below average 0.8 = clear and easy to read 1.0 = editorial-grade clari...
[60]

Color Harmony and Hierarchy ========================================• Palette harmony and cohesiveness• Contrast management• Brand consistency• Accent color usage• Mood and tone alignment Scoring guide: 0.2 = conflicting or distracting colors 0.4 = below average MM-WebAgent31 0.6 = slightly below average 0.8 = harmonious and clear hierarchy 1.0 = very ref...
[61]

Visual Clarity and Polish ========================================• Visual noise levels• Iconography consistency• Visual grouping and rhythm• UI detail quality• Image quality and cohesion Scoring guide: 0.2 = noisy, inconsistent, unpolished 0.4 = below average 0.6 = slightly below average 0.8 = clean and polished 1.0 = extremely professional and visually ...
[62]

Li et al

Overall Professional Aesthetic ========================================• Overall consistency• Modernity and visual maturity• Brand expression quality• Attention to detail• Aesthetic sophistication Scoring guide: 0.2 = amateur-looking 0.4 = below average 0.6 = slightly below average 0.8 = professional and mature 1.0 = significantly refined and cohesive ## ...
[63]

image": [

IMAGE EXTRACTION ----- Identify all standalone visual elements described as photographs, illustrations, renders, product images, hero visuals, gallery items, decorative artwork, portraits, or any other static visual asset intended to be embedded into the webpage. Extraction rules: - Extract ONLY descriptions that refer to an external image asset. - Do NOT...
[64]

video": [

VIDEO EXTRACTION ----- Extract any explicitly described video intended for embedding: - background looping video - hero section motion footage - product demonstration clips - cinematic sequences or animated scenes described as a video Do NOT extract UI animations or transitions (hover, fade, scroll effects). Rules: - Extract **verbatim**, with no modifica...
[65]

chart": [

CHART / DATA VISUALIZATION EXTRACTION ----- If the prompt describes any chart, graph, or data visualization: - Extract the chart description verbatim. - Also extract any dataset, table, or numerical values provided in the prompt. Dataset Extraction Rules: - Include the dataset in full. - Must be formatted in **markdown**. - Do NOT summarize, rewrite, or c...
[66]

minimalist look

WHAT MUST *NOT* BE EXTRACTED ----- To avoid false positives: - Do NOT extract icons, arrows, separators, borders, lines, geometric shapes. - Do NOT extract abstract references to style or mood ("minimalist look", "warm aesthetic"). - Do NOT extract layout-relative descriptions ("to the left", "below the header"). - Do NOT extract decorative UI components ...
[67]

image": [

OUTPUT FORMAT (STRICT) ----- - Only output the JSON structure shown below. - The output must be **directly parseable by a JSON parser**. - Do NOT include any extra text or commentary outside this structure. { "image": [ "...", "..." ], "video": [ "...", "..." ], "chart": [ "...", "..." ] } If a category has no items, output an empty list. MM-WebAgent35 Pr...
[68]

[WEBPAGE DESIGN PROMPT] A global webpage design prompt for reference
[69]

[EXTRACTED MULTIMODAL ELEMENTS] A dict of visual asset descriptions extracted from the webpage design prompt and must be incorporated for the webpage, which may include images, videos, and charts
[70]

hero right side

[EXISTING ELEMENTS] A dict of existing visual asset descriptions that have been generated and are available. Your task is to identify which visual elements are STILL MISSING (i.e., not yet generated or clearly not attempted) by comparing the extracted multimodal elements against the existing elements. How to determine missing elements (IMPORTANT — MATCHIN...
[72]

A cropped screenshot of this image **as it actually appears in the webpage** (image #2)
[73]

The original image asset file itself (image #3, if available)
[75]

A relevant HTML/CSS excerpt + embedding diagnostics (text) Your goal is to determine whether the rendered image (image #2) correctly reflects the ’users intended design and matches the webpage’s visual style. **CRITICAL**: Distinguish between: - **Image issues**: the asset itself is wrong (missing required text/ details, wrong style, artifacts, watermark,...
[76]

Required details are visible and correct
[77]

No unwanted/extraneous content (random text, watermark, artifacts, accidental borders, etc.)
[78]

Consistency with overall webpage style (palette, tone, icon/ illustration style)
[79]

description

Cropping/clipping/alignment problems introduced by embedding ----- SCORING RULES ----- Start from 1.0. For each identified issue (each distinct problem), subtract 0.2. The final score cannot go below 0. ----- STEP 4 — Suggest Fixes (TWO CATEGORIES) ----- For every issue identified, you MUST decide whether it should be fixed by editing the IMAGE or by fixi...
[80]

The original user design prompt used to generate the entire webpage
[81]

A sequence of extracted video frames that represent the video’s content. 40 Y. Li et al
[82]

- Include all fragments that explicitly or implicitly reference the video content

A relevant HTML/CSS excerpt + embedding diagnostics (text) Your task consists of three steps: ----- STEP 1 — Extract Relevant Description (meta_design) ----- From the global design prompt, extract ONLY the sentences or fragments Rules: - Extract verbatim text only — no paraphrasing or adding extra details. - Include all fragments that explicitly or implic...

Showing first 80 references.