arxiv: 2604.13648 · v1 · submitted 2026-04-15 · 💻 cs.SE

Recognition: unknown

Figma2Code: Automating Multimodal Design to Code in the Wild

Yi Gui , Jiawan Zhang , Yina Wang , Tianran Ma , Yao Wan , Shilin He , Dongping Chen , Zhou Zhao

show 4 more authors

Wenbin Jiang Xuanhua Shi Hai Jin Philip S Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:15 UTC · model grok-4.3

classification 💻 cs.SE

keywords design to codeFigmamultimodal large language modelsUI code generationlayout responsivenesscode maintainabilityfront-end automation

0 comments

The pith

Figma metadata helps models match designs visually yet still produces code with poor layout responsiveness and low maintainability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Figma2Code as a multimodal task that converts real Figma design files into production UI code by using both images and embedded metadata. Current image-only methods miss critical layout rules and asset details that Figma files already contain, which raises the cost and error rate of front-end work. The authors built a dataset of 3,055 processed samples and curated 213 high-quality cases, then tested ten open-source and proprietary multimodal models on them. Results indicate that proprietary models copy visual appearance more accurately but still fail to generate code that adapts across screen sizes or remains easy for developers to edit. Ablation tests confirm the gap arises because models tend to copy raw visual attributes instead of inferring structural relationships.

Core claim

Incorporating Figma metadata alongside images advances design-to-code automation, yet even the strongest proprietary multimodal models remain limited in producing responsive layouts and maintainable code, largely because they directly map primitive visual attributes from the metadata rather than reasoning about UI structure.

What carries the argument

The Figma2Code task and its 213-case curated dataset, which pairs design images with metadata files and evaluates models on visual fidelity, layout responsiveness, and code maintainability.

Load-bearing premise

The 213 high-quality cases drawn from 3,055 processed Figma samples represent typical real-world usage and that judgments of layout responsiveness and code maintainability are consistent and reproducible.

What would settle it

Re-running the ten models on an independent set of 500 new Figma files from varied domains and measuring responsiveness with automated layout checks would falsify the claimed limitations if scores rise substantially.

Figures

Figures reproduced from arXiv: 2604.13648 by Dongping Chen, Hai Jin, Jiawan Zhang, Philip S Yu, Shilin He, Tianran Ma, Wenbin Jiang, Xuanhua Shi, Yao Wan, Yi Gui, Yina Wang, Zhou Zhao.

**Figure 2.** Figure 2: The pipeline of constructing the FIGMA2CODE dataset. • New Problem and Dataset. We introduce a new task, FIGMA2CODE, advancing design-to-code research beyond image-only methods toward a multimodal, industry-relevant setting. To this end, we build the FIGMA2CODE dataset, comprising 213 diverse, high-quality samples. • Comprehensive Benchmark. We are the first to establish a systematic evaluation framework t… view at source ↗

**Figure 4.** Figure 4: Estimated token distribution over differ [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on five metadata components. Using the ERNIE 4.5 424B VL model, we [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: HTML snippet generated by Grok4 (left) and the golden implementation (right). The [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for using GPT-4o in additional annotation. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: The custom-built interface for manual annotation of design pages. From this UI, design [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of dataset samples across multiple attributes, including platform, complexity, [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Estimated total token length distribution. 0 2 4 6 8 10 12 Max Depth 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Density Max Depth [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 12.** Figure 12: Node counts distribution. FRAME RECTANGLE TEXT INSTANCE GROUP Node Type 0 20 40 60 80 100 Count [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 14.** Figure 14: Estimated metadata size distribution. 0 1000 2000 3000 4000 5000 Width (pixels) 2000 0 2000 4000 6000 8000 10000 12000 14000 Height (pixels) Images [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 16.** Figure 16: Different content type distribution across the dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Word cloud of the most frequent terms in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: System prompt for the JSON-only modality, illustrating the explicit definition of input [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: System prompt for the multi-modal (JSON + Screenshot) modality, illustrating the explicit [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: The prompt used to instruct the Critic persona for self-evaluation. [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: The prompt used to instruct the Refiner persona for code correction and improvement. [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Scatter plot between pixel-level Mean Absolute Error (MAE [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative cases summarizing the relationship between MAE [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: Illustration of a Dataset Sample. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: Illustration of Metadata Refinement. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗

**Figure 26.** Figure 26: Illustration of Samples of different Complexities. [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗

**Figure 27.** Figure 27: Case Study 1. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗

**Figure 28.** Figure 28: Case Study 2. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗

**Figure 29.** Figure 29: Case Study 3. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_29.png] view at source ↗

**Figure 30.** Figure 30: Case Study 4. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_30.png] view at source ↗

**Figure 31.** Figure 31: Case Study 5. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_31.png] view at source ↗

**Figure 32.** Figure 32: Case Study 6. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_32.png] view at source ↗

**Figure 33.** Figure 33: Case Study 7. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_33.png] view at source ↗

**Figure 34.** Figure 34: Case Study 8. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_34.png] view at source ↗

**Figure 35.** Figure 35: Case Study 9. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_35.png] view at source ↗

**Figure 36.** Figure 36: Case Study 10. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_36.png] view at source ↗

read the original abstract

Front-end development constitutes a substantial portion of software engineering, yet converting design mockups into production-ready User Interface (UI) code remains tedious and costly. While recent work has explored automating this process with Multimodal Large Language Models (MLLMs), existing approaches typically rely solely on design images. As a result, they must infer complex UI details from images alone, often leading to degraded results. In real-world development workflows, however, design mockups are usually delivered as Figma files, a widely used tool for front-end design, that embed rich multimodal information (e.g., metadata and assets) essential for generating high-quality UI. To bridge this gap, we introduce Figma2Code, a new task that advances design-to-code into a multimodal setting and aims to automate design-to-code in the wild. Specifically, we collect paired design images and their corresponding metadata files from the Figma community. We then apply a series of processing operations, including rule-based filtering, human- and MLLM-based annotation and screening, and metadata refinement. This process yields 3,055 samples, from which designers curate a balanced dataset of 213 high-quality cases. Using this dataset, we benchmark ten state-of-the-art open-source and proprietary MLLMs. Our results show that while proprietary models achieve superior visual fidelity, they remain limited in layout responsiveness and code maintainability. Further experiments across modalities and ablation studies corroborate this limitation, partly due to models' tendency to directly map primitive visual attributes from Figma metadata.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Figma2Code defines a useful multimodal task and dataset but its claims about model limits on responsiveness and maintainability rest on qualitative judgments without clear quantitative backing.

read the letter

The main thing to know is that this paper sets up a new task called Figma2Code, pulls real paired image-and-metadata samples from the Figma community, and shows that even the best proprietary models still produce code that falls short on responsive layouts and maintainable structure. The metadata helps with visual fidelity, but the models keep mapping raw visual attributes too directly instead of building proper responsive components. That gap feels real based on their modality ablations. What the work does well is the data collection pipeline and the decision to move beyond image-only inputs. They started with thousands of samples, applied rule-based filters plus human and model screening, and landed on a balanced set of 213 cases. Running ten models with cross-modality tests and ablations gives a clearer picture than most prior design-to-code papers. The finding that metadata alone does not solve the harder engineering problems is worth noting. The soft spots are in the evaluation. Responsiveness and maintainability are assessed qualitatively with no reported proxies such as breakpoint coverage, relative unit usage, component reuse rates, or complexity measures. The reduction from 3,055 processed samples to the final 213 high-quality cases is done by designers, but the exact selection rules and any agreement checks are not detailed enough to judge reproducibility. If those labels are subjective or tied to a narrow slice of Figma usage, the observed limitations could be narrower than claimed. This paper is for researchers working on AI-assisted front-end tools or multimodal code generation. Anyone building or evaluating design-to-code systems would find the task definition and dataset useful, even if they want tighter metrics. It deserves a serious referee because the task is practical, the data effort is non-trivial, and the core observation about remaining gaps is grounded enough to warrant discussion. I would send it for review with a request to clarify the evaluation criteria and release the dataset if possible.

Referee Report

3 major / 2 minor

Summary. This paper introduces the Figma2Code task to automate converting Figma design mockups (including images and metadata) to UI code using MLLMs. It describes collecting 3,055 samples from the Figma community, processing them with rule-based filtering, annotations, and human screening to create a curated dataset of 213 high-quality cases. The authors benchmark ten MLLMs on this dataset, concluding that proprietary models achieve better visual fidelity but are limited in layout responsiveness and code maintainability due to directly mapping primitive attributes from metadata. Modality experiments and ablations are used to support the findings.

Significance. The work is significant in shifting design-to-code from image-only to multimodal settings, which aligns better with real-world Figma workflows. The dataset curation and benchmarking provide a new resource and baseline for the field. If the limitations identified are confirmed with more rigorous metrics, it could influence the development of better MLLMs for UI generation tasks.

major comments (3)

[Dataset Construction] The description of how the 213 cases were selected from 3,055 samples by designers does not include the specific criteria used for 'high-quality' or any measure of inter-annotator agreement for the screening process. Since the benchmark results and conclusions depend on this subset, this omission affects the reliability of the representativeness claim.
[Experiments] The evaluation of layout responsiveness and code maintainability is described qualitatively without accompanying quantitative proxies (such as counts of media queries, relative CSS units, or code duplication metrics). This makes the central claim about model limitations difficult to assess objectively or reproduce.
[Ablation Studies] The attribution of limitations to models 'directly map[ping] primitive visual attributes from Figma metadata' is not supported by specific measurements or examples from the ablation studies showing this behavior.

minor comments (2)

[Abstract] The abstract states results show proprietary models are limited but does not provide any specific quantitative metrics or examples to illustrate the limitations.
[References] Ensure all related work on design-to-code is cited, particularly recent MLLM applications in UI generation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We value the suggestions for enhancing the rigor of our dataset description, experimental evaluations, and ablation analyses. We will update the paper to address these points and provide more detailed explanations and quantitative support where needed.

read point-by-point responses

Referee: [Dataset Construction] The description of how the 213 cases were selected from 3,055 samples by designers does not include the specific criteria used for 'high-quality' or any measure of inter-annotator agreement for the screening process. Since the benchmark results and conclusions depend on this subset, this omission affects the reliability of the representativeness claim.

Authors: We agree that explicit criteria and inter-annotator agreement would improve transparency. In the revised manuscript, we will expand the Dataset Construction section to detail the specific high-quality selection criteria used by the designers (including UI component diversity, visual complexity, real-world applicability, and code generation feasibility) and report inter-annotator agreement metrics from the human screening process to better substantiate the 213-case subset. revision: yes
Referee: [Experiments] The evaluation of layout responsiveness and code maintainability is described qualitatively without accompanying quantitative proxies (such as counts of media queries, relative CSS units, or code duplication metrics). This makes the central claim about model limitations difficult to assess objectively or reproduce.

Authors: We acknowledge the value of quantitative proxies for objectivity. In the revised Experiments section, we will introduce and report measurable proxies including average counts of media queries, proportion of relative CSS units (e.g., %, em, vw) versus fixed pixels, and code duplication metrics such as repeated style definitions or class overlaps. These will provide reproducible evidence supporting our claims on responsiveness and maintainability limitations. revision: yes
Referee: [Ablation Studies] The attribution of limitations to models 'directly map[ping] primitive visual attributes from Figma metadata' is not supported by specific measurements or examples from the ablation studies showing this behavior.

Authors: We recognize the need for more explicit linkage. In the revised Ablation Studies section, we will add concrete code examples demonstrating direct mapping of primitive Figma attributes (e.g., fixed pixel dimensions leading to non-responsive layouts) and include quantitative measurements such as correlations between metadata inclusion and absolute positioning usage across ablation variants to strengthen the attribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with direct observations

full rationale

The paper presents an empirical task definition, dataset curation from Figma files, and benchmarking of ten MLLMs on visual fidelity, layout responsiveness, and code maintainability. No mathematical derivations, equations, fitted parameters, or predictions appear. Results are stated as direct observations from the 213 curated cases and ablation experiments. No self-citations are invoked as load-bearing premises, and no step reduces a claimed result to its own inputs by construction. The reduction from 3,055 to 213 samples and qualitative judgments are methodological choices, not circular derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper is entirely empirical and introduces no new mathematical axioms, free parameters, or invented physical entities. The central contribution is a task definition and curated dataset built from existing Figma files and standard MLLM evaluation practices.

invented entities (1)

Figma2Code task no independent evidence
purpose: To formalize and automate multimodal design-to-code conversion using Figma metadata in addition to images
Newly defined task that extends image-only design-to-code methods; no independent falsifiable evidence is provided beyond the dataset itself.

pith-pipeline@v0.9.0 · 5608 in / 1361 out tokens · 53861 ms · 2026-05-10T13:15:09.602023+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
cs.CV 2026-05 accept novelty 8.0

Vision2Code is a multi-domain benchmark that evaluates image-to-code generation via rendered outputs scored by a VLM rater with dataset-specific rubrics, revealing domain-dependent model performance and enabling impro...

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

GPT-4o System Card

Accessed: 2025-09-11. Figma, Inc. Figma: The collaborative interface design tool. https://www.figma.com/ . Accessed: 2025-09-17. Golnaz Gharachorlu. Code smells in cascading style sheets : an empirical study and a predictive model. 2014. Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Examine the layout, text, colors, and all design elements carefully
[3]

Ensure every field in the JSON is present and contains a valid, non-empty value
[4]

For thecontentfield, if none of the reference categories are a good fit, create a new, appropriate category name
[5]

content":

Ensure the final output is a single, fully parsable JSON object. Examples: { "content": "Messaging", "description": "A mobile UI for a messaging app, showing a list of recent chats with profile pictures and last message snippets." } { "content": "Dashboard", "description": "A desktop admin dashboard in dark mode, displaying analytics charts, a side naviga...

2026
[7]

{imageRef}

Allowed Rendering for Image References:Prefer <img src="{imageRef}"> for placed/raster/SVG assets. Use CSS background-image only when the design explicitly uses the image as a background fill. When using Tailwind arbitrary values, escape properly, e.g.,bg-[url(’assets/foo.png’)]. 3.Path Integrity:Do not alter provided paths or their semantics. 4.No Halluc...

2026
[8]

This is a hard requirement

Mandatory Image Asset Binding:For every visible node whose fills include an IMAGE with a non-empty, case-sensitive imageRef value (already a local relative path), you must render that exact asset path in the HTML. This is a hard requirement
[9]

{imageRef}

Allowed Rendering for Image References:Prefer <img src="{imageRef}"> for placed/raster/SVG assets. Use CSS background-image only when the design explicitly uses the image as a background fill. When using Tailwind arbitrary values, escape properly, e.g.,bg-[url(’assets/foo.png’)]. 3.Path Integrity:Do not alter provided paths or their semantics. 4.No Halluc...

2026
[10]

Compare the visual rendering of the HTML with the provided screenshot to identify layout and style mismatches
[11]

Cross-reference the HTML against the Intermediate Representation (IR) to verify correct implementation of specified properties (e.g., colors, fonts, spacing)
[12]

For each identified issue, provide a precise description and a concrete suggestion for correction
[13]

critique

If the code is already perfect and requires no changes, return a JSON object with an emptycritiquearray. Example: { "critique": [ { "issue_type": "Styling", "description": "The primary action button’s background color is incorrect. It appears as a standard blue but should be a specific shade of purple as per the design.", "suggestion": "In the button’s cl...

2026
[14]

Carefully review the entire current HTML code
[15]

Apply the suggestions to correct the code

If acritiqueJSON is provided, systematically address every issue listed. Apply the suggestions to correct the code
[16]

Independently identify and fix any potential issues related to layout, styling, semantic correctness, or component structure

If no critique is provided, perform a general review of the code. Independently identify and fix any potential issues related to layout, styling, semantic correctness, or component structure
[17]

en"> 3<head> 4<meta charset=

Ensure your final output is a single, complete, and valid HTML document that incorporates all necessary corrections. Example Output: 1<!DOCTYPE html> 2<html lang="en"> 3<head> 4<meta charset="UTF-8"> 5<meta name="viewport" content="width=device-width, initial-scale=1.0"> 6<script src="https://cdn.tailwindcss.com"></script> 7<title>Refined Page</title> 8</...

work page arXiv 2026