pith. sign in

arxiv: 2606.24333 · v1 · pith:3Y7MRK2Tnew · submitted 2026-06-23 · 💻 cs.CV

UniTranslator: A Unified Multi-modal Framework for End-to-end In-Image Machine Translation

Pith reviewed 2026-06-26 00:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords in-image machine translationunified multimodal frameworkscene text translationimage generationalignment modulespatial mask decodermutual reinforcement
0
0 comments X

The pith

UniTranslator adds alignment and spatial mask modules to a unified multimodal model to translate and render scene text without conflicts or misalignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniTranslator as a framework that performs in-image machine translation by combining visual-text understanding and image generation in one system. It targets two problems in existing unified models: inconsistencies between the translation inferred in understanding and the text used in generation, plus inaccurate placement of rendered text. The approach introduces an Understand-Generation Alignment Module to enforce semantic consistency across the two stages and a Spatial Mask Decoder that applies pixel-level supervision to text regions. Experiments on multiple benchmarks show top results across languages and layouts, along with evidence that the understanding and generation tasks reinforce each other.

Core claim

UniTranslator is a unified multimodal framework for end-to-end in-image machine translation that tightly couples translation understanding and text editing; an Understand-Generation Alignment Module bridges the representation gap to encourage semantic consistency between translated content prediction and text rendering, while a Spatial Mask Decoder supplies pixel-level supervision over text regions to improve spatial grounding, geometric alignment, and layout controllability, yielding state-of-the-art performance and a mutual reinforcement effect between understanding and generation.

What carries the argument

Understand-Generation Alignment Module (UGAM) that bridges representation gaps between understanding and generation, together with Spatial Mask Decoder (SMD) that supplies pixel-level supervision over text regions.

If this is right

  • State-of-the-art performance across diverse language directions and complex real-world layouts.
  • Strong mutual reinforcement effect between translation understanding and image generation.
  • Improved semantic consistency between translated content prediction and text rendering.
  • Enhanced geometric alignment and layout controllability during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Joint training of understanding and generation may transfer to other multimodal editing tasks that require both content prediction and visual output.
  • The spatial supervision approach could be tested on video or live camera feeds where text must be translated and overlaid in real time.
  • If the mutual reinforcement holds, similar unified backbones might benefit from adding alignment and mask-style supervision in non-text domains such as object editing.

Load-bearing premise

Adding the alignment module and spatial mask decoder to an existing unified multimodal backbone will resolve understanding-generation conflicts and spatial misalignment.

What would settle it

Ablation experiments on the benchmarks showing no gain in translation accuracy, consistency, or spatial alignment when the Understand-Generation Alignment Module or Spatial Mask Decoder is removed would falsify the necessity of these components.

Figures

Figures reproduced from arXiv: 2606.24333 by Can Ma, Jiahao Lyu, Jiahui Yang, Jian Luan, Pei Fu, Shaojie Zhang, Yu Zhou, Zhenbo Luo, Zhenhang Li.

Figure 1
Figure 1. Figure 1: (a) Performance comparison of unified multimodal models and specialized image-to-image translation models on the Translatotron-V(ision) benchmark. (b) Illus￾tration of the semantic conflict between the understanding and generation objectives in unified multimodal models for the IIMT task, which could cause the generated out￾put to deviate from the target translation semantics. (c) Illustration of the posit… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of UniTranslator, a unified multimodal framework for in-image ma￾chine translation. It introduces UGAM for understand-generation alignment and SMD for spatially grounded text editing. existing UMMs still fall short on IIMT, which requires tight coordination be￾tween text understanding and image editing, especially for content preservation and style-consistent text rendering. Our work addresses thi… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of zero-shot inference on the AnyTrans [27] dataset. The first row presents Chinese-to-English translation, while the second row presents English-to￾German translation. 5 Visualization & Discussion [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the Translatotron-V benchmark. The RGB values shown below each image correspond to the background color, providing a direct com￾parison of the predicted background colors produced by different models. 6 Conclusion We presented UniTranslator, a unified multi-modal framework for in-image ma￾chine translation that jointly optimizes translation understanding and image generation. To r… view at source ↗
Figure 5
Figure 5. Figure 5: Failure cases on highly stylized typography. UniTranslator can usually localize the target region and generate semantically correct translated content, but may fail to fully preserve fine-grained font attributes such as stroke erosion, glow intensity, or cursive deformation. Source Image Seedream 5.0 Wan 2.7 UniTranslator [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases on images with complicated backgrounds. While UniTranslator generally preserves the overall scene structure and text layout, it may still introduce unintended changes in background texture, color, or local appearance in difficult cases. out preservation, and source-faithful rendering, challenging cases remain under highly stylized typography and complex backgrounds. As shown in [PITH_FULL_IM… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on the AnyTrans benchmark [27] in zero-shot settings. D Details of the Structure-BLEU. Structure-BLEU (S-BLEU) is proposed by Translatotron-V [16], aiming to evalu￾ate text location information for IIMT. S-BLEU extends the conventional BLEU metric by incorporating spatial structure information. Instead of directly com￾paring text sequences, S-BLEU first extracts text instances and th… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on the IIMT30k benchmark [39]. Images with red bor￾ders denote the source images, images with yellow borders correspond to the ground￾truth targets provided in the original dataset, and images with green borders represent the outputs generated by our UniTranslator. These results indicate that UniTranslator better balances translation correct￾ness, layout alignment, and font-aware vis… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on the PRIM benchmark [38]. Red-bordered images denote the source images, green-bordered images represent the outputs generated by our UniTranslator, and yellow-bordered images correspond to the ground-truth targets provided in the dataset. The two ground-truth images are produced by first translating the source text using GPT-4 and Google Translate, respectively, and then rendering … view at source ↗
read the original abstract

In-Image Machine Translation (IIMT) aims to translate scene text in an image and render the translated text back into the original regions while preserving the overall visual appearance. Recent unified multimodal models provide a promising solution by combining visual-text understanding and image generation within a single framework. However, directly adapting such models to IIMT remains challenging. In particular, they often suffer from understanding-generation conflicts, where the translation inferred during understanding is inconsistent with the text supervision used in generation, and spatial position misalignment, where the rendered text does not accurately match the target text regions. To address these issues, we present UniTranslator, a unified multimodal framework for IIMT that tightly couples translation understanding and text editing. Specifically, we introduce an Understand-Generation Alignment Module (UGAM) to bridge the representation gap between understanding and generation, encouraging semantic consistency between translated content prediction and text rendering. We further propose a Spatial Mask Decoder (SMD) with pixel-level supervision over text regions to improve spatial grounding, geometric alignment, and layout controllability during generation. Extensive experiments on multiple benchmarks demonstrate that UniTranslator achieves state-of-the-art performance across diverse language directions and complex real-world layouts. Moreover, our results reveal a strong mutual reinforcement effect between translation understanding and image generation, highlighting the advantage of unified translation multimodal learning. Code is available at https://github.com/SeerRay-Lab/Unitranslator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes UniTranslator, a unified multimodal framework for end-to-end In-Image Machine Translation (IIMT). It identifies two challenges when adapting unified multimodal models—understanding-generation conflicts and spatial position misalignment—and introduces an Understand-Generation Alignment Module (UGAM) to enforce semantic consistency between translated content prediction and text rendering, plus a Spatial Mask Decoder (SMD) with pixel-level supervision to improve spatial grounding and layout controllability. The manuscript claims that these additions yield state-of-the-art results across diverse language directions and complex real-world layouts on multiple benchmarks, while also demonstrating a mutual reinforcement effect between translation understanding and image generation.

Significance. If the experimental results hold, the work would be significant for multimodal machine translation and scene-text editing. It provides a concrete architectural approach to coupling understanding and generation in a single backbone, with modules that directly target identified failure modes. The reported mutual reinforcement between the two tasks is a potentially useful empirical observation that could inform future unified multimodal designs.

major comments (1)
  1. [Abstract] Abstract: the central claim of state-of-the-art performance is stated without any accompanying metrics, baselines, ablation results, or dataset details. This absence prevents verification that UGAM and SMD are responsible for the reported gains and that the mutual-reinforcement effect is robust.
minor comments (1)
  1. The GitHub link for code release is provided; confirming that the repository contains the full training and evaluation pipelines would strengthen reproducibility claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the feedback. The sole major comment concerns the abstract's lack of quantitative details supporting the SOTA claim. We agree this is a valid observation and will revise the abstract to include key metrics, baselines, and dataset references while preserving its conciseness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of state-of-the-art performance is stated without any accompanying metrics, baselines, ablation results, or dataset details. This absence prevents verification that UGAM and SMD are responsible for the reported gains and that the mutual-reinforcement effect is robust.

    Authors: We agree the abstract would benefit from added specificity. In the revision we will incorporate concise quantitative results (e.g., BLEU improvements over prior SOTA on the primary benchmarks), name the main datasets and baselines, and briefly note that ablations isolating UGAM and SMD appear in Section 4. The mutual-reinforcement analysis remains in the experiments; the abstract will reference it without expanding into full details. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces architectural modules (UGAM and SMD) to a multimodal backbone and supports its SOTA and mutual-reinforcement claims solely through benchmark experiments. No equations, derivations, or predictions appear that reduce by construction to fitted inputs, self-citations, or renamed ansatzes. The argument is empirical and externally falsifiable on public datasets, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities beyond the two named modules can be identified.

pith-pipeline@v0.9.1-grok · 5810 in / 1136 out tokens · 14363 ms · 2026-06-26T00:48:31.282090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 1 canonical work pages

  1. [1]

    com/index/image-generation-api/(2025), accessed: 2025-04-23 2, 8, 9

    AI,Z.:IntroducingourlatestimagegenerationmodelintheAPI.https://openai. com/index/image-generation-api/(2025), accessed: 2025-04-23 2, 8, 9

  2. [2]

    In: European conference on computer vision

    Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive se- quence models. In: European conference on computer vision. pp. 178–196. Springer (2022) 11, 22

  3. [3]

    In: Proceedings of the 14th International Conference on Spoken Language Translation

    Cettolo, M., Federico, M., Bentivogli, L., Niehues, J., Stüker, S., Sudoh, K., Yoshino, K., Federmann, C.: Overview of the iwslt 2017 evaluation campaign. In: Proceedings of the 14th International Conference on Spoken Language Translation. pp. 2–14 (2017) 9

  4. [4]

    In: Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

    Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Federico, M.: Report on the 11th iwslt evaluation campaign. In: Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign. pp. 2–17 (2014) 9

  5. [5]

    Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked gener- ative image transformer (2022) 10, 21

  6. [6]

    arXiv preprint arXiv:2501.17811 (2025) 8, 9

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 8, 9

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, Z., Zhao, F., Shu, Y., Liu, Y., Yu, L., Zhou, Y.: Styletextgen: Style- conditioned multilingual scene text generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7643–7653 (2026) 2

  8. [8]

    In: 2020 25th Interna- tional Conference on Pattern Recognition (ICPR)

    Chen, Z., Yin, F., Zhang, X.Y., Yang, Q., Liu, C.L.: Cross-lingual text image recognition via multi-task sequence to sequence learning. In: 2020 25th Interna- tional Conference on Pattern Recognition (ICPR). pp. 3122–3129. IEEE (2021) 4

  9. [9]

    arXiv preprint arXiv:2507.06261 (2025) 2, 8, 9

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 2, 8, 9

  10. [10]

    arXiv preprint arXiv:2505.14683 (2025) 4, 8, 9

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 4, 8, 9

  11. [11]

    In: ACL (2016) 10

    Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30k: Multilingual english- german image descriptions. In: ACL (2016) 10

  12. [12]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12873–12883 (June 2021) 10

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021) 21

  14. [14]

    arXiv preprint arXiv:2504.01934 (2025) 4

    Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 4

  15. [15]

    Visual Intelligence 3(1), 27 (2025) 4

    Jin, Y., Li, J., Gu, T., Liu, Y., Zhao, B., Lai, J., Gan, Z., Wang, Y., Wang, C., Tan, X., et al.: Efficient multimodal large language models: A survey. Visual Intelligence 3(1), 27 (2025) 4

  16. [16]

    In: Findings of the Association for Computational Linguistics ACL 2024

    Lan, Z., Niu, L., Meng, F., Zhou, J., Zhang, M., Su, J.: Translatotron-v (ison): An end-to-end model for in-image machine translation. In: Findings of the Association for Computational Linguistics ACL 2024. pp. 5472–5485 (2024) 4, 8, 9, 10, 11, 21, 23 16 J.Lyu et al

  17. [17]

    Lan, Z., Niu, L., Meng, F., Zhou, J., Zhang, M., Su, J.: Translatotron-v(ison): An end-to-end model for in-image machine translation (2024),https://arxiv.org/ abs/2407.0289421

  18. [18]

    arXiv preprint arXiv:2305.17415 (2023) 4, 10, 22

    Lan, Z., Yu, J., Li, X., Zhang, W., Luan, J., Wang, B., Huang, D., Su, J.: Ex- ploring better text image translation with multimodal codebook. arXiv preprint arXiv:2305.17415 (2023) 4, 10, 22

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, G., Zhang, C., Liang, Y., Shen, H., Zhang, Y., Lyu, P., Wang, W., Wan, X., Zeng, G., Hu, H., et al.: Mmtit-bench: A multilingual and multi-scenario bench- mark with cognition-perception-reasoning guided text-image machine translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16593–16602 (2026) 4

  20. [20]

    arXiv preprint arXiv:2410.10168 (2024) 2

    Li, Z., Shu, Y., Zeng, W., Yang, D., Zhou, Y.: First creating backgrounds then rendering texts: A new paradigm for visual text blending. arXiv preprint arXiv:2410.10168 (2024) 2

  21. [21]

    In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

    Liang, Y., Zhang, Y., Ma, C., Zhang, Z., Zhao, Y., Xiang, L., Zong, C., Zhou, Y.: Document image machine translation with dynamic multi-pre-trained models assembling. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 7084–7095 (2024) 4

  22. [22]

    arXiv preprint arXiv:2506.03147 (2025) 4, 8, 9

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 4, 8, 9

  23. [23]

    arXiv preprint arXiv:2505.05422 (2025) 4

    Lin, H., Wang, T., Ge, Y., Ge, Y., Lu, Z., Wei, Y., Zhang, Q., Sun, Z., Shan, Y.: Toklip: Marry visual tokens to clip for multimodal comprehension and generation. arXiv preprint arXiv:2505.05422 (2025) 4

  24. [24]

    arXiv preprint arXiv:2603.10495 (2026) 4

    Lyu, J., Fu, P., Li, Z., Zeng, W., Zhang, S., Yang, J., Ma, C., Zhou, Y., Luo, Z., Luan, J.: Imtbench: A multi-scenario cross-modal collaborative evaluation bench- mark for in-image machine translation. arXiv preprint arXiv:2603.10495 (2026) 4

  25. [25]

    OpenAI: GLM-Image: Auto-regressive for Dense-knowledge and High-fidelity Im- age Generation.https://z.ai/blog/glm-image/(2026), accessed: 2026-01-04 8, 9

  26. [26]

    In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002) 9

  27. [27]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Qian, Z., Zhang, P., Yang, B., Fan, K., Ma, Y., Wong, D., Sun, X., Ji, R.: Any- trans: Translate anytext in the image with large scale models. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 2432–2444 (2024) 4, 13, 23

  28. [28]

    In: Webber, B., Cohn, T., He, Y., Liu, Y

    Rei,R.,Stewart,C.,Farinha,A.C.,Lavie,A.:COMET:AneuralframeworkforMT evaluation. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 2685–2702. Association for Computational Linguistics, Online (Nov 2020).https: //doi.org/10.18653/v1/2020.emnlp- main.213,https:...

  29. [29]

    Salesky,E., Koehn,P.,Post, M.:Benchmarkingvisually-situated translationoftext innaturalimages.In:ProceedingsoftheNinthConferenceonMachineTranslation. pp. 1167–1182 (2024) 4

  30. [30]

    arXiv preprint arXiv:2505.23606 (2025) 4 Unitranslator 17

    Shi, Q., Bai, J., Zhao, Z., Chai, W., Yu, K., Wu, J., Song, S., Tong, Y., Li, X., Li, X., et al.: Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model. arXiv preprint arXiv:2505.23606 (2025) 4 Unitranslator 17

  31. [31]

    arXiv preprint arXiv:2504.21682 (2025) 2

    Shu, Y., Zeng, W., Zhao, F., Chen, Z., Li, Z., Yang, X., Zhou, Y., Rota, P., Bai, X., Jin, L., et al.: Visual text processing: A comprehensive review and unified evaluation. arXiv preprint arXiv:2504.21682 (2025) 2

  32. [32]

    In: International Conference on Document Analysis and Recognition

    Su, T., Liu, S., Zhou, S.: Rtnet: An end-to-end method for handwritten text image translation. In: International Conference on Document Analysis and Recognition. pp. 99–113. Springer (2021) 4

  33. [33]

    arXiv preprint arXiv:2503.20853 (2025) 4

    Swerdlow, A., Prabhudesai, M., Gandhi, S., Pathak, D., Fragkiadaki, K.: Unified multimodal discrete diffusion. arXiv preprint arXiv:2503.20853 (2025) 4

  34. [34]

    Team,C.:Chameleon:Mixed-modalearly-fusionfoundationmodels.arXivpreprint arXiv:2405.09818 (2024) 4

  35. [35]

    Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.Y., Gao, L., Xiao, S., Wei, X., Ma, X., Cai, X., Guan, Y., Hu, J.: Longcat-image technical report (2025), https://arxiv.org/abs/2512.075848, 9

  36. [36]

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalable image generation via next-scale prediction (2024) 10, 21

  37. [37]

    In: Findings of the Association for Computational Linguistics: EMNLP 2023

    Tian, Y., Li, X., Liu, Z., Guo, Y., Wang, B.: In-image neural machine translation with segmented pixel sequence-to-sequence model. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 15046–15057 (2023) 4

  38. [38]

    arXiv preprint arXiv:2509.05146 (2025) 4, 10, 11, 22, 25

    Tian, Y., Liu, Z., Liu, Z., Feng, C., Li, X., Huang, H., Guo, Y.: Prim: Towards prac- tical in-image multilingual machine translation. arXiv preprint arXiv:2509.05146 (2025) 4, 10, 11, 22, 25

  39. [39]

    arXiv preprint arXiv:2505.15282 (2025) 4, 10, 22, 24

    Tian, Y., Liu, Z., Liu, Z., Guo, Y.: Exploring in-image machine translation with real-world background. arXiv preprint arXiv:2505.15282 (2025) 4, 10, 22, 24

  40. [40]

    In: International Conference on Pattern Recognition

    Vaidya, S., Sharma, A.K., Gatti, P., Mishra, A.: Show me the world in my lan- guage: Establishing the first baseline for scene-text to scene-text translation. In: International Conference on Pattern Recognition. pp. 312–328. Springer (2024) 4

  41. [41]

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need (2023) 10

  42. [42]

    arXiv preprint arXiv:2505.20147 (2025) 4

    Wang, J., Lai, Y., Li, A., Zhang, S., Sun, J., Kang, N., Wu, C., Li, Z., Luo, P.: Fudoki: Discrete flow-based unified understanding and generation via kinetic- optimal velocities. arXiv preprint arXiv:2505.20147 (2025) 4

  43. [43]

    arXiv preprint arXiv:2409.18869 (2024) 4

    Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024) 4

  44. [44]

    arXiv preprint arXiv:2506.11820 (2025) 4

    Wang, X., Pan, J., Liu, Y., Zhao, X., Lyu, C., Wu, M., Biemann, C., Wang, L., Xu, L., Luo, W., et al.: Rethinking multilingual vision-language translation: Dataset, evaluation, and adaptation. arXiv preprint arXiv:2506.11820 (2025) 4

  45. [45]

    IEEE transactions on image processing 13(4), 600–612 (2004) 9

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 9

  46. [46]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang,Z.,Guan,T.,Fu,P.,Duan,C.,Jiang,Q.,Guo,Z.,Guo,S.,Luo,J.,Shen,W., Yang, X.: Marten: Visual question answering with mask generation for multi-modal document understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14460–14471 (2025) 9

  47. [47]

    arXiv preprint arXiv:2508.02324 (2025) 2, 8, 9

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 2, 8, 9

  48. [48]

    arXiv preprint arXiv:2503.21979 (2025) 4 18 J.Lyu et al

    Wu, S., Zhang, W., Xu, L., Jin, S., Wu, Z., Tao, Q., Liu, W., Li, W., Loy, C.C.: Harmonizing visual representations for unified multimodal understanding and gen- eration. arXiv preprint arXiv:2503.21979 (2025) 4 18 J.Lyu et al

  49. [49]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13294–13304 (2025) 4

  50. [50]

    arXiv preprint arXiv:2408.12528 (2024) 4

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. arXiv preprint arXiv:2408.12528 (2024) 4

  51. [51]

    arXiv preprint arXiv:2505.15809 (2025) 4

    Yang, L., Tian, Y., Li, B., Zhang, X., Shen, K., Tong, Y., Wang, M.: Mmada: Mul- timodal large diffusion language models. arXiv preprint arXiv:2505.15809 (2025) 4

  52. [52]

    Advances in Neural Information Processing Systems37, 138569–138594 (2024) 2

    Zeng, W., Shu, Y., Li, Z., Yang, D., Zhou, Y.: Textctrl: Diffusion-based scene text editing with prior guidance control. Advances in Neural Information Processing Systems37, 138569–138594 (2024) 2

  53. [53]

    arXiv preprint arXiv:2505.02567 (2025) 4

    Zhang, X., Guo, J., Zhao, S., Fu, M., Duan, L., Hu, J., Chng, Y.X., Wang, G.H., Chen, Q.G., Xu, Z., Luo, W., Zhang, K.: Unified multimodal understanding and generation models: Advances, challenges, and opportunities. arXiv preprint arXiv:2505.02567 (2025) 4

  54. [54]

    IEEE Transactions on Audio, Speech and Language Processing (2025) 4

    Zhang, Z., Zhang, Y., Liang, Y., Ma, C., Xiang, L., Zhao, Y., Zhou, Y., Zong, C.: Reading when translating: Multi-modal document image machine translation with reading flow prediction. IEEE Transactions on Audio, Speech and Language Processing (2025) 4

  55. [55]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4

    Zhang, Z., Zhang, Y., Liang, Y., Ma, C., Xiang, L., Zhao, Y., Zhou, Y., Zong, C.: Understand layout and translate text: Unified feature-conductive end-to-end document image translation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 4

  56. [56]

    In: Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers)

    Zhu, S., Li, S., Lei, Y., Xiong, D.: Peit: bridging the modality gap with pre-trained models for end-to-end image translation. In: Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 13433–13447 (2023) 4, 8, 9, 10, 11, 21, 22 Unitranslator 19 The appendix includes the following aspects: –A...