MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
Glyph-byt5- v2: A strong aesthetic baseline for accurate multilingual visual text rendering
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4verdicts
UNVERDICTED 4representative citing papers
POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.
T2I models frequently exhibit semantic errors, logical inconsistencies, and incorrect reasoning steps in visual text generation tasks, unlike text-only models.
citing papers explorer
-
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
-
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
-
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.
-
Evaluating Reasoning Fidelity in Visual Text Generation
T2I models frequently exhibit semantic errors, logical inconsistencies, and incorrect reasoning steps in visual text generation tasks, unlike text-only models.