Token-to-Token Alignment of Text Embeddings for Semantic Blending

Daniel Cohen-Or; Or Patashnik; Ron Mokady; Saar Huberman

arxiv: 2606.24021 · v1 · pith:RYIXCMTRnew · submitted 2026-06-22 · 💻 cs.CV · cs.GR

Token-to-Token Alignment of Text Embeddings for Semantic Blending

Saar Huberman , Ron Mokady , Or Patashnik , Daniel Cohen-Or This is my paper

Pith reviewed 2026-06-26 08:31 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords text embeddingstoken alignmentsemantic blendingtext-to-imagelinear interpolationprompt structureembedding alignmentcontinuous editing

0 comments

The pith

Text embedding spaces in text-to-image models contain continuous semantic structure that token-to-token alignment makes usable for interpolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Token sequences from text prompts vary in wording and order even for similar images, so direct interpolation between them yields incoherent results. The paper argues this is due to misalignment of representations rather than an absence of underlying structure. It proposes a two-stage Token-to-Token alignment that first rephrases prompts into a shared structured form and then matches token embeddings by semantic similarity. With correspondences established, linear interpolation between embeddings produces smooth semantic transitions. This shows that semantic control can be obtained by organizing existing representations instead of retraining or modifying the generative model.

Core claim

Text embedding spaces in text-to-image models implicitly encode a continuous semantic structure that becomes accessible once representations are properly aligned; the Token-to-Token alignment framework maps semantically corresponding concepts to consistent positions across prompts and aligns their embeddings, after which simple linear interpolation yields smooth and coherent semantic transitions for blending and continuous editing.

What carries the argument

Token-to-Token alignment: a two-stage process of structural alignment that rephrases prompts into shared form followed by embedding-level alignment that matches tokens by semantic similarity.

If this is right

Linear interpolation between aligned embeddings produces smooth and coherent semantic transitions.
Image blending and continuous control of edits become feasible operations.
Semantic control is achievable by organizing existing representations rather than modifying the generative model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment principle could be tested on prompt pairs that differ in style or abstraction level to measure how far the structure generalizes.
If the method works, it might reduce reliance on hand-crafted prompt templates by automatically standardizing token positions.
Extending the approach to video or 3D generation would test whether the continuous structure holds across temporal or spatial dimensions.

Load-bearing premise

Arbitrary prompts can be rephrased into a shared structured form while preserving their original semantics and that semantic similarity between tokens can be determined reliably enough to produce valid correspondences.

What would settle it

Apply the alignment to a set of prompts, perform linear interpolation between the aligned embeddings, and check whether the generated images exhibit gradual semantic changes or instead show abrupt discontinuities and artifacts.

Figures

Figures reproduced from arXiv: 2606.24021 by Daniel Cohen-Or, Or Patashnik, Ron Mokady, Saar Huberman.

**Figure 1.** Figure 1: Semantic interpolation through aligned text embeddings. Direct interpolation between prompt embeddings yields inconsistent and entangled transitions. By aligning prompts at both the structural and embedding levels, we enforce token-wise semantic correspondence, enabling linear interpolation to follow a meaningful path in the text embedding space. This reveals a continuous semantic manifold that diffusion m… view at source ↗

**Figure 2.** Figure 2: Applications enabled by aligned text embedding interpolation. Once representations are aligned, simple linear interpolation produces coherent semantic transitions. This unified framework supports continuous synthesis (top), image editing (middle), and image blending (bottom), generating meaningful intermediate states across tasks. Red boxes denote input images. changes (e.g., “cat” vs. “lion”) and local va… view at source ↗

**Figure 3.** Figure 3: Method overview. Inputs are transformed into structured descriptions under a shared schema with aligned textual phrasing, establishing correspondence between scene components and how they are expressed. Their token embeddings are then aligned in the text encoder space to enforce token-wise semantic correspondence. Interpolation between aligned embeddings produces intermediate representations, which the gen… view at source ↗

**Figure 4.** Figure 4: Concept level structural alignment. Inputs are transformed into structured descriptions under a shared schema, where scene components are decomposed into semantic fields (e.g., lighting, objects) with consistent ordering. Corresponding elements are expressed using aligned textual phrasing within each field, establishing coarse correspondence at the textual level. 3 Method In this section, we present our fr… view at source ↗

**Figure 5.** Figure 5: Embedding-level alignment. We compute a pairwise cosine similarity matrix between token embeddings, then refine it using field-based masking and a locality bias, and apply a temperaturescaled softmax to obtain alignment weights. These weights define a projection that maps source embeddings onto the target token structure, re-expressing their semantic content in the target layout and establishing token-wis… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison with continuous editing methods. Existing methods often rely on appearance-based interpolation or introduce abrupt semantic changes, whereas our method produces smooth and semantically coherent transitions with meaningful intermediate states. Input A Input B Input A Input B DiffMorpher FreeMorph Vibe Space T2T (Ours) ←−−−−−−−− Intermediate transitions (0 < α < 1) −−−−−−−−→ ←−−−−−−−− … view at source ↗

**Figure 7.** Figure 7: Qualitative comparison with continuous blending methods. While prior approaches may produce mixed or drifting attributes along the transition, our method interpolates in aligned text embedding space, yielding coherent and semantically valid intermediate images at every step. Baselines. We compare our method against prior work across two settings. For continuous image blending, we consider DiffMorpher [57],… view at source ↗

**Figure 8.** Figure 8: User study results. Win rates (%) of our method in pairwise comparisons against editing (left) and blending (right) baselines across evaluation criteria. existing methods either abruptly generate the flower or mainly modify its scale without modeling a meaningful semantic growth process. In contrast, our method produces smooth semantic transitions with coherent intermediate states that progressively reflec… view at source ↗

**Figure 10.** Figure 10: Qualitative alignment ablation study. Direct interpolation produces abrupt transitions. Text-level alignment improves continuity but lacks fine-grained detail, while embedding alignment alone leads to unstable transformations. Combining both yields smooth and coherent transitions [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: User study interface used for pairwise evaluation of continuous blending. Participants [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Structural text alignment pipeline for continuous blending. Stage 1 prompts the LLM to jointly generate aligned source/edit JSONs from two images while enforcing shared structure, object count, field order, and consistent wording for unaffected regions. Stage 2 performs imageaware bidirectional refinement: unchanged fields are force-copied, while affected fields are rewritten in parallel with minimal wor… view at source ↗

**Figure 13.** Figure 13: Qualitative comparison with continuous blending methods. Results generated using [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison with continuous blending methods. Results generated using [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Additional continuous blending results. Each row shows a gradual transition between two [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison with continuous editing methods. For the instruction “change the scene to nighttime,” prior approaches mainly rely on appearance-based transitions such as progressive darkening, whereas our method produces a more coherent semantic transition with gradual changes in illumination and shadow placement consistent with the changing time of day. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_… view at source ↗

**Figure 17.** Figure 17: Qualitative comparison with continuous editing methods. Top example: results generated [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Additional continuous editing results. Each row shows a gradual edit applied to a reference [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

read the original abstract

In modern generative models, images are specified and controlled through text prompts. In practice, images are generated from sequences of tokens derived from these prompts. However, the space of token sequences lacks a consistent accessible structure: semantically similar images may correspond to sequences that differ in wording, ordering, and placement of concepts, while similar token sequences may encode very different semantics. This apparent lack of structure makes it difficult to perform smooth transitions in this space, hindering applications such as image blending and continuous control of edits. We argue that this limitation stems not from the absence of semantic structure, but from misalignment between representations. To address this misalignment, we introduce Token-to-Token alignment, a framework that establishes explicit semantic correspondence between tokens across prompts. Our approach transforms prompts into a structured representation in which semantically corresponding concepts are mapped to consistent positions across prompts, and then aligns their token embeddings based on semantic similarity. Concretely, the method consists of two stages: a structural alignment that rephrases prompts into a shared structured form, followed by an embedding-level alignment that matches token representations across prompts. With this alignment in place, simple linear interpolation becomes a meaningful operation, producing smooth and coherent semantic transitions and enabling applications such as blending and continuous editing. Our results show that text embedding spaces in text-to-image models implicitly encode a continuous semantic structure that becomes accessible once representations are properly aligned, suggesting that semantic control can be achieved by organizing existing representations rather than modifying the generative model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's two-stage token alignment lets linear interpolation work for semantic blending in text-to-image models, but the structural rephrasing step likely supplies the consistency rather than uncovering it in the embeddings.

read the letter

The core contribution is a two-stage process: first rephrase prompts into a shared structured form, then match token embeddings by semantic similarity so that linear interpolation produces coherent image transitions. This targets the practical problem that different wordings for similar concepts produce inconsistent token sequences in current pipelines.

The approach is new in its explicit focus on token-to-token correspondences for blending and editing tasks. It correctly notes that the issue is misalignment rather than missing structure and shows how organizing existing representations can enable continuous control without retraining or modifying the generative model. That framing is useful for anyone already working inside Stable Diffusion or similar systems who needs smoother prompt-based edits.

The soft spot is the structural alignment stage. The description gives no details on how rephrasing is performed, whether it uses an LLM, templates, or something else. If that step imposes consistent positions and concepts, then the subsequent embedding interpolation may simply be operating on artificially similar inputs rather than revealing intrinsic continuity in the original token space. The abstract supplies no ablations that test interpolation with and without the rephrasing, no quantitative metrics, and no direct comparisons to prior interpolation or prompt-rewriting baselines. Without those, the central claim that the embedding space already encodes the needed structure remains unverified.

The stress-test concern holds up on the given description: the method may be engineering the desired continuity instead of discovering it. A reader would need the full implementation details and results to judge whether the alignment adds value beyond the rephrasing.

This is for practitioners and researchers in generative computer vision who build editing tools. It deserves peer review because the idea is concrete and addresses a real workflow pain point, even if the current write-up leaves the evidence thin.

Referee Report

2 major / 1 minor

Summary. The paper claims that token sequences in text-to-image models lack consistent structure due to misalignment of representations, and introduces a Token-to-Token alignment framework consisting of a structural alignment stage (rephrasing arbitrary prompts into a shared structured form) followed by an embedding-level alignment stage (matching tokens across prompts by semantic similarity). With this alignment, linear interpolation becomes meaningful, enabling smooth semantic transitions for applications such as blending and continuous editing. The central thesis is that text embedding spaces already encode continuous semantic structure implicitly, which becomes accessible once representations are properly aligned, allowing semantic control by organizing existing representations rather than modifying the generative model.

Significance. If the central claim holds after verification, the work would indicate that semantic blending and continuous control can be achieved without retraining or architectural changes to the underlying text-to-image model, potentially lowering the barrier for applications that require smooth interpolation in prompt space. The emphasis on revealing rather than imposing structure distinguishes it from purely template-driven or external-rephrasing approaches, though this distinction requires explicit testing.

major comments (2)

[Abstract] Abstract: The claim that alignment 'reveals' implicit continuous semantic structure (rather than the structural rephrasing stage imposing it) is load-bearing for the central thesis, yet the description provides no implementation details on the rephrasing procedure, no ablation removing the rephrasing step, and no comparison showing that correspondences arise from the embedding space itself. Without such evidence the observed coherence could be an artifact of the external structuring step.
[Abstract] Abstract: The assertion that 'simple linear interpolation becomes a meaningful operation' and produces 'smooth and coherent semantic transitions' is presented without any quantitative results, error metrics, baseline comparisons, or failure-case analysis. The soundness assessment therefore rests on an unverified procedural description rather than falsifiable predictions or empirical grounding.

minor comments (1)

The abstract is written at a high level of generality; expanding the method description with at least one concrete example of prompt rephrasing and token correspondence would improve readability without altering the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the empirical support and clarity of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that alignment 'reveals' implicit continuous semantic structure (rather than the structural rephrasing stage imposing it) is load-bearing for the central thesis, yet the description provides no implementation details on the rephrasing procedure, no ablation removing the rephrasing step, and no comparison showing that correspondences arise from the embedding space itself. Without such evidence the observed coherence could be an artifact of the external structuring step.

Authors: We agree that explicit evidence is required to substantiate that the embedding space itself supplies the continuous structure. The rephrasing stage only normalizes prompt syntax to enable positional consistency; token correspondences are then computed exclusively via cosine similarity in the frozen text embedding space. In revision we will (1) detail the rephrasing template and procedure, (2) add an ablation that performs embedding-level alignment directly on un-rephrased prompts, and (3) include a random-matching baseline to isolate the contribution of semantic similarity. These additions will be placed in the method and experiments sections. revision: yes
Referee: [Abstract] Abstract: The assertion that 'simple linear interpolation becomes a meaningful operation' and produces 'smooth and coherent semantic transitions' is presented without any quantitative results, error metrics, baseline comparisons, or failure-case analysis. The soundness assessment therefore rests on an unverified procedural description rather than falsifiable predictions or empirical grounding.

Authors: The body of the manuscript reports qualitative blending results and application examples, yet the abstract does not cite quantitative metrics. We will revise the abstract to reference the specific metrics (e.g., CLIP-based semantic consistency scores and user-study preference rates) already computed in the experiments. We will also expand the experiments section with direct baseline comparisons against unaligned interpolation and a failure-case analysis. These changes will supply the requested empirical grounding. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method is procedural without self-referential derivations

full rationale

The paper describes a two-stage procedural framework (structural rephrasing of prompts into shared form, followed by token embedding alignment) and concludes that embedding spaces implicitly encode continuous semantic structure. No equations, quantitative predictions, fitted parameters, or derivations are present that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claim follows from applying the method rather than from any self-definitional or fitted-input loop, making the presentation self-contained as a methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details, equations, or experimental sections are available from the abstract, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5799 in / 875 out tokens · 14373 ms · 2026-06-26T08:31:43.728493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 25 canonical work pages · 11 internal anchors

[1]

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, 2018

2018
[2]

SmolLM3: smol, multilingual, long-context reasoner

Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, Lean...

2025
[3]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

2025
[4]

Continuous, subject-specific attribute control in t2i models by identifying semantic directions

Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, and Björn Ommer. Continuous, subject-specific attribute control in t2i models by identifying semantic directions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13231– 13241, 2025

2025
[5]

Semantic parsing via paraphrasing

Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1415–1425, 2014

2014
[6]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023
[7]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023
[8]

Freemorph: Tuning-free generalized image morphing with diffusion model

Yukang Cao, Chenyang Si, Jinghao Wang, and Ziwei Liu. Freemorph: Tuning-free generalized image morphing with diffusion model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18111–18120, 2025

2025
[9]

Text slider: Efficient and plug-and-play continuous concept control for image/video synthesis via lora adapters

Pin-Yen Chiu, I Fang, Jun-Cheng Chen, et al. Text slider: Efficient and plug-and-play continuous concept control for image/video synthesis via lora adapters. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 613–622, 2026

2026
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Word Translation Without Parallel Data

Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data.arXiv preprint arXiv:1710.04087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, and Daniel Cohen-Or. On-the-fly repulsion in the contextual space for rich diversity in diffusion transformers.arXiv preprint arXiv:2603.28762, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models, 2023

Yusuf Dalva and Pinar Yanardag. Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models, 2023

2023
[14]

Fluxspace: Disentangled semantic editing in rectified flow transformers, 2024

Yusuf Dalva, Kavana Venkatesh, and Pinar Yanardag. Fluxspace: Disentangled semantic editing in rectified flow transformers, 2024. 10

2024
[15]

Interpreting the weight space of customized diffusion models

Amil Dravid, Yossi Gandelsman, Kuan-Chieh Wang, Rameen Abdal, Gordon Wetzstein, Alexei A Efros, and Kfir Aberman. Interpreting the weight space of customized diffusion models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems
[16]

The unreasonable effectiveness of text embedding interpolation for continuous image steering.arXiv preprint arXiv:2603.17998, 2026

Yigit Ekin and Yossi Gandelsman. The unreasonable effectiveness of text embedding interpolation for continuous image steering.arXiv preprint arXiv:2603.17998, 2026

work page arXiv 2026
[17]

Concept sliders: Lora adaptors for precise control in diffusion models

Rohit Gandikota, Joanna Materzy ´nska, Tingrui Zhou, Antonio Torralba, and David Bau. Concept sliders: Lora adaptors for precise control in diffusion models. InEuropean Conference on Computer Vision, pages 172–188. Springer, 2024

2024
[18]

Tokenverse: Versatile multi-concept personalization in token modulation space,

Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space,
[19]

URLhttps://arxiv.org/abs/2501.12224

work page arXiv
[20]

Generating an image from 1,000 words: Enhancing text-to-image with structured captions.arXiv preprint arXiv:2511.06876, abs/2511.06876, 2025

Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huber- man, Gal Davidi, Guy Bukchin, Kfir Goldberg, and Ron Mokady. Generating an image from 1,000 words: Enhancing text-to-image with structured captions.arXiv preprint arXiv:2511.06876, abs/2511.06876, 2025

work page arXiv 2025
[21]

Ganspace: Discovering interpretable gan controls.Advances in neural information processing systems, 33:9841–9850, 2020

Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls.Advances in neural information processing systems, 33:9841–9850, 2020

2020
[22]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017
[24]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.arXiv preprint arxiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[25]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Disentangling semantics and syntax in sentence embeddings with pre-trained language models

James Y Huang, Kuan-Hao Huang, and Kai-Wei Chang. Disentangling semantics and syntax in sentence embeddings with pre-trained language models. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1372–1379, 2021

2021
[27]

Image generation from contextually-contradictory prompts

Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, and Daniel Cohen-Or. Image generation from contextually-contradictory prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14885–14894, 2026

2026
[28]

Pnp inversion: Boosting diffusion-based editing with 3 lines of code

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. InThe Twelfth International Conference on Learning Representations, 2024

2024
[29]

Bbq-to-image: Numeric bounding box and qolor control in large-scale text-to-image models.arXiv preprint arXiv:2602.20672, 2026

Eliran Kachlon, Alexander Visheratin, Nimrod Sarid, Tal Hacham, Eyal Gutflaish, Saar Huberman, Hezi Zisman, David Ruppin, and Ron Mokady. Bbq-to-image: Numeric bounding box and qolor control in large-scale text-to-image models.arXiv preprint arXiv:2602.20672, 2026

work page arXiv 2026
[30]

Saedit: Token-level control for continuous image editing via sparse autoencoder.arXiv preprint arXiv:2510.05081, 2025

Ronen Kamenetsky, Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, and Daniel Cohen-Or. Saedit: Token-level control for continuous image editing via sparse autoencoder.arXiv preprint arXiv:2510.05081, 2025

work page arXiv 2025
[31]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

2020
[32]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InConference on Computer Vision and Pattern Recognition 2023, 2023

2023
[33]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 11

2021
[34]

Flowedit: Inversion- free text-based editing using pre-trained flow models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion- free text-based editing using pre-trained flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19721–19730, 2025

2025
[35]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025
[36]

Playground v3: Improving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

work page arXiv 2024
[37]

Llm4gen: Leveraging semantic representation of llms for text-to-image generation

Mushui Liu, Yuhang Ma, Zhen Yang, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, and Changjie Fan. Llm4gen: Leveraging semantic representation of llms for text-to-image generation. InProceedings of the AAAI conference on Artificial Intelligence, volume 39, pages 5523–5531, 2025

2025
[38]

Tokendial: Continuous attribute control in text-to-video via spatiotemporal token offsets,

Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, and Jui-Hsien Wang. Tokendial: Continuous attribute control in text-to-video via spatiotemporal token offsets,
[39]

URLhttps://arxiv.org/abs/2603.27520

work page arXiv
[40]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Video Analysis and Generation via a Semantic Progress Function

Gal Metzer, Sagi Polaczek, Ali Mahdavi-Amiri, Raja Giryes, and Daniel Cohen-Or. Video analysis and generation via a semantic progress function.arXiv preprint arXiv:2604.22554, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Exploiting Similarities among Languages for Machine Translation

Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation.arXiv preprint arXiv:1309.4168, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[43]

Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025

Rishubh Parihar, Or Patashnik, Daniil Ostashev, R Venkatesh Babu, Daniel Cohen-Or, and Kuan-Chieh Wang. Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025

work page arXiv 2025
[44]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE/CVF international conference on computer vision, pages 2085–2094, 2021

2085
[45]

Now Foundations and Trends, 2019

Gabriel Peyré and Marco Cuturi.Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019
[46]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021. URL https://arxiv.org/abs/2102.12092

work page internal anchor Pith review Pith/arXiv arXiv 2021
[47]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[48]

Pathways on the image manifold: Image editing via video generation

Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, and Ron Kimmel. Pathways on the image manifold: Image editing via video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7857–7866, 2025

2025
[49]

Dynamic programming algorithm optimization for spoken word recognition

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1):43–49, 1978

1978
[50]

Interfacegan: Interpreting the disentangled face representation learned by gans.IEEE transactions on pattern analysis and machine intelligence, 44(4): 2004–2018, 2020

Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. Interfacegan: Interpreting the disentangled face representation learned by gans.IEEE transactions on pattern analysis and machine intelligence, 44(4): 2004–2018, 2020

2004
[51]

Plug-and-play diffusion features for text- driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023

1921
[52]

Continuous control of editing models via adaptive-origin guidance.arXiv preprint arXiv:2602.03826, 2026

Alon Wolf, Chen Katzir, Kfir Aberman, and Or Patashnik. Continuous control of editing models via adaptive-origin guidance.arXiv preprint arXiv:2602.03826, 2026

work page arXiv 2026
[53]

Uncovering the disentanglement capability in text-to-image diffusion models

Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1900–1910, 2023. 12

1900
[54]

Numerikontrol: Adding numeric control to diffusion transformers for instruction-based image editing.arXiv preprint arXiv:2511.23105, 2025

Zhenyu Xu, Xiaoqi Shen, Haotian Nan, and Xinyu Zhang. Numerikontrol: Adding numeric control to diffusion transformers for instruction-based image editing.arXiv preprint arXiv:2511.23105, 2025

work page arXiv 2025
[55]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Vibe spaces for creatively connecting and expressing visual concepts.arXiv preprint arXiv:2512.14884, 2025

Huzheng Yang, Katherine Xu, Andrew Lu, Michael D Grossberg, Yutong Bai, and Jianbo Shi. Vibe spaces for creatively connecting and expressing visual concepts.arXiv preprint arXiv:2512.14884, 2025

work page arXiv 2025
[57]

Uncovering the text embedding in text-to-image diffusion models.arXiv preprint arXiv:2404.01154, 2024

Hu Yu, Hao Luo, Fan Wang, and Feng Zhao. Uncovering the text embedding in text-to-image diffusion models.arXiv preprint arXiv:2404.01154, 2024

work page arXiv 2024
[58]

Slideredit: Continuous image editing with fine-grained instruction control.arXiv preprint arXiv:2511.09715, 2025

Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan Rossi, and Soheil Feizi. Slideredit: Continuous image editing with fine-grained instruction control.arXiv preprint arXiv:2511.09715, 2025

work page arXiv 2025
[59]

Diffmorpher: Unleashing the capability of diffusion models for image morphing

Kaiwen Zhang, Yifan Zhou, Xudong Xu, Bo Dai, and Xingang Pan. Diffmorpher: Unleashing the capability of diffusion models for image morphing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921, 2024

2024
[60]

a samurai standing and holding a sword in the middle of a forest

Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, and An-an Liu. Group relative attention guidance for image editing.arXiv preprint arXiv:2510.24657, 2025. A Additional details A.1 Benchmarks Morph4dataFollowing the protocol of FreeMorph [ 8], we evaluate semantic interpolation on Morph4Data, a curated data...

work page arXiv 2025

[1] [1]

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, 2018

2018

[2] [2]

SmolLM3: smol, multilingual, long-context reasoner

Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Carlos Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallouédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raffel, Lean...

2025

[3] [3]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025

2025

[4] [4]

Continuous, subject-specific attribute control in t2i models by identifying semantic directions

Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, and Björn Ommer. Continuous, subject-specific attribute control in t2i models by identifying semantic directions. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13231– 13241, 2025

2025

[5] [5]

Semantic parsing via paraphrasing

Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1415–1425, 2014

2014

[6] [6]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023

[7] [7]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

2023

[8] [8]

Freemorph: Tuning-free generalized image morphing with diffusion model

Yukang Cao, Chenyang Si, Jinghao Wang, and Ziwei Liu. Freemorph: Tuning-free generalized image morphing with diffusion model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18111–18120, 2025

2025

[9] [9]

Text slider: Efficient and plug-and-play continuous concept control for image/video synthesis via lora adapters

Pin-Yen Chiu, I Fang, Jun-Cheng Chen, et al. Text slider: Efficient and plug-and-play continuous concept control for image/video synthesis via lora adapters. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 613–622, 2026

2026

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Word Translation Without Parallel Data

Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data.arXiv preprint arXiv:1710.04087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Omer Dahary, Benaya Koren, Daniel Garibi, and Daniel Cohen-Or. On-the-fly repulsion in the contextual space for rich diversity in diffusion transformers.arXiv preprint arXiv:2603.28762, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models, 2023

Yusuf Dalva and Pinar Yanardag. Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models, 2023

2023

[14] [14]

Fluxspace: Disentangled semantic editing in rectified flow transformers, 2024

Yusuf Dalva, Kavana Venkatesh, and Pinar Yanardag. Fluxspace: Disentangled semantic editing in rectified flow transformers, 2024. 10

2024

[15] [15]

Interpreting the weight space of customized diffusion models

Amil Dravid, Yossi Gandelsman, Kuan-Chieh Wang, Rameen Abdal, Gordon Wetzstein, Alexei A Efros, and Kfir Aberman. Interpreting the weight space of customized diffusion models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

[16] [16]

The unreasonable effectiveness of text embedding interpolation for continuous image steering.arXiv preprint arXiv:2603.17998, 2026

Yigit Ekin and Yossi Gandelsman. The unreasonable effectiveness of text embedding interpolation for continuous image steering.arXiv preprint arXiv:2603.17998, 2026

work page arXiv 2026

[17] [17]

Concept sliders: Lora adaptors for precise control in diffusion models

Rohit Gandikota, Joanna Materzy ´nska, Tingrui Zhou, Antonio Torralba, and David Bau. Concept sliders: Lora adaptors for precise control in diffusion models. InEuropean Conference on Computer Vision, pages 172–188. Springer, 2024

2024

[18] [18]

Tokenverse: Versatile multi-concept personalization in token modulation space,

Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space,

[19] [19]

URLhttps://arxiv.org/abs/2501.12224

work page arXiv

[20] [20]

Generating an image from 1,000 words: Enhancing text-to-image with structured captions.arXiv preprint arXiv:2511.06876, abs/2511.06876, 2025

Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huber- man, Gal Davidi, Guy Bukchin, Kfir Goldberg, and Ron Mokady. Generating an image from 1,000 words: Enhancing text-to-image with structured captions.arXiv preprint arXiv:2511.06876, abs/2511.06876, 2025

work page arXiv 2025

[21] [21]

Ganspace: Discovering interpretable gan controls.Advances in neural information processing systems, 33:9841–9850, 2020

Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls.Advances in neural information processing systems, 33:9841–9850, 2020

2020

[22] [22]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

2017

[24] [24]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.arXiv preprint arxiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[25] [25]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Disentangling semantics and syntax in sentence embeddings with pre-trained language models

James Y Huang, Kuan-Hao Huang, and Kai-Wei Chang. Disentangling semantics and syntax in sentence embeddings with pre-trained language models. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1372–1379, 2021

2021

[27] [27]

Image generation from contextually-contradictory prompts

Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, and Daniel Cohen-Or. Image generation from contextually-contradictory prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14885–14894, 2026

2026

[28] [28]

Pnp inversion: Boosting diffusion-based editing with 3 lines of code

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. InThe Twelfth International Conference on Learning Representations, 2024

2024

[29] [29]

Bbq-to-image: Numeric bounding box and qolor control in large-scale text-to-image models.arXiv preprint arXiv:2602.20672, 2026

Eliran Kachlon, Alexander Visheratin, Nimrod Sarid, Tal Hacham, Eyal Gutflaish, Saar Huberman, Hezi Zisman, David Ruppin, and Ron Mokady. Bbq-to-image: Numeric bounding box and qolor control in large-scale text-to-image models.arXiv preprint arXiv:2602.20672, 2026

work page arXiv 2026

[30] [30]

Saedit: Token-level control for continuous image editing via sparse autoencoder.arXiv preprint arXiv:2510.05081, 2025

Ronen Kamenetsky, Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, and Daniel Cohen-Or. Saedit: Token-level control for continuous image editing via sparse autoencoder.arXiv preprint arXiv:2510.05081, 2025

work page arXiv 2025

[31] [31]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020

2020

[32] [32]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InConference on Computer Vision and Pattern Recognition 2023, 2023

2023

[33] [33]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 11

2021

[34] [34]

Flowedit: Inversion- free text-based editing using pre-trained flow models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion- free text-based editing using pre-trained flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19721–19730, 2025

2025

[35] [35]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

2025

[36] [36]

Playground v3: Improving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

work page arXiv 2024

[37] [37]

Llm4gen: Leveraging semantic representation of llms for text-to-image generation

Mushui Liu, Yuhang Ma, Zhen Yang, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, and Changjie Fan. Llm4gen: Leveraging semantic representation of llms for text-to-image generation. InProceedings of the AAAI conference on Artificial Intelligence, volume 39, pages 5523–5531, 2025

2025

[38] [38]

Tokendial: Continuous attribute control in text-to-video via spatiotemporal token offsets,

Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, and Jui-Hsien Wang. Tokendial: Continuous attribute control in text-to-video via spatiotemporal token offsets,

[39] [39]

URLhttps://arxiv.org/abs/2603.27520

work page arXiv

[40] [40]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[41] [41]

Video Analysis and Generation via a Semantic Progress Function

Gal Metzer, Sagi Polaczek, Ali Mahdavi-Amiri, Raja Giryes, and Daniel Cohen-Or. Video analysis and generation via a semantic progress function.arXiv preprint arXiv:2604.22554, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Exploiting Similarities among Languages for Machine Translation

Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation.arXiv preprint arXiv:1309.4168, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[43] [43]

Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025

Rishubh Parihar, Or Patashnik, Daniil Ostashev, R Venkatesh Babu, Daniel Cohen-Or, and Kuan-Chieh Wang. Kontinuous kontext: Continuous strength control for instruction-based image editing.arXiv preprint arXiv:2510.08532, 2025

work page arXiv 2025

[44] [44]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. InProceedings of the IEEE/CVF international conference on computer vision, pages 2085–2094, 2021

2085

[45] [45]

Now Foundations and Trends, 2019

Gabriel Peyré and Marco Cuturi.Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019

[46] [46]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021. URL https://arxiv.org/abs/2102.12092

work page internal anchor Pith review Pith/arXiv arXiv 2021

[47] [47]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[48] [48]

Pathways on the image manifold: Image editing via video generation

Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, and Ron Kimmel. Pathways on the image manifold: Image editing via video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7857–7866, 2025

2025

[49] [49]

Dynamic programming algorithm optimization for spoken word recognition

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1):43–49, 1978

1978

[50] [50]

Interfacegan: Interpreting the disentangled face representation learned by gans.IEEE transactions on pattern analysis and machine intelligence, 44(4): 2004–2018, 2020

Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. Interfacegan: Interpreting the disentangled face representation learned by gans.IEEE transactions on pattern analysis and machine intelligence, 44(4): 2004–2018, 2020

2004

[51] [51]

Plug-and-play diffusion features for text- driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023

1921

[52] [52]

Continuous control of editing models via adaptive-origin guidance.arXiv preprint arXiv:2602.03826, 2026

Alon Wolf, Chen Katzir, Kfir Aberman, and Or Patashnik. Continuous control of editing models via adaptive-origin guidance.arXiv preprint arXiv:2602.03826, 2026

work page arXiv 2026

[53] [53]

Uncovering the disentanglement capability in text-to-image diffusion models

Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1900–1910, 2023. 12

1900

[54] [54]

Numerikontrol: Adding numeric control to diffusion transformers for instruction-based image editing.arXiv preprint arXiv:2511.23105, 2025

Zhenyu Xu, Xiaoqi Shen, Haotian Nan, and Xinyu Zhang. Numerikontrol: Adding numeric control to diffusion transformers for instruction-based image editing.arXiv preprint arXiv:2511.23105, 2025

work page arXiv 2025

[55] [55]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Vibe spaces for creatively connecting and expressing visual concepts.arXiv preprint arXiv:2512.14884, 2025

Huzheng Yang, Katherine Xu, Andrew Lu, Michael D Grossberg, Yutong Bai, and Jianbo Shi. Vibe spaces for creatively connecting and expressing visual concepts.arXiv preprint arXiv:2512.14884, 2025

work page arXiv 2025

[57] [57]

Uncovering the text embedding in text-to-image diffusion models.arXiv preprint arXiv:2404.01154, 2024

Hu Yu, Hao Luo, Fan Wang, and Feng Zhao. Uncovering the text embedding in text-to-image diffusion models.arXiv preprint arXiv:2404.01154, 2024

work page arXiv 2024

[58] [58]

Slideredit: Continuous image editing with fine-grained instruction control.arXiv preprint arXiv:2511.09715, 2025

Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan Rossi, and Soheil Feizi. Slideredit: Continuous image editing with fine-grained instruction control.arXiv preprint arXiv:2511.09715, 2025

work page arXiv 2025

[59] [59]

Diffmorpher: Unleashing the capability of diffusion models for image morphing

Kaiwen Zhang, Yifan Zhou, Xudong Xu, Bo Dai, and Xingang Pan. Diffmorpher: Unleashing the capability of diffusion models for image morphing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921, 2024

2024

[60] [60]

a samurai standing and holding a sword in the middle of a forest

Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, and An-an Liu. Group relative attention guidance for image editing.arXiv preprint arXiv:2510.24657, 2025. A Additional details A.1 Benchmarks Morph4dataFollowing the protocol of FreeMorph [ 8], we evaluate semantic interpolation on Morph4Data, a curated data...

work page arXiv 2025