{"work":{"id":"f0270d36-2952-47fb-84c1-95e3ec341126","openalex_id":null,"doi":null,"arxiv_id":"2112.10752","raw_key":null,"title":"High-Resolution Image Synthesis with Latent Diffusion Models","authors":null,"authors_text":"Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bj\\\"orn Ommer","year":2021,"venue":"cs.CV","abstract":"By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .","external_url":"https://arxiv.org/abs/2112.10752","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T20:12:53.544941+00:00","pith_arxiv_id":"2112.10752","created_at":"2026-05-09T02:29:38.824953+00:00","updated_at":"2026-05-14T20:12:53.544941+00:00","title_quality_ok":true,"display_title":"High-Resolution Image Synthesis with Latent Diffusion Models","render_title":"High-Resolution Image Synthesis with Latent Diffusion Models"},"hub":{"state":{"work_id":"f0270d36-2952-47fb-84c1-95e3ec341126","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":42,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2022-04-13T01:10:33+00:00","last_pith_cited_at":"2026-05-13T03:16:12+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T20:46:11.550563+00:00","tier_text":"hub"},"tier":"hub","role_counts":[],"polarity_counts":[],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:49:01.560155+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Denoising Diffusion Probabilistic Models","work_id":"dc023f4e-7c79-471c-b713-deeb559ba010","shared_citers":14},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":12},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":12},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":12},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":11},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":11},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":11},{"title":"Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding","work_id":"af16442b-a46f-469d-8818-c37b53a504c7","shared_citers":9},{"title":"Diffusion Models Beat GANs on Image Synthesis","work_id":"2eb944bb-93ba-462c-8111-4e8c915dd873","shared_citers":7},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":7},{"title":"Scalable Diffusion Models with Transformers","work_id":"a3a05169-18b1-42bb-8775-eada50163437","shared_citers":7},{"title":"Deep Unsupervised Learning using Nonequilibrium Thermodynamics","work_id":"986277c3-5997-4593-942c-17cdec737a72","shared_citers":6},{"title":"Make-A-Video: Text-to-Video Generation without Text-Video Data","work_id":"52a801fc-a707-45a1-a8cd-0d6702f124ab","shared_citers":6},{"title":"Elucidating the Design Space of Diffusion-Based Generative Models","work_id":"a80a774d-caed-4e4c-9b69-471be05076e6","shared_citers":5},{"title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models","work_id":"34430d19-7919-48ce-88a5-17b3bfe2192e","shared_citers":5},{"title":"(page 13)","work_id":"578102d7-413d-4e1e-b8a7-a532ef75f1a6","shared_citers":5},{"title":"Progressive Distillation for Fast Sampling of Diffusion Models","work_id":"fd04f498-ff85-4de3-bcc7-31ef072b2ceb","shared_citers":5},{"title":"Scaling Rectified Flow Transformers for High-Resolution Image Synthesis","work_id":"4dc55d76-271e-42dd-878f-c20546599c69","shared_citers":5},{"title":"Adding Conditional Control to Text-to-Image Diffusion Models","work_id":"226c10cc-8cc0-4c01-9358-f23db6c470c7","shared_citers":4},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":4},{"title":"arXiv.csabs/2305.08891(2023) 1","work_id":"abdf465f-a666-4bc8-bcc6-1148df861601","shared_citers":4},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":4},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":4},{"title":"Consistency Models","work_id":"502bf494-8fcd-434f-828f-0566ab606719","shared_citers":4}],"time_series":[{"n":2,"year":2022},{"n":2,"year":2023},{"n":1,"year":2024},{"n":36,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:48:32.324673+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:48:36.546468+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"High-Resolution Image Synthesis with Latent Diffusion Models","claims":[{"claim_text":"By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks High-Resolution Image Synthesis with Latent Diffusion Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:49:04.927222+00:00"}},"summary":{"title":"High-Resolution Image Synthesis with Latent Diffusion Models","claims":[{"claim_text":"By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks High-Resolution Image Synthesis with Latent Diffusion Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Denoising Diffusion Probabilistic Models","work_id":"dc023f4e-7c79-471c-b713-deeb559ba010","shared_citers":14},{"title":"Classifier-Free Diffusion Guidance","work_id":"acf2c588-c088-4a6c-938e-150ad7c666d7","shared_citers":12},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":12},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":12},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":11},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":11},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":11},{"title":"Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding","work_id":"af16442b-a46f-469d-8818-c37b53a504c7","shared_citers":9},{"title":"Diffusion Models Beat GANs on Image Synthesis","work_id":"2eb944bb-93ba-462c-8111-4e8c915dd873","shared_citers":7},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":7},{"title":"Scalable Diffusion Models with Transformers","work_id":"a3a05169-18b1-42bb-8775-eada50163437","shared_citers":7},{"title":"Deep Unsupervised Learning using Nonequilibrium Thermodynamics","work_id":"986277c3-5997-4593-942c-17cdec737a72","shared_citers":6},{"title":"Make-A-Video: Text-to-Video Generation without Text-Video Data","work_id":"52a801fc-a707-45a1-a8cd-0d6702f124ab","shared_citers":6},{"title":"Elucidating the Design Space of Diffusion-Based Generative Models","work_id":"a80a774d-caed-4e4c-9b69-471be05076e6","shared_citers":5},{"title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models","work_id":"34430d19-7919-48ce-88a5-17b3bfe2192e","shared_citers":5},{"title":"(page 13)","work_id":"578102d7-413d-4e1e-b8a7-a532ef75f1a6","shared_citers":5},{"title":"Progressive Distillation for Fast Sampling of Diffusion Models","work_id":"fd04f498-ff85-4de3-bcc7-31ef072b2ceb","shared_citers":5},{"title":"Scaling Rectified Flow Transformers for High-Resolution Image Synthesis","work_id":"4dc55d76-271e-42dd-878f-c20546599c69","shared_citers":5},{"title":"Adding Conditional Control to Text-to-Image Diffusion Models","work_id":"226c10cc-8cc0-4c01-9358-f23db6c470c7","shared_citers":4},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":4},{"title":"arXiv.csabs/2305.08891(2023) 1","work_id":"abdf465f-a666-4bc8-bcc6-1148df861601","shared_citers":4},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":4},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":4},{"title":"Consistency Models","work_id":"502bf494-8fcd-434f-828f-0566ab606719","shared_citers":4}],"time_series":[{"n":2,"year":2022},{"n":2,"year":2023},{"n":1,"year":2024},{"n":36,"year":2026}],"dependency_candidates":[]},"authors":[]}}