{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:UTRDYT6CJUBBSHFVA7YYXGCZKC","short_pith_number":"pith:UTRDYT6C","schema_version":"1.0","canonical_sha256":"a4e23c4fc24d02191cb507f18b9859509c58f3dc6a9de13d2dce55948f469b3c","source":{"kind":"arxiv","id":"2302.08453","version":2},"attestation_state":"computed","paper":{"title":"T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Lightweight adapters align external signals with the internal knowledge of frozen text-to-image diffusion models.","cross_cats":["cs.AI","cs.LG","cs.MM"],"primary_cat":"cs.CV","authors_text":"Chong Mou, Jian Zhang, Liangbin Xie, Xiaohu Qie, Xintao Wang, Yanze Wu, Ying Shan, Zhongang Qi","submitted_at":"2023-02-16T17:56:08Z","abstract_excerpt":"The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out\" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to al"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2302.08453","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2023-02-16T17:56:08Z","cross_cats_sorted":["cs.AI","cs.LG","cs.MM"],"title_canon_sha256":"3493f9a6a3b749cc00c722a39ae608f90bd5448d955798810d630ed3d7ddd30a","abstract_canon_sha256":"e545ed690231d735387208fe8386e4305dfae265b4b9b2599e1aabb9efe9bccf"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.340209Z","signature_b64":"RNnCUV31pQzYoHrrOm9SuWEBcRf6GnK+mtacLRvHj2kCVHOX4fxlovkq0xFJUQ8qr4JQyAtcLk6wSYwGUZGTDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"a4e23c4fc24d02191cb507f18b9859509c58f3dc6a9de13d2dce55948f469b3c","last_reissued_at":"2026-05-17T23:38:46.339664Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.339664Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Lightweight adapters align external signals with the internal knowledge of frozen text-to-image diffusion models.","cross_cats":["cs.AI","cs.LG","cs.MM"],"primary_cat":"cs.CV","authors_text":"Chong Mou, Jian Zhang, Liangbin Xie, Xiaohu Qie, Xintao Wang, Yanze Wu, Ying Shan, Zhongang Qi","submitted_at":"2023-02-16T17:56:08Z","abstract_excerpt":"The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out\" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to al"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the internal knowledge implicitly learned by large T2I models can be effectively aligned with external control signals using simple lightweight adapters without degrading generative quality or requiring full model retraining.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"T2I-Adapters are lightweight modules that enable fine-grained control over color and structure in text-to-image diffusion models by aligning external conditions with the frozen model's internal knowledge.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Lightweight adapters align external signals with the internal knowledge of frozen text-to-image diffusion models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b527280d0607c40b6b5f40438bcb17f874526f01f80c619b9587e7395a95ffab"},"source":{"id":"2302.08453","kind":"arxiv","version":2},"verdict":{"id":"7c68bf8c-0a5c-4e4e-8952-60b6d2ec499c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T22:44:07.595923Z","strongest_claim":"we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models.","one_line_summary":"T2I-Adapters are lightweight modules that enable fine-grained control over color and structure in text-to-image diffusion models by aligning external conditions with the frozen model's internal knowledge.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the internal knowledge implicitly learned by large T2I models can be effectively aligned with external control signals using simple lightweight adapters without degrading generative quality or requiring full model retraining.","pith_extraction_headline":"Lightweight adapters align external signals with the internal knowledge of frozen text-to-image diffusion models."},"references":{"count":47,"sample":[{"doi":"","year":2022,"title":"eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers","work_id":"2cd7b629-ab37-4ce5-b51e-aa4d99547468","ref_index":1,"cited_arxiv_id":"2211.01324","is_internal_anchor":true},{"doi":"","year":2018,"title":"Coco- stuff: Thing and stuff classes in context","work_id":"649caf1e-4b1d-47df-83fd-95d8f230ac97","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Vision transformer adapter for dense predictions","work_id":"2371694e-3e9f-4892-a1a2-f64b28d4b349","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Openmmlab pose estimation toolbox and benchmark","work_id":"c2f0a614-b463-40d8-9007-d3ab5f2f0d14","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Gen- erative adversarial networks: An overview","work_id":"9a75560f-8b87-4ce7-a03c-f2bba15dbf27","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":47,"snapshot_sha256":"c82d31b5ac7b3de42f604c9fad087aa16850081e6f79bd3bd5c5a34c20732e36","internal_anchors":7},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2302.08453","created_at":"2026-05-17T23:38:46.339750+00:00"},{"alias_kind":"arxiv_version","alias_value":"2302.08453v2","created_at":"2026-05-17T23:38:46.339750+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2302.08453","created_at":"2026-05-17T23:38:46.339750+00:00"},{"alias_kind":"pith_short_12","alias_value":"UTRDYT6CJUBB","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"UTRDYT6CJUBBSHFV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"UTRDYT6C","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2405.18716","citing_title":"SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2511.16766","citing_title":"SVG360: Editable Multiview Vector Graphics from a Single SVG","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20807","citing_title":"Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21431","citing_title":"iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18010","citing_title":"Functionalization via Structure Completion and Motion Rectification","ref_index":125,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18871","citing_title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2509.17458","citing_title":"CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2401.07519","citing_title":"InstantID: Zero-shot Identity-Preserving Generation in Seconds","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2302.05543","citing_title":"Adding Conditional Control to Text-to-Image Diffusion Models","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2601.05127","citing_title":"LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2310.19512","citing_title":"VideoCrafter1: Open Diffusion Models for High-Quality Video Generation","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2503.21755","citing_title":"VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2403.14608","citing_title":"Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey","ref_index":242,"is_internal_anchor":true},{"citing_arxiv_id":"2404.02101","citing_title":"CameraCtrl: Enabling Camera Control for Text-to-Video Generation","ref_index":135,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26244","citing_title":"MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09622","citing_title":"Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10019","citing_title":"The two clocks and the innovation window: When and how generative models learn rules","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00781","citing_title":"Map2World: Segment Map Conditioned Text to 3D World Generation","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00707","citing_title":"PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00658","citing_title":"UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2504.17761","citing_title":"Step1X-Edit: A Practical Framework for General Image Editing","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19954","citing_title":"Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07940","citing_title":"Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2308.06721","citing_title":"IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2307.04725","citing_title":"AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning","ref_index":13,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC","json":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC.json","graph_json":"https://pith.science/api/pith-number/UTRDYT6CJUBBSHFVA7YYXGCZKC/graph.json","events_json":"https://pith.science/api/pith-number/UTRDYT6CJUBBSHFVA7YYXGCZKC/events.json","paper":"https://pith.science/paper/UTRDYT6C"},"agent_actions":{"view_html":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC","download_json":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC.json","view_paper":"https://pith.science/paper/UTRDYT6C","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2302.08453&json=true","fetch_graph":"https://pith.science/api/pith-number/UTRDYT6CJUBBSHFVA7YYXGCZKC/graph.json","fetch_events":"https://pith.science/api/pith-number/UTRDYT6CJUBBSHFVA7YYXGCZKC/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC/action/timestamp_anchor","attest_storage":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC/action/storage_attestation","attest_author":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC/action/author_attestation","sign_citation":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC/action/citation_signature","submit_replication":"https://pith.science/pith/UTRDYT6CJUBBSHFVA7YYXGCZKC/action/replication_record"}},"created_at":"2026-05-17T23:38:46.339750+00:00","updated_at":"2026-05-17T23:38:46.339750+00:00"}