Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture with a new paired dataset.
hub Canonical reference
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
Canonical reference. 89% of citing Pith papers cite this work as background.
abstract
The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.
SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
ControlNet adds spatial conditioning controls to pretrained text-to-image diffusion models via zero convolutions for stable fine-tuning on small or large datasets.
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.
Map2World produces scale-consistent 3D worlds from text and arbitrary segment maps via a detail enhancer that incorporates global structure information.
PhysEdit introduces adaptive reasoning depth and spatial masking to make image editing faster and more instruction-aligned without retraining the base model.
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.
Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
SketchDeco performs training-free sketch colourisation via diffusion inversion to insert user colors followed by custom self-attention blending for local fidelity and global harmony.
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
InstantID enables zero-shot identity-preserving image generation from one facial image via a novel IdentityNet that combines strong semantic and weak spatial conditioning with text prompts in diffusion models.
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
PhyDrawGen is a neuro-symbolic pipeline that extracts typed scene graphs via LLM, converts them to physically constrained PSLGs via deterministic solver, and refines via fine-tuned Qwen-VL, claiming superior performance over GPT-5-image and Gemini models on 1,449 physics problems.
A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.
DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-HMM challenge winner.
CARINOX unifies noise optimization and exploration with human-correlated reward selection to boost compositional alignment in diffusion models, reporting +16% on T2I-CompBench++ and +11% on HRS while keeping quality and diversity.
OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consistency among open-source models on the new OmniContext benchmark.
citing papers explorer
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.