A Comprehensive Ecosystem for Open-Domain Customized Video Generation
Pith reviewed 2026-06-27 10:03 UTC · model grok-4.3
The pith
CustoMDiT adapts a pretrained diffusion transformer for customized video generation using only 8 percent extra parameters on a new million-scale dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Releasing PexelsCustom-1M together with the parameter-efficient CustoMDiT framework allows a pretrained multimodal Diffusion Transformer to become a customized video generator that preserves specific identities across open-domain prompts while adding only eight percent learnable parameters, and this combination outperforms prior state-of-the-art approaches.
What carries the argument
CustoMDiT, the parameter-efficient adaptation framework that inserts a small set of learnable parameters into a pretrained multimodal Diffusion Transformer to steer it toward identity-specific video output.
If this is right
- Video customization becomes feasible for thousands of identities instead of the roughly one hundred covered by older benchmarks.
- Only a small fraction of model weights must be updated, lowering the compute needed for each new identity.
- The OpenCustom benchmark supplies a more demanding and diverse test than prior small-scale sets.
- Public release of the full dataset, benchmark, and code creates a shared starting point for further work on identity-preserving generation.
Where Pith is reading between the lines
- The same light adaptation approach might transfer to other modalities such as audio or 3D content without full retraining.
- Curated large datasets of this kind could reduce reliance on massive pretraining runs by supplying better-aligned examples from the start.
- Combining the framework with explicit motion or temporal modules might further improve long-video coherence.
Load-bearing premise
The one million triplets in PexelsCustom-1M capture real identity attributes across thousands of categories without major curation biases or labeling mistakes that would distort training or evaluation.
What would settle it
If retraining CustoMDiT on PexelsCustom-1M produces videos that fail to match target identities on the OpenCustom benchmark at rates no better than earlier methods, the central performance claim would not hold.
read the original abstract
Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated <identity, text, video> triplets across 8,000+ categories. Leveraging this, we propose CustoMDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art. However, benchmarks such as DreamBooth cover only 100 classes, which is insufficient for real-world applications. To overcome this, we construct OpenCustom, a new benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. Extensive experiments confirm the advantages of both our dataset and model. We will open-source the entire ecosystem--including dataset, pipeline, benchmark, and implementations--to support further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PexelsCustom-1M, the first publicly available million-scale dataset containing one million curated <identity, text, video> triplets across 8,000+ categories for identity-preserving video generation. It proposes CustoMDiT, a parameter-efficient framework adapting a pretrained multimodal Diffusion Transformer into a customized video generator using only 8% additional learnable parameters. The work also constructs the OpenCustom benchmark with 1,000+ categories via cross-dataset fusion from ImageNet and MS-COCO, claims that the method surpasses prior state-of-the-art, and states that extensive experiments confirm the advantages of both the dataset and model, with plans to open-source the full ecosystem.
Significance. If the reported superiority holds under rigorous validation, the work would be significant for enabling open-domain customized video generation by addressing the lack of large-scale annotated datasets and providing an efficient adaptation approach that avoids full retraining. The commitment to open-sourcing the dataset, pipeline, benchmark, and implementations is a clear strength that would support reproducibility and further research in the field.
major comments (2)
- [Abstract / Dataset Description] Abstract / Dataset section: The central claim that PexelsCustom-1M 'accurately capture[s] diverse identity-specific attributes' and enables SOTA performance rests on the dataset being free of substantial curation biases or annotation errors, yet the manuscript supplies no validation statistics, inter-annotator agreement scores, diversity metrics (e.g., motion variety or category balance across 8,000+ categories), or error analysis for the one million triplets. This is load-bearing for both the fine-tuning of CustoMDiT and the reliability of the OpenCustom benchmark.
- [Abstract / Experiments] Abstract / Experiments section: The assertions that 'our method surpasses prior state-of-the-art' and that 'extensive experiments confirm the advantages' are made without reference to specific quantitative metrics, baselines (e.g., DreamBooth comparisons), ablation results on the 8% parameter adaptation, or error analysis. If these details are missing or insufficiently reported in the experimental sections, the superiority claim cannot be substantiated.
minor comments (1)
- [Abstract] Abstract: The phrasing 'Our method surpasses prior state-of-the-art.' is imprecise; it should specify the exact prior methods, metrics (e.g., identity preservation scores, video quality), and benchmark settings for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on dataset validation and experimental reporting. We address each major comment below and will revise the manuscript to strengthen these aspects.
read point-by-point responses
-
Referee: [Abstract / Dataset Description] Abstract / Dataset section: The central claim that PexelsCustom-1M 'accurately capture[s] diverse identity-specific attributes' and enables SOTA performance rests on the dataset being free of substantial curation biases or annotation errors, yet the manuscript supplies no validation statistics, inter-annotator agreement scores, diversity metrics (e.g., motion variety or category balance across 8,000+ categories), or error analysis for the one million triplets. This is load-bearing for both the fine-tuning of CustoMDiT and the reliability of the OpenCustom benchmark.
Authors: We agree that explicit validation statistics are needed to support the dataset claims. The current manuscript describes the curation process but does not report inter-annotator agreement, diversity metrics, or error analysis. In revision we will add a new subsection with these details, including inter-annotator agreement computed on a held-out sample, category balance and motion variety statistics across the 8,000+ categories, and a quantitative error analysis on a random subset of triplets. revision: yes
-
Referee: [Abstract / Experiments] Abstract / Experiments section: The assertions that 'our method surpasses prior state-of-the-art' and that 'extensive experiments confirm the advantages' are made without reference to specific quantitative metrics, baselines (e.g., DreamBooth comparisons), ablation results on the 8% parameter adaptation, or error analysis. If these details are missing or insufficiently reported in the experimental sections, the superiority claim cannot be substantiated.
Authors: The experimental section contains quantitative comparisons and ablations, but the abstract and high-level claims do not explicitly reference the metrics or tie them to the 8% adaptation results. We will revise the abstract to cite specific metrics (e.g., FID, CLIP similarity) and ensure the experiments section clearly tabulates DreamBooth baselines, the parameter-efficiency ablation, and error analysis. This will make the superiority statements directly traceable to the reported numbers. revision: partial
Circularity Check
No circularity; claims rest on new dataset, benchmark, and empirical adaptation.
full rationale
The paper's central contributions are the creation of PexelsCustom-1M (1M triplets across 8000+ categories), the OpenCustom benchmark (via cross-dataset fusion), and CustoMDiT (8% parameter adaptation of a pretrained DiT). No equations, first-principles derivations, or 'predictions' are described that reduce by construction to fitted inputs or self-citations. Superiority claims are supported by experiments on the new resources rather than self-referential fitting. This matches the reader's assessment of no load-bearing circularity in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pretrained multimodal Diffusion Transformer can be adapted for identity-preserving customized video generation using a small fraction of additional parameters.
Reference graph
Works this paper leans on
-
[1]
Customized Video Generation (CVG) seeks to preserve visual identities while embedding them into diverse sce- narios guided by text
INTRODUCTION The rapid advancement of video generation has intensified demands for customizable content creation in domains such as advertising and digital media. Customized Video Generation (CVG) seeks to preserve visual identities while embedding them into diverse sce- narios guided by text. Although prior works in customized image [1, 2, 3, 4, 5, 6, 7,...
-
[2]
A Comprehensive Ecosystem for Open-Domain Customized Video Generation
OPEN-DOMAIN DA TA CURA TION 2.1. Data Pre-Processing Pexels-400K contains high-quality videos, each accompanied by a descriptive caption. However, these captions primarily focus on the main subject and its motion, while lacking descriptions of other present identities. To address this limitation, we employ a vision- language model (VLM) [22] to generate s...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
2 summarizes the training and inference pipeline of CustoMDiT
METHOD Fig. 2 summarizes the training and inference pipeline of CustoMDiT. Following OminiControl [6], we inject the reference image via a Low-Rank Adapter (LoRA) while keeping the pretrained backbone frozen. Prior approaches [3, 10, 14] typically rely on a learned fea- ture extractor or an off-the-shelf image encoder (e.g., CLIP), which often emphasizes ...
-
[4]
EXPERIMENTS 4.1. Experimental Setup Implementation Details.We use CogVideoX-5B [29] as the base model for CustoMDiT, setting both the LoRA rank and LoRA al- pha to 128. CustoMDiT is trained on PexelsCustom-1M for 8,000 steps (global batch size 128) without data augmentation, using 64 NVIDIA A100 GPUs for 60 hours, followed by an additional 2,000 training ...
-
[5]
Building on this, we develop an efficient CVG framework via LoRA-adapted MMDiT
CONCLUSION We present a large-scale open-domain dataset for customized video generation (CVG). Building on this, we develop an efficient CVG framework via LoRA-adapted MMDiT. To rigorously evaluate open- domain generalization, we introduce a benchmark covering over 1,000 categories. We will open-source all resources to support fu- ture research. While our...
-
[6]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,
Nataniel Ruiz et al., “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” inProceed- ings of the IEEE/CVF conference on computer vision and pat- tern recognition, 2023, pp. 22500–22510
2023
-
[7]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal et al., “An image is worth one word: Personaliz- ing text-to-image generation using textual inversion,”arXiv preprint arXiv:2208.01618, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye et al., “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,”arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance,
Xiaowei Wang et al., “Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance,”arXiv preprint arXiv:2406.07209, 2024
-
[10]
Blip-diffusion: Pre-trained subject represen- tation for controllable text-to-image generation and editing,
Dongxu Li et al., “Blip-diffusion: Pre-trained subject represen- tation for controllable text-to-image generation and editing,” Advances in Neural Information Processing Systems, vol. 36, pp. 30146–30166, 2023
2023
-
[11]
Ominicontrol: Minimal and uni- versal control for diffusion transformer,
Zhenxiong Tan et al., “Ominicontrol: Minimal and uni- versal control for diffusion transformer,”arXiv preprint arXiv:2411.15098, vol. 3, 2024
-
[12]
Ssr-encoder: Encoding selective subject representation for subject-driven generation,
Yuxuan Zhang et al., “Ssr-encoder: Encoding selective subject representation for subject-driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8069–8078
2024
-
[13]
T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion mod- els,
Chong Mou et al., “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion mod- els,” inProceedings of the AAAI conference on artificial intel- ligence, 2024, vol. 38, pp. 4296–4304
2024
-
[14]
Adding conditional control to text-to- image diffusion models,
Lvmin Zhang et al., “Adding conditional control to text-to- image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836– 3847
2023
-
[15]
Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,
Yuxiang Wei et al., “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15943–15953
2023
-
[16]
Motionbooth: Motion-aware customized text-to-video generation,
Jianzong Wu et al., “Motionbooth: Motion-aware customized text-to-video generation,”arXiv preprint arXiv:2406.17758, 2024
-
[17]
Dreamvideo: Composing your dream videos with customized subject and motion,
Yujie Wei et al., “Dreamvideo: Composing your dream videos with customized subject and motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 6537–6549
2024
-
[18]
Tao Wu et al., “Customcrafter: Customized video genera- tion with preserving motion and concept composition abili- ties,”arXiv preprint arXiv:2408.13239, 2024
-
[19]
Videobooth: Diffusion-based video gen- eration with image prompts,
Yuming Jiang et al., “Videobooth: Diffusion-based video gen- eration with image prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6689–6700
2024
-
[20]
Id-animator: Zero-shot identity-preserving human video generation,
Xuanhua He et al., “Id-animator: Zero-shot identity-preserving human video generation,”arXiv preprint arXiv:2404.15275, 2024
-
[21]
Identity-preserving text-to-video generation by frequency decomposition,
Shenghai Yuan et al., “Identity-preserving text-to-video generation by frequency decomposition,”arXiv preprint arXiv:2411.17440, 2024
-
[22]
Still-moving: Customized video genera- tion without customized video data,
Hila Chefer et al., “Still-moving: Customized video genera- tion without customized video data,”ACM Transactions on Graphics (TOG), vol. 43, no. 6, pp. 1–11, 2024
2024
-
[23]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo et al., “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,”arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control,
Yujie Wei et al., “Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control,”arXiv preprint arXiv:2410.13830, 2024
-
[25]
Imagenet large scale visual recogni- tion challenge,
Olga Russakovsky et al., “Imagenet large scale visual recogni- tion challenge,”International journal of computer vision, vol. 115, pp. 211–252, 2015
2015
-
[27]
Florence-2: Advancing a unified rep- resentation for a variety of vision tasks,
Bin Xiao et al., “Florence-2: Advancing a unified rep- resentation for a variety of vision tasks,”arXiv preprint arXiv:2311.06242, 2023
-
[28]
Grounded sam: Assembling open-world models for diverse visual tasks,
Tianhe Ren et al., “Grounded sam: Assembling open-world models for diverse visual tasks,” 2024
2024
-
[29]
Grounding dino 1.5: Advance the
Tianhe Ren et al., “Grounding dino 1.5: Advance the” edge” of open-set object detection,”arXiv preprint arXiv:2405.10300, 2024
-
[30]
Segment anything,
Alexander Kirillov et al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026
2023
-
[31]
Movie gen: A cast of media foundation models,
Adam Polyak et al., “Movie gen: A cast of media foundation models,” 2025
2025
-
[32]
Multi-concept customization of text-to- image diffusion,
Nupur Kumari et al., “Multi-concept customization of text-to- image diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1931– 1941
2023
-
[33]
Customvideo: Customizing text-to- video generation with multiple subjects,
Zhao Wang et al., “Customvideo: Customizing text-to- video generation with multiple subjects,”arXiv preprint arXiv:2401.09962, 2024
-
[34]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang et al., “Cogvideox: Text-to-video diffu- sion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Imagenet: A large-scale hierarchical image database,
Jia Deng et al., “Imagenet: A large-scale hierarchical image database,” in2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255
2009
-
[36]
Microsoft coco: Common objects in con- text,
Tsung-Yi Lin et al., “Microsoft coco: Common objects in con- text,” inComputer vision–ECCV 2014: 13th European confer- ence, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 2014, pp. 740–755
2014
-
[37]
Learning transferable visual models from natural language supervision,
Alec Radford et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[38]
Emerging properties in self-supervised vision transformers,
Mathilde Caron et al., “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF inter- national conference on computer vision, 2021, pp. 9650–9660
2021
-
[39]
VBench: Comprehensive benchmark suite for video generative models,
Ziqi Huang et al., “VBench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024
2024
-
[40]
Character to video generation,
Vidu et al., “Character to video generation,” 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.