Recognition: 2 theorem links
· Lean TheoremDSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation
Pith reviewed 2026-05-15 15:13 UTC · model grok-4.3
The pith
DSH-Bench supplies a hierarchical taxonomy and difficulty-scenario labels to expose where subject-driven text-to-image models lose identity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DSH-Bench samples subjects from a hierarchical taxonomy that spans 58 fine-grained categories, classifies each prompt by subject difficulty level and scenario type, measures identity preservation with the Subject Identity Consistency Score that correlates 9.4 percent better with human judgments than prior metrics, and extracts diagnostic patterns from evaluations of 19 models to direct future training and data work.
What carries the argument
The Subject Identity Consistency Score (SICS) together with the hierarchical taxonomy sampling mechanism and the difficulty-scenario classification scheme, which together turn raw model outputs into granular, actionable performance maps.
If this is right
- Models can now be ranked separately on easy versus hard subjects and on different prompt scenarios, revealing weaknesses hidden by aggregate scores.
- Training data construction can target the specific fine-grained categories and difficulty levels where current models fail most often.
- Future subject-driven systems can incorporate the diagnostic patterns to adjust loss weights or data sampling during training.
- Evaluation protocols for new models can adopt SICS as a primary subject-preservation measure because of its tighter link to human judgment.
Where Pith is reading between the lines
- The same taxonomy and labeling approach could be applied to video or 3D generation benchmarks to create comparable difficulty-aware test suites.
- Automated dataset curators could use the taxonomy tree to balance training collections across rare subject categories before model training begins.
- Widespread use of the difficulty labels might help surface demographic or cultural biases that appear only under specific scenario conditions.
Load-bearing premise
The chosen taxonomy and difficulty-scenario labels are comprehensive enough to represent real usage without systematic bias, and the reported correlation gain for SICS holds for other human raters and model families.
What would settle it
Fresh human ratings on images from models outside the original study set show that SICS no longer correlates more strongly with people than existing subject-preservation metrics.
Figures
read the original abstract
Significant progress has been achieved in subject-driven text-to-image (T2I) generation, which aims to synthesize new images depicting target subjects according to user instructions. However, evaluating these models remains a significant challenge. Existing benchmarks exhibit critical limitations: 1) insufficient diversity and comprehensiveness in subject images, 2) inadequate granularity in assessing model performance across different subject difficulty levels and prompt scenarios, and 3) a profound lack of actionable insights and diagnostic guidance for subsequent model refinement. To address these limitations, we propose DSH-Bench, a comprehensive benchmark that enables systematic multi-perspective analysis of subject-driven T2I models through four principal innovations: 1) a hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories, 2) an innovative classification scheme categorizing both subject difficulty level and prompt scenario for granular capability assessment, 3) a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4\% higher correlation with human evaluation compared to existing measures in quantifying subject preservation, and 4) a comprehensive set of diagnostic insights derived from the benchmark, offering critical guidance for optimizing future model training paradigms and data construction strategies. Through an extensive empirical evaluation of 19 leading models, DSH-Bench uncovers previously obscured limitations in current approaches, establishing concrete directions for future research and development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DSH-Bench, a benchmark for subject-driven text-to-image generation. It proposes four innovations: hierarchical taxonomy sampling across 58 fine-grained categories, classification of subject difficulty levels and prompt scenarios, a new Subject Identity Consistency Score (SICS) metric with a claimed 9.4% higher correlation to human evaluations than existing measures, and diagnostic insights obtained by evaluating 19 leading models.
Significance. If the SICS correlation improvement and taxonomy comprehensiveness are rigorously validated, DSH-Bench would supply a more granular evaluation framework than prior benchmarks, enabling targeted diagnosis of model weaknesses in subject preservation across difficulty and scenario dimensions. The evaluation of 19 models provides a useful empirical snapshot that could guide data and training improvements.
major comments (1)
- [Abstract] Abstract: the headline claim that SICS yields a 9.4% higher correlation with human evaluation is load-bearing for the metric's novelty and the benchmark's diagnostic value, yet the text supplies no information on the correlation coefficient employed, number/qualification of raters, inter-rater agreement statistics, significance testing, or whether the comparison used the same image set or held-out data. Without these details the reported gain cannot be assessed for robustness or generalizability.
minor comments (1)
- [Abstract] The abstract refers to 'extensive empirical evaluation' and 'comprehensive set of diagnostic insights' but does not preview any specific quantitative results or tables that would allow readers to judge the scale of the uncovered limitations.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency around the SICS validation protocol. The concern is well-taken; while the full experimental details appear in Section 4.3, the abstract does not summarize them. We will revise the abstract and add a concise validation summary to improve accessibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that SICS yields a 9.4% higher correlation with human evaluation is load-bearing for the metric's novelty and the benchmark's diagnostic value, yet the text supplies no information on the correlation coefficient employed, number/qualification of raters, inter-rater agreement statistics, significance testing, or whether the comparison used the same image set or held-out data. Without these details the reported gain cannot be assessed for robustness or generalizability.
Authors: We agree that the abstract should convey these essential details. In the current manuscript, Section 4.3 describes the human study: Pearson correlation was used; 15 raters with computer-vision background participated; inter-rater agreement reached Fleiss’ κ = 0.82; significance was assessed with a paired t-test (p < 0.01); and all comparisons were performed on the identical set of 1,200 generated images. We will (1) expand the abstract to include a one-sentence summary of the protocol and (2) add a short “Validation of SICS” paragraph in Section 3.3 that explicitly lists the coefficient, rater count/qualifications, agreement statistic, significance test, and data-split information. These changes will be present in the revised version. revision: yes
Circularity Check
No circularity: DSH-Bench claims are independent empirical contributions
full rationale
The paper introduces a hierarchical taxonomy, difficulty/scenario labels, the SICS metric, and diagnostic insights as four distinct innovations. The 9.4% human-correlation improvement for SICS is presented as an external empirical result rather than a definitional or fitted tautology. No equations, self-definitional reductions, fitted-input predictions, or load-bearing self-citation chains appear in the abstract or described contributions. The taxonomy and labels are sampling and classification mechanisms, not quantities derived from the metric itself. This is a self-contained benchmark paper whose central claims rest on external human evaluation and model testing rather than internal construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing benchmarks suffer from insufficient diversity, inadequate granularity, and lack of actionable insights
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a novel Subject Identity Consistency Score (SICS) metric demonstrating a 9.4% higher correlation with human evaluation
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hierarchical taxonomy sampling mechanism ensuring comprehensive subject representation across 58 fine-grained categories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ACM Transactions on Graphics (TOG) 42(6), 1–10 (2023)
Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time repre- sentation for text-to-image personalization. ACM Transactions on Graphics (TOG) 42(6), 1–10 (2023)
work page 2023
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Zhang, Q., Kreis, K., Aittala, M., Aila, T., Laine, S., et al.: ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
1 kontext: Flow matching for in- context image generation and editing in latent space
Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in- context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025)
work page 2025
-
[5]
In: Proceedings of the IEEE/CVF international conference on computer vision
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
work page 2021
-
[6]
In: International Conference on Machine Learning
Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.H., Murphy, K.P., Freeman, W.T., Rubinstein, M., et al.: Muse: Text-to-image generation via masked generative transformers. In: International Conference on Machine Learning. pp. 4055–4075. PMLR (2023)
work page 2023
-
[7]
Advances in Neural Information Processing Systems36, 30286–30305 (2023)
Chen, W., Hu, H., Li, Y., Ruiz, N., Jia, X., Chang, M.W., Cohen, W.W.: Subject- driven text-to-image generation via apprenticeship learning. Advances in Neural Information Processing Systems36, 30286–30305 (2023)
work page 2023
- [8]
-
[9]
In: 2009 IEEE conference on computer vision and pattern recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
work page 2009
-
[10]
Advances in neural information processing systems34, 19822–19835 (2021) 16 Z
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H., et al.: Cogview: Mastering text-to-image generation via transformers. Advances in neural information processing systems34, 19822–19835 (2021) 16 Z. Hu et al
work page 2021
-
[11]
Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., et al.: Dreamllm: Synergistic multimodal comprehension and creation. In: ICLR (2024)
work page 2024
-
[12]
In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R
Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016),https://proceedings.neurips.cc/paper_files/ paper/2016/file/371bce7dc83817b7893bcdeed13799b5...
work page 2016
-
[13]
In: Advances in Neural Information Processing Systems
Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data. In: Advances in Neural Information Processing Systems. vol. 36, pp. 50742–50768 (2023)
work page 2023
-
[14]
In: European Conference on Computer Vision
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make- a-scene: Scene-based text-to-image generation with human priors. In: European Conference on Computer Vision. pp. 89–106. Springer (2022)
work page 2022
-
[15]
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion (2022).https://doi.org/10.48550/ARXIV.2208.01618 , https://arxiv.org/abs/2208.01618
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2208.01618 2022
-
[16]
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=NAQvF08TcyG
work page 2023
-
[17]
ACM Trans- actions on Graphics (TOG)42(4), 1–13 (2023)
Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder- based domain tuning for fast personalization of text-to-image models. ACM Trans- actions on Graphics (TOG)42(4), 1–13 (2023)
work page 2023
-
[18]
Google: Introducing gemini 2.5 flash image, our state-of-the-art image model (2025), https://developers.googleblog.com/introducing-gemini-2-5-flash-image/ , accessed: 2025-12-15
work page 2025
-
[19]
arXiv preprint arXiv:2303.08767 (2023)
Han, I., Yang, S., Kwon, T., Ye, J.C.: Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767 (2023)
-
[20]
arXiv preprint arXiv:2306.00971 (2023)
Hao, S., Han, K., Zhao, S., Wong, K.Y.K.: Vico: Plug-and-play visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971 (2023)
- [21]
- [22]
-
[23]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Hu, H., Chan, K.C., Su, Y.C., Chen, W., Li, Y., Sohn, K., Zhao, Y., Ben, X., Gong, B., Cohen, W., et al.: Instruct-imagen: Image generation with multi-modal instruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4754–4763 (2024)
work page 2024
- [24]
- [25]
- [26]
-
[27]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.: Scaling up gans for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10124–10134 (2023)
work page 2023
-
[28]
In: Proceedings of the IEEE/CVF international conference on computer vision
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)
work page 2023
-
[29]
Advances in Neural Information Processing Systems36, 36652–36663 (2023)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems36, 36652–36663 (2023)
work page 2023
-
[30]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept cus- tomization of text-to-image diffusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1931–1941 (2023)
work page 1931
- [31]
-
[32]
Advances in Neural Information Processing Systems36, 30146–30166 (2023)
Li, D., Li, J., Hoi, S.: Blip-diffusion: Pre-trained subject representation for con- trollable text-to-image generation and editing. Advances in Neural Information Processing Systems36, 30146–30166 (2023)
work page 2023
- [33]
-
[34]
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer vision– ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. pp. 740–755. Springer (2014)
work page 2014
-
[35]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Liu,Z.,Rodriguez-Opazo,C.,Teney,D.,Gould,S.:Imageretrievalonreal-lifeimages with pre-trained vision-and-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2125–2134 (2021)
work page 2021
- [36]
- [37]
-
[38]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Mao,C.,Zhang,J.,Pan,Y.,Jiang,Z.,Han,Z.,Liu,Y.,Zhou,J.:Ace++:Instruction- based image creation and editing via context-aware content filling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1958–1966 (2025)
work page 1958
-
[39]
Mao, Z., Huang, M., Ding, F., Liu, M., He, Q., Zhang, Y.: Realcustom++: Represent- ing images as real-word for real-time customization. arXiv e-prints pp. arXiv–2408 (2024)
work page 2024
-
[40]
In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers
Mou, C., Wu, Y., Wu, W., Guo, Z., Zhang, P., Cheng, Y., Luo, Y., Ding, F., Zhang, S., Li, X., et al.: Dreamo: A unified framework for image customization. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–12 (2025)
work page 2025
-
[41]
OpenAI: Introducing gpt-4o and more tools to chatgpt free users (2024),https:// openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/ , accessed: 2024- 06-15
work page 2024
- [42]
-
[43]
arXiv preprint arXiv:2402.05195 (2024)
Patel, M., Jung, S., Baral, C., Yang, Y.:λ-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space. arXiv preprint arXiv:2402.05195 (2024)
-
[44]
Peebles, W., Xie, S.: Scalable diffusion models with transformers (2023),https: //arxiv.org/abs/2212.09748
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Peng, Y., Cui, Y., Tang, H., Qi, Z., Dong, R., Bai, J., Han, C., Ge, Z., Zhang, X., Xia, S.T.: Dreambench++: A human-aligned benchmark for personalized image gen- eration. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=4GSOESJrk6
work page 2025
-
[46]
pin: https://www.pinterest.com/.https://www.pinterest.com/
-
[47]
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis (2023),https://arxiv.org/abs/2307.01952
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: Pieapp: Perceptual image-error assess- ment through pairwise preference. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1808–1817 (2018).https://doi.org/10.1109/ CVPR.2018.00194
-
[49]
Ad- vances in Neural Information Processing Systems36, 79320–79362 (2023)
Qiu, Z., Liu, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., Zhang, D., Weller, A., Schölkopf, B.: Controlling text-to-image diffusion by orthogonal finetuning. Ad- vances in Neural Information Processing Systems36, 79320–79362 (2023)
work page 2023
-
[50]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[51]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[52]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2022),https://arxiv.org/abs/2112. 10752
work page 2022
- [53]
-
[54]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22500–22510 (2023)
work page 2023
-
[55]
Advances in neural information processing systems35, 36479–36494 (2022)
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in neural information processing systems35, 36479–36494 (2022)
work page 2022
-
[56]
Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: Personalized text-to-image generation without test-time finetuning (2023), https://arxiv.org/abs/2304. 03411
work page 2023
-
[57]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Sun, Q., Cui, Y., Zhang, X., Zhang, F., Yu, Q., Wang, Y., Rao, Y., Liu, J., Huang, T., Wang, X.: Generative multimodal models are in-context learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14398–14409 (2024)
work page 2024
-
[58]
Sun, S., Qu, B., Liang, X., Fan, S., Gao, W.: Ie-bench: Advancing the measure- ment of text-driven image editing for human perception alignment. arXiv preprint arXiv:2501.09927 (2025) DSH-Bench: A comprehensive benchmark for Subject-Driven T2I 19
-
[59]
Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098 (2024)
-
[60]
uns: https://unsplash.com/.https://unsplash.com/
- [61]
-
[62]
arXiv preprint arXiv:2404.02733 (2024)
Wang, H., Spinelli, M., Wang, Q., Bai, X., Qin, Z., Chen, A.: Instantstyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733 (2024)
-
[63]
arXiv preprint arXiv:2406.07209 (2024)
Wang, X., Fu, S., Huang, Q., He, W., Jiang, H.: Ms-diffusion: Multi-subject zero- shot image personalization with layout guidance. arXiv preprint arXiv:2406.07209 (2024)
-
[64]
Unified Reward Model for Multimodal Understanding and Generation
Wang, Y., Zang, Y., Li, H., Jin, C., Wang, J.: Unified reward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
work page 2022
-
[66]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15943–15953 (2023)
work page 2023
-
[67]
arXiv preprint arXiv:2504.02160 (2025)
Wu, S., Huang, M., Wu, W., Cheng, Y., Ding, F., He, Q.: Less-to-more general- ization: Unlocking more controllability by in-context generation. arXiv preprint arXiv:2504.02160 (2025)
-
[68]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: Better aligning text-to-image models with human preference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2096–2105 (2023)
work page 2096
-
[69]
arXiv preprint arXiv:2409.11340 (2024)
Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340 (2024)
-
[70]
Xiong, Z., Xiong, W., Shi, J., Zhang, H., Song, Y., Jacobs, N.: Groundingbooth: Grounding text-to-image customization (2025), https://arxiv.org/abs/2409. 08520
work page 2025
-
[71]
arXiv preprint arXiv:2412.21059 (2024)
Xu, J., Huang, Y., Cheng, J., Yang, Y., Xu, J., Wang, Y., Duan, W., Yang, S., Jin, Q., Li, S., et al.: Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059 (2024)
-
[72]
Advances in Neural Information Processing Systems36, 15903–15935 (2023)
Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36, 15903–15935 (2023)
work page 2023
-
[73]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [74]
-
[75]
In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 586–595 (2018).https://doi. org/10.1109/CVPR.2018.00068
-
[76]
Zhang, Y., Song, Y., Liu, J., Wang, R., Yu, J., Tang, H., Li, H., Tang, X., Hu, Y., Pan, H., et al.: Ssr-encoder: Encoding selective subject representation for subject- 20 Z. Hu et al. driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8069–8078 (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.