DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models

This repository contains the models from our paper DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models.

Code Project Page

teaser

Abstract

In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, which is capable of processing all three modalities within a single architecture without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis.

Model details

Model type: DiffBlender successfully synthesizes complex combinations of input modalities. It enables flexible manipulation of conditions, providing the customized generation aligned with user preferences. We designed its structure to intuitively extend to additional modalities while achieving a low training cost through a partial update of hypernetworks.

We provide its model checkpoint, trained with six modalities: sketch, depth map, grounding box, keypoints, color palette, and style embedding. >> ./checkpoint_latest.pth

License: Apache 2.0 License

Where to send questions or comments about the model: https://github.com/sungnyun/diffblender/issues

Quick Start

Install the necessary packages with:

$ pip install -r requirements.txt

Download DiffBlender model checkpoint from this Huggingface model, and place it under ./diffblender_checkpoints/.
Also, prepare the SD model from this link (we used CompVis/sd-v1-4.ckpt).

Try Multimodal T2I Generation with DiffBlender

$ python inference.py --ckpt_path=./diffblender_checkpoints/{CKPT_NAME}.pth \
                      --official_ckpt_path=/path/to/sd-v1-4.ckpt \
                      --save_name={SAVE_NAME}

Results will be saved under ./inference/{SAVE_NAME}/, in the format as {conditions + generated image}.

Training dataset

Microsoft COCO 2017 dataset

Citation

If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.

@article{kim2023diffblender,
  title={DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models},
  author={Kim, Sungnyun and Lee, Junsoo and Hong, Kibeom and Kim, Daesik and Ahn, Namhyuk},
  journal={arXiv preprint arXiv:2305.15194},
  year={2023}
}