Update README.md
Browse files
README.md
CHANGED
|
@@ -30,10 +30,11 @@ pipeline_tag: image-to-video
|
|
| 30 |
> [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
|
| 31 |
|
| 32 |
|
| 33 |
-
|
| 34 |
## 🔥🔥🔥 News!!
|
| 35 |
-
|
| 36 |
-
* May
|
|
|
|
|
|
|
| 37 |
|
| 38 |
|
| 39 |
## 📑 Open-source Plan
|
|
@@ -42,9 +43,15 @@ pipeline_tag: image-to-video
|
|
| 42 |
- Single-Subject Video Customization
|
| 43 |
- [x] Inference
|
| 44 |
- [x] Checkpoints
|
| 45 |
-
- [
|
| 46 |
- Audio-Driven Video Customization
|
|
|
|
|
|
|
|
|
|
| 47 |
- Video-Driven Video Customization
|
|
|
|
|
|
|
|
|
|
| 48 |
- Multi-Subject Video Customization
|
| 49 |
|
| 50 |
## Contents
|
|
@@ -161,7 +168,6 @@ conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=
|
|
| 161 |
|
| 162 |
# 4. Install pip dependencies
|
| 163 |
python -m pip install -r requirements.txt
|
| 164 |
-
python -m pip install tensorrt-cu12-bindings==10.6.0 tensorrt-cu12-libs==10.6.0
|
| 165 |
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
|
| 166 |
python -m pip install ninja
|
| 167 |
python -m pip install git+https://github.com/Dao-AILab/[email protected]
|
|
@@ -174,7 +180,7 @@ In case of running into float point exception(core dump) on the specific GPU typ
|
|
| 174 |
pip install nvidia-cublas-cu12==12.4.5.8
|
| 175 |
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
|
| 176 |
|
| 177 |
-
# Option 2: Forcing to
|
| 178 |
pip uninstall -r requirements.txt # uninstall all packages
|
| 179 |
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
|
| 180 |
pip install -r requirements.txt
|
|
@@ -188,12 +194,12 @@ Additionally, you can also use HunyuanVideo Docker image. Use the following comm
|
|
| 188 |
# For CUDA 12.4 (updated to avoid float point exception)
|
| 189 |
docker pull hunyuanvideo/hunyuanvideo:cuda_12
|
| 190 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
|
| 191 |
-
pip install gradio==3.39.0
|
| 192 |
|
| 193 |
# For CUDA 11.8
|
| 194 |
docker pull hunyuanvideo/hunyuanvideo:cuda_11
|
| 195 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
|
| 196 |
-
pip install gradio==3.39.0
|
| 197 |
```
|
| 198 |
|
| 199 |
|
|
@@ -205,13 +211,14 @@ The details of download pretrained models are shown [here](models/README.md).
|
|
| 205 |
|
| 206 |
For example, to generate a video with 8 GPUs, you can use the following command:
|
| 207 |
|
|
|
|
| 208 |
```bash
|
| 209 |
cd HunyuanCustom
|
| 210 |
|
| 211 |
export MODEL_BASE="./models"
|
| 212 |
export PYTHONPATH=./
|
| 213 |
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
| 214 |
-
--
|
| 215 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
|
| 216 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 217 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
|
|
@@ -223,6 +230,52 @@ torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.
|
|
| 223 |
--save-path './results/sp_720p'
|
| 224 |
```
|
| 225 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
## 🔑 Single-gpu Inference
|
| 227 |
|
| 228 |
For example, to generate a video with 1 GPU, you can use the following command:
|
|
@@ -231,10 +284,10 @@ For example, to generate a video with 1 GPU, you can use the following command:
|
|
| 231 |
cd HunyuanCustom
|
| 232 |
|
| 233 |
export MODEL_BASE="./models"
|
| 234 |
-
export
|
| 235 |
export PYTHONPATH=./
|
| 236 |
python hymm_sp/sample_gpu_poor.py \
|
| 237 |
-
--
|
| 238 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
|
| 239 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 240 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
|
@@ -256,7 +309,7 @@ export MODEL_BASE="./models"
|
|
| 256 |
export CPU_OFFLOAD=1
|
| 257 |
export PYTHONPATH=./
|
| 258 |
python hymm_sp/sample_gpu_poor.py \
|
| 259 |
-
--
|
| 260 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
|
| 261 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 262 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
|
@@ -275,8 +328,14 @@ python hymm_sp/sample_gpu_poor.py \
|
|
| 275 |
```bash
|
| 276 |
cd HunyuanCustom
|
| 277 |
|
| 278 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 279 |
|
|
|
|
|
|
|
| 280 |
```
|
| 281 |
|
| 282 |
## 🔗 BibTeX
|
|
@@ -284,7 +343,7 @@ bash ./scripts/run_gradio.sh
|
|
| 284 |
If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
|
| 285 |
|
| 286 |
```BibTeX
|
| 287 |
-
@misc{
|
| 288 |
title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
|
| 289 |
author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
|
| 290 |
year={2025},
|
|
@@ -297,4 +356,4 @@ If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your re
|
|
| 297 |
|
| 298 |
## Acknowledgements
|
| 299 |
|
| 300 |
-
We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
|
|
|
|
| 30 |
> [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
|
| 31 |
|
| 32 |
|
|
|
|
| 33 |
## 🔥🔥🔥 News!!
|
| 34 |
+
* June 6, 2025: 💃 We release the inference code and model weights of audio-driven and video-driven powered by [OmniV2V](https://arxiv.org/abs/2506.01801).
|
| 35 |
+
* May 13, 2025: 🎉 HunyuanCustom has been integrated into [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper/blob/develop/example_workflows/hyvideo_custom_testing_01.json) by [Kijai](https://github.com/kijai).
|
| 36 |
+
* May 12, 2025: 🔥 HunyuanCustom is available in Cloud-Native-Build (CNB) [HunyuanCustom](https://cnb.cool/tencent/hunyuan/HunyuanCustom).
|
| 37 |
+
* May 8, 2025: 👋 We release the inference code and model weights of HunyuanCustom. [Download](models/README.md).
|
| 38 |
|
| 39 |
|
| 40 |
## 📑 Open-source Plan
|
|
|
|
| 43 |
- Single-Subject Video Customization
|
| 44 |
- [x] Inference
|
| 45 |
- [x] Checkpoints
|
| 46 |
+
- [x] ComfyUI
|
| 47 |
- Audio-Driven Video Customization
|
| 48 |
+
- [x] Inference
|
| 49 |
+
- [x] Checkpoints
|
| 50 |
+
- [ ] ComfyUI
|
| 51 |
- Video-Driven Video Customization
|
| 52 |
+
- [x] Inference
|
| 53 |
+
- [x] Checkpoints
|
| 54 |
+
- [ ] ComfyUI
|
| 55 |
- Multi-Subject Video Customization
|
| 56 |
|
| 57 |
## Contents
|
|
|
|
| 168 |
|
| 169 |
# 4. Install pip dependencies
|
| 170 |
python -m pip install -r requirements.txt
|
|
|
|
| 171 |
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
|
| 172 |
python -m pip install ninja
|
| 173 |
python -m pip install git+https://github.com/Dao-AILab/[email protected]
|
|
|
|
| 180 |
pip install nvidia-cublas-cu12==12.4.5.8
|
| 181 |
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
|
| 182 |
|
| 183 |
+
# Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages
|
| 184 |
pip uninstall -r requirements.txt # uninstall all packages
|
| 185 |
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
|
| 186 |
pip install -r requirements.txt
|
|
|
|
| 194 |
# For CUDA 12.4 (updated to avoid float point exception)
|
| 195 |
docker pull hunyuanvideo/hunyuanvideo:cuda_12
|
| 196 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
|
| 197 |
+
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
|
| 198 |
|
| 199 |
# For CUDA 11.8
|
| 200 |
docker pull hunyuanvideo/hunyuanvideo:cuda_11
|
| 201 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
|
| 202 |
+
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
|
| 203 |
```
|
| 204 |
|
| 205 |
|
|
|
|
| 211 |
|
| 212 |
For example, to generate a video with 8 GPUs, you can use the following command:
|
| 213 |
|
| 214 |
+
### Run Single-Subject Video Customization
|
| 215 |
```bash
|
| 216 |
cd HunyuanCustom
|
| 217 |
|
| 218 |
export MODEL_BASE="./models"
|
| 219 |
export PYTHONPATH=./
|
| 220 |
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
| 221 |
+
--ref-image './assets/images/seg_woman_01.png' \
|
| 222 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
|
| 223 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 224 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
|
|
|
|
| 230 |
--save-path './results/sp_720p'
|
| 231 |
```
|
| 232 |
|
| 233 |
+
### Run Video-Driven Video Customization (Video Editing)
|
| 234 |
+
```bash
|
| 235 |
+
cd HunyuanCustom
|
| 236 |
+
|
| 237 |
+
export MODEL_BASE="./models"
|
| 238 |
+
export PYTHONPATH=./
|
| 239 |
+
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
| 240 |
+
--ref-image './assets/images/sed_red_panda.png' \
|
| 241 |
+
--input-video './assets/input_videos/001_bg.mp4' \
|
| 242 |
+
--mask-video './assets/input_videos/001_mask.mp4' \
|
| 243 |
+
--expand-scale 5 \
|
| 244 |
+
--video-condition \
|
| 245 |
+
--pos-prompt "Realistic, High-quality. A red panda is walking on a stone road." \
|
| 246 |
+
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 247 |
+
--ckpt ${MODEL_BASE}"/hunyuancustom_editing_720P/mp_rank_00_model_states.pt" \
|
| 248 |
+
--seed 1024 \
|
| 249 |
+
--infer-steps 50 \
|
| 250 |
+
--flow-shift-eval-video 5.0 \
|
| 251 |
+
--save-path './results/sp_editing_720p'
|
| 252 |
+
# --pose-enhance # Enable for human videos to improve pose generation quality.
|
| 253 |
+
```
|
| 254 |
+
|
| 255 |
+
### Run Audio-Driven Video Customization
|
| 256 |
+
```bash
|
| 257 |
+
cd HunyuanCustom
|
| 258 |
+
|
| 259 |
+
export MODEL_BASE="./models"
|
| 260 |
+
export PYTHONPATH=./
|
| 261 |
+
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
| 262 |
+
--ref-image './assets/images/seg_man_01.png' \
|
| 263 |
+
--input-audio './assets/audios/milk_man.mp3' \
|
| 264 |
+
--audio-strength 0.8 \
|
| 265 |
+
--audio-condition \
|
| 266 |
+
--pos-prompt "Realistic, High-quality. In the study, a man sits at a table featuring a bottle of milk while delivering a product presentation." \
|
| 267 |
+
--neg-prompt "Two people, two persons, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 268 |
+
--ckpt ${MODEL_BASE}"/hunyuancustom_audio_720P/mp_rank_00_model_states.pt" \
|
| 269 |
+
--seed 1026 \
|
| 270 |
+
--video-size 720 1280 \
|
| 271 |
+
--sample-n-frames 129 \
|
| 272 |
+
--cfg-scale 7.5 \
|
| 273 |
+
--infer-steps 30 \
|
| 274 |
+
--use-deepcache 1 \
|
| 275 |
+
--flow-shift-eval-video 13.0 \
|
| 276 |
+
--save-path './results/sp_audio_720p'
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
## 🔑 Single-gpu Inference
|
| 280 |
|
| 281 |
For example, to generate a video with 1 GPU, you can use the following command:
|
|
|
|
| 284 |
cd HunyuanCustom
|
| 285 |
|
| 286 |
export MODEL_BASE="./models"
|
| 287 |
+
export DISABLE_SP=1
|
| 288 |
export PYTHONPATH=./
|
| 289 |
python hymm_sp/sample_gpu_poor.py \
|
| 290 |
+
--ref-image './assets/images/seg_woman_01.png' \
|
| 291 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
|
| 292 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 293 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
|
|
|
| 309 |
export CPU_OFFLOAD=1
|
| 310 |
export PYTHONPATH=./
|
| 311 |
python hymm_sp/sample_gpu_poor.py \
|
| 312 |
+
--ref-image './assets/images/seg_woman_01.png' \
|
| 313 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
|
| 314 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
| 315 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
|
|
|
| 328 |
```bash
|
| 329 |
cd HunyuanCustom
|
| 330 |
|
| 331 |
+
# Single-Subject Video Customization
|
| 332 |
+
bash ./scripts/run_gradio.sh
|
| 333 |
+
|
| 334 |
+
# Video-Driven Video Customization
|
| 335 |
+
bash ./scripts/run_gradio.sh --video
|
| 336 |
|
| 337 |
+
# Audio-Driven Video Customization
|
| 338 |
+
bash ./scripts/run_gradio.sh --audio
|
| 339 |
```
|
| 340 |
|
| 341 |
## 🔗 BibTeX
|
|
|
|
| 343 |
If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
|
| 344 |
|
| 345 |
```BibTeX
|
| 346 |
+
@misc{hu2025hunyuancustom,
|
| 347 |
title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
|
| 348 |
author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
|
| 349 |
year={2025},
|
|
|
|
| 356 |
|
| 357 |
## Acknowledgements
|
| 358 |
|
| 359 |
+
We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [HunyuanVideo-Avatar](https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar), [MimicMotion](https://github.com/Tencent/MimicMotion), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
|