How does a RTX4090 with 24GB memory run this model ?

#18
by MillionMeng - opened

When I try this model on RTX4090 with 24GB memory, it report the torch.OutOfMemoryError:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 23.53 GiB of which 44.44 MiB is free. Including non-PyTorch memory, this process has 22.92 GiB memory in use. Of the allocated memory 22.36 GiB is allocated by PyTorch, and 196.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Then I refer to a similar question promoted in Qwen-Image and I change my code like below which add device_map config in from_pretrained:
pipeline = QwenImageEditPipeline.from_pretrained(
"/home/tmp/Qwen-Image-Edit",
device_map="balanced")
print("pipeline loaded")
pipeline.to(torch.bfloat16)
pipeline.to("cuda")
pipeline.set_progress_bar_config(disable=None)

but reports:
ValueError: It seems like you have activated a device mapping strategy on the pipeline which doesn't allow explicit device placement using to(). You can call reset_device_map() to remove the existing device map from the pipeline.

How do I modify my code so that I can run this model on my 24GB RTX4090 ? Are there any code examples or documents I can refer ?

with the 4-step lora my rtx4090 takes about 14 seconds... If you need help, I suggest you consider doing what I did... https://chat.qwen.ai/ Go ask Qwen yourself, LOL. Surprisingly, it was very aware of its image functionality and corrected my script without any hesitation

you can quantize it see this conversation: https://huggingface.co/Qwen/Qwen-Image-Edit/discussions/6

try commenting out the line pipeline.to("cuda"), that should work.

Welcome to try using our inference framework https://github.com/KE-AI-ENG/FastDM,
It can run qwen-image with 24GB vram cards

Welcome to try using our inference framework https://github.com/KE-AI-ENG/FastDM,
It can run qwen-image with 24GB vram cards

On dual 3090 it needs 7 seconds.
CUDA 0: has the text encoder and transformer
CUDA 1: has thd VAE because you get a oom error

Optimally manage the vram explicitly because CPU offload makes it slower

Welcome to try using our inference framework https://github.com/KE-AI-ENG/FastDM,
It can run qwen-image with 24GB vram cards

On dual 3090 it needs 7 seconds.
CUDA 0: has the text encoder and transformer
CUDA 1: has thd VAE because you get a oom error

Optimally manage the vram explicitly because CPU offload makes it slower

Yes, we haven't supported multi cards mode. The whole pipeline can run in one 24G vram card if the generated-resolution < 768, but we set the text_encoder running in cpu.

Sign up or log in to comment