Is there a way how to train SDXL Controlnet on 24 GB VRAM card #9669

blacklig · 2024-10-14T10:52:25Z

blacklig
Oct 14, 2024

Hi, anyone successfully ran this demo script:

https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md

on something like RTX 3090 or RTX 4090?

I have not been successful and always ran into OOM..

tried xformes but probably way more optimization things would be necessary..

I can use 2x 3090 or 4090 but only pcie connected so not even sure if that would be any help, probably would be very slow but if some sharding can be implemented to at least fit into VRAM?

Any help would be greatly appreciated :) Thanks

a-r-r-o-w · 2024-10-14T15:53:28Z

a-r-r-o-w
Oct 14, 2024
Maintainer

It is definitely possible to do the training on one or more 24 GB GPUs. I would recommend trying a finetuning toolkit like SimpleTuner as a starting point, but if you'd like to dive deep into how low memory training could be done generally, I would recommend trying to read up on and use the following techniques/suggestions:

Gradient Checkpointing
Precomputing latents and prompt embeddings, so you don't have to load VAE or Text Encoder into memory for training
DeepSpeed
Fully Sharded Data Parallel
Low-bit optimizers (8-bit, 4-bit, CPU Offloading) from TorchAO or bitsandbytes
Quantized training from TorchAO
Custom triton kernels for specific operations in memory-efficient manner

Typically, DeepSpeed + Gradient checkpointing is more than enough to finetune billion parameter models under 24 GB at low batch sizes (1-4). DeepSpeed should be quite easy to enable - you just have to follow the questions and create appropriate config using accelerate config. Gradient checkpointing can be enabled in the SDXL script by passing --gradient_checkpointing

1 reply

blacklig Oct 15, 2024
Author

thanks a lot for nice overview for possibilities.

I have already proceed into some options like accelerate with deepspeed which really allows to run training of bigger models on 2 GPUs under 24 GB VRAM, but problem is, it is very slow.. GPUs run like only at 150ish W out of 450W max.. (also pcie running at only 4@4 is issue).

So how could I ideally run traininig just on one card without sharding and deepspeed? As that way I know it would max out card not like this.

I dont know if going for 8 bit I wont compromise model too much? Thought about it as well but I am quite doubious here.

If anyone has more insights into it, would be very appreciated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way how to train SDXL Controlnet on 24 GB VRAM card #9669

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Is there a way how to train SDXL Controlnet on 24 GB VRAM card #9669

blacklig Oct 14, 2024

Replies: 1 comment · 1 reply

a-r-r-o-w Oct 14, 2024 Maintainer

blacklig Oct 15, 2024 Author

blacklig
Oct 14, 2024

Replies: 1 comment 1 reply

a-r-r-o-w
Oct 14, 2024
Maintainer

blacklig Oct 15, 2024
Author