Running example code directly from the huggingface stable diffusion 3.5 site link and I am getting extremely slow run times, averaging 90 seconds per iteration. For reference when I use Stable Diffusion XL Base I get ~5.25 iterations per second.
Code I am running:
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3.5-large-turbo", torch_dtype=torch.float16, token="token_here")
pipe = pipe.to("cuda")
image = pipe(
"A capybara holding a sign that reads Hello Fast World",
num_inference_steps=4,
guidance_scale=7.0,
).images[0]
image.save("capybara.png")
My specs: i7-13700K 13th gen 32gb ram RTX 4080 Super
And here are my packages and versions: python 3.10.4 Cuda version 12.1
accelerate==1.1.1
certifi==2024.8.30
charset-normalizer==3.4.0
colorama==0.4.6
diffusers==0.31.0
filelock==3.13.1
fsspec==2024.2.0
huggingface-hub==0.26.2
idna==3.10
importlib_metadata==8.5.0
Jinja2==3.1.3
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.3
packaging==24.2
pillow==10.2.0
protobuf==5.28.3
psutil==6.1.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.4.5
sentencepiece==0.2.0
sympy==1.13.1
tokenizers==0.20.3
torch==2.5.1+cu121
torchaudio==2.5.1+cu121
torchvision==0.20.1+cu121
tqdm==4.67.0
transformers==4.46.3
typing_extensions==4.9.0
urllib3==2.2.3
zipp==3.21.0
Things I have tried:
- Updating Nvidia drivers
- ovewriting pretrained_model_name_or_path with a custom directory that I downloaded SD3.5Turbo in
- Enabling enable_model_cpu_offload()
- Fresh venv and installed only packages needed for stable diffusion
My hope is to get 5 iterations / second or better with SD3.5 Turbo