pytorch - Running Stable Diffusion locally

I've been trying to follow the instructions here to run StableDiffusion locally, but the code appears to just hang.

I've cloned the repo, and installed dependencies:

$ git show --stat   
commit cf1d67a6fd5ea1aa600c4df58e5b47da45f6bdbf (HEAD -> main, origin/main, origin/HEAD)
Author: hardmaru <[email protected]>
Date:   Sat Mar 25 11:24:20 2023 +0900

    Update modelcard.md

 modelcard.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
# (i.e. )

$ pip freeze
absl-py==2.1.0
aiohappyeyeballs==2.4.6
aiohttp==3.11.12
[...]
websockets==15.0
Werkzeug==3.1.3
yarl==1.18.3

And I've downloaded the weights as suggested from here.

I've made two changes to the code to workaround errors that showed up when I first ran:

$ git diff
diff --git ldm/modules/diffusionmodules/util.py ldm/modules/diffusionmodules/util.py
index daf35da..3e82831 100644
--- ldm/modules/diffusionmodules/util.py
+++ ldm/modules/diffusionmodules/util.py
@@ -130,7 +130,7 @@ class CheckpointFunction(torch.autograd.Function):
         ctx.input_tensors = list(args[:length])
         ctx.input_params = list(args[length:])
         ctx.gpu_autocast_kwargs = {"enabled": torch.is_autocast_enabled(),
-                                   "dtype": torch.get_autocast_gpu_dtype(),
+                                   "dtype": torch.get_autocast_dtype('cuda'),
                                    "cache_enabled": torch.is_autocast_cache_enabled()}
         with torch.no_grad():
             output_tensors = ctx.run_function(*ctx.input_tensors)
diff --git scripts/txt2img.py scripts/txt2img.py
index 9d955e3..5a1be9b 100644
--- scripts/txt2img.py
+++ scripts/txt2img.py
@@ -27,7 +27,7 @@ def chunk(it, size):
 
 def load_model_from_config(config, ckpt, device=torch.device("cuda"), verbose=False):
     print(f"Loading model from {ckpt}")
-    pl_sd = torch.load(ckpt, map_location="cpu")
+    pl_sd = torch.load(ckpt, map_location="cpu", weights_only=False)
     if "global_step" in pl_sd:
         print(f"Global Step: {pl_sd['global_step']}")
     sd = pl_sd["state_dict"]

But output from running the script appears to hang forever (at least, I've seen no change in >10 minutes):

$ python scripts/txt2img.py --prompt "A dog having a nice time in a park" --ckpt /home/scubbo/.cache/huggingface/hub/models--stabilityai--stable-diffusion-2-1/snapshots/5cae40e6a2745ae2b01ad92ae5043f95f23644d6/v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
Global seed set to 42
Loading model from /home/scubbo/.cache/huggingface/hub/models--stabilityai--stable-diffusion-2-1/snapshots/5cae40e6a2745ae2b01ad92ae5043f95f23644d6/v2-1_768-ema-pruned.ckpt
Global Step: 110000
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Creating invisible watermark encoder (see )...
Sampling:   0%|                                                                                                                                                                                                                | 0/3 [00:00<?, ?it/s]/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/models/diffusion/ddim.py:36: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)1 [00:00<?, ?it/s]
  self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/models/diffusion/ddim.py:37: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/models/diffusion/ddim.py:38: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/models/diffusion/ddim.py:39: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/models/diffusion/ddim.py:40: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/.venv/lib/python3.11/site-packages/torch/_tensor.py:1077: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  return self.reciprocal() * other
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/modules/diffusionmodules/util.py:76: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  sigmas = eta * np.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev))
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/modules/diffusionmodules/util.py:76: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  sigmas = eta * np.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev))
/home/scubbo/Code/stable-diffusion-testing/stablediffusion/ldm/models/diffusion/ddim.py:49: DeprecationWarning: __array_wrap__ must accept context and return_scalar arguments (positionally) in the future. (Deprecated NumPy 2.0)
  self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
Data shape for DDIM sampling is (3, 4, 96, 96), eta 0.0
Running DDIM Sampling with 50 timesteps


DDIM Sampler:   0%|                                                                                                                                                                                                           | 0/50 [00:00<?, ?it/s]

This is on a machine with an Nvidia Quadro P1000, with nothing else using the card - I recognize that that isn't the most high-end card, but I would expect some progress in ~20 minutes if any progress was going to be made:

$ nvidia-smi
Mon Feb 17 19:29:08 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro P1000                   On  | 00000000:05:00.0 Off |                  N/A |
| 34%   27C    P8              N/A /  N/A |      4MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

By comparison, I can run the StableDiffusion Web UI on this machine and get image generation in approx. 2 minutes - but I want to be able to call it programmatically.

EDIT: I unintentionally left the process running for over 4 hours in the background, and when I returned, DDIM Sampler: was only at 8%. Given that the StableDiffusion Web UI can complete an image in ~2 minutes, should I conclude that the txt2img.py script is not using my GPU?

EDIT2: I tried following the guide here, and got the following error when executing pipe.to("cuda"):

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 3.94 GiB of which 64.56 MiB is free. Including non-PyTorch memory, this process has 3.88 GiB memory in use. Of the allocated memory 3.76 GiB is allocated by PyTorch, and 83.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (.html#environment-variables)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pytorch - Running Stable Diffusion locally - Stack Overflow

与本文相关的文章

评论列表(0)