[WIP] Asynchronous model mover for lowvram #14855

wfjsw · 2024-02-07T08:36:25Z

Description

This is an attempt to speed up--lowvramby taking the model moving out of the forward loop.
The model moving is made asynchronous, by creating a separate CUDA stream dedicated for moving the model, and utilizing CUDA event for synchronoizing back to the default stream.
A lookahead buffer zone is designed, to make the model moving process faster than the forward phase, so in the meanwhile the GPU always has something to do.

I'm getting a 3.7it/s on a 3060 Laptop with half of the VRAM compared to--medvram.It was originally 1.65it/s. As a reference, the medvram speed was 5.8it/s.

Concerns

This is still a prototype, and not all original semantics are followed.
CUDA stream and CUDA events are used. They are CUDA specific. I think there are similar things on IPEX, but nothing similar on DML.
The size of the lookahead buffer is a tweakable settings. A larger buffer would increase the VRAM usage; a smaller buffer would probably make the forward a bit slower. The generation speed gained by larger buffer has a limit.

Checklist:

I have readcontributing wiki page
I have performed a self-review of my own code
My code follows thestyle guidelines
My code passestests

wfjsw · 2024-02-09T08:25:49Z

Smart mover

The smart mover does something similar to forge, and it only move tensors from CPU to GPU, but not coming back.

At some point, I was somehow able to get a same or even 2x faster speed than sd-webui-forge under--always-no-vram(which is something that is similar to--lowvram) when themax_prefetchis 5-8. Now I can only get it as fast as forge, and the output is broken somehow. Very unfortunately I did not save the file at that time. There are bugs hidden somewhere in the code but I am getting tired trying to find it.

I'd gonna leave it as is and come back when I am getting interested again.

wfjsw · 2024-02-12T03:46:20Z

The broken images seems to be caused by not synchronize back the after usage to the creation stream. Fixed.

Also changed to layer-wise movement.

wfjsw · 2024-02-12T19:00:20Z

There might be problem with extra networks. Haven't look into that.

AUTOMATIC1111 · 2024-02-17T06:07:13Z

This looks very cool, but please don't change the formatting of those existing lines in lowvram.py (newlines and quotes), put those new classes into separate file and write a bit of comment there how the performance gain is achieved. Also maybe an option to use old method even if steaming is supported.

wfjsw · 2024-02-17T07:44:20Z

Need some help on making this support Lora/Controlnets.

As these things probably altering weights and biases, the tensors cached in the mover may be outdated, and a slow path will be taken.

AUTOMATIC1111 · 2024-02-17T08:47:13Z

I'll be honest with you, I don't know how it works, so I can't help either; The "not moving from GPU to CPU" is smart and reasonable, and it can be implemented with ease, but cuda streams things would need me to get a lot more involved to understand.

Plus, there is FP8 support now, maybe that one can work better than lowvram for people who need it?

wfjsw · 2024-02-17T08:59:57Z

The cuda stream thing is used because I want to overlap memcpy with compute. It can be seen as threads.

Briefly speaking, it does several things (all in a non-blocking way to Python):

On the stream B, the cpu tensors is copied to cuda.
On the stream B, itrecord_eventwhich can be seen as a timeline marker. It marks the tensor as ready
On the stream default, it waits for thereadyevent, and then compute forward with the tensor
On the stream default, itrecord_eventwhich mark the work on this tensor is done.
On the stream B, it waits for thedoneevent, so it eventually deallocates as the tensor is being deleted, after the tensor has finished its job.

Apart from the moving things, I have to do these things in addition:

Track the cuda tensor so it is used for forwarding
Save the event to wait them on the other stream
Maintain reference to tensors so it deallocates on the right time

Can thenot moving from GPU to CPUreally be implemented easily? Torch moves modules in-place and it's a major pain that prevents me from making the implemention simple, and forces me to place hooks ontorch.nn.functional.I assume I have to do deep copy to achieve this and that sounds costly. CUDA stream looks, on the other hand, more easier.

Regarding FP8, I think it does not hurt if there is more options.

wfjsw · 2024-02-17T09:12:49Z

Actually, there are 2 main pain points that drives me here:

To donot moving from GPU to CPUon the module's level, I need to clone the module and use it for forwarding. It can't be done with the forward hook and can't be reliably done with monkeypatching forward.
To actually gain performance benefit, I need to know next N tensors while hooking. I don't know how to do this in a reasonable way. There is an alternative way, which is avoiding cuda stream synchronization in the middle of the computation, so I can queue all jobs before they run. In that situation the result will not be immediately available to the Python world after forward. AFAIK, Torch, however perform synchronization on all module'sforwardso this is hard to do.

wfjsw · 2024-02-21T03:11:05Z

A better way is implemented here, which uses the async nature of CUDA. One thing to note that for the acceleration to work, the weight and biases of the unet must be placed in non-pageable (pinned) memory (they will go to pageable if the module is somehowto-ed). Lora is tested.

However, should any extension / modules touches the weight and biases of the model (by usingto(),for example), they need to make them pinned by._apply(lambda x: x.pin_memory()).Otherwise it will fall back to the slow path.

light-and-ray · 2024-02-21T07:48:36Z

As I understand it requires more vram then old lowvram. Maybe you should disable it by default?

Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller

wfjsw · 2024-02-21T08:06:31Z

As I understand it requires more vram then old lowvram. Maybe you should disable it by default?

I profiled withPYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsyncand without FP8. The original implementation takes 166 MB, while this implementation takes 387 MB. The differences is negligable, compared with the difference in sampling step time, which is 625 ms vs 229 ms.

Also the two streams does not go out of sync by a big margin.

Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller

False. This takes significantly less VRAM. 890 MB vs 350 MB. The speed difference is 200 ms per step vs 260 ms per step.

light-and-ray · 2024-02-21T08:23:08Z

I saw in discord async lowvram keep more then one layer in gpu. But maybe it really requires even less vram idk

The original implementation takes 755 MB

Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram

I will test this patch and original lowvram, medvram on it

wfjsw · 2024-02-21T08:25:12Z

Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram

It is the peak usage recorded by Nsight.
PYTORCH_CUDA_ALLOC_CONFmakes big difference here. thenativebackend do consumes ~1.6 GB VRAM. But I think it is a matter of GC and can resolve itself when there is VRAM pressure.

wfjsw · 2024-02-21T08:31:10Z

A closer look shows that it is the horizontal scale of the diagram. The actual usage is smaller. See the tooltips on the new screenshots.

light-and-ray · 2024-02-21T08:42:21Z

GPU MX150 2GB

ARGS:
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
./webui.sh --xformers "$@"
+ fp8

4 steps 512x512


--lowvram
Time taken: 22.1 sec.
A: 0.96 GB, R: 1.19 GB, Sys: 1.3/1.95508 GB (66.0%)

--medvram
Time taken: 18.7 sec.
A: 1.33 GB, R: 1.72 GB, Sys: 1.8/1.95508 GB (93.2%)

this patch --lowvram
Time taken: 17.2 sec.
A: 1.17 GB, R: 1.84 GB, Sys: 1.9/1.95508 GB (99.7%)


torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

Hm, this patch really requres more vram for me

Maybe it ignoresPYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync?Maybe I need to update Nvidia driver? Or maybe--xformersis a problem?

light-and-ray · 2024-02-21T08:46:44Z

The same vram usage, but slower...

This patch, no xformers and no PYTORCH_CUDA_ALLOC_CONF
Time taken: 23.2 sec.
A: 1.25 GB, R: 1.85 GB, Sys: 1.9/1.95508 GB (99.2%)

wfjsw · 2024-02-21T08:47:52Z

May be your actual compute work is lagging behind. Usensysto figure out.

I can add synchronize mark there to constraint it, but it hurts the performance by a lot.

Without xformers it will be slower.

One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?

light-and-ray · 2024-02-21T08:49:18Z

One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?

Yes, but hires vram usage. 93% vs 99% XD

light-and-ray · 2024-02-21T08:57:11Z

May be your actual compute work is lagging behind. Usensysto figure out.

Can't install... Installedsudo apt install nsight-systems,but there are only nsys-ui, which doesn't workCannot mix incompatible Qt library (5.15.10) with this library (5.15.2)(hate this QT compatible issues)

wfjsw · 2024-02-21T08:59:16Z

You can use nsys cli. Collect these data:


Collect CUDA trace	On
Collect CUDA's GPU memory usage	On

light-and-ray · 2024-02-21T09:27:12Z

I have onlynsys-uiafter installation. Maybe I need to reboot pc, but I'm afraid because Qt incompatible error. It's a bad signal, maybe I wont be able to boot my kde XD. I already had similar issue. And today I must be online

I will try to collect these data

wfjsw · 2024-02-21T09:35:04Z

https://docs.nvidia /nsight-systems/UserGuide/index.html#installing-the-cli-on-your-target

light-and-ray · 2024-02-21T10:24:31Z

@wfjswcheck discord PM

wfjsw · 2024-02-23T08:46:13Z

IPEX does not seem to supportpin_memoryright now.

wfjsw · 2024-02-24T04:45:37Z

To fix for default options:

Traceback (most recent call last):
File "threading.py", line 973, in _bootstrap
File "threading.py", line 1016, in _bootstrap_inner
File "<enhanced_experience vendors.sentry_sdk.integrations.threading>", line 70, in run
File "E:\novelai-webui\py310\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "E:\novelai-webui\py310\lib\site-packages\gradio\utils.py", line 707, in wrapper
response = f(*args, **kwargs)
File "E:\novelai-webui\modules\ui_extra_networks.py", line 419, in pages_html
return refresh()
File "E:\novelai-webui\modules\ui_extra_networks.py", line 425, in refresh
pg.refresh()
File "E:\novelai-webui\modules\ui_extra_networks_textual_inversion.py", line 13, in refresh
sd_hijack.model_hijack.embedding_db.load_textual_inversion_embeddings(force_reload=True)
File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 222, in load_textual_inversion_embeddings
self.expected_shape = self.get_expected_shape()
File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 154, in get_expected_shape
vec = shared.sd_model.cond_stage_model.encode_embedding_init_text( ",", 1)
File "E:\novelai-webui\modules\shared_items.py", line 128, in sd_model
return modules.sd_models.model_data.get_sd_model()
File "E:\novelai-webui\modules\sd_models.py", line 574, in get_sd_model
errors.display(e, "loading stable diffusion model", full_traceback=True)
File "E:\novelai-webui\modules\sd_models.py", line 571, in get_sd_model
load_model()
File "E:\novelai-webui\modules\sd_models.py", line 698, in load_model
load_model_weights(sd_model, checkpoint_info, state_dict, timer)
File "E:\novelai-webui\modules\sd_models.py", line 441, in load_model_weights
module.to(torch.float8_e4m3fn)
File "E:\novelai-webui\py310\lib\site-packages\torch\nn\modules\module.py", line 825, in _apply
param_applied = fn(param)
File "E:\novelai-webui\modules\sd_models.py", line 441, in <lambda>
module.to(torch.float8_e4m3fn)
RuntimeError: cannot pin 'CUDAFloat8_e4m3fnType' only dense CPU tensors can be pinned

Nuullll · 2024-02-25T09:20:17Z

modules/lowvram.py

+
+if use_streamlined_lowvram:
+# put it into pinned memory to achieve data transfer overlap
+diff_model.time_embed._apply(lambda x: x.pin_memory())


Specifyingdeviceparameter will letpin_memoryoffload to other non-CUDA backends (e.g. IPEX)

Suggested change

diff_model.time_embed._apply(lambdax:x.pin_memory())

diff_model.time_embed._apply(lambdax:x.pin_memory(device=devices.get_optimal_device_name()))

Nuullll · 2024-02-25T10:35:02Z

Intel A750 8G (IPEX backend): this improves the performance from 0.7it/s to 1.5it/s with no significant VRAM usage increase.

wfjsw · 2024-02-25T17:48:09Z

Someone says the LoRA is not actually working. Pending test.

UPDATE: I cannot reproduce

UPDATE: For FP16 LoRAs, it will have a hard time trying to apply them on CPUs. Need some cast here.

wfjsw · 2024-02-27T09:01:06Z

TODO: add a queue somewhere to constraint the speed

…sd_models.py

wfjsw · 2024-03-10T10:26:42Z

@light-and-raycan you try this? it no longer should oom now
~~nvm i implemented it wrongly~~

light-and-ray · 2024-03-10T18:40:12Z

It still uses more vram then medvram

Time taken: 17.2 sec.
A: 1.27 GB, R: 1.85 GB, Sys: 2.0/1.95508 GB (99.9%)

wfjsw · 2024-03-11T08:59:28Z

There is a new setting in the optimization folder. Reduce it and see what happens.

You can go with 1 or 2.

light-and-ray · 2024-03-11T13:48:59Z

Maximum number of loaded modules in low VRAM mode = 1
Time taken: 18.9 sec.
A: 1.19 GB, R: 1.76 GB, Sys: 1.9/1.95508 GB (95.6%)

Maximum number of loaded modules in low VRAM mode = 2
Time taken: 16.9 sec.
A: 1.19 GB, R: 1.75 GB, Sys: 1.9/1.95508 GB (94.7%)

On first few runs there was 99% usage

According to the graphic of vram usage, the peak 95-99% is on VAE stages

Disabled HyperTile VAE + Maximum number of loaded modules in low VRAM mode = 2
Time taken: 16.9 sec.
A: 1.18 GB, R: 1.73 GB, Sys: 1.8/1.95508 GB (93.8%)

Last test with no your patch still much low vram:
Time taken: 21.7 sec.
A: 0.96 GB, R: 1.41 GB, Sys: 1.5/1.95508 GB (75.5%)

AndreyRGW · 2024-07-23T14:27:39Z

Any progress on this?

wfjsw · 2024-07-23T14:46:46Z

I still need a nsight system profile for lowend cards to find out why the max block limit does not work (as it seems)

wfjsw changed the base branch from master to dev February 7, 2024 08:36

wfjsw force-pushed the model-mover branch from 565925e to 5295c96 Compare February 9, 2024 08:21

wfjsw mentioned this pull request Feb 9, 2024

Support nn.Module.to(..., copy=True) pytorch/pytorch#119368

Open

wfjsw marked this pull request as ready for review February 12, 2024 03:44

wfjsw requested a review fromAUTOMATIC1111 as acode owner February 12, 2024 03:44

wfjsw marked this pull request as draft February 19, 2024 10:31

wfjsw marked this pull request as ready for review February 21, 2024 03:09

Nuullll reviewed Feb 25, 2024

View reviewed changes

wfjsw marked this pull request as draft February 27, 2024 09:01

wfjsw added14commits March 9, 2024 19:12

async weight mover

c1702ea

smart lowvram mover

4f0d5a5

cleanup

cca6102

fix ci

f729b21

fix ci

8278ad0

tweak

5d69b1e

avoid infinite loop

a58ee39

remove profiler

cf3cc4c

remove unneeded changes

0caa753

better impl

fffc902

fix ci

8828c9e

add option to revert to old behavior

1d950c7

only allow cuda for streamlined lowvram and handle sparse tensors in…

4703b95

…sd_models.py

fix signature

572f4cd

wfjsw force-pushed the model-mover branch from 10cf770 to 572f4cd Compare March 10, 2024 05:12

avoid oom on slow cards

e8df8a9

fix impl

ed69979

	diff_model.time_embed._apply(lambdax:x.pin_memory())
	diff_model.time_embed._apply(lambdax:x.pin_memory(device=devices.get_optimal_device_name()))

[WIP] Asynchronous model mover for lowvram #14855

Are you sure you want to change the base?

[WIP] Asynchronous model mover for lowvram #14855

Conversation

wfjsw commented Feb 7, 2024 • edited Loading

Description

Concerns

Checklist:

wfjsw commented Feb 9, 2024 • edited Loading

Smart mover

wfjsw commented Feb 12, 2024

wfjsw commented Feb 12, 2024

AUTOMATIC1111 commented Feb 17, 2024

wfjsw commented Feb 17, 2024

AUTOMATIC1111 commented Feb 17, 2024

wfjsw commented Feb 17, 2024 • edited Loading

wfjsw commented Feb 17, 2024 • edited Loading

wfjsw commented Feb 21, 2024 • edited Loading

light-and-ray commented Feb 21, 2024 • edited Loading

wfjsw commented Feb 21, 2024 • edited Loading

light-and-ray commented Feb 21, 2024

wfjsw commented Feb 21, 2024 • edited Loading

wfjsw commented Feb 21, 2024

light-and-ray commented Feb 21, 2024 • edited Loading

light-and-ray commented Feb 21, 2024 • edited Loading

wfjsw commented Feb 21, 2024

light-and-ray commented Feb 21, 2024 • edited Loading

light-and-ray commented Feb 21, 2024

wfjsw commented Feb 21, 2024

light-and-ray commented Feb 21, 2024 • edited Loading

wfjsw commented Feb 21, 2024

light-and-ray commented Feb 21, 2024

wfjsw commented Feb 23, 2024

wfjsw commented Feb 24, 2024

Nuullll Feb 25, 2024

Choose a reason for hiding this comment

Nuullll commented Feb 25, 2024

wfjsw commented Feb 25, 2024 • edited Loading

wfjsw commented Feb 27, 2024

wfjsw commented Mar 10, 2024 • edited Loading

light-and-ray commented Mar 10, 2024

wfjsw commented Mar 11, 2024 • edited Loading

light-and-ray commented Mar 11, 2024

AndreyRGW commented Jul 23, 2024

wfjsw commented Jul 23, 2024

wfjsw commented Feb 7, 2024 •

edited

Loading

wfjsw commented Feb 9, 2024 •

edited

Loading

wfjsw commented Feb 17, 2024 •

edited

Loading

wfjsw commented Feb 17, 2024 •

edited

Loading

wfjsw commented Feb 21, 2024 •

edited

Loading

light-and-ray commented Feb 21, 2024 •

edited

Loading

wfjsw commented Feb 21, 2024 •

edited

Loading

wfjsw commented Feb 21, 2024 •

edited

Loading

light-and-ray commented Feb 21, 2024 •

edited

Loading

light-and-ray commented Feb 21, 2024 •

edited

Loading

light-and-ray commented Feb 21, 2024 •

edited

Loading

light-and-ray commented Feb 21, 2024 •

edited

Loading

wfjsw commented Feb 25, 2024 •

edited

Loading

wfjsw commented Mar 10, 2024 •

edited

Loading

wfjsw commented Mar 11, 2024 •

edited

Loading