Skip to content
New issue

Have a question about this project?Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of serviceand privacy statement.We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Asynchronous model mover for lowvram #14855

Draft
wants to merge 16 commits into
base: dev
Choose a base branch
from

Conversation

wfjsw
Copy link
Contributor

@wfjsw wfjsw commented Feb 7, 2024

Description

  • This is an attempt to speed up--lowvramby taking the model moving out of the forward loop.
  • The model moving is made asynchronous, by creating a separate CUDA stream dedicated for moving the model, and utilizing CUDA event for synchronoizing back to the default stream.
  • A lookahead buffer zone is designed, to make the model moving process faster than the forward phase, so in the meanwhile the GPU always has something to do.

I'm getting a 3.7it/s on a 3060 Laptop with half of the VRAM compared to--medvram.It was originally 1.65it/s. As a reference, the medvram speed was 5.8it/s.

Concerns

  • This is still a prototype, and not all original semantics are followed.
  • CUDA stream and CUDA events are used. They are CUDA specific. I think there are similar things on IPEX, but nothing similar on DML.
  • The size of the lookahead buffer is a tweakable settings. A larger buffer would increase the VRAM usage; a smaller buffer would probably make the forward a bit slower. The generation speed gained by larger buffer has a limit.

Checklist:

@wfjsw wfjsw changed the base branch from master to dev February 7, 2024 08:36
@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 9, 2024

Smart mover

The smart mover does something similar to forge, and it only move tensors from CPU to GPU, but not coming back.

At some point, I was somehow able to get a same or even 2x faster speed than sd-webui-forge under--always-no-vram(which is something that is similar to--lowvram) when themax_prefetchis 5-8. Now I can only get it as fast as forge, and the output is broken somehow. Very unfortunately I did not save the file at that time. There are bugs hidden somewhere in the code but I am getting tired trying to find it.

I'd gonna leave it as is and come back when I am getting interested again.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 12, 2024

The broken images seems to be caused by not synchronize back the after usage to the creation stream. Fixed.

Also changed to layer-wise movement.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 12, 2024

There might be problem with extra networks. Haven't look into that.

@AUTOMATIC1111
Copy link
Owner

This looks very cool, but please don't change the formatting of those existing lines in lowvram.py (newlines and quotes), put those new classes into separate file and write a bit of comment there how the performance gain is achieved. Also maybe an option to use old method even if steaming is supported.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 17, 2024

Need some help on making this support Lora/Controlnets.

As these things probably altering weights and biases, the tensors cached in the mover may be outdated, and a slow path will be taken.

@AUTOMATIC1111
Copy link
Owner

I'll be honest with you, I don't know how it works, so I can't help either; The "not moving from GPU to CPU" is smart and reasonable, and it can be implemented with ease, but cuda streams things would need me to get a lot more involved to understand.

Plus, there is FP8 support now, maybe that one can work better than lowvram for people who need it?

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 17, 2024

The cuda stream thing is used because I want to overlap memcpy with compute. It can be seen as threads.

Briefly speaking, it does several things (all in a non-blocking way to Python):

  • On the stream B, the cpu tensors is copied to cuda.
  • On the stream B, itrecord_eventwhich can be seen as a timeline marker. It marks the tensor as ready
  • On the stream default, it waits for thereadyevent, and then compute forward with the tensor
  • On the stream default, itrecord_eventwhich mark the work on this tensor is done.
  • On the stream B, it waits for thedoneevent, so it eventually deallocates as the tensor is being deleted, after the tensor has finished its job.

Apart from the moving things, I have to do these things in addition:

  • Track the cuda tensor so it is used for forwarding
  • Save the event to wait them on the other stream
  • Maintain reference to tensors so it deallocates on the right time

Can thenot moving from GPU to CPUreally be implemented easily? Torch moves modules in-place and it's a major pain that prevents me from making the implemention simple, and forces me to place hooks ontorch.nn.functional.I assume I have to do deep copy to achieve this and that sounds costly. CUDA stream looks, on the other hand, more easier.

Regarding FP8, I think it does not hurt if there is more options.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 17, 2024

Actually, there are 2 main pain points that drives me here:

  • To donot moving from GPU to CPUon the module's level, I need to clone the module and use it for forwarding. It can't be done with the forward hook and can't be reliably done with monkeypatching forward.
  • To actually gain performance benefit, I need to know next N tensors while hooking. I don't know how to do this in a reasonable way. There is an alternative way, which is avoiding cuda stream synchronization in the middle of the computation, so I can queue all jobs before they run. In that situation the result will not be immediately available to the Python world after forward. AFAIK, Torch, however perform synchronization on all module'sforwardso this is hard to do.

@wfjsw wfjsw marked this pull request as draft February 19, 2024 10:31
@wfjsw wfjsw marked this pull request as ready for review February 21, 2024 03:09
@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 21, 2024

A better way is implemented here, which uses the async nature of CUDA. One thing to note that for the acceleration to work, the weight and biases of the unet must be placed in non-pageable (pinned) memory (they will go to pageable if the module is somehowto-ed). Lora is tested.

However, should any extension / modules touches the weight and biases of the model (by usingto(),for example), they need to make them pinned by._apply(lambda x: x.pin_memory()).Otherwise it will fall back to the slow path.

@light-and-ray
Copy link
Contributor

light-and-ray commented Feb 21, 2024

As I understand it requires more vram then old lowvram. Maybe you should disable it by default?

Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 21, 2024

As I understand it requires more vram then old lowvram. Maybe you should disable it by default?

I profiled withPYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsyncand without FP8. The original implementation takes 166 MB, while this implementation takes 387 MB. The differences is negligable, compared with the difference in sampling step time, which is 625 ms vs 229 ms.

Also the two streams does not go out of sync by a big margin.

image

image

image

image

Also due to fp8 is enough 2GB for medvram, and lowvram saves only about 500 MB. With this patch this already small vram difference will be smaller

False. This takes significantly less VRAM. 890 MB vs 350 MB. The speed difference is 200 ms per step vs 260 ms per step.

image

image

@light-and-ray
Copy link
Contributor

I saw in discord async lowvram keep more then one layer in gpu. But maybe it really requires even less vram idk

The original implementation takes 755 MB

Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram

I will test this patch and original lowvram, medvram on it

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 21, 2024

Is it total vram usage? I tested my 2gb it ate about 1.7 GB in fp16 mode lowvram

It is the peak usage recorded by Nsight.
PYTORCH_CUDA_ALLOC_CONFmakes big difference here. thenativebackend do consumes ~1.6 GB VRAM. But I think it is a matter of GC and can resolve itself when there is VRAM pressure.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 21, 2024

A closer look shows that it is the horizontal scale of the diagram. The actual usage is smaller. See the tooltips on the new screenshots.

@light-and-ray
Copy link
Contributor

light-and-ray commented Feb 21, 2024

GPU MX150 2GB

ARGS:
export PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync
./webui.sh --xformers "$@"
+ fp8

4 steps 512x512


--lowvram
Time taken: 22.1 sec.
A: 0.96 GB, R: 1.19 GB, Sys: 1.3/1.95508 GB (66.0%)

--medvram
Time taken: 18.7 sec.
A: 1.33 GB, R: 1.72 GB, Sys: 1.8/1.95508 GB (93.2%)

this patch --lowvram
Time taken: 17.2 sec.
A: 1.17 GB, R: 1.84 GB, Sys: 1.9/1.95508 GB (99.7%)


torch: 2.1.2+cu121  •  xformers: 0.0.23.post1

Hm, this patch really requres more vram for me

Maybe it ignoresPYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync?Maybe I need to update Nvidia driver? Or maybe--xformersis a problem?

@light-and-ray
Copy link
Contributor

light-and-ray commented Feb 21, 2024

The same vram usage, but slower...

This patch, no xformers and no PYTORCH_CUDA_ALLOC_CONF
Time taken: 23.2 sec.
A: 1.25 GB, R: 1.85 GB, Sys: 1.9/1.95508 GB (99.2%)

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 21, 2024

May be your actual compute work is lagging behind. Usensysto figure out.

I can add synchronize mark there to constraint it, but it hurts the performance by a lot.

Without xformers it will be slower.

One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?

@light-and-ray
Copy link
Contributor

light-and-ray commented Feb 21, 2024

One thing interesting is that the lowvram goes faster than medvram. Can you upload a nsys profile?

Yes, but hires vram usage. 93% vs 99% XD

@light-and-ray
Copy link
Contributor

May be your actual compute work is lagging behind. Usensysto figure out.

Can't install... Installedsudo apt install nsight-systems,but there are only nsys-ui, which doesn't workCannot mix incompatible Qt library (5.15.10) with this library (5.15.2)(hate this QT compatible issues)

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 21, 2024

You can use nsys cli. Collect these data:

Collect CUDA trace On
Collect CUDA's GPU memory usage On

@light-and-ray
Copy link
Contributor

light-and-ray commented Feb 21, 2024

I have onlynsys-uiafter installation. Maybe I need to reboot pc, but I'm afraid because Qt incompatible error. It's a bad signal, maybe I wont be able to boot my kde XD. I already had similar issue. And today I must be online

I will try to collect these data

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 21, 2024

@light-and-ray
Copy link
Contributor

@wfjswcheck discord PM

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 23, 2024

IPEX does not seem to supportpin_memoryright now.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 24, 2024

To fix for default options:

Traceback (most recent call last):
File "threading.py", line 973, in _bootstrap
File "threading.py", line 1016, in _bootstrap_inner
File "<enhanced_experience vendors.sentry_sdk.integrations.threading>", line 70, in run
File "E:\novelai-webui\py310\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
result = context.run(func, *args)
File "E:\novelai-webui\py310\lib\site-packages\gradio\utils.py", line 707, in wrapper
response = f(*args, **kwargs)
File "E:\novelai-webui\modules\ui_extra_networks.py", line 419, in pages_html
return refresh()
File "E:\novelai-webui\modules\ui_extra_networks.py", line 425, in refresh
pg.refresh()
File "E:\novelai-webui\modules\ui_extra_networks_textual_inversion.py", line 13, in refresh
sd_hijack.model_hijack.embedding_db.load_textual_inversion_embeddings(force_reload=True)
File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 222, in load_textual_inversion_embeddings
self.expected_shape = self.get_expected_shape()
File "E:\novelai-webui\modules\textual_inversion\textual_inversion.py", line 154, in get_expected_shape
vec = shared.sd_model.cond_stage_model.encode_embedding_init_text( ",", 1)
File "E:\novelai-webui\modules\shared_items.py", line 128, in sd_model
return modules.sd_models.model_data.get_sd_model()
File "E:\novelai-webui\modules\sd_models.py", line 574, in get_sd_model
errors.display(e, "loading stable diffusion model", full_traceback=True)
File "E:\novelai-webui\modules\sd_models.py", line 571, in get_sd_model
load_model()
File "E:\novelai-webui\modules\sd_models.py", line 698, in load_model
load_model_weights(sd_model, checkpoint_info, state_dict, timer)
File "E:\novelai-webui\modules\sd_models.py", line 441, in load_model_weights
module.to(torch.float8_e4m3fn)
File "E:\novelai-webui\py310\lib\site-packages\torch\nn\modules\module.py", line 825, in _apply
param_applied = fn(param)
File "E:\novelai-webui\modules\sd_models.py", line 441, in <lambda>
module.to(torch.float8_e4m3fn)
RuntimeError: cannot pin 'CUDAFloat8_e4m3fnType' only dense CPU tensors can be pinned


if use_streamlined_lowvram:
# put it into pinned memory to achieve data transfer overlap
diff_model.time_embed._apply(lambda x: x.pin_memory())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Specifyingdeviceparameter will letpin_memoryoffload to other non-CUDA backends (e.g. IPEX)

Suggested change
diff_model.time_embed._apply(lambdax:x.pin_memory())
diff_model.time_embed._apply(lambdax:x.pin_memory(device=devices.get_optimal_device_name()))

@Nuullll
Copy link
Contributor

Nuullll commented Feb 25, 2024

Intel A750 8G (IPEX backend): this improves the performance from 0.7it/s to 1.5it/s with no significant VRAM usage increase.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 25, 2024

Someone says the LoRA is not actually working. Pending test.

UPDATE: I cannot reproduce

UPDATE: For FP16 LoRAs, it will have a hard time trying to apply them on CPUs. Need some cast here.

@wfjsw
Copy link
Contributor Author

wfjsw commented Feb 27, 2024

TODO: add a queue somewhere to constraint the speed

@wfjsw wfjsw marked this pull request as draft February 27, 2024 09:01
@wfjsw
Copy link
Contributor Author

wfjsw commented Mar 10, 2024

@light-and-raycan you try this? it no longer should oom now
nvm i implemented it wrongly

@light-and-ray
Copy link
Contributor

It still uses more vram then medvram

Time taken: 17.2 sec.
A: 1.27 GB, R: 1.85 GB, Sys: 2.0/1.95508 GB (99.9%)

@wfjsw
Copy link
Contributor Author

wfjsw commented Mar 11, 2024

There is a new setting in the optimization folder. Reduce it and see what happens.

You can go with 1 or 2.

@light-and-ray
Copy link
Contributor

Maximum number of loaded modules in low VRAM mode = 1
Time taken: 18.9 sec.
A: 1.19 GB, R: 1.76 GB, Sys: 1.9/1.95508 GB (95.6%)

Maximum number of loaded modules in low VRAM mode = 2
Time taken: 16.9 sec.
A: 1.19 GB, R: 1.75 GB, Sys: 1.9/1.95508 GB (94.7%)

On first few runs there was 99% usage

According to the graphic of vram usage, the peak 95-99% is on VAE stages

Disabled HyperTile VAE + Maximum number of loaded modules in low VRAM mode = 2
Time taken: 16.9 sec.
A: 1.18 GB, R: 1.73 GB, Sys: 1.8/1.95508 GB (93.8%)

Last test with no your patch still much low vram:
Time taken: 21.7 sec.
A: 0.96 GB, R: 1.41 GB, Sys: 1.5/1.95508 GB (75.5%)

@AndreyRGW
Copy link
Contributor

Any progress on this?

@wfjsw
Copy link
Contributor Author

wfjsw commented Jul 23, 2024

I still need a nsight system profile for lowend cards to find out why the max block limit does not work (as it seems)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants