Very low GPU utilization #140

JiriEr · 2023-04-11T20:36:45Z

JiriEr
Apr 11, 2023

Even with parallel processing, I am not able to get generally over 50 % GPU utilization.

Windows 10
Nvidia Quadro RTX 6000 Ada Gen
Model Large-v2
Float16

Tried the same with https://github.com/openai/whisper and with few instances in parallel, I can get about 98 % of GPU utilization. Four instances is the maximum I can load into memory and is the most effective for processing multiple files. The best time efficiency I can obtain is with FP32 precision here.

So the result is that for mixed file size audio files (from 5 to about 50 minutes), I get better speed with https://github.com/openai/whisper instead of faster-whisper. For multiple (hundreds) small files, faster-whisper is really faster.

I am really not sure how and if it is possible to improve GPU utilization and speed generally. It is possible that poor GPU utilization is connected to older CUDA (11.8) used by PyTorch not fully supporting new GPUs.

Any advice is appreciated.

guillaumekln · 2023-04-11T21:47:11Z

guillaumekln
Apr 11, 2023

Just to make sure, are you using the same transcription options in openai/whisper, especially the same beam size?

Also how are you running multiple instances with openai/whisper? Do you use the same technique to run multiple instances of faster-whisper?

10 replies

phineas-pta Apr 13, 2023

this should be the better approach to what you're trying to achieve: pytorch/pytorch#16943 (comment)

JiriEr Apr 13, 2023
Author

Can you post the code you are running for this comparison?

Did you check the code I run? Anything wrong from your point of view?

guillaumekln Apr 13, 2023

Your code looks OK to me.

What tool do you use to monitor the GPU usage? I'm getting different readings in the Task Manager compared to third party tools like GPU-Z. In GPU-Z the usage was always close to 100% while the usage in the Task Manager was lower.

Also what's the GPU utilization when you don't use multiprocessing at all?

Note that faster-whisper has a way to run multiple GPU transcriptions from a single Python process. This may not be faster but it can be worth testing. You could build a single model instance with multiple workers and then use a thread pool. For example:

model = WhisperModel(model_file, device="cuda", compute_type="float16", num_workers=pocet_procesu)

[...]

pool = multiprocessing.pool.ThreadPool(processes=pocet_procesu)

JiriEr Apr 13, 2023
Author

I tried num_workers approach already. Result was exactly the same. But used slightly different pool syntax. Will try yours and will post results.

I use nvidia-smi -l 3 (3 seconds refresh) and nvidia-smi dmon to monitor GPU (from Nvidia CUDA SDK). Those numbers are real numbers, not like task manager or whatever else.

Now trying torch.multiprocessing instead of multiprocessing with slight code modifications to maybe improve the speed.

bbenglish Feb 14, 2024

"nvidia-smi -l 3" saved my day!

phineas-pta · 2023-04-12T15:18:15Z

phineas-pta
Apr 12, 2023

if you have powerful hardware then the better parallelization technique is batch execution, for whisper you can check out openai/whisper#662, for faster-whisper it isn't available yet see #59

also i rather suspect a RAM bottleneck, in your screenshot there are 32 GB RAM, although RTX 6000 has 48 GB VRAM; a rule of thumb is you should have RAM ≥ VRAM

15 replies

JiriEr Apr 14, 2023
Author

Yes, of course. No problem. But I can start on Monday. Now I am out for the weekend.

What parameters exactly? FP16, beam_size 5?

guillaumekln Apr 14, 2023

Yes, if you can use the same parameters that would be great.

phineas-pta Apr 15, 2023

ah yeah i completely forgot the tiny model, yes my pc can run it, i'll try when possible

phineas-pta Apr 15, 2023

and just a small note though, the fp16 option of openai/whipser is misleading, it doesn't speed up openai/whisper#622 nor load the model in float16, it only make the computation to be in float16

JiriEr Apr 17, 2023
Author

openai/whisper, verbose = True, fp16 = True, condition_on_previous_text = False, beam_size = 5

multiprocessing:
1 - 7079 s
2 - 3792 s
3 - 2677 s
4 - 2527 s

Going to add more RAM tomorrow, so I can run tests again.

JonathanFly · 2023-04-20T05:10:49Z

JonathanFly
Apr 20, 2023

I was noticing low GPU usage, but I assumed my system was messed up and have since wiped it. But I'll post my stats in case it's useful information.

3090Ti, 10-900k, 128GB. Windows 10 native and WSL tested, 128GB is also available in WSL2. beam_size 5, everything left at defaults, power measured at the wall.

1) Sanity check with bad filename (no transcription)

whisper --model large-v2 bad_filename.m4a
My time: 22 seconds 

whisper-ctranslate2 --model large-v2 bad_filename.m4a
My time: 27 seconds

2) The 13 minute YouTube Video, Windows Native

whisper-ctranslate2 --model large-v2 test13.m4a
My time: 181 seconds 
Faster-Whisper V100S: 54 seconds
(181 - 5) / 54 = 3.26x longer than V100S
120 Watts on GPU
4.4x realtime speed

whisper --model large-v2 test13.m4a
My time: 367 seconds
Faster-Whisper V100S: 270 seconds 
367 / 270 = 1.35 times longer than V100S
220 Watts on GPU
2.18x realtime

3) The 13 minute YouTube Video, WSL2

whisper-ctranslate2 --model large-v2 test13.m4a
My time: 207 seconds
GPU 70 Watts. 

whisper --model large-v2 test13.m4a
My time: 690 seconds
120 Watts

So the 3090 Ti takes 1.35x longer on Whisper, but 3.26x longer in faster-whisper. I could have sworn when I first benched faster-whisper on that 13 minute file, the 3090 was close to your 54 seconds. In the time since I was trying to optimize Stable Diffusion and left a trail of weird cuda dlls all over my system, so I have since burned it all down.

If you or anyone else who gets close to 54 seconds happens to remember, can you post the exact versions (direct links if you can) of Cuda, cuDNN, cuBLAS, anything you installed that has a specific version, and mention if it's Linux, WSL2, windows native? There's some stuff like FP32 accumulate which is faster on the v100 but I'm pretty sure the 3090 shouldn't be 1/3 the speed. I know how to install everything so faster-whisper works but my guess is that not all versions are equal.

I don't know what's up with WSL2. Other stuff is 250+ watts, just gonna wipe that too.

I quite like the efficient power usage when I'm at the computer, and it's still more than 2x whisper. But cranking up the speed and the GPU would be a nice option when I'm not present. And if I'm gonna set CUDA up again up from scratch want to try and start with a clean install and then I rerun the bench every time I tweak something to make sure I don't break it again.

Edit: Set this up on a friend's 4090 in WSL2. Start to finish, including model loading time and detecting language, 51 seconds on the 13 minute video.

1 reply

guillaumekln Apr 20, 2023

I ran the benchmark with the NVIDIA Docker image nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04. However, I don't think different CUDA or cuDNN versions have a big impact on performance.

guillaumekln · 2023-04-20T16:21:28Z

guillaumekln
Apr 20, 2023

Thanks everyone for the feedback.

I possibly have a solution for this issue in OpenNMT/CTranslate2#1177.

Can you help testing this change and see if it makes a difference?

To install this development branch:

Go to the build page
Download the artifact "python-wheels"
Extract the archive
Install the Windows wheel matching your Python version with pip install --force-reinstall <wheel file>

8 replies

JiriEr Apr 25, 2023
Author

In the meanwhile, I upgraded memory to 192 GB and installed fresh new Python 3.10.11. I got something strange:

Standard branch:
1 - 2112 s
2 - 1130 s
3 - 834 s
4 - 818 s
5 - 837 s

Development branch:
1 - 2066 s
4 - 824 s, second test 813 s (computer freshly restarted)

With memory upgrade, I understand the faster processing for multiple processes. But why the hell processing in single process is now so slow? Was 1753 s with 32 GB RAM and Python 3.9.

guillaumekln Apr 25, 2023

At least it seems that the development I proposed does not help.

JiriEr Apr 27, 2023
Author

What CPU do you have? I heard rumors that GPU utilization is good with AMD and not so good with Intel CPUs. Is it possible?

guillaumekln Apr 28, 2023

I have a combo Intel i5-11400F + RTX 3060 in my desktop Windows machine, and I also see the CPU usage jumping to 100% with the script your shared. It's possibly a CPU bottleneck. I'm not sure if it's an issue or something we can improve.

JiriEr Apr 29, 2023
Author

I have Xeon 4116. Will check peaks and post results. But most if the time CPU is 16 % or so what I remember.

ILG2021 · 2023-04-29T00:09:23Z

ILG2021
Apr 29, 2023

When I use faster-whisper on a 1070ti to transcibe a short sentence it cost about 1s, but when I use a RTX 4090, it cost about 500ms, only 2x speed up. But some benchmark websites tell that 4090's int8 inference speed is 8x faster than 1070ti. So it is the problem?

1 reply

JiriEr Apr 29, 2023
Author

IMHO use much larger sample and then compare. Please provide the results and conditions/settings.

ozancaglayan · 2023-06-21T22:34:29Z

ozancaglayan
Jun 21, 2023

I'm also seeing similar patterns when I do a tight timeit loop in ipython, e.g.

%timeit model.transcribe(30_sec_data, language='en', ...)

then I monitor the usage with nvidia-smi. Sometimes I can see 50-60 or 90 but sometimes its 0-1-2%.

I'll try your branch now.

Another thing I noticed is the following:
If I measure runtime with perf_counter() like below, I always get the same transcript (deterministic params are used), but every once in a while the runtime jumps abruptly:

  def transcribe(inp):
      return list(model.transcribe(
          15_sec_of_data,
          beam_size=1,
          temperature=0,
          vad_filter=False,
          best_of=1,
          condition_on_previous_text=False,
          language='en',
          without_timestamps=True,
      )[0])

for i in range(10):
            start = time.perf_counter()
            with cProfile.Profile() as pr:
                tr = transcribe(chunk)
                text = [seg.text for seg in tr]

# runtime in seconds, its either ~0.16 or ~0.75 which is quite a slowdown
[0.16589166 0.16490253 0.16597999 0.74616441 0.16531022 0.16502466
 0.16513272 0.16531197 0.1649516  0.1658973 ]

1 reply

ozancaglayan Jun 21, 2023

okay the builds from your CUDA branch does not seem to make any difference. Here are more results for 15, 30 and 45 seconds of audio transcriptions' durations on an nvidia A10G, notice the outliers (last value is median)

1       15      [0.13527477 0.13656517 0.13947337 0.13820819 0.1352191  0.13680875
 0.1388859  0.68581902 0.13548611 0.13696537]   0.13688706140965223
1       30      [0.19476631 0.19356811 0.19425887 0.75258902 0.19612772 0.1949179
 0.19498255 0.19411617 0.1959054  0.74832008]   0.19495022296905518
1       45      [0.31407789 0.31185118 0.94475138 0.31329156 0.31414784 0.31154767
 0.92644564 0.31257725 0.31470591 0.31334018]   0.31370903505012393

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very low GPU utilization #140

{{title}}

Replies: 6 comments 36 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Very low GPU utilization #140

Replies: 6 comments · 36 replies

JiriEr Apr 13, 2023 Author

JiriEr Apr 13, 2023 Author

JiriEr Apr 14, 2023 Author

JiriEr Apr 17, 2023 Author

1) Sanity check with bad filename (no transcription)

2) The 13 minute YouTube Video, Windows Native

3) The 13 minute YouTube Video, WSL2

JiriEr Apr 25, 2023 Author

JiriEr Apr 27, 2023 Author

JiriEr Apr 29, 2023 Author

JiriEr Apr 29, 2023 Author

Replies: 6 comments 36 replies

JiriEr Apr 13, 2023
Author

JiriEr Apr 13, 2023
Author

JiriEr Apr 14, 2023
Author

JiriEr Apr 17, 2023
Author

JiriEr Apr 25, 2023
Author

JiriEr Apr 27, 2023
Author

JiriEr Apr 29, 2023
Author

JiriEr Apr 29, 2023
Author