Replies: 6 comments 36 replies
-
Just to make sure, are you using the same transcription options in openai/whisper, especially the same beam size? Also how are you running multiple instances with openai/whisper? Do you use the same technique to run multiple instances of faster-whisper? |
Beta Was this translation helpful? Give feedback.
-
if you have powerful hardware then the better parallelization technique is batch execution, for also i rather suspect a RAM bottleneck, in your screenshot there are 32 GB RAM, although RTX 6000 has 48 GB VRAM; a rule of thumb is you should have RAM ≥ VRAM |
Beta Was this translation helpful? Give feedback.
-
I was noticing low GPU usage, but I assumed my system was messed up and have since wiped it. But I'll post my stats in case it's useful information. 3090Ti, 10-900k, 128GB. Windows 10 native and WSL tested, 128GB is also available in WSL2. beam_size 5, everything left at defaults, power measured at the wall. 1) Sanity check with bad filename (no transcription)
2) The 13 minute YouTube Video, Windows Native
3) The 13 minute YouTube Video, WSL2
So the 3090 Ti takes 1.35x longer on Whisper, but 3.26x longer in faster-whisper. I could have sworn when I first benched faster-whisper on that 13 minute file, the 3090 was close to your 54 seconds. In the time since I was trying to optimize Stable Diffusion and left a trail of weird cuda dlls all over my system, so I have since burned it all down. If you or anyone else who gets close to 54 seconds happens to remember, can you post the exact versions (direct links if you can) of Cuda, cuDNN, cuBLAS, anything you installed that has a specific version, and mention if it's Linux, WSL2, windows native? There's some stuff like FP32 accumulate which is faster on the v100 but I'm pretty sure the 3090 shouldn't be 1/3 the speed. I know how to install everything so faster-whisper works but my guess is that not all versions are equal. I don't know what's up with WSL2. Other stuff is 250+ watts, just gonna wipe that too. I quite like the efficient power usage when I'm at the computer, and it's still more than 2x whisper. But cranking up the speed and the GPU would be a nice option when I'm not present. And if I'm gonna set CUDA up again up from scratch want to try and start with a clean install and then I rerun the bench every time I tweak something to make sure I don't break it again. Edit: Set this up on a friend's 4090 in WSL2. Start to finish, including model loading time and detecting language, 51 seconds on the 13 minute video. |
Beta Was this translation helpful? Give feedback.
-
Thanks everyone for the feedback. I possibly have a solution for this issue in OpenNMT/CTranslate2#1177. Can you help testing this change and see if it makes a difference? To install this development branch:
|
Beta Was this translation helpful? Give feedback.
-
When I use faster-whisper on a 1070ti to transcibe a short sentence it cost about 1s, but when I use a RTX 4090, it cost about 500ms, only 2x speed up. But some benchmark websites tell that 4090's int8 inference speed is 8x faster than 1070ti. So it is the problem? |
Beta Was this translation helpful? Give feedback.
-
I'm also seeing similar patterns when I do a tight %timeit model.transcribe(30_sec_data, language='en', ...) then I monitor the usage with nvidia-smi. Sometimes I can see 50-60 or 90 but sometimes its 0-1-2%. I'll try your branch now. Another thing I noticed is the following: def transcribe(inp):
return list(model.transcribe(
15_sec_of_data,
beam_size=1,
temperature=0,
vad_filter=False,
best_of=1,
condition_on_previous_text=False,
language='en',
without_timestamps=True,
)[0])
for i in range(10):
start = time.perf_counter()
with cProfile.Profile() as pr:
tr = transcribe(chunk)
text = [seg.text for seg in tr]
|
Beta Was this translation helpful? Give feedback.
-
Even with parallel processing, I am not able to get generally over 50 % GPU utilization.
Windows 10
Nvidia Quadro RTX 6000 Ada Gen
Model Large-v2
Float16
Tried the same with https://github.com/openai/whisper and with few instances in parallel, I can get about 98 % of GPU utilization. Four instances is the maximum I can load into memory and is the most effective for processing multiple files. The best time efficiency I can obtain is with FP32 precision here.
So the result is that for mixed file size audio files (from 5 to about 50 minutes), I get better speed with https://github.com/openai/whisper instead of faster-whisper. For multiple (hundreds) small files, faster-whisper is really faster.
I am really not sure how and if it is possible to improve GPU utilization and speed generally. It is possible that poor GPU utilization is connected to older CUDA (11.8) used by PyTorch not fully supporting new GPUs.
Any advice is appreciated.
Beta Was this translation helpful? Give feedback.
All reactions