Whisper ONNX model error during decoding: Non-zero status code returned while running Expand node #66

cvl01 · 2024-07-31T11:01:10Z

I am getting the following error when using the whisper engine with align.

2024-07-31 13:00:06.2345858 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/blocks.0/attn/Expand' Status Message: invalid expand shape
Error: Non-zero status code returned while running Expand node. Name:'/blocks.0/attn/Expand' Status Message: invalid expand shape

Command I am running, for reference:
echogarden align "audio.wav" "transcript.txt" "result.srt" "result.json" --language=nl --crop=false --engine=whisper

The text was updated successfully, but these errors were encountered:

rotemdan · 2024-07-31T11:04:41Z

Never seen it before. Could be a lot of things, including an ONNX runtime issue. Can you send the audio and transcript that produces it?

cvl01 · 2024-07-31T11:08:24Z

No, unfortunately I cannot share the audio, it is confidential.

I am getting this error on about 10 out of 20 audio files that I processed.

rotemdan · 2024-07-31T11:18:13Z

The error looks like it has to do with some issue with an input tensor or its dimensions.

Are you using cpu provider (there can be all sorts of errors with DirectML or CUDA)?
What OS are you using? I've tested it on Windows mostly.
Do you know which exact line produces this error? That would help a lot. I could narrow it down to things like which of the two models (encoder / decoder) it happens, and when exactly when.
Does it happens with other sizes of Whisper models like base or small?

Edit: you can run with --debug to get the full error.

cvl01 · 2024-07-31T11:32:53Z

I just ran it with --whisper.encoderProvider=cpu --whisper.decoderProvider=cpu --debug but I get the same error.

I am using Windows

This is the output. I have redacted some parts, but not the last words.

Prepare audio part at time position 1440.60.. 2.8ms
Extract mel spectogram from audio part.. 65.9ms
Normalize mel spectogram.. 16.8ms
Encode mel spectogram with Whisper encoder model.. 59.7ms
Decode text tokens with Whisper decoder model.. [REDACTED...] Daar kan hier echter geen sprake van2024-07-31 13:30:11.7071441 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/blocks.0/attn/Expand' Status Message: invalid expand shape
Error: Non-zero status code returned while running Expand node. Name:'/blocks.0/attn/Expand' Status Message: invalid expand shape
    at Immediate.<anonymous> (C:\Users\luik001c\echogarden-github\node_modules\onnxruntime-node\dist\backend.js:45:108)
    at process.processImmediate (node:internal/timers:483:21)

rotemdan · 2024-07-31T11:43:31Z

It happens during the call to decode a single next token.

Based on the fact that it shows that other tokens were decoded before it, the problem isn't with initializing the decoder, it's with the actual decoding itself, which narrows it down.

I've tested the Whisper implementation with many different inputs. I've never encountered this particular error. Without a way to reproduce it I can't know what exactly causes it.

You could try to the Whisper recognition model on the same audio input and see if you get an error. Most likely you wouldn't.

If you don't get it, it could have something to do with the particular tokens that are decoded using the forced decoding. Maybe something about the language being Dutch. I don't know, maybe special tokens that are used there. It's really hard to determine.

You say it's common in the inputs you are trying. You can also try English inputs to see if it happens with them as well.

If you happen to have anything that produces this error that you can send, it will really help.

cvl01 · 2024-07-31T12:08:49Z

I just ran it succesfully with the Whisper small model. Seems to be a issue related to the tiny model specifically, then.

cvl01 · 2024-07-31T15:06:34Z

Update: still getting the error on some files, although on less files when using small instead of tiny

2024-07-31 17:03:36.2064689 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running Reshape node. Name:'/blocks.0/attn/Reshape_3' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2538)\onnxruntime.dll!00007FFFFE1C07D4: (caller: 00007FFFFE7D6FDA) Exception(2) tid(6134) 8007023E {Application Error}
The exception %s (0x
Error: Non-zero status code returned while running Reshape node. Name:'/blocks.0/attn/Reshape_3' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2538)\onnxruntime.dll!00007FFFFE1C07D4: (caller: 00007FFFFE7D6FDA) Exception(2) tid(6134) 8007023E {Application Error}
The exception %s (0x

rotemdan · 2024-07-31T15:12:18Z

It looks like a different exception.

DmlExecutionProvider means it's using DirectML (Windows GPU acceleration). I get a lot of errors with that provider but usually on other models, not Whisper. May be an issue with the ONNX runtime. They only added support for GPU processing on onnxruntime-node in the past few months.

cvl01 · 2024-07-31T15:13:42Z

Then I'll try rerunning in cpu mode

rotemdan · 2024-10-04T08:22:19Z

The new 1.6.0 release now uses onnxruntime-node v1.19.2 instead of v1.18.x (previously).

Try to see if these issues still occur with the new version.

Anyway, based on my testing I've still never encountered them myself. It could be related to the particular combination of OS/hardware I'm testing on.

It's unlikely that a particular whisper model has an issue since these ONNX models are derived from the original ones from OpenAI.

Maybe something in one of the model's internal configuration (parameters like number of heads, constants, etc.) is triggering the issue, but not really causing it.

rotemdan · 2024-12-05T12:32:38Z

I've finally reproduced this (pretty much by accident), with latest onnxrumtime-node (v1.20.1), and with the base.en model (CPU decoding provider).

I also tried the small.en, base, tiny and tiny.en models on the same exact input and parameters, but the error did not occur with them.

I tried with the same model (base.en) but with whisper.decoderProvider=dml, meaning it used GPU processing with DirectML. I was surprised but the issue did occur with it. That was useful information.

The JavaScript call stack doesn't show anything, seems to be some sort of async callback used internally within the ONNX runtime.

After some web search. The closest issue I could find on the ONNX runtime repository is this one, opened March 19, 2021, and is still unresolved. The issue has a more reduced test case, but I can't really understand it, or draw any useful information from it.

Whatever is causing this, and the intermittent, unpredictable nature, it's likely an issue or odd behavior or edge case with the way the ONNX runtime is dealing with particular tensor input to a particular node. The decoder model is internally using dynamic tensor dimensions, which means there could be rare cases where the tensor dimensions can cause an issue, but that would be hard to isolate. The error message doesn't give enough information to understand which ones are related.

The models are just the original OpenAI models, exported to ONNX using the standard PyTorch ONNX functionality. The issue seems to happen in the middle of decoding of a part - not at the beginning or end - so It's not clear what exactly is happening, so far.

I'll keep experimenting. I'll see if I can uncover anything further.

rotemdan added bug Something isn't working external Issues that are related to external sources labels Dec 5, 2024

rotemdan changed the title ~~Error in onnxruntime while aligning speech to transcript using whisper~~ Whisper ONNX model error during decoding: Non-zero status code returned while running Expand node Dec 6, 2024

rotemdan added the recognition Issue related to speech recognition label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper ONNX model error during decoding: Non-zero status code returned while running Expand node #66

Whisper ONNX model error during decoding: Non-zero status code returned while running Expand node #66

cvl01 commented Jul 31, 2024 •

edited

Loading

rotemdan commented Jul 31, 2024

cvl01 commented Jul 31, 2024

rotemdan commented Jul 31, 2024 •

edited

Loading

cvl01 commented Jul 31, 2024 •

edited

Loading

rotemdan commented Jul 31, 2024

cvl01 commented Jul 31, 2024

cvl01 commented Jul 31, 2024 •

edited

Loading

rotemdan commented Jul 31, 2024

cvl01 commented Jul 31, 2024

rotemdan commented Oct 4, 2024 •

edited

Loading

rotemdan commented Dec 5, 2024 •

edited

Loading

Whisper ONNX model error during decoding: Non-zero status code returned while running Expand node #66

Whisper ONNX model error during decoding: Non-zero status code returned while running Expand node #66

Comments

cvl01 commented Jul 31, 2024 • edited Loading

rotemdan commented Jul 31, 2024

cvl01 commented Jul 31, 2024

rotemdan commented Jul 31, 2024 • edited Loading

cvl01 commented Jul 31, 2024 • edited Loading

rotemdan commented Jul 31, 2024

cvl01 commented Jul 31, 2024

cvl01 commented Jul 31, 2024 • edited Loading

rotemdan commented Jul 31, 2024

cvl01 commented Jul 31, 2024

rotemdan commented Oct 4, 2024 • edited Loading

rotemdan commented Dec 5, 2024 • edited Loading

cvl01 commented Jul 31, 2024 •

edited

Loading

rotemdan commented Jul 31, 2024 •

edited

Loading

cvl01 commented Jul 31, 2024 •

edited

Loading

cvl01 commented Jul 31, 2024 •

edited

Loading

rotemdan commented Oct 4, 2024 •

edited

Loading

rotemdan commented Dec 5, 2024 •

edited

Loading