Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The last version of whisper (v20240930) doesn't seem to be supported ('NoneType' object has no attribute 'shape') #212

Closed
mfucci opened this issue Oct 1, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@mfucci
Copy link

mfucci commented Oct 1, 2024

When I installed it (from git or with pip), it crashed with:

  File "/Users/mfucci/miniconda3/lib/python3.11/site-packages/whisper_timestamped-1.15.4-py3.11.egg/whisper_timestamped/transcribe.py", line 777, in hook_attention_weights
    if w.shape[-2] > 1:
       ^^^^^^^
AttributeError: 'NoneType' object has no attribute 'shape'

I had to downgrade whisper library to get it to work:
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git@v20231117

@Jeronymous
Copy link
Member

Wait, what is your version of whisper-timestamped ?
I remember such a bug was fixed quite some time ago

@Jeronymous
Copy link
Member

OK my bad : your version is visible (1.15.4) and it's latest.

OK indeed, if a new version of openai-whisper was released, whisper-timestamped probably need to adapt

@Jeronymous Jeronymous added the bug Something isn't working label Oct 1, 2024
@villesau
Copy link
Contributor

villesau commented Oct 2, 2024

Facing the same issue with: https://huggingface.co/openai/whisper-large-v3

whisper_timestamped.load_model("openai/whisper-large-v3", device="cuda")

E: Apparently that appears also when using gibberish model name: whisper_timestamped.load_model("lasdfdsafdsa", device="cuda")

E: Confused it would be the model that does not work but it was the whisper package version. Pinning to openai-whisper==20240927 in requirements.txt helped! I believe this prevents using the new turbo model though.

@Alptimus
Copy link

Alptimus commented Oct 2, 2024

I had to downgrade whisper library to get it to work: pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git@v20231117

❤️

@neonwatty
Copy link

openai-whisper==20240927

With this pinned version of openai-whisper I can use the new turbo large v3 model.

Running on mac hardware.

@neonwatty
Copy link

neonwatty commented Oct 3, 2024

Here's the traceback I received when attempting to use the new turbo model and the openai-whisper==20240930 version - starting from whisper_timestamped's call to openai's whisper.

Here's the release comparison of 20240930 vs 20240927.

The issue looks to be in the decoder - where serious pruning was performed for turbo.

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper_timestamped/transcribe.py:888, in _transcribe_timestamped_efficient(model, audio, remove_punctuation_from_words, compute_word_confidence, include_punctuation_in_confidence, refine_whisper_precision_nframes, alignment_heads, plot_word_alignment, word_alignement_most_top_layers, detect_disfluencies, trust_whisper_timestamps, use_timestamps_for_alignment, **whisper_options)
    885     if compute_word_confidence or no_speech_threshold is not None:
    886         all_hooks.append(model.decoder.ln.register_forward_hook(hook_output_logits))
--> 888     transcription = model.transcribe(audio, **whisper_options)
    890 finally:
    891 
    892     # Remove hooks
    893     for hook in all_hooks:

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/transcribe.py:279, in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, prepend_punctuations, append_punctuations, clip_timestamps, hallucination_silence_threshold, **decode_options)
    276 mel_segment = pad_or_trim(mel_segment, N_FRAMES).to(model.device).to(dtype)
    278 decode_options["prompt"] = all_tokens[prompt_reset_since:]
--> 279 result: DecodingResult = decode_with_fallback(mel_segment)
    280 tokens = torch.tensor(result.tokens)
    282 if no_speech_threshold is not None:
    283     # no voice activity check

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/transcribe.py:195, in transcribe.<locals>.decode_with_fallback(segment)
    192     kwargs.pop("best_of", None)
    194 options = DecodingOptions(**kwargs, temperature=t)
--> 195 decode_result = model.decode(segment, options)
    197 needs_fallback = False
    198 if (
    199     compression_ratio_threshold is not None
    200     and decode_result.compression_ratio > compression_ratio_threshold
    201 ):

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/decoding.py:824, in decode(model, mel, options, **kwargs)
    821 if kwargs:
    822     options = replace(options, **kwargs)
--> 824 result = DecodingTask(model, options).run(mel)
    826 return result[0] if single else result

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/decoding.py:737, in DecodingTask.run(self, mel)
    734 tokens = tokens.repeat_interleave(self.n_group, dim=0).to(audio_features.device)
    736 # call the main sampling loop
--> 737 tokens, sum_logprobs, no_speech_probs = self._main_loop(audio_features, tokens)
    739 # reshape the tensors to have (n_audio, n_group) as the first two dimensions
    740 audio_features = audio_features[:: self.n_group]

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/decoding.py:687, in DecodingTask._main_loop(self, audio_features, tokens)
    685 try:
    686     for i in range(self.sample_len):
--> 687         logits = self.inference.logits(tokens, audio_features)
    689         if (
    690             i == 0 and self.tokenizer.no_speech is not None
    691         ):  # save no_speech_probs
    692             probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/decoding.py:163, in PyTorchInference.logits(self, tokens, audio_features)
    159 if tokens.shape[-1] > self.initial_token_length:
    160     # only need to use the last token except in the first forward pass
    161     tokens = tokens[:, -1:]
--> 163 return self.model.decoder(tokens, audio_features, kv_cache=self.kv_cache)

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/model.py:242, in TextDecoder.forward(self, x, xa, kv_cache)
    239 x = x.to(xa.dtype)
    241 for block in self.blocks:
--> 242     x = block(x, xa, mask=self.mask, kv_cache=kv_cache)
    244 x = self.ln(x)
    245 logits = (
    246     x @ torch.transpose(self.token_embedding.weight.to(x.dtype), 0, 1)
    247 ).float()

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper/model.py:169, in ResidualAttentionBlock.forward(self, x, xa, mask, kv_cache)
    167 x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)[0]
    168 if self.cross_attn:
--> 169     x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
    170 x = x + self.mlp(self.mlp_ln(x))
    171 return x

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1616, in Module._call_impl(self, *args, **kwargs)
   1614     hook_result = hook(self, args, kwargs, result)
   1615 else:
-> 1616     hook_result = hook(self, args, result)
   1618 if hook_result is not None:
   1619     result = hook_result

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper_timestamped/transcribe.py:882, in _transcribe_timestamped_efficient.<locals>.<lambda>(layer, ins, outs, index)
    878     if i < nblocks - word_alignement_most_top_layers:
    879         continue
    880     all_hooks.append(
    881         block.cross_attn.register_forward_hook(
--> 882             lambda layer, ins, outs, index=j: hook_attention_weights(layer, ins, outs, index))
    883     )
    884     j += 1
    885 if compute_word_confidence or no_speech_threshold is not None:

File ~/Desktop/speech_app/venv/lib/python3.12/site-packages/whisper_timestamped/transcribe.py:777, in _transcribe_timestamped_efficient.<locals>.hook_attention_weights(layer, ins, outs, index)
    775 w = outs[-1]
    776 # Only the last attention weights is useful
--> 777 if w.shape[-2] > 1:
    778     w = w[:, :, -1:, :]
    779 segment_attweights[index].append(w.cpu())

AttributeError: 'NoneType' object has no attribute 'shape'

@jonasrenault
Copy link

The issue is related to patch #2359 which uses F.scaled_dot_product_attention if available. In this case, the attention weights returned by whisper seem to be None.

A workaround is to use the disable_sdpa context manager introduced in same patch when calling transcribe, though this will limit the performance improvement introduced by the latest version and turbo model of whisper:

import whisper_timestamped as whisperts
from whisper.model import disable_sdpa

audio = whisperts.load_audio("AUDIO.wav")
model = whisperts.load_model("turbo")
with disable_sdpa():
    results = whisperts.transcribe(model, audio)

@Med280
Copy link

Med280 commented Oct 11, 2024

we supposed to not face that issue when you specify your requirements versions
openai-whisper==20231117

@Jeronymous Jeronymous changed the title The last version of whisper (v20240930) doesn't seem to be supported The last version of whisper (v20240930) doesn't seem to be supported ('NoneType' object has no attribute 'shape') Oct 29, 2024
@Jeronymous
Copy link
Member

The issue is related to patch #2359 which uses F.scaled_dot_product_attention if available. In this case, the attention weights returned by whisper seem to be None.

A workaround is to use the disable_sdpa context manager introduced in same patch when calling transcribe, though this will limit the performance improvement introduced by the latest version and turbo model of whisper:

import whisper_timestamped as whisperts
from whisper.model import disable_sdpa

audio = whisperts.load_audio("AUDIO.wav")
model = whisperts.load_model("turbo")
with disable_sdpa():
    results = whisperts.transcribe(model, audio)

Thanks a lot @jonasrenault I pushed a workaround based on that, to avoid some people being stucked

(sorry for the delay, that was broken for 1 month now ... I have unfortunately much less time now to be active on this whisper-timestamped project)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants