-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with segments when the model doesn't output end time for segment. #1100
Comments
Is my hypothesis correct ? Can somebody reply ? |
@MahmoudAshraf97 .. Any idea ? do you think the code is handling the case where the model is not outputting any end time stamp token ? |
The model needs to decide where the next segment is going to start, it can do this using 3 methods by order:
If you do not want the model to use one of the three methods, use the batched inference |
@sanchit-gandhi , @MahmoudAshraf97 .. The code is doing something like this. Let's say the audio is 6sec long, the model is not outputting any end time stamp token and word_timestamps=True. First, I think it is taking 6sec audio, features are extracted and padded it to make 3000 dim (like 30s file).
My question : The first set of tokens are generated after the whole audio is consumed by the encoder& decoder. For the second time , why are we processing the part of the audio ( i.e., [2.5s, end_of audio]) again. |
Because that's how the sequential whisper algorithm works, it starts the new segment at the end of the last word of the previous segment |
But the first segment itself is created after consuming the whole audio. why are we processing part of the whole audio again ? |
May be I am being naive, I did not understand the process completely. |
You have a point here, this might be handled as an edge case where the whole audio is consumed, the loop should stop |
I will share some audio files.. Do you need model also to reproduce the issue ? |
Yes, all steps to reproduce |
Could you share your email id ? I want to share 3 audio files and model using google drive.. |
I have shared with you. Let me know if you need anything else. |
@MahmoudAshraf97 ... did you receive the google drive link ? |
I did receive it but I don't have the capacity to work on it yet |
But, were you able to reproduce the error ? |
Yes I was, for the time being there are several solutions, you can disable word timestamps and it will output the correct transcription and then use forced alignment, or use whisper to generate word timestamps manually using the encoder output |
I have fine-tuned an hugging face model. This model is not outputting end time stamp token for the segment.
When I run faster whisper with this model and use "word_timestamps=False", it is giving the correct number of segments and the "text" is also correct.
Say the audio file is 6sec.
But when I run it using "word_timestamps=True" (and my model is not giving end time stamp token) on whole 6sec audio, it is computing the first segment correctly (say segment_end=2.54s). The code is taking window again from 2.54s to 6s and it is giving some weird/unnecessary/hallucinated output.
Exactly at line 1241 in transcribe.py, the expression evaluates to True. we assign "seek" to the end of the segment/timestamp of the last word (which is 2.54s) in this case. But "seek" should have been 6 .
If the model outputs end time stamp, the expression evaluates to False, seek would be 6.
The text was updated successfully, but these errors were encountered: