Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with segments when the model doesn't output end time for segment. #1100

Open
bchinnari opened this issue Oct 28, 2024 · 17 comments
Open

Comments

@bchinnari
Copy link

bchinnari commented Oct 28, 2024

I have fine-tuned an hugging face model. This model is not outputting end time stamp token for the segment.
When I run faster whisper with this model and use "word_timestamps=False", it is giving the correct number of segments and the "text" is also correct.

Say the audio file is 6sec.
But when I run it using "word_timestamps=True" (and my model is not giving end time stamp token) on whole 6sec audio, it is computing the first segment correctly (say segment_end=2.54s). The code is taking window again from 2.54s to 6s and it is giving some weird/unnecessary/hallucinated output.

Exactly at line 1241 in transcribe.py, the expression evaluates to True. we assign "seek" to the end of the segment/timestamp of the last word (which is 2.54s) in this case. But "seek" should have been 6 .
If the model outputs end time stamp, the expression evaluates to False, seek would be 6.

@bchinnari
Copy link
Author

Is my hypothesis correct ?

Can somebody reply ?

@bchinnari
Copy link
Author

@MahmoudAshraf97 .. Any idea ? do you think the code is handling the case where the model is not outputting any end time stamp token ?

@MahmoudAshraf97
Copy link
Collaborator

The model needs to decide where the next segment is going to start, it can do this using 3 methods by order:

  1. the model predicted the last token timestamp
  2. you enable word timestamps and the end of the last word is used
  3. naive 30s chunking is used

If you do not want the model to use one of the three methods, use the batched inference

@bchinnari
Copy link
Author

@sanchit-gandhi , @MahmoudAshraf97 .. The code is doing something like this.

Let's say the audio is 6sec long, the model is not outputting any end time stamp token and word_timestamps=True.

First, I think it is taking 6sec audio, features are extracted and padded it to make 3000 dim (like 30s file).

  1. The encoder is run on the 30sec and it is outputting 1500*768 matrix.
  2. generate() function is called and it is giving some sequence of tokens in which i see the first token is 50364 (representing |0.00| timestamp and at the end there is no timestamp token from my fine-tuned model.
  3. segments is created where the start is 0, end is length of audio
  4. word time stamps are created using add_word_timestamps(). Here, word boundaries are created for each word and also the "end" for the whole segment is also modified.
  5. then, start of the next segment is determined from last_word_end , say it is 2.5sec.
  6. Then, again the whole process (of all the above steps) begins where the new audio is [2.5s, end_of_audio]

My question : The first set of tokens are generated after the whole audio is consumed by the encoder& decoder. For the second time , why are we processing the part of the audio ( i.e., [2.5s, end_of audio]) again.

@MahmoudAshraf97
Copy link
Collaborator

Because that's how the sequential whisper algorithm works, it starts the new segment at the end of the last word of the previous segment

@bchinnari
Copy link
Author

bchinnari commented Nov 6, 2024

But the first segment itself is created after consuming the whole audio. why are we processing part of the whole audio again ?

@bchinnari
Copy link
Author

bchinnari commented Nov 6, 2024

May be I am being naive, I did not understand the process completely.
Only one segment is generated after the consuming the whole audio. Why are we trying to process the part of the audio again if the whole audio gave only one segment already.

@MahmoudAshraf97
Copy link
Collaborator

You have a point here, this might be handled as an edge case where the whole audio is consumed, the loop should stop
Can you upload this audio file to reproduce the problem? it'd be great if you have more than one example

@bchinnari
Copy link
Author

I will share some audio files.. Do you need model also to reproduce the issue ?

@MahmoudAshraf97
Copy link
Collaborator

Yes, all steps to reproduce

@bchinnari
Copy link
Author

bchinnari commented Nov 7, 2024

Could you share your email id ? I want to share 3 audio files and model using google drive..

@MahmoudAshraf97
Copy link
Collaborator

[email protected]

@bchinnari
Copy link
Author

I have shared with you. Let me know if you need anything else.

@bchinnari
Copy link
Author

@MahmoudAshraf97 ... did you receive the google drive link ?

@MahmoudAshraf97
Copy link
Collaborator

I did receive it but I don't have the capacity to work on it yet

@bchinnari
Copy link
Author

But, were you able to reproduce the error ?

@MahmoudAshraf97
Copy link
Collaborator

Yes I was, for the time being there are several solutions, you can disable word timestamps and it will output the correct transcription and then use forced alignment, or use whisper to generate word timestamps manually using the encoder output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants