Issue with segments when the model doesn't output end time for segment. #1100

bchinnari · 2024-10-28T18:54:55Z

I have fine-tuned an hugging face model. This model is not outputting end time stamp token for the segment.
When I run faster whisper with this model and use "word_timestamps=False", it is giving the correct number of segments and the "text" is also correct.

Say the audio file is 6sec.
But when I run it using "word_timestamps=True" (and my model is not giving end time stamp token) on whole 6sec audio, it is computing the first segment correctly (say segment_end=2.54s). The code is taking window again from 2.54s to 6s and it is giving some weird/unnecessary/hallucinated output.

Exactly at line 1241 in transcribe.py, the expression evaluates to True. we assign "seek" to the end of the segment/timestamp of the last word (which is 2.54s) in this case. But "seek" should have been 6 .
If the model outputs end time stamp, the expression evaluates to False, seek would be 6.

bchinnari · 2024-10-29T06:59:47Z

Is my hypothesis correct ?

Can somebody reply ?

bchinnari · 2024-11-05T10:49:48Z

@MahmoudAshraf97 .. Any idea ? do you think the code is handling the case where the model is not outputting any end time stamp token ?

MahmoudAshraf97 · 2024-11-05T12:23:21Z

The model needs to decide where the next segment is going to start, it can do this using 3 methods by order:

the model predicted the last token timestamp
you enable word timestamps and the end of the last word is used
naive 30s chunking is used

If you do not want the model to use one of the three methods, use the batched inference

bchinnari · 2024-11-06T04:19:48Z

@sanchit-gandhi , @MahmoudAshraf97 .. The code is doing something like this.

Let's say the audio is 6sec long, the model is not outputting any end time stamp token and word_timestamps=True.

First, I think it is taking 6sec audio, features are extracted and padded it to make 3000 dim (like 30s file).

The encoder is run on the 30sec and it is outputting 1500*768 matrix.
generate() function is called and it is giving some sequence of tokens in which i see the first token is 50364 (representing |0.00| timestamp and at the end there is no timestamp token from my fine-tuned model.
segments is created where the start is 0, end is length of audio
word time stamps are created using add_word_timestamps(). Here, word boundaries are created for each word and also the "end" for the whole segment is also modified.
then, start of the next segment is determined from last_word_end , say it is 2.5sec.
Then, again the whole process (of all the above steps) begins where the new audio is [2.5s, end_of_audio]

My question : The first set of tokens are generated after the whole audio is consumed by the encoder& decoder. For the second time , why are we processing the part of the audio ( i.e., [2.5s, end_of audio]) again.

MahmoudAshraf97 · 2024-11-06T04:23:35Z

Because that's how the sequential whisper algorithm works, it starts the new segment at the end of the last word of the previous segment

bchinnari · 2024-11-06T04:47:04Z

But the first segment itself is created after consuming the whole audio. why are we processing part of the whole audio again ?

bchinnari · 2024-11-06T06:42:35Z

May be I am being naive, I did not understand the process completely.
Only one segment is generated after the consuming the whole audio. Why are we trying to process the part of the audio again if the whole audio gave only one segment already.

MahmoudAshraf97 · 2024-11-06T16:01:21Z

You have a point here, this might be handled as an edge case where the whole audio is consumed, the loop should stop
Can you upload this audio file to reproduce the problem? it'd be great if you have more than one example

bchinnari · 2024-11-07T05:47:32Z

I will share some audio files.. Do you need model also to reproduce the issue ?

MahmoudAshraf97 · 2024-11-07T10:31:29Z

Yes, all steps to reproduce

bchinnari · 2024-11-07T10:33:40Z

Could you share your email id ? I want to share 3 audio files and model using google drive..

MahmoudAshraf97 · 2024-11-07T10:44:11Z

[email protected]

bchinnari · 2024-11-07T16:45:50Z

I have shared with you. Let me know if you need anything else.

bchinnari · 2024-11-11T02:10:55Z

@MahmoudAshraf97 ... did you receive the google drive link ?

MahmoudAshraf97 · 2024-11-11T08:49:08Z

I did receive it but I don't have the capacity to work on it yet

bchinnari · 2024-11-11T08:51:01Z

But, were you able to reproduce the error ?

MahmoudAshraf97 · 2024-11-11T09:20:15Z

Yes I was, for the time being there are several solutions, you can disable word timestamps and it will output the correct transcription and then use forced alignment, or use whisper to generate word timestamps manually using the encoder output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with segments when the model doesn't output end time for segment. #1100

Issue with segments when the model doesn't output end time for segment. #1100

bchinnari commented Oct 28, 2024 •

edited

Loading

bchinnari commented Oct 29, 2024

bchinnari commented Nov 5, 2024

MahmoudAshraf97 commented Nov 5, 2024

bchinnari commented Nov 6, 2024

MahmoudAshraf97 commented Nov 6, 2024

bchinnari commented Nov 6, 2024 •

edited

Loading

bchinnari commented Nov 6, 2024 •

edited

Loading

MahmoudAshraf97 commented Nov 6, 2024

bchinnari commented Nov 7, 2024

MahmoudAshraf97 commented Nov 7, 2024

bchinnari commented Nov 7, 2024 •

edited

Loading

MahmoudAshraf97 commented Nov 7, 2024

bchinnari commented Nov 7, 2024

bchinnari commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024

bchinnari commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024

Issue with segments when the model doesn't output end time for segment. #1100

Issue with segments when the model doesn't output end time for segment. #1100

Comments

bchinnari commented Oct 28, 2024 • edited Loading

bchinnari commented Oct 29, 2024

bchinnari commented Nov 5, 2024

MahmoudAshraf97 commented Nov 5, 2024

bchinnari commented Nov 6, 2024

MahmoudAshraf97 commented Nov 6, 2024

bchinnari commented Nov 6, 2024 • edited Loading

bchinnari commented Nov 6, 2024 • edited Loading

MahmoudAshraf97 commented Nov 6, 2024

bchinnari commented Nov 7, 2024

MahmoudAshraf97 commented Nov 7, 2024

bchinnari commented Nov 7, 2024 • edited Loading

MahmoudAshraf97 commented Nov 7, 2024

bchinnari commented Nov 7, 2024

bchinnari commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024

bchinnari commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024

bchinnari commented Oct 28, 2024 •

edited

Loading

bchinnari commented Nov 6, 2024 •

edited

Loading

bchinnari commented Nov 6, 2024 •

edited

Loading

bchinnari commented Nov 7, 2024 •

edited

Loading