-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming Inference of the FS-EEND system #16
Comments
Hi, I think there may be a couple of areas that need some clarification.
Actually, the network architecture of FS-EEND is causal/online with a masked self-attention module in the time dimension. This means that the output for each frame only depends on its previous context (while ignoring a few look-ahead frames for descriptive purposes). Therefore, performing inference directly provides online diarization results due to the masking mechanism in the time dimension, and this is essentially the same as iteratively performing inference over time steps.
We refer to the recipe for data preparation in EEND, which generates kaldi-format data including wav.scp/utt2spk/spk2utt/segments/rttm. The wav.scp records the waveform paths, and the other files are used to generate ground-truth labels. If you want to use other dataset for fine-tuning or inference, you should prepare these kaldi-format files, which can be referenced from Kaldi's recipe. If you only want to input the waveform for inference without labels for reference, please modify KaldiDiarizationDataset and test_step. By the way, direct inference without fine-tuning usually yields poorer results due to domain differences |
Yes sir, thank you for your previous reply. I understand the inference process. I have performed the evaluation on the AMI corpus test dataset with the pretrained model given by your team in the readme. I would like to know whether the streaming inference can be done or not. For example, if I run a particular code block, can the diarization is performed on the speech coming-in through the microphone on the go (real-time streaming). I don't find the scripts relating to the online-diarization inference. Even though the system only depending on the causal frames, is it possible to perform the diarization in the streaming or real-time setting (low latency) with the audio/speech coming from microphone. Just like DIART system, as depicted in the pic. |
Thank you for your comments, I understand now. FS-EEND can perform online/streaming inference by changing the masked parallel form into an iterative inference paradigm. The two paradigms are equivalent in terms of the output results. We have modified the TransformerEncoder with masked self-attention for streaming inference and validated the equivalence between the two paradigms. The code updates can be found in nnet/modules/streaming_tfm.py. However, the entire system's streaming inference, like DIART, still requires additional engineering work, such as the decoder and Conv1D parts. This will take some extra time, and we will update the code later. It is important to emphasize that these are engineering implementation differences and do not affect the conclusion in the paper that FS-EEND can perform streaming inference. |
Thank you for your kind reply, sir. I understand the differences. Thank you for providing you answer and adding new feature to your project. I was trying to implement the same not the exact streaming inference but like buffer wise diarization. Here is the code, please help me if I am doing anything wrong so far.
|
Thank you for your interest and suggestions. If you have any questions, please feel free to raise them at any time. |
Can't we just use a post-clustering process? |
Online speaker diarization can be achieved by extracting global speaker embeddings at the segment level and then performing online clustering. This requires pretraining a speaker verification network to extract global speaker embeddings. However, in the EEND framework, the speaker embeddings learned are local, meaning the embedding of the same speaker may vary across different utterances, as shown in Fig.4 in EEND-EDA. Therefore, online diarization by clustering segment-level embedding without attending to previous context is not straightforward within the EEND framework. |
The text was updated successfully, but these errors were encountered: