-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scripts for SummaryMixing SSL #9
base: SummaryMixing_w2v2
Are you sure you want to change the base?
scripts for SummaryMixing SSL #9
Conversation
|
||
The main branch of this repository will keep tracking the latest version of SpeechBrain available. Unfortunately the results reported in our [publication](https://arxiv.org/abs/2307.07421) and bellow in the Table were obtained with SpeechBrain v0.5 and may not be exactly reproduced with the current code. If you want the exact same results, please use our dedicated | ||
[branch](https://github.com/SamsungLabs/SummaryMixing/tree/speechbrain_v0.5) that contains the code compatible with SpeechBrain v0.5! | ||
# SummaryMixing wav2vec 2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not erase the previous Readme, it should be combined.
@@ -0,0 +1,344 @@ | |||
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to copy the train.py? Did you change something in it?
@@ -0,0 +1,342 @@ | |||
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
@@ -0,0 +1,90 @@ | |||
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create a PR for wav2vec 2.0 pretraining on speechbrain with standard MHSA
latents = self.modules.normalize( | ||
latents, wav_lens, epoch=current_epoch | ||
).detach() | ||
elif self.hparams.frontend_type == "mel_v2": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these if should be removed with only the good one staying.
mask_prob=hparams["mask_prob"], | ||
mask_length=hparams["mask_length"], | ||
) | ||
elif hparams["frontend_type"] == "mel_cnn_base": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
@@ -1,489 +0,0 @@ | |||
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be deleted!!!!
@@ -1,359 +0,0 @@ | |||
# ############################################################################ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be deleted!
@@ -1,1044 +0,0 @@ | |||
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't delete!
@@ -1,665 +0,0 @@ | |||
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't delete!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shucongzhang, a few questions about this prep script.
- for step 2 what do you mean by the vad script ? (im using
cut_by_vad.py
but there is another vad script) - also from a brief look at the lengths of the audio files i believe that you could be remove the majority of the data my limiting to only 20.2 seconds, by using the following code. Do you know how many hours are left after this?
def make_csv_for_each(subpath_1_csv_file_folder, max_length=20.2):
# other code
if duration_seconds > max_length:
continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to give an estimate, I'm estimating that for the large set you will only have 100 hours of audio (instead of 51k)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @whettenr , thank you for your question. The vad script I'm referring to is the "cut_by_vad.py" in the "libri-light" github repo. It will cut the books to utterances as close as possible to target_len_sec
. There are some issues with our server which contains the whole libri-light, so I have tested the scripts with the small split. What I did is:
python cut_by_vad.py --input_dir libri-light/small/ --output_dir libri-light/small_20s_vad/ --target_len_sec 20
python make_librilight_csv.py small_20s_vad small_20s_vad_csv
By this, I got 356 hours of data with 12.4s/15.8s/18.1s 25th/50th/75th percentile utterance length.
Can you also try the steps above for the small subset. Please let me know if the amount of data you have is different with the numbers above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response! And I did not put --target_len_sec 20
. That could defiantly make a huge difference. I think that is why for me the VAD was cutting files into lengths around 50 and 60 seconds. I will try with the small and let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shucongzhang did it and got the same 356 hours of data with 12.4s/15.8s/18.1s 25th/50th/75t
This PR provides necessary code and recipes for reproduce the results of the SummaryMixing SSL paper.