Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scripts for SummaryMixing SSL #9

Open
wants to merge 1 commit into
base: SummaryMixing_w2v2
Choose a base branch
from

Conversation

shucongzhang
Copy link

This PR provides necessary code and recipes for reproduce the results of the SummaryMixing SSL paper.

@shucongzhang shucongzhang requested a review from TParcollet June 20, 2024 17:33

The main branch of this repository will keep tracking the latest version of SpeechBrain available. Unfortunately the results reported in our [publication](https://arxiv.org/abs/2307.07421) and bellow in the Table were obtained with SpeechBrain v0.5 and may not be exactly reproduced with the current code. If you want the exact same results, please use our dedicated
[branch](https://github.com/SamsungLabs/SummaryMixing/tree/speechbrain_v0.5) that contains the code compatible with SpeechBrain v0.5!
# SummaryMixing wav2vec 2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not erase the previous Readme, it should be combined.

@@ -0,0 +1,344 @@
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to copy the train.py? Did you change something in it?

@@ -0,0 +1,342 @@
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

@@ -0,0 +1,90 @@
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a PR for wav2vec 2.0 pretraining on speechbrain with standard MHSA

latents = self.modules.normalize(
latents, wav_lens, epoch=current_epoch
).detach()
elif self.hparams.frontend_type == "mel_v2":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these if should be removed with only the good one staying.

mask_prob=hparams["mask_prob"],
mask_length=hparams["mask_length"],
)
elif hparams["frontend_type"] == "mel_cnn_base":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@@ -1,489 +0,0 @@
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be deleted!!!!

@@ -1,359 +0,0 @@
# ############################################################################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be deleted!

@@ -1,1044 +0,0 @@
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't delete!

@@ -1,665 +0,0 @@
""" SummaryMixing © 2023 by Samsung Electronics is licensed under CC BY-NC 4.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't delete!

@TParcollet TParcollet mentioned this pull request Jun 28, 2024
3 tasks
Copy link

@whettenr whettenr Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shucongzhang, a few questions about this prep script.

  • for step 2 what do you mean by the vad script ? (im using cut_by_vad.py but there is another vad script)
  • also from a brief look at the lengths of the audio files i believe that you could be remove the majority of the data my limiting to only 20.2 seconds, by using the following code. Do you know how many hours are left after this?
def make_csv_for_each(subpath_1_csv_file_folder, max_length=20.2):
    # other code
    if duration_seconds > max_length:
              continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to give an estimate, I'm estimating that for the large set you will only have 100 hours of audio (instead of 51k)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @whettenr , thank you for your question. The vad script I'm referring to is the "cut_by_vad.py" in the "libri-light" github repo. It will cut the books to utterances as close as possible to target_len_sec. There are some issues with our server which contains the whole libri-light, so I have tested the scripts with the small split. What I did is:

  1. python cut_by_vad.py --input_dir libri-light/small/ --output_dir libri-light/small_20s_vad/ --target_len_sec 20
  2. python make_librilight_csv.py small_20s_vad small_20s_vad_csv

By this, I got 356 hours of data with 12.4s/15.8s/18.1s 25th/50th/75th percentile utterance length.

Can you also try the steps above for the small subset. Please let me know if the amount of data you have is different with the numbers above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response! And I did not put --target_len_sec 20. That could defiantly make a huge difference. I think that is why for me the VAD was cutting files into lengths around 50 and 60 seconds. I will try with the small and let you know.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shucongzhang did it and got the same 356 hours of data with 12.4s/15.8s/18.1s 25th/50th/75t

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants