You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried fine-tuning with a small clean dataset of Vietnamese speech, self-collected from YouTube, about 100 hours of audio. Here are a few audio demos. However, the results did not meet my expectations.
Here’s how I prepare data:
Clean dataset: I used the Vietnamese data mentioned above. I filtered the collected audio segments that were shorter than 3 seconds to match sub_sample_length = 3.072.
Noise dataset: I downloaded the DNS Interspeech 2020 noise data from here: DNS-Challenge noise data.
RIR dataset: I downloaded the dataset from the release page here: RIR dataset.
Test dataset: I used the test set from DNS-Challenge: Test set.
I used a 3080 GPU with a batch size of 12 and gradient accumulation steps set to 3. Model starting from the checkpoint fullsubnet_best_model_58epochs.tar.
I trained for 15 epochs. However, the loss decreased only in the first few epochs and then started increasing. When I tested the inference on a few samples, I noticed that the model left more noise compared to the original performance.
Am I missing something in the fine-tuning process?
Do you have any advice for me?
Thank you!
The text was updated successfully, but these errors were encountered:
Hi,
I tried fine-tuning with a small clean dataset of Vietnamese speech, self-collected from YouTube, about 100 hours of audio. Here are a few audio demos. However, the results did not meet my expectations.
Here’s how I prepare data:
sub_sample_length = 3.072
.I used a 3080 GPU with a batch size of 12 and gradient accumulation steps set to 3. Model starting from the checkpoint fullsubnet_best_model_58epochs.tar.
I trained for 15 epochs. However, the loss decreased only in the first few epochs and then started increasing. When I tested the inference on a few samples, I noticed that the model left more noise compared to the original performance.
Am I missing something in the fine-tuning process?
Do you have any advice for me?
Thank you!
The text was updated successfully, but these errors were encountered: