This repository contains resources (dataset and notebooks) for reproducing experiments in the BembaSpeech: A Speech Recognition Corpus for the Bemba Language.
Please consider citing as follows if you use part of the code or data in your work or project:
@InProceedings{sikasote-anastasopoulos:2022:LREC,
author = {Sikasote, Claytone and Anastasopoulos, Antonios},
title = {BembaSpeech: A Speech Recognition Corpus for the Bemba Language},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {7277--7283},
abstract = {We present a preprocessed, ready-to-use automatic speech recognition corpus, BembaSpeech, consisting over 24 hours of read speech in the Bemba language, a written but low-resourced language spoken by over 30\% of the population in Zambia. To assess its usefulness for training and testing ASR systems for Bemba, we explored different approaches; supervised pre-training (training from scratch), cross-lingual transfer learning from a monolingual English pre-trained model using DeepSpeech on the portion of the dataset and fine-tuning large scale self-supervised Wav2Vec2.0 based multilingual pre-trained models on the complete BembaSpeech corpus. From our experiments, the 1 billion XLS-R parameter model gives the best results. The model achieves a word error rate (WER) of 32.91\%, results demonstrating that model capacity significantly improves performance and that multilingual pre-trained models transfers cross-lingual acoustic representation better than monolingual pre-trained English model on the BembaSpeech for the Bemba ASR. Lastly, results also show that the corpus can be used for building ASR systems for Bemba language.},
url = {https://aclanthology.org/2022.lrec-1.790}
}
In this project we used the DeepSpeech v0.8.2 release for our experiments. We refer the reader to Mozilla DeepSpeech for latest updates.
The data used in this project is a 17hrs portion of the BembaSpeech corpus consisting of audio files whose size is not more than 10 seconds as per DeepSpeech input pipeline requirement.
ID | Dataset | CSV file | No. of Utterances | Size | Description |
---|---|---|---|---|---|
1 | training | train.csv | 10200 | 14hrs, 20min | Used for training |
2 | development | dev.csv | 1437 | 2hrs | Used for validation |
3 | testing | test.csv | 756 | 1hr, 18min | Used for testing |
To create the language model for our experiments, we used two sets of Bemba text; transcript (from train and dev sets) denited as LM1 and a combination of transcripts and JW300 denoted as LM2.
You can run and follow the notebook which provides the step by step process of creating different N-gram language models using KenLM tool.
In the notebooks folder, you will find notebooks used in the training of the DeepSpeech Bemba ASR model.
- lm.ipynb - used to create the N-gram language models
- baseline.ipynb - used to train the baseline for our experiments
- ft_model.ipynb - used to finetune DeepSpeech English pretrained model without inclusion of language model.
- ftune_5glm_trans.ipynb - used to finetune DeepSpeech`s English pretrained model with inclusion of the 5-gram LM (from LM1 Bemba text) scorer.
You can download the models (both acoustic and scorer) that achieved the best results 54.78%.
The code used to finetune the XLSR models on the BembaSpeech can be found HERE.