Project: Aphasia
Author: Koray Poyraz
Description
Patients that suffer from Aphasia have difficulty comprehending and/or formulating language. The cause is usually brain damage in the language center. Recovery from Aphasia is usually never 100%, and rehab can take years but does help the patients. Regardless, having Aphasia is usually very stressful for the patients even during rehabilitation sessions. Specialists from the Rijndam rehabilitation institute in Rotterdam treat patients that suffer from Aphasia. Their impression is that the stress experienced by patients may be amplified by human-human interaction in which the patients experience the 'embarrassment' of not being able to communicate correctly. Possibly, the rehabilition stress can be reduced by having patients do exercises on a computer rather than to talk to a person. For this project the first goal is to see if we can properly translate what Aphasia patients say to text and identify where they likely make mistakes in their language.
- Communication
- Domain Knowledge
- Data Collection
- Data Preperation
- Predictive Models
- Data Visualization
- Oversampling
- Model Selection
- Evaluation
- Diagnostics
- Extra
- Paper
Below is described per subject the studies performed, techniques used, references to literature and results.
- API = stands for "Application Programming Interface". Is a program that communicates depending on the protocols with another program. E.g. a program I developed communicates with the Google Speech to Text.
- Notebook = A web application that allows you to create and share documents that contain live code, comparisons, visualisations and narrative text.
- MFCCs = (Mel Frequency Cepstral Coefficient) a feature extraction method that is widely used in automatic speech and speaker recognition.
- Scraper = also called web scraping, is for extracting data from websites.
- Library = a library of functions you can get to develop a program or script.
- SPHINX = a ready-made Speech to Text tool / engine with which you can develop your own Speech to Text.
- Phoneme boundary generator = a generator that generates phoneme boundaries.
- STT = Speech to Text (Google Service)
- Literatuur
- Provides a picture of the techniques that are applied
- A Review on Speech Recognition Technique
- Provides the steps taken to extract the features from the audio signals
- Speech Processing for Machine Learning MFCCs
- A GitHub repository with useful information about existing repositories for Speech Recognition systems
- awesome-python-scientific-audio
- DTW (Dynamic Time Warping) is used to compare word signals. In time series, dynamic time warping (DTW) is one of the algorithms for measuring the similarity between two temporal sequences, which can vary in speed.
- Understanding Dynamic Time Warping
"Data collection" is important for the stage "Data Preperation". Below are topics of tasks that have been performed to collect and structurally store the data for the project. Each topic can include desk research and notebooks to perform a specific task or tasks that are relevant to the project. Data has been collected from different sources. The sources are "VoxForge", "Uva" and "CORPUS".
fon.hum.uva is a website where a free database is offered with spoken audio files and accompanying texts, link to the website. For retrieving data from Uva I wrote a scraper to get the data from their website. The reason for the scraper is because the database is not downloadable so one has to download from their website per click and that takes a lot of time.
- Notebook
VoxForge is a website where spoken audio files with accompanying texts are offered for free. For VoxForge I also wrote a scraper to get the data from their website, because like Uva, downloading per click takes a lot of time. Hence the scraper also for VoxForge. link to the website.
- Notebook
CORPUS is a large data consisting of Dutch spoken audio with related words. This data is available free of charge at this link. The data is downloaded and stored on the server by our project manager. I wrote a transformer to transform the data into the desired structure to get started. It is described under "Data Preperation". This data collection is important for the Phoneme Boundary Classifier .
Below are topics of tasks performed to prepare data for the project. Each topic can include desk research and notebooks to perform a specific task or tasks that are relevant to the project.
I developed this API to quickly convert the process of audio files to text. Otherwise, that process had to be done manually which takes a lot of time. In addition, this API also has the function to get the timestamps of per word in an audio signal. This was important to be able to create a data set for future use, e.g. for a neural network.
To realize this I created a project on GitHub called "Aphasia project". I also prepared an installation guide for my project colleagues so that they can use the API.
- Aphasia-project Github
To get an overview of the existing Speech to Text service, I did a desk research. I have come to the conclusion that there are services from large companies that do not support the Dutch language except Google. In order to link Google's Speech to Text to my API, I consulted the following literature.
- Literature
Google has a number of rules when it comes to transforming audio signal into text. One (without using Cloud Storage) may not pass on audio for more than 1 minute. Since we have audio files that are longer than a minute, another solution had to be found.
First solution was implementing a function that cuts an audio in minutes taking into account not cutting by word signal. I implemented this function to cut audio files within 1 minute and transform them to text. Cut functie:
Second solution was to enable a Cloud Storage service and link it to the Aphasia API. This gives the freedom to transform audio into text for more than a minute.
The architecture of Aphasia API:
On this topic I converted the Aphasia API into a notebook with additional functions to batch run the data collection "Voxforge" to transform a folder full of audio files into word timestamps and creating CSV files as a dataset. This datasets consists of columns "begin", "end", "word" and "audiopath" which will eventually be used with the "Phoneme boundary generator". See notebook for more information.
-
A desk-research into existing tools that can extract the word timestamps from an audio signal.
-
Notebook
An aligner script has been developed for this project. The aligner was important to be able to generate data as training set for SPHINX (a ready-to-use Speech to Text tool). The aligner is mainly intended for the data collection from "UVA" because the sentences are not aligned. The Aeneas library was used to realize this.
-
Aeneas documentation
-
Aeneas library
-
Notebook
For the data of CORPUS, a transformer has been written that transforms the data of CORPUS to the desired structure consisting of columns "begin", "end", "word" and "audiopath" and save as CSV file. This data is used with the Phoneme Boundary Generator which then generates a new data set for the Phoneme Boundary Classifier . See notebook for further information. This data is also been used by my project colleagues.
- Notebook
After "Data Collection" and "Data Preperation" topics, which have put the data in a desired structure, a Phoneme Boundary Generator has been developed. What this generator does is generate phoneme boundaries as data by concatenating the last N milliseconds of a word and beginning N milliseconds of the next word. This dataset is for training a Phoneme Boundary Classifier .
Two types of generators have been developed. The "V2" stores the aggregated N milliseconds as described above, and "V3" only stores the difference between the last N milliseconds of a word and the beginning N milliseconds of the next word. With this I want to see which approach produces a better validation accuracy and recall score.
For feature extraction of the audio signals and obtaining the MFCCs, the following library and source were used:
-
Source
-
Library
-
Notebook
In this project, not only data collection or data preparation was important, but also the development and training of a Phoneme Boundary Classifier. In order to train a Phoneme Boundary Classifier model with the collected Dutch-language CORPUS data, a number of machine and deep learning models have been tested. The models are:
-
Machine learning
- Random Forest Classifier
-
Deep learning
- MLP (Multi Layer Perceptron)
- Bi-LSTM (bi-directional Long Short-Term Memory)
For some of the above models, Scikit-Learn and Tensorflow Core library have been used.
One reason for using the Tensorflow Core is more customization options such as selection of the GPU cores, application of activation function per neural network layer and it is more suitable for developing deep learning networks.
These models have been trained with the data generated by the Phoneme Boundary Generator (CORPUS NL) to develop a Phoneme Boundary Classifier.
The goal of trying out these models is to ultimately choose a model to train it with the dataset.
-
Random Forest Classifier
-
MLP classifier (Multi Layer Perceptron)
-
Bi-LSTM classifier
The dataset generated with the V2 phoneme boundary generator is used in the following topics: Oversampling, Model Selection, Evaluation and Diagnosis. The reason for this is because it provides a better result in validation accuracy and recall score.
Below is a visualization of the datasets generated by V2 and V3 generators. Each dataset consists of the columns "region", "label", "sample_rate", "begin", "end" and "audiopath".
Overview dataset
Info datatypes
Visualization N milliseconden audio signal and MFCCs
Overview dataset
Info datatypes
Visualization N milliseconden audio signal and MFCCs
To counter skewed classes , a "generateMoreData ()" function has been written to improve the ratio between the label 0 and 1, making them balanced. This function is implemented in the notebooks of the models.
Before oversampling - here you can clearly see that the dataset suffers from skewed classes because the ratio between label 0 and 1 is not balanced.
After oversampling - here it can be clearly seen that the dataset is well balanced.
In this section it is important to select the most interesting values for the hyper parameters. For each model it is indicated which parameters it concerns, along with a plotted result. For further information see the notebook per model. Before the model selection is carried out, an OVERSAMPLING takes place to improve the ratio between the labels 0 and 1 so that they are balanced. Otherwise the dataset will of course have to contend with Skewed Classes and we don't want that.
This section shows which value is interesting to use for the hyper parameters "max depth" and "estimators"
- Max depth The plot above shows that the model becomes more complex, which leads to overfitting. This happens when the value for "Max of depth" is higher than 5. This information gives the possibility to choose a max depth to use a larger data set for training the model.- Estimator
Below, the model is retrained, but with 1 million dataset. In order to realize a desired number of dataset, a function has been written that returns a balanced desired number of dataset called "getBatchData ()", see notebook. In this selection, the focus is on the "estimators" value.
The plots from above show that there is very little difference after 8 estimators. Even with 100 estimators. The line of train and validation accuracy are not far apart. This indicates that there is no under- or overfitting. The estimator 32 gives the highest validation accuracy score.
Here we look at which value is interesting to use with the hyperparameters "number of neurons", "learning rate" and "number of layers"
- Number of neurons In the plot of the number of neurons we can see that he suffers from overfitting after 60 neurons.- Learning rate In the learning rate plot we can see that the validation accuracy decreases with a higher learning rate.
- Number of layers In the plot of number of layers we can see that the validation accuracy decreases and training accuracy increases with more layers, so he overfit.
Here we look at what classification report score the different values of the hypermeter "num_neurons" and "num layers" give.
- number of neurons
In these plots we can see that the Recall at 70 neurons is highest at class 1 and lowest at class 0. Since the focus is on class 1, 70 neurons is interesting.
- number of layers
In these plots we can see that the Recall scores worse with more than 1 layer at class 1. Since the focus is on class 1, 1 layer is interesting.
From the results above we see that 70 neurons with 1 layer gives the highest Recall score on class 1. We will use these values to create an MLP classifier.
- notebook
Here we look at which value can best be used with the hyper parameters "num neurons", "learning rate" and "learningsteps".
- Number of neurons In these plots we can see that the Recall at 70 neurons is highest at class 1 and lowest at class 0. In the left plot we see that we are dealing with overfitting.- Learning rate In the plot of learning rate, we can see that the validation accuracy and Recall score at class 1 decreases with a higher learning rate.
- number of training steps In the plot of learning steps we can see that the Recall score at class 1 is highest at approximately 8200 learning steps. However, we are struggling with an overfitting in the left plot.
In this section, "oversampling" has been performed for each model first to improve the ratio between the label 0 and 1 so that they are balanced. For this I wrote the function "generateMoreData ()", see the topic Oversampling .
In this part, an evaluation of the results of the models that have been trained is performed. Finally, one model is chosen (model selection) and then proceeds to the last stage "Diagnosis".
After model selection of the value for "max depth" and "estimators", the model was trained with the full datasets.
- Train, validation acc., Recall and Precision score
After selecting the values for "num neurons", "learning rate" and "num layers", the model was trained with the complete datasets.
- Train, validation acc., Recall and Precision score
- notebook
After selecting the values for "num neurons", "learning rate" and "number of training steps", the model was trained with the complete datasets.
- Train and validation accuracy %
- Recall and Precision score
- A score table
[0] = class 0, [1] = class 1
Score % | RFC | MLP | Bi-LSTM |
---|---|---|---|
Training acc. | 0.56 | 0.62 | 0.76 |
Validation acc. | 0.56 | 0.62 | 0.58 |
Recall [0] | 0.54 | 0.67 | 0.61 |
Recall [1] | 0.59 | 0.57 | 0.51 |
Precision [0] | 0.57 | 0.61 | 0.64 |
Precision [1] | 0.56 | 0.63 | 0.49 |
F1 score [0] | 0.55 | 0.63 | 0.62 |
F1 score [1] | 0.57 | 0.60 | 0.50 |
From the score table we can see that the model MLP has the best results. He scores highest in validation accuracy, Precision class 1 and second best on recall class 1. This means that we proceed with MLP to the Diagnosis stage.
In this section we continue with the chosen model MLP. Here we look at the problems the model faces, eg High Bias or High Variance. For further information is above the link to the notebook "phoneme boundary MLP classifier diagnostics".
In the plot of iterations we can see that we are dealing with HIGH VARIANCE (Overfitting). So the model is too complex.
A zoomed-in plot on the learning process
As a solution we will use Regularization .
- Regularization
By first plotting lambda values regularization we can see which value gives the best result.
Regularization values: [0, 1e-05, 0.001, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24]
Above we see at a low lambda value HiGH VARIANCE and at a high lambda value HIGH BIAS. The lambda values: (0.64, 1.28, 2.56) give better generalization.
After regularization, the best value was chosen to reduce overfitting and underfitting.
Below the plot of the final model with the selected lambda value.
In this plot we can see that the model is not overfit or underfit but generalized.
- notebook
- Dataset MFCC and word of which only the words starting with "st".
- Dataset MFCC and word.
- Dataset MFCC, word with sounds and phonemes code list.