Replies: 9 comments
-
>>> reuben |
Beta Was this translation helpful? Give feedback.
-
>>> dan.bmh |
Beta Was this translation helpful? Give feedback.
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> dan.bmh |
Beta Was this translation helpful? Give feedback.
-
>>> reuben |
Beta Was this translation helpful? Give feedback.
-
>>> dan.bmh |
Beta Was this translation helpful? Give feedback.
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> reuben |
Beta Was this translation helpful? Give feedback.
-
>>> dan.bmh |
Beta Was this translation helpful? Give feedback.
-
>>> dan.bmh
[January 25, 2021, 2:00pm]
and
,
as I already mentioned some time ago, I'm working on implementing new
STT networks using tensorflow2. slash
In the last days I did make a lot of progress and wanted to share it,
along with a request to you and a suggestion for future development
procedure.
My current network, which implements Quartznet15x5
(paper), reaches a WER of 3.7%
on LibriSpeech (test-clean) using your official English scorer. slash
The network also has much fewer parameters (19M vs 48M) and thus should
be faster in inference than the current DeepSpeech1 network. There is
also the option to use the even smaller model of Quartznet5x5 (6.7M
params), if required. I reached a WER of 4.5% with it.
You can find the pretrained networks and the source code here: slash
(Please note that the training code is still highly experimental and a
lot of features DeepSpeech had are still missing, but I hope to add most
of them in the next time). slash
GitLab
### Files · dspol · Jaco-Assistant / DeepSpeech-Polyglot
Training code and checkpoints for DeepSpeech in multiple languages
Now I would like to ask, if we could make those models usable with the
deepspeech bindings. The problem is that this will require some changes
in the native client codes, because apart from the network architecture
I also had to change the input pipeline. slash
It would be great if you could look into this and update the client
bindings accordingly. We would also need to think about a new procedure
for streaming inference, but some parts of the reference implementation
from Nvidia
(link)
should be usable for that.
Besides the request for updating the bindings, I would like to make
another suggestion, just as an idea to think about, which I think could
improve the development in future: Splitting DeepSpeech into the three
parts of usage, training and datasets. slash
We keep the github repo as main repository and entrypoint, but split out
the training part into DeepSpeech-Polyglot. This should save you a lot
of time compared to updating DeepSpeech and hopefully gives me some more
developing support. I would also give you access to the repository
then. slash
Splitting the parts of downloading and preparing the datasets into it's
own tool would make it usable for other STT projects, too, and therefore
new datasets might be added faster. I would suggest to use
corcua for that, which I did
create with the focus of making the addition of new datasets as easy as
possible (I first tried audiomate, but I found their architecture to
complicated).
What do you think about the two ideas?
Greetings slash
Daniel
slash
slash
(Notes on the above checkpoint: I did transfer the pretrained models
from Nvidia, who used pytorch, to tensorflow. While this does work well
for the network itself, I had some problems with the input pipeline. The
spectrogram+filterbanks calculation has a slightly different output in
tensorflow, which increases the WER of the transferred networks by about
1%. The problem could be reduced somewhat by training some additional
epochs on LibriSpeech, but I think we still could improve this by about
0.6% if we either solve the pipeline difference or run a longer training
over the transferred checkpoint.)
[This is an archived TTS discussion thread from discourse.mozilla.org/t/integration-of-deepspeech-polyglots-new-networks]
Beta Was this translation helpful? Give feedback.
All reactions