Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate publically releasable corpus to train the language model on #1244

Closed
kdavis-mozilla opened this issue Feb 15, 2018 · 4 comments
Closed

Comments

@kdavis-mozilla
Copy link
Contributor

Generate a text upon which the language model can be trained and which can be release under the current licensing.

@kdavis-mozilla
Copy link
Contributor Author

In light of the test results of #1237 the corpus will be librispeach's training data set.

@kmonachopoulos
Copy link

Hello,

Did you generate full mozilla vocab.txt ? There is a LS LM here : http://www.openslr.org/11/ but the librispeech-lm-norm.txt.gz contains transcripts in capital letters. Can we re-build LM using this? Does that make a difference ?

Thanks

@kdavis-mozilla
Copy link
Contributor Author

@kmonachopoulos I've tried building with that and the quality of recognition goes down.

@lock
Copy link

lock bot commented Jan 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jan 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants