-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #8 from translatorswb/main
Congolese Swahili models by TWB
- Loading branch information
Showing
4 changed files
with
725 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
Terms and conditions | ||
|
||
TWB Gamayun Portal Access and License Agreement | ||
|
||
Please read the following Terms and Conditions (“Terms”) carefully. By registering for access to, or by accessing, the Available Data (as defined below), you hereby acknowledge that you have read, and agree to be bound by, these Terms. | ||
As used herein, the terms “Translators without Borders”, “TWB”, “we” or “us” shall mean Translators without Borders - US, Inc. and any of its affiliates and subsidiaries. The terms “you” and “user” shall mean any individual, partnership, corporation, trust, limited liability company, governmental authority, or other entity that registers for access to the Available Data. | ||
|
||
1. License Grant: Subject to these Terms, and the License Restrictions set forth below, TWB hereby grants to you a revocable, limited, royalty free, worldwide, non-exclusive right and license to access, use, extract, copy, modify, and create Derivative Works of the Available Data (the “License”). For purposes of this Agreement, “Available Data” shall mean any language and translation-related information and data made available to you through the TWB Open Data Portal (the web portal that you will have access to once you complete this registration process by accepting these Terms), and “Derivative Works” shall mean any data or information in any form created as a result of the use, modification, combination, calculation, conversion, or manipulation of the Available Data or any portion thereof. TWB shall have no liability whatsoever related to your use of any Available Data, and you shall indemnify, hold harmless and defend TWB, and their respective directors, partners, officers, employees, representatives, and agents (collectively, the “Indemnitees”) from and against any and all third party claims, liabilities, losses, reasonable and necessary expenses actually incurred (including reasonable attorneys' fees), fines, penalties, taxes or damages (collectively "Liabilities") incurred by you to the extent such Liabilities result from your use or misuse of the Available Data. | ||
Except for the License granted above, neither party grants any intellectual property rights to the other pursuant to this Agreement unless expressly stated otherwise. | ||
TWB reserves the right to revoke the License at any time if it determines, in its sole discretion, that you have violated any of these Terms or the License Restrictions set forth below. | ||
|
||
2. License Restrictions. The following restrictions shall apply to the License (“License Restrictions”): | ||
a. Public or Commercial Use; Notice. The License shall not prohibit any (i) public use (e.g. use of the Available Data on any websites or applications made available to the general public) or (ii) commercial use, of the Available Data or Derivative Works; provided, however, that prior to engaging in any such public or commercial use, you provide notice to TWB, which notice shall include a brief description of the proposed public or commercial use. | ||
b. Open Nature of Derivative Works. The License shall not restrict your ability to create or distribute Derivative Works; provided, however, that if you provide any license to such Derivative Works, such license shall contain elements substantially similar to those contained in this License, including that such Derivative Work is made available at no cost and in an open manner. | ||
c. Attribution. The License is being granted to you on the condition that if and when you redistribute the Available Data or Derivative Works, you provide modest attribution to TWB by including the following copyright notice where redistributed: | ||
“Copyright 2020, Translators Without Borders – US, Inc.” | ||
d. Prohibited Uses. The License shall not grant the right to use any Available Data for any Prohibited Uses. As used herein, a “Prohibited Use” shall mean any use of the Available Data to: | ||
Create, distribute or otherwise transmit any information or content that is unlawful, harmful, threatening, embarrassing, abusive, harassing, tortuous, defamatory, vulgar, obscene, libelous, deceptive, fraudulent, contains explicit or graphic descriptions or accounts of sexual acts, invasive of another’s privacy, false or purposefully deceptive, or hateful; | ||
Create, distribute or otherwise transmit any information or content that victimizes, harasses, degrades, or intimidates an individual or group of individuals on the basis of religion, gender, sexual orientation, race, ethnicity, age, or disability; | ||
Harm minors in any way; | ||
Create, distribute or otherwise transmit any information or content that you do not have a right to transmit under any law or under contractual or fiduciary relationships (such as inside information, proprietary and confidential information learned or disclosed as part of employment relationships or under nondisclosure agreements); | ||
Create, distribute or otherwise transmit any information or content that infringes any patent, trademark, trade secret, copyright or other proprietary or confidentiality rights of any party; | ||
Create, distribute or otherwise transmit any information or content any unsolicited or unauthorized advertising, promotional materials, “junk mail,” “Spam,” or any other form of solicitation; | ||
Create, distribute or otherwise transmit any information or content any material that contains software viruses, Trojan horses, worms, time bombs, cancel bots, or any other computer code, files or programs designed to interrupt, destroy, or limit the functionality of any computer software or hardware or telecommunications equipment or any other similarly destructive activity, or surreptitiously intercept or expropriate any system, data or personal information; | ||
Engage in any activity that is contrary to or which would adversely affect the purpose or intention TWB’s humanitarian-focused mission of using language to increase access to critical knowledge and information; or | ||
Intentionally or unintentionally violate any applicable law. | ||
|
||
3. LIMITATION OF LIABILITY: YOU ACKNOWLEDGE AND AGREE THAT YOU ACCESS AND UTILIZE THE AVAILABLE DATA AT YOUR OWN DISCRETION AND RISK. THE AVAILABLE DATA IS PROVIDED “AS IS” AND “AS AVAILABLE.” TWB: | ||
IS NOT PROVIDING ANY WARRANTIES AND REPRESENTATIONS REGARDING THE AVAILABLE DATA; | ||
DISCLAIMS ALL WARRANTIES AND REPRESENTATIONS OF ANY KIND WITH REGARD TO THE AVAILABLE DATA, INCLUDING ANY IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT OF THIRD PARTY RIGHTS, OR FITNESS FOR A PARTICULAR PURPOSE; | ||
DOES NOT WARRANT THE ACCURACY, ADEQUACY, AVAILABILITY, APPROPRIATENESS, COMPLETENESS, RELIABILITY, TIMELINESS, USEFULNESS, OR OTHERWISE OF THE CONTENT OR INFORMATION CONTAINED IN THE AVAILABLE DATA AND EXPRESSLY DISCLAIMS LIABILITY FOR ERRORS OR OMISSIONS IN THE INFORMATION AND CONTENT; AND | ||
WILL NOT BE LIABLE FOR ANY PROBLEMS EXPERIENCED BY YOU DUE TO CAUSES BEYOND OUR CONTROL. | ||
IN NO EVENT WILL TWB, ITS OFFICERS, DIRECTORS, EMPLOYEES, AGENTS, PARENTS, AFFILIATES, SUCCESSORS OR ASSIGNS, BE LIABLE TO YOU OR ANY OTHER PARTY (i) FOR ANY INDIRECT, SPECIAL, PUNITIVE, INCIDENTAL OR CONSEQUENTIAL DAMAGES OR ANY OTHER DAMAGES ARISING IN ANY WAY OUT OF THE AVAILABILITY, USE, RELIANCE ON, OR INABILITY TO USE THE AVAILABLE DATA, EVEN IF TWB OR ITS AGENTS SHALL HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES, AND REGARDLESS OF THE FORM OF ACTION, WHETHER IN CONTRACT, TORT, OR OTHERWISE; OR (ii) FOR ANY CLAIM ATTRIBUTABLE TO ERRORS, OMISSIONS, OR OTHER INACCURACIES IN, OR DESTRUCTIVE PROPERTIES OF THE AVAILABLE DATA. BECAUSE SOME STATES OR JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR THE LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, IN SUCH STATES OR JURISDICTIONS, TWB’S LIABILITY SHALL BE LIMITED TO THE EXTENT PERMITTED BY LAW. | ||
|
||
4. Assignment: You may not assign the License without the prior written consent of TWB. | ||
|
||
5. Notice. All notices required to be given pursuant to these Terms shall be sufficient if sent either certified mail, return receipt requested, to the addresses set forth below: | ||
Translators without Borders | ||
30 Main Street, | ||
Danbury CT USA 06810 | ||
Attn: Grace Tang | ||
|
||
For purposes of Section 2(a) only, electronic notice shall be sufficient if sent to the following email address: [email protected]. | ||
|
||
6. Enforceability and Governing Law: In the event any of the terms or provisions of these Terms and Conditions shall be held to be unenforceable, the remaining terms and provisions shall be unimpaired and the unenforceable term or provision shall be replaced by such enforceable term or provision as comes closest to the intention underlying the unenforceable term or provision. These Terms shall be subject to any other agreements you have entered into with TWB. These Terms, and your access to and use of the Available Data, shall be governed by the laws of the State of Massachusetts. | ||
Any action against TWB arising from or relating to these Terms or your access to and use of the Available Data must be brought by you in state or federal court located in the State of Massachusetts. You consent to the jurisdiction and venue of the state and federal courts located within the State of Massachusetts for the adjudication of all claims arising from or relating to your access to and use of the Available Data and the provisions of these Terms. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Model card for Bengali STT | ||
|
||
Jump to section: | ||
|
||
- [Model details](#model-details) | ||
- [Intended use](#intended-use) | ||
- [Performance Factors](#performance-factors) | ||
- [Metrics](#metrics) | ||
- [Training data](#training-data) | ||
- [Training parameters](#training-parameters) | ||
- [Language models](#language-models) | ||
- [Evaluation data](#evaluation-data) | ||
- [Ethical considerations](#ethical-considerations) | ||
- [Caveats and recommendations](#caveats-and-recommendations) | ||
|
||
## Model details | ||
|
||
- Person and organization developing model: [Alp Öktem](https://alpoktem.github.io/) @[Clear Global/Translators without Borders](https://clearglobal.org/). | ||
- Model language: Swahili (Congo) / `swc` / `sw-cd` | ||
- Model date: August 26, 2021 | ||
- Model type: `Speech-to-Text` | ||
- Model version: `v0.3.0` | ||
- Compatible with 🐸 STT version: `v0.10.0a13` | ||
- License: Custom (`LICENSE.txt`) | ||
- Citation details: `@techreport{swc-stt, author = {\"Oktem, Alp}, title = {SWC STT 0.3}, institution = {Translators without Borders}, address = {\url{https://github.com/coqui-ai/STT-models}} year = {2021}, month = {June}, number = {STT-SWC-0.3} }` | ||
- Official page: [https://gamayun.translatorswb.org/data/](https://gamayun.translatorswb.org/data/swc-stt-model) | ||
- Where to send questions or comments about the model: Directly to [Alp Öktem](mailto:[email protected]) or you can leave an issue on [`STT-model` issues](https://github.com/coqui-ai/STT-models/issues), open a new discussion on [`STT-model` discussions](https://github.com/coqui-ai/STT-models/discussions), or chat with us on [Gitter](https://gitter.im/coqui-ai/). | ||
|
||
## Intended use | ||
|
||
Speech-to-Text for the Congolese dialect of [Swahili Language](https://en.wikipedia.org/wiki/Swahili_language) on 16kHz, mono-channel audio. | ||
|
||
## Performance Factors | ||
|
||
Factors relevant to Speech-to-Text performance include but are not limited to speaker demographics, recording quality, and background noise. Read more about STT performance factors [here](https://stt.readthedocs.io/en/latest/DEPLOYMENT.html#how-will-a-model-perform-on-my-data). | ||
|
||
## Metrics | ||
|
||
STT models are usually evaluated in terms of their transcription accuracy, deployment Real-Time Factor, and model size on disk. | ||
|
||
#### Transcription Accuracy | ||
|
||
The following Word Error Rates and Character Error Rates are reported on [Congolese Swahili Commands dataset](https://gamayun.translatorswb.org/data/). | ||
|
||
|Test Corpus|Scorer|WER|CER| | ||
|-----------|---|---|---| | ||
|TICO-19 devset|swc-general|18.31\%|6.15\%| | ||
|Congolese Swahili Commands|swc-commands|21.08\%|20.82\%| | ||
|
||
#### Real-Time Factor | ||
|
||
Real-Time Factor (RTF) is defined as `processing-time / length-of-audio`. The exact real-time factor of an STT model will depend on the hardware setup, so you may experience a different RTF. | ||
|
||
Recorded average RTF on laptop CPU: ` ` | ||
|
||
#### Model Size | ||
|
||
`swc-stt-0.3.pbmm`: 188.9 Mb | ||
`swc-stt-0.3.tflite`:47.3 Mb | ||
`swc-general.scorer`: 158.6 Mb | ||
`swc-commands.scorer`: 2.9 Kb | ||
|
||
### Approaches to uncertainty and variability | ||
|
||
Confidence scores and multiple paths from the decoding beam can be used to measure model uncertainty and provide multiple, variable transcripts for any processed audio. | ||
|
||
## Training data | ||
|
||
Acoustic model was trained on top of English STT model using portions of [Congolese Swahili audio mini-kit](https://gamayun.translatorswb.org/download/congolese-swahili-audio-mini-kit/) and [TICO-19 Congolese Swahili testing set](https://gamayun.translatorswb.org/download/congolese-swahili-tico-19-audio-test-set/). It was converted to 16kHz WAV before training. | ||
|
||
Total train size: 8.93 (mini-kit) + 3.27 (TICO-19 testset) = 12.2 hours | ||
Dev size: 0.49 hours (mini-kit) | ||
Test size: 1.71 hours (TICO-19 devset) | ||
|
||
## Training parameters | ||
|
||
|Parameter|Value| | ||
|---------|-----| | ||
|Epochs|200| | ||
|Drop source layers|2| | ||
|Learning rate|0.001| | ||
|Dropout rate|0.2| | ||
|augment frequency_mask|[p=0.8,n=2:4,size=2:4]| | ||
|augment time_mask|[p=0.8,n=2:4,size=10:50,domain=spectrogram] | | ||
|Train/test/dev batch size|32| | ||
|
||
## Language models | ||
|
||
Model is packaged with two language models (scorers): | ||
- *General purpose language model* (`swc-general.scorer`) is trained on a 37.7M word mixed Swahili text corpus | ||
- *Commands language model* (`swc-commands.scorer`) is trained on 12 commands (numbers from 1 to 10 and yes/no) which are listed in `vocab-commands.txt`. | ||
|
||
## Evaluation data | ||
|
||
The Model was evaluated on two different sets: | ||
- [Congolese Swahili audio commands corpus](https://gamayun.translatorswb.org/download/swc-audio-commands/): 185 sample subset (1.8 minutes) consisting of 5 speakers uttering numbers 1 to 10 and yes/no in Congolese Swahili. For this evaluation, the `swc-commands` language model was used. | ||
- [Congolese Swahili TICO-19 audio development set](https://gamayun.translatorswb.org/download/swc-tico-19-audio-devset/): 536 sample subset (1.71 hours) consisting of TICO-19 domain sentences spoken by a male and female speaker (listed in `swc-tico-test.csv`). For this evaluation, the `swc-general` language model was used. | ||
|
||
## Ethical considerations | ||
|
||
Deploying a Speech-to-Text model into any production setting has ethical implications. You should consider these implications before use. | ||
|
||
### Demographic Bias | ||
|
||
You should assume every machine learning model has demographic bias unless proven otherwise. For STT models, it is often the case that transcription accuracy is better for men than it is for women. If you are using this model in production, you should acknowledge this as a potential issue. | ||
|
||
### Surveillance | ||
|
||
Speech-to-Text may be mis-used to invade the privacy of others by recording and mining information from private conversations. This kind of individual privacy is protected by law in may countries. You should not assume consent to record and analyze private speech. | ||
|
||
## Caveats and recommendations | ||
|
||
Machine learning models (like this STT model) perform best on data that is similar to the data on which they were trained. Read about what to expect from an STT model with regard to your data [here](https://stt.readthedocs.io/en/latest/DEPLOYMENT.html#how-will-a-model-perform-on-my-data). | ||
|
||
In most applications, it is recommended that you [train your own language model](https://stt.readthedocs.io/en/latest/LANGUAGE_MODEL.html) to improve transcription accuracy on your speech data. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
|
||
a | ||
b | ||
c | ||
d | ||
e | ||
f | ||
g | ||
h | ||
i | ||
j | ||
k | ||
l | ||
m | ||
n | ||
o | ||
p | ||
q | ||
r | ||
s | ||
t | ||
u | ||
v | ||
w | ||
x | ||
y | ||
z |
Oops, something went wrong.