Skip to content

Commit

Permalink
bengali model metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
alpoktem committed Jun 16, 2021
1 parent 5f9dba3 commit 666d94c
Show file tree
Hide file tree
Showing 4 changed files with 2,228 additions and 0 deletions.
46 changes: 46 additions & 0 deletions bengali/twb/v0.1.0/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Terms and conditions

TWB Gamayun Portal Access and License Agreement

Please read the following Terms and Conditions (“Terms”) carefully. By registering for access to, or by accessing, the Available Data (as defined below), you hereby acknowledge that you have read, and agree to be bound by, these Terms.
As used herein, the terms “Translators without Borders”, “TWB”, “we” or “us” shall mean Translators without Borders - US, Inc. and any of its affiliates and subsidiaries. The terms “you” and “user” shall mean any individual, partnership, corporation, trust, limited liability company, governmental authority, or other entity that registers for access to the Available Data.

1. License Grant: Subject to these Terms, and the License Restrictions set forth below, TWB hereby grants to you a revocable, limited, royalty free, worldwide, non-exclusive right and license to access, use, extract, copy, modify, and create Derivative Works of the Available Data (the “License”). For purposes of this Agreement, “Available Data” shall mean any language and translation-related information and data made available to you through the TWB Open Data Portal (the web portal that you will have access to once you complete this registration process by accepting these Terms), and “Derivative Works” shall mean any data or information in any form created as a result of the use, modification, combination, calculation, conversion, or manipulation of the Available Data or any portion thereof. TWB shall have no liability whatsoever related to your use of any Available Data, and you shall indemnify, hold harmless and defend TWB, and their respective directors, partners, officers, employees, representatives, and agents (collectively, the “Indemnitees”) from and against any and all third party claims, liabilities, losses, reasonable and necessary expenses actually incurred (including reasonable attorneys' fees), fines, penalties, taxes or damages (collectively "Liabilities") incurred by you to the extent such Liabilities result from your use or misuse of the Available Data.
Except for the License granted above, neither party grants any intellectual property rights to the other pursuant to this Agreement unless expressly stated otherwise.
TWB reserves the right to revoke the License at any time if it determines, in its sole discretion, that you have violated any of these Terms or the License Restrictions set forth below.

2. License Restrictions. The following restrictions shall apply to the License (“License Restrictions”):
a. Public or Commercial Use; Notice. The License shall not prohibit any (i) public use (e.g. use of the Available Data on any websites or applications made available to the general public) or (ii) commercial use, of the Available Data or Derivative Works; provided, however, that prior to engaging in any such public or commercial use, you provide notice to TWB, which notice shall include a brief description of the proposed public or commercial use.
b. Open Nature of Derivative Works. The License shall not restrict your ability to create or distribute Derivative Works; provided, however, that if you provide any license to such Derivative Works, such license shall contain elements substantially similar to those contained in this License, including that such Derivative Work is made available at no cost and in an open manner.
c. Attribution. The License is being granted to you on the condition that if and when you redistribute the Available Data or Derivative Works, you provide modest attribution to TWB by including the following copyright notice where redistributed:
“Copyright 2020, Translators Without Borders – US, Inc.”
d. Prohibited Uses. The License shall not grant the right to use any Available Data for any Prohibited Uses. As used herein, a “Prohibited Use” shall mean any use of the Available Data to:
Create, distribute or otherwise transmit any information or content that is unlawful, harmful, threatening, embarrassing, abusive, harassing, tortuous, defamatory, vulgar, obscene, libelous, deceptive, fraudulent, contains explicit or graphic descriptions or accounts of sexual acts, invasive of another’s privacy, false or purposefully deceptive, or hateful;
Create, distribute or otherwise transmit any information or content that victimizes, harasses, degrades, or intimidates an individual or group of individuals on the basis of religion, gender, sexual orientation, race, ethnicity, age, or disability;
Harm minors in any way;
Create, distribute or otherwise transmit any information or content that you do not have a right to transmit under any law or under contractual or fiduciary relationships (such as inside information, proprietary and confidential information learned or disclosed as part of employment relationships or under nondisclosure agreements);
Create, distribute or otherwise transmit any information or content that infringes any patent, trademark, trade secret, copyright or other proprietary or confidentiality rights of any party;
Create, distribute or otherwise transmit any information or content any unsolicited or unauthorized advertising, promotional materials, “junk mail,” “Spam,” or any other form of solicitation;
Create, distribute or otherwise transmit any information or content any material that contains software viruses, Trojan horses, worms, time bombs, cancel bots, or any other computer code, files or programs designed to interrupt, destroy, or limit the functionality of any computer software or hardware or telecommunications equipment or any other similarly destructive activity, or surreptitiously intercept or expropriate any system, data or personal information;
Engage in any activity that is contrary to or which would adversely affect the purpose or intention TWB’s humanitarian-focused mission of using language to increase access to critical knowledge and information; or
Intentionally or unintentionally violate any applicable law.

3. LIMITATION OF LIABILITY: YOU ACKNOWLEDGE AND AGREE THAT YOU ACCESS AND UTILIZE THE AVAILABLE DATA AT YOUR OWN DISCRETION AND RISK. THE AVAILABLE DATA IS PROVIDED “AS IS” AND “AS AVAILABLE.” TWB:
IS NOT PROVIDING ANY WARRANTIES AND REPRESENTATIONS REGARDING THE AVAILABLE DATA;
DISCLAIMS ALL WARRANTIES AND REPRESENTATIONS OF ANY KIND WITH REGARD TO THE AVAILABLE DATA, INCLUDING ANY IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT OF THIRD PARTY RIGHTS, OR FITNESS FOR A PARTICULAR PURPOSE;
DOES NOT WARRANT THE ACCURACY, ADEQUACY, AVAILABILITY, APPROPRIATENESS, COMPLETENESS, RELIABILITY, TIMELINESS, USEFULNESS, OR OTHERWISE OF THE CONTENT OR INFORMATION CONTAINED IN THE AVAILABLE DATA AND EXPRESSLY DISCLAIMS LIABILITY FOR ERRORS OR OMISSIONS IN THE INFORMATION AND CONTENT; AND
WILL NOT BE LIABLE FOR ANY PROBLEMS EXPERIENCED BY YOU DUE TO CAUSES BEYOND OUR CONTROL.
IN NO EVENT WILL TWB, ITS OFFICERS, DIRECTORS, EMPLOYEES, AGENTS, PARENTS, AFFILIATES, SUCCESSORS OR ASSIGNS, BE LIABLE TO YOU OR ANY OTHER PARTY (i) FOR ANY INDIRECT, SPECIAL, PUNITIVE, INCIDENTAL OR CONSEQUENTIAL DAMAGES OR ANY OTHER DAMAGES ARISING IN ANY WAY OUT OF THE AVAILABILITY, USE, RELIANCE ON, OR INABILITY TO USE THE AVAILABLE DATA, EVEN IF TWB OR ITS AGENTS SHALL HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES, AND REGARDLESS OF THE FORM OF ACTION, WHETHER IN CONTRACT, TORT, OR OTHERWISE; OR (ii) FOR ANY CLAIM ATTRIBUTABLE TO ERRORS, OMISSIONS, OR OTHER INACCURACIES IN, OR DESTRUCTIVE PROPERTIES OF THE AVAILABLE DATA. BECAUSE SOME STATES OR JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR THE LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, IN SUCH STATES OR JURISDICTIONS, TWB’S LIABILITY SHALL BE LIMITED TO THE EXTENT PERMITTED BY LAW.

4. Assignment: You may not assign the License without the prior written consent of TWB.

5. Notice. All notices required to be given pursuant to these Terms shall be sufficient if sent either certified mail, return receipt requested, to the addresses set forth below:
Translators without Borders
30 Main Street,
Danbury CT USA 06810
Attn: Grace Tang

For purposes of Section 2(a) only, electronic notice shall be sufficient if sent to the following email address: [email protected].

6. Enforceability and Governing Law: In the event any of the terms or provisions of these Terms and Conditions shall be held to be unenforceable, the remaining terms and provisions shall be unimpaired and the unenforceable term or provision shall be replaced by such enforceable term or provision as comes closest to the intention underlying the unenforceable term or provision. These Terms shall be subject to any other agreements you have entered into with TWB. These Terms, and your access to and use of the Available Data, shall be governed by the laws of the State of Massachusetts.
Any action against TWB arising from or relating to these Terms or your access to and use of the Available Data must be brought by you in state or federal court located in the State of Massachusetts. You consent to the jurisdiction and venue of the state and federal courts located within the State of Massachusetts for the adjudication of all claims arising from or relating to your access to and use of the Available Data and the provisions of these Terms.
107 changes: 107 additions & 0 deletions bengali/twb/v0.1.0/MODEL_CARD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Model card for Bengali STT

Jump to section:

- [Model details](#model-details)
- [Intended use](#intended-use)
- [Performance Factors](#performance-factors)
- [Metrics](#metrics)
- [Training data](#training-data)
- [Training parameters](#training-parameters)
- [Evaluation data](#evaluation-data)
- [Ethical considerations](#ethical-considerations)
- [Caveats and recommendations](#caveats-and-recommendations)

## Model details

- Person and organization developing model: [Alp Öktem](https://alpoktem.github.io/) @[Clear Global/Translators without Borders](https://clearglobal.org/).
- Model language: Bengali / বাংলা / `bn` / `ben`
- Model date: June 9, 2021
- Model type: `Speech-to-Text`
- Model version: `v0.1.0`
- Compatible with 🐸 STT version: `v0.10.0a6`
- License: Custom
- Citation details: `@techreport{bengali-stt, author = {\"Oktem, Alp}, title = {Bengali STT 0.1}, institution = {Translators without Borders}, address = {\url{https://github.com/coqui-ai/STT-models}} year = {2021}, month = {June}, number = {STT-BN-0.1} }`
- Official page: [https://gamayun.translatorswb.org/data/](https://gamayun.translatorswb.org/download/bengali-asr-model/)
- Where to send questions or comments about the model: You can leave an issue on [`STT-model` issues](https://github.com/coqui-ai/STT-models/issues), open a new discussion on [`STT-model` discussions](https://github.com/coqui-ai/STT-models/discussions), or chat with us on [Gitter](https://gitter.im/coqui-ai/).

## Intended use

Speech-to-Text for the [Bengali Language](https://en.wikipedia.org/wiki/Bengali_language) on 16kHz, mono-channel audio.

## Performance Factors

Factors relevant to Speech-to-Text performance include but are not limited to speaker demographics, recording quality, and background noise. Read more about STT performance factors [here](https://stt.readthedocs.io/en/latest/DEPLOYMENT.html#how-will-a-model-perform-on-my-data).

## Metrics

STT models are usually evaluated in terms of their transcription accuracy, deployment Real-Time Factor, and model size on disk.

#### Transcription Accuracy

The following Word Error Rates and Character Error Rates are reported on [Large Bengali ASR training data set](https://www.openslr.org/53/).

|Test Corpus|WER|CER|
|-----------|---|---|
|Large Bengali ASR training data set|30.6\%|11.0\%|

#### Real-Time Factor

Real-Time Factor (RTF) is defined as `processing-time / length-of-audio`. The exact real-time factor of an STT model will depend on the hardware setup, so you may experience a different RTF.

Recorded average RTF on laptop CPU: ` `

#### Model Size

`bn-model.pbmm`: 189.3M
`general-bn.scorer`: 71.9M

### Approaches to uncertainty and variability

Confidence scores and multiple paths from the decoding beam can be used to measure model uncertainty and provide multiple, variable transcripts for any processed audio.

## Training data

Acoustic model was trained on top of English STT model using the [Large Bengali ASR training data set](https://www.openslr.org/53/). It was converted to 16kHz WAV before training.

Train size: 203067 samples, 199.99 hours
Dev size: 10690 samples, 10.55 hours

Language model was trained on OSCAR and Bengali portions of English-Bengali parallel corpora available from [OPUS](https://opus.nlpl.eu/).

Lines: 782827
Tokens: 13953256

## Training parameters

|Parameter|Value|
|---------|-----|
|Epochs|200|
|Drop source layers|2|
|Learning rate|0.001|
|Dropout rate|0.2|
|augment frequency_mask|[p=0.8,n=2:4,size=2:4]|
|augment time_mask|[p=0.8,n=2:4,size=10:50,domain=spectrogram] |
|Train/test/dev batch size|32|

## Evaluation data

The Model was evaluated on a 2000 sample subset (1.84 hours) of [Large Bengali ASR training data set](https://www.openslr.org/53/). Testing set filenames and transcriptions are included with the model.

## Ethical considerations

Deploying a Speech-to-Text model into any production setting has ethical implications. You should consider these implications before use.

### Demographic Bias

You should assume every machine learning model has demographic bias unless proven otherwise. For STT models, it is often the case that transcription accuracy is better for men than it is for women. If you are using this model in production, you should acknowledge this as a potential issue.

### Surveillance

Speech-to-Text may be mis-used to invade the privacy of others by recording and mining information from private conversations. This kind of individual privacy is protected by law in may countries. You should not assume consent to record and analyze private speech.

## Caveats and recommendations

Machine learning models (like this STT model) perform best on data that is similar to the data on which they were trained. Read about what to expect from an STT model with regard to your data [here](https://stt.readthedocs.io/en/latest/DEPLOYMENT.html#how-will-a-model-perform-on-my-data).

In most applications, it is recommended that you [train your own language model](https://stt.readthedocs.io/en/latest/LANGUAGE_MODEL.html) to improve transcription accuracy on your speech data.
75 changes: 75 additions & 0 deletions bengali/twb/v0.1.0/alphabet.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@

ি
Loading

0 comments on commit 666d94c

Please sign in to comment.