Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added flag for running *only* one 1 or more languages #56

Merged
merged 2 commits into from
Jan 28, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,18 @@ Usage
===========


Given the ``clips.tsv`` file dumped from the Common Voice database one creates a corpora in the directory ``corpora`` as follows
Given the ``clips.tsv`` file dumped from the Common Voice database, you can create a corpus (for each language in the ``clips.tsv`` file) as follows:

``CorporaCreator$ create-corpora -d corpora -f clips.tsv``

This will create the corpora in the directory ``corpora`` from the ``clips.tsv`` file.

If you would like to just create corpora for a some language(s), you can pass the ``--langs`` flag as follows:

``CorporaCreator$ create-corpora -d corpora -f clips.tsv --langs en fr``

This will create the corpora only for English and French.

Each created corpus will contain the files ``valid.tsv``, containing the validated clips; ``invalid.tsv``, containing the invalidated clips; and ``other.tsv``, containing clips that don't have sufficient votes to be considered valid or invalid. In addition it will contain the files ``train.tsv``, the valid clips in the training set; ``dev.tsv``, the valid clips in the validation set; and ``test.tsv``, the valid clips in test set.

The split of ``valid.tsv`` into ``train.tsv``, ``dev.tsv``, and ``test.tsv`` is done such that the number of clips in ``dev.tsv`` or ``test.tsv`` is a "statistically significant" sample relataive to the number of clips in ``train.tsv``. More specificially, if the population size is the number of clips in ``train.tsv``, then the number of clips in ``dev.tsv`` or ``test.tsv`` is the sample size required for a confidence level of 99% and a margin of error of 1% for the ``train.tsv`` population size.
Expand Down
7 changes: 7 additions & 0 deletions src/corporacreator/argparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,13 @@ def parse_args(args):
help="Path to the Common Voice tsv for all languages",
dest="tsv_filename",
)
parser.add_argument(
"-l",
"--langs",
required=False,
nargs='+',
help="Which language(s) you want to make corpora for",
)
parser.add_argument(
"-d",
"--directory",
Expand Down
13 changes: 12 additions & 1 deletion src/corporacreator/corpora.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from corporacreator import Corpus
from corporacreator.preprocessors import common
import argparse

_logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -42,7 +43,17 @@ def create(self):
corpora_data[["sentence", "up_votes", "down_votes"]] = corpora_data[
["sentence", "up_votes", "down_votes"]
].swifter.apply(func=lambda arg: common_wrapper(*arg), axis=1)
for locale in corpora_data.locale.unique():
if self.args.langs:
# check if all languages provided at command line are actually
# in the clips.tsv file, if not, throw error
if self.args.langs.issubset(corpora_data.locale.unique()):
locales = self.args.langs
else:
raise argparse.ArgumentTypeError("ERROR: You have requested languages which do not exist in clips.tsv")
else:
locales = corpora_data.locale.unique()

for locale in locales:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check that self.args.langs is a subset of corpora_data.locale.unique() and print an error message if it's not and also stop the program.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdavis-mozilla -- requested changes made to README and catch if entered languages are actually subset of possible languages...

also learned that Python has a x.issubset(y) function:)

_logger.info("Selecting %s corpus data..." % locale)
corpus_data = corpora_data.loc[
lambda df: df.locale == locale,
Expand Down