Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #9 (Preprocessors do not have access to the user_id and sentence) #10

Merged
merged 1 commit into from
Dec 12, 2018

Conversation

kdavis-mozilla
Copy link
Contributor

No description provided.

Copy link

@tilmankamp tilmankamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for having the user_id? Is there any accessible information through this id? If yes, why not passing that information as an object/hash?

@kdavis-mozilla
Copy link
Contributor Author

See issue #9 (Preprocessors do not have access to the user_id and sentence).

@tilmankamp
Copy link

OK - got it. But how would the preprocessor/user get the sample path with a given user_id and sentence practically?
Another design problem: This approach ties all preprocessors to the "already recorded" use-case.

@kdavis-mozilla
Copy link
Contributor Author

Common Voice will have a alpha release. This alpha release will contain audio + clips.tsv. Contributors will download the alpha release and then write a preprocessor for their language for this alpha release.

So preprocessor writers will have the alpha release with audio on their hard drive and be able to hear the audio of a particular (user_id, sentence) pairing when the write the preprocessor by looking in the tsv for the (user_id, sentence) pairing and finding the corresponding audio clip.

Yes this ties all preprocessors to the "already recorded" use-case. That is the problem at hand. As issue #9 (Preprocessors do not have access to the user_id and sentence) illustrates. It is impossible to always correctly handle the "not recorded" use-case as we never know how to convert "I am in room 2049" to words without hearing the audio.

@tilmankamp
Copy link

In the "not yet recorded" case we would just define how to speak "I am in room 2049".
LGTM

@kdavis-mozilla
Copy link
Contributor Author

Yes, but then we have to ask Common Voice (again) to not put digits in to the sentences to be read. Hopefully they'll listen this time 😄

@kdavis-mozilla kdavis-mozilla merged commit 6f02488 into master Dec 12, 2018
@tilmankamp
Copy link

In my understanding the preprocessors should also be used to preprocess sentences before the actual recording - if not even before merging new sentences into the sentence collection.

@tilmankamp
Copy link

They could even introduce a commit-hook that prohibits committing sentences with numbers.

@kdavis-mozilla
Copy link
Contributor Author

This is only to be used after the fact to create corpora, i.e. this is used once everything is recorded.

@kdavis-mozilla
Copy link
Contributor Author

Yes they could introduce a commit-hook, but they haven't ☹️

@kdavis-mozilla kdavis-mozilla deleted the issue9 branch December 13, 2018 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants