Fixed #9 (Preprocessors do not have access to the user_id and sentence) #10

kdavis-mozilla · 2018-12-12T05:57:39Z

No description provided.

tilmankamp

What's the use case for having the user_id? Is there any accessible information through this id? If yes, why not passing that information as an object/hash?

kdavis-mozilla · 2018-12-12T08:11:28Z

See issue #9 (Preprocessors do not have access to the user_id and sentence).

tilmankamp · 2018-12-12T08:33:02Z

OK - got it. But how would the preprocessor/user get the sample path with a given user_id and sentence practically?
Another design problem: This approach ties all preprocessors to the "already recorded" use-case.

kdavis-mozilla · 2018-12-12T08:42:22Z

Common Voice will have a alpha release. This alpha release will contain audio + clips.tsv. Contributors will download the alpha release and then write a preprocessor for their language for this alpha release.

So preprocessor writers will have the alpha release with audio on their hard drive and be able to hear the audio of a particular (user_id, sentence) pairing when the write the preprocessor by looking in the tsv for the (user_id, sentence) pairing and finding the corresponding audio clip.

Yes this ties all preprocessors to the "already recorded" use-case. That is the problem at hand. As issue #9 (Preprocessors do not have access to the user_id and sentence) illustrates. It is impossible to always correctly handle the "not recorded" use-case as we never know how to convert "I am in room 2049" to words without hearing the audio.

tilmankamp · 2018-12-12T08:57:19Z

In the "not yet recorded" case we would just define how to speak "I am in room 2049".
LGTM

kdavis-mozilla · 2018-12-12T09:03:19Z

Yes, but then we have to ask Common Voice (again) to not put digits in to the sentences to be read. Hopefully they'll listen this time 😄

tilmankamp · 2018-12-12T09:06:18Z

In my understanding the preprocessors should also be used to preprocess sentences before the actual recording - if not even before merging new sentences into the sentence collection.

tilmankamp · 2018-12-12T09:07:42Z

They could even introduce a commit-hook that prohibits committing sentences with numbers.

kdavis-mozilla · 2018-12-12T09:08:43Z

This is only to be used after the fact to create corpora, i.e. this is used once everything is recorded.

kdavis-mozilla · 2018-12-12T09:09:38Z

Yes they could introduce a commit-hook, but they haven't ☹️

Fixed #9 (Preprocessors do not have access to the user_id and sentence)

567dade

kdavis-mozilla requested a review from tilmankamp December 12, 2018 05:57

tilmankamp reviewed Dec 12, 2018

View reviewed changes

kdavis-mozilla merged commit 6f02488 into master Dec 12, 2018

kdavis-mozilla deleted the issue9 branch December 13, 2018 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed #9 (Preprocessors do not have access to the user_id and sentence) #10

Fixed #9 (Preprocessors do not have access to the user_id and sentence) #10

kdavis-mozilla commented Dec 12, 2018

tilmankamp left a comment

kdavis-mozilla commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018

Fixed #9 (Preprocessors do not have access to the user_id and sentence) #10

Fixed #9 (Preprocessors do not have access to the user_id and sentence) #10

Conversation

kdavis-mozilla commented Dec 12, 2018

tilmankamp left a comment

Choose a reason for hiding this comment

kdavis-mozilla commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

tilmankamp commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018

kdavis-mozilla commented Dec 12, 2018