removed unnecessary quotationmarks from de-lang sentences #54

simnotes · 2019-01-24T14:14:58Z

Noticed multiple strange cases of sentences containing a lot of quotation marks in reoccurring patters in german language corpus.

Patterns found:

"""content""content."
"content""content""content."
"content."

Examples

"""Beim Wrestling ist alles nur Show"", behauptet Willi."
"Das Prinzip heißt ""teile und herrsche""."
"Probier's mal mit Gemütlichkeit!"

Only checked the french corpus as well, the problem did not seem to occur in it. So applied the fix in the de.py preprocessor. (Although I did not check any other language corpus, so your mileage may vary.)

kdavis-mozilla · 2019-01-24T16:37:50Z

Are you sure that some of these examples

"Das Prinzip heißt ""teile und herrsche""."

shouldn't be fixed as

Das Prinzip heißt „teile und herrsche".

or maybe

Das Prinzip heißt «teile und herrsche».

simnotes · 2019-01-25T09:34:42Z

I gave it a good long thought... in my opinion the only valid options are:

Das Prinzip heißt „teile und herrsche".

or

Das Prinzip heißt teile und herrsche.

The first one is probably the spelling-wise correct version, but in the case of training a neural network for speech-to-text and other tasks I probably would use the second one as input, since the quotationmarks won't actually be spoken/pronounced. So it depends - if the main use of Common Voice Project is giving ppl a good dataset for training models on, I'd use the second one. In every other case, I'd use the first one. What do you think?

kdavis-mozilla · 2019-01-26T11:42:52Z

For speech to text I think it's currently hard for an engine to place quotes in the correct position. However, I don't think it's out of the realm of possibility for an engine to place quotes correctly in the near future, one to two years. (Actually I'd be surprised if its not possible in research code now.)

The logic of keeping the quotes in is that it future proofs the data set, allowing it to be of use now and in future.

One further question: Why do you think guillemets are never valid in German? Or are you only talking about the direction of the guillemets? In other words...

Das Prinzip heißt »teile und herrsche«.

kdavis-mozilla

Thanks for the PR!

I'm wondering, as in the discussion we've had, if it's possible to retain quotes that are correct.

For example,

"Das Prinzip heißt ""teile und herrsche""."

would be converted to

Das Prinzip heißt „teile und herrsche“.

I guess in another PR you or I could convert guillemets, « and », and single guillemets, ‹ and ›, to this form too, if there are any.

kdavis-mozilla · 2019-01-27T06:39:26Z

Just grepped through the currently generated German csv's and found no guillemets, « and », or single guillemets, ‹ and ›. So the extra PR will no be required.

simnotes · 2019-01-27T12:47:09Z

Will incorporate the requested changes tomorrow so that conversion will be like

Das Prinzip heißt „teile und herrsche“.

As for the guillemets I found in the german book Duden (which is the standard reference for german spelling and grammar, that only inverted guillements like »…« are accepted, but those are only used in novel writing and nowhere else used. So since CV is about spoken language, there should not be any »…« in the dataset.

…tent" to content"content"content

simnotes · 2019-01-28T09:15:19Z

Ok, I think now the changes are ready to be merged with master 😄

kdavis-mozilla · 2019-01-28T11:53:17Z

Thanks for the updates! The changes look good to my eye.

However, I just ran them and realized that there are lots of other quote problems in de....

"""""""Sie werden doch nicht etwa"""", setzte ich zum Protest an, doch es war bereits zu spät."""
"""""""Das kann kein Dauerzustand sein"""", stellte ein Sprecher vor laufenden Kameras klar."""
"""""""Wollen wir nicht ein Stück Sacher Torte essen, wenn wir schon einmal in Wien sind?"""", schlug Toni vor."""
...

So I'm wondering if it makes more sense, and is possible, to solve them all at once! Or should we just solve them piecemeal?

What's your take on this?

simnotes · 2019-01-28T12:55:35Z

I noticed the same problems in clips.tsv.
However since clips.tsv will be loaded with the following method call in CorporaCreator

df = pd.read_csv('../corpora/de/output/de/train.tsv', 
        sep="\t",
        parse_dates=False,
        engine="python",
        encoding="utf-8",
        error_bad_lines=False,
        quotechar='"',
        quoting=csv.QUOTE_NONE,)

The sentence
"""""""Sie werden doch nicht etwa"""", setzte ich zum Protest an, doch es war bereits zu spät."""
will get stripped to
"""Sie werden doch nicht etwa"", setzte ich zum Protest an, doch es war bereits zu spät."
which in turn will get transformed by the discussed PR at hand to
Sie werden doch nicht etwa, setzte ich zum Protest an, doch es war bereits zu spät.

But since we made the decision to leave the quotationmarks in the dataset in the before mentioned example, we could argue to change it to
"Sie werden doch nicht etwa", setzte ich zum Protest an, doch es war bereits zu spät.

So... maybe I make one more change to my PR and we resolve it to be like the last example?

Another thing I noticed we should discuss:
The save method in corpus.py is currently the following:

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='"'
)

but in my opinion it should be

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='\\', quoting=csv.QUOTE_NONE
)

Because if we leave it, like it is, then all the formerly removed multi-quotes will return 😄
So sentences like
Das Prinzip heißt „teile und herrsche“.
will get turned to
"Das Prinzip heißt ""teile und herrsche""."
again on save. (The quotes will get escaped by another quotation-mark and the whole sentence will get quoted again, because it contains quotation-marks.)

kdavis-mozilla · 2019-01-28T13:56:18Z

So... maybe I make one more change to my PR and we resolve it to be like the last example?

This sounds good to me

Nice catch on the switch from

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='"'
)

to

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='\\', quoting=csv.QUOTE_NONE
)

I can add another PR with this change. I'll ask for your review.

simnotes · 2019-01-28T14:04:38Z

I think we are done here 😄

kdavis-mozilla

LGTM

removed unnecessary quotationmarks from de-lang sentences

7202ba0

kdavis-mozilla suggested changes Jan 26, 2019

View reviewed changes

changed quote-replacement behaviour of pattern "content""content""con…

82b6d79

…tent" to content"content"content

change quote-replacement of pattern """content""content"

db5f902

kdavis-mozilla approved these changes Jan 28, 2019

View reviewed changes

kdavis-mozilla merged commit 63f9be1 into common-voice:master Jan 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removed unnecessary quotationmarks from de-lang sentences #54

removed unnecessary quotationmarks from de-lang sentences #54

simnotes commented Jan 24, 2019

kdavis-mozilla commented Jan 24, 2019

simnotes commented Jan 25, 2019 •

edited

Loading

kdavis-mozilla commented Jan 26, 2019 •

edited

Loading

kdavis-mozilla left a comment

kdavis-mozilla commented Jan 27, 2019

simnotes commented Jan 27, 2019 •

edited

Loading

simnotes commented Jan 28, 2019

kdavis-mozilla commented Jan 28, 2019

simnotes commented Jan 28, 2019 •

edited

Loading

kdavis-mozilla commented Jan 28, 2019

simnotes commented Jan 28, 2019

kdavis-mozilla left a comment

removed unnecessary quotationmarks from de-lang sentences #54

removed unnecessary quotationmarks from de-lang sentences #54

Conversation

simnotes commented Jan 24, 2019

kdavis-mozilla commented Jan 24, 2019

simnotes commented Jan 25, 2019 • edited Loading

kdavis-mozilla commented Jan 26, 2019 • edited Loading

kdavis-mozilla left a comment

Choose a reason for hiding this comment

kdavis-mozilla commented Jan 27, 2019

simnotes commented Jan 27, 2019 • edited Loading

simnotes commented Jan 28, 2019

kdavis-mozilla commented Jan 28, 2019

simnotes commented Jan 28, 2019 • edited Loading

kdavis-mozilla commented Jan 28, 2019

simnotes commented Jan 28, 2019

kdavis-mozilla left a comment

Choose a reason for hiding this comment

simnotes commented Jan 25, 2019 •

edited

Loading

kdavis-mozilla commented Jan 26, 2019 •

edited

Loading

simnotes commented Jan 27, 2019 •

edited

Loading

simnotes commented Jan 28, 2019 •

edited

Loading