Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removed unnecessary quotationmarks from de-lang sentences #54

Merged
merged 3 commits into from
Jan 28, 2019

Conversation

simnotes
Copy link
Collaborator

Noticed multiple strange cases of sentences containing a lot of quotation marks in reoccurring patters in german language corpus.

Patterns found:

  • """content""content."
  • "content""content""content."
  • "content."

Examples

  • """Beim Wrestling ist alles nur Show"", behauptet Willi."
  • "Das Prinzip heißt ""teile und herrsche""."
  • "Probier's mal mit Gemütlichkeit!"

Only checked the french corpus as well, the problem did not seem to occur in it. So applied the fix in the de.py preprocessor. (Although I did not check any other language corpus, so your mileage may vary.)

@kdavis-mozilla
Copy link
Contributor

Are you sure that some of these examples

"Das Prinzip heißt ""teile und herrsche""."

shouldn't be fixed as

Das Prinzip heißt „teile und herrsche".

or maybe

Das Prinzip heißt «teile und herrsche».

@simnotes
Copy link
Collaborator Author

simnotes commented Jan 25, 2019

I gave it a good long thought... in my opinion the only valid options are:

Das Prinzip heißt „teile und herrsche".

or

Das Prinzip heißt teile und herrsche.

The first one is probably the spelling-wise correct version, but in the case of training a neural network for speech-to-text and other tasks I probably would use the second one as input, since the quotationmarks won't actually be spoken/pronounced. So it depends - if the main use of Common Voice Project is giving ppl a good dataset for training models on, I'd use the second one. In every other case, I'd use the first one. What do you think?

@kdavis-mozilla
Copy link
Contributor

kdavis-mozilla commented Jan 26, 2019

For speech to text I think it's currently hard for an engine to place quotes in the correct position. However, I don't think it's out of the realm of possibility for an engine to place quotes correctly in the near future, one to two years. (Actually I'd be surprised if its not possible in research code now.)

The logic of keeping the quotes in is that it future proofs the data set, allowing it to be of use now and in future.

One further question: Why do you think guillemets are never valid in German? Or are you only talking about the direction of the guillemets? In other words...

Das Prinzip heißt »teile und herrsche«.

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I'm wondering, as in the discussion we've had, if it's possible to retain quotes that are correct.

For example,

"Das Prinzip heißt ""teile und herrsche""."

would be converted to

Das Prinzip heißt „teile und herrsche“.

I guess in another PR you or I could convert guillemets, « and », and single guillemets, ‹ and ›, to this form too, if there are any.

@kdavis-mozilla
Copy link
Contributor

Just grepped through the currently generated German csv's and found no guillemets, « and », or single guillemets, ‹ and ›. So the extra PR will no be required.

@simnotes
Copy link
Collaborator Author

simnotes commented Jan 27, 2019

Will incorporate the requested changes tomorrow so that conversion will be like

Das Prinzip heißt „teile und herrsche“.

As for the guillemets I found in the german book Duden (which is the standard reference for german spelling and grammar, that only inverted guillements like »…« are accepted, but those are only used in novel writing and nowhere else used. So since CV is about spoken language, there should not be any »…« in the dataset.

@simnotes
Copy link
Collaborator Author

Ok, I think now the changes are ready to be merged with master 😄

@kdavis-mozilla
Copy link
Contributor

Thanks for the updates! The changes look good to my eye.

However, I just ran them and realized that there are lots of other quote problems in de....

"""""""Sie werden doch nicht etwa"""", setzte ich zum Protest an, doch es war bereits zu spät."""
"""""""Das kann kein Dauerzustand sein"""", stellte ein Sprecher vor laufenden Kameras klar."""
"""""""Wollen wir nicht ein Stück Sacher Torte essen, wenn wir schon einmal in Wien sind?"""", schlug Toni vor."""
...

So I'm wondering if it makes more sense, and is possible, to solve them all at once! Or should we just solve them piecemeal?

What's your take on this?

@simnotes
Copy link
Collaborator Author

simnotes commented Jan 28, 2019

I noticed the same problems in clips.tsv.
However since clips.tsv will be loaded with the following method call in CorporaCreator

df = pd.read_csv('../corpora/de/output/de/train.tsv', 
        sep="\t",
        parse_dates=False,
        engine="python",
        encoding="utf-8",
        error_bad_lines=False,
        quotechar='"',
        quoting=csv.QUOTE_NONE,)

The sentence
"""""""Sie werden doch nicht etwa"""", setzte ich zum Protest an, doch es war bereits zu spät."""
will get stripped to
"""Sie werden doch nicht etwa"", setzte ich zum Protest an, doch es war bereits zu spät."
which in turn will get transformed by the discussed PR at hand to
Sie werden doch nicht etwa, setzte ich zum Protest an, doch es war bereits zu spät.

But since we made the decision to leave the quotationmarks in the dataset in the before mentioned example, we could argue to change it to
"Sie werden doch nicht etwa", setzte ich zum Protest an, doch es war bereits zu spät.

So... maybe I make one more change to my PR and we resolve it to be like the last example?

Another thing I noticed we should discuss:
The save method in corpus.py is currently the following:

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='"'
)

but in my opinion it should be

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='\\', quoting=csv.QUOTE_NONE
)

Because if we leave it, like it is, then all the formerly removed multi-quotes will return 😄
So sentences like
Das Prinzip heißt „teile und herrsche“.
will get turned to
"Das Prinzip heißt ""teile und herrsche""."
again on save. (The quotes will get escaped by another quotation-mark and the whole sentence will get quoted again, because it contains quotation-marks.)

@kdavis-mozilla
Copy link
Contributor

So... maybe I make one more change to my PR and we resolve it to be like the last example?

This sounds good to me

Nice catch on the switch from

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='"'
)

to

dataframe.to_csv(
  path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='\\', quoting=csv.QUOTE_NONE
)

I can add another PR with this change. I'll ask for your review.

@simnotes
Copy link
Collaborator Author

I think we are done here 😄

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kdavis-mozilla kdavis-mozilla merged commit 63f9be1 into common-voice:master Jan 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants