-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
removed unnecessary quotationmarks from de-lang sentences #54
Conversation
Are you sure that some of these examples
shouldn't be fixed as
or maybe
|
I gave it a good long thought... in my opinion the only valid options are:
or
The first one is probably the spelling-wise correct version, but in the case of training a neural network for speech-to-text and other tasks I probably would use the second one as input, since the quotationmarks won't actually be spoken/pronounced. So it depends - if the main use of Common Voice Project is giving ppl a good dataset for training models on, I'd use the second one. In every other case, I'd use the first one. What do you think? |
For speech to text I think it's currently hard for an engine to place quotes in the correct position. However, I don't think it's out of the realm of possibility for an engine to place quotes correctly in the near future, one to two years. (Actually I'd be surprised if its not possible in research code now.) The logic of keeping the quotes in is that it future proofs the data set, allowing it to be of use now and in future. One further question: Why do you think guillemets are never valid in German? Or are you only talking about the direction of the guillemets? In other words...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I'm wondering, as in the discussion we've had, if it's possible to retain quotes that are correct.
For example,
"Das Prinzip heißt ""teile und herrsche""."
would be converted to
Das Prinzip heißt „teile und herrsche“.
I guess in another PR you or I could convert guillemets, « and », and single guillemets, ‹ and ›, to this form too, if there are any.
Just grepped through the currently generated German csv's and found no guillemets, « and », or single guillemets, ‹ and ›. So the extra PR will no be required. |
Will incorporate the requested changes tomorrow so that conversion will be like
As for the guillemets I found in the german book Duden (which is the standard reference for german spelling and grammar, that only inverted guillements like »…« are accepted, but those are only used in novel writing and nowhere else used. So since CV is about spoken language, there should not be any »…« in the dataset. |
…tent" to content"content"content
Ok, I think now the changes are ready to be merged with master 😄 |
Thanks for the updates! The changes look good to my eye. However, I just ran them and realized that there are lots of other quote problems in de....
So I'm wondering if it makes more sense, and is possible, to solve them all at once! Or should we just solve them piecemeal? What's your take on this? |
I noticed the same problems in clips.tsv. df = pd.read_csv('../corpora/de/output/de/train.tsv',
sep="\t",
parse_dates=False,
engine="python",
encoding="utf-8",
error_bad_lines=False,
quotechar='"',
quoting=csv.QUOTE_NONE,) The sentence But since we made the decision to leave the quotationmarks in the dataset in the before mentioned example, we could argue to change it to So... maybe I make one more change to my PR and we resolve it to be like the last example? Another thing I noticed we should discuss: dataframe.to_csv(
path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='"'
) but in my opinion it should be dataframe.to_csv(
path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='\\', quoting=csv.QUOTE_NONE
) Because if we leave it, like it is, then all the formerly removed multi-quotes will return 😄 |
This sounds good to me Nice catch on the switch from dataframe.to_csv(
path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='"'
) to dataframe.to_csv(
path, sep="\t", header=True, index=False, encoding="utf-8", escapechar='\\', quoting=csv.QUOTE_NONE
) I can add another PR with this change. I'll ask for your review. |
I think we are done here 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Noticed multiple strange cases of sentences containing a lot of quotation marks in reoccurring patters in german language corpus.
Patterns found:
Examples
Only checked the french corpus as well, the problem did not seem to occur in it. So applied the fix in the de.py preprocessor. (Although I did not check any other language corpus, so your mileage may vary.)