Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty string sentence in cv-corpus-5-2020-06-22/en/test.tsv #108

Open
antimora opened this issue Jul 22, 2020 · 8 comments
Open

Empty string sentence in cv-corpus-5-2020-06-22/en/test.tsv #108

antimora opened this issue Jul 22, 2020 · 8 comments

Comments

@antimora
Copy link

While processing entries from cv-corpus-5-2020-06-22/en/test.tsv, I have discovered an empty string sentence ("") on line #557 referencing common_voice_en_16759015.mp3. This entry also exists in validated.tsv. I haven't checked if there are more of the same type of errors.

@kdavis-mozilla
Copy link
Contributor

Strange, I thought the code prevented this, but there it is.

Oh I think I see what happened. The string is not empty but the string contains two quote marks.

It looks like the string originally was something like...

"<b> </b>..."

which contains, HTML tags, a non-printable character " " (&#160;), and likely other characters removed by common.py#L69.

After the HTML and other stuff was removed it was then turned into...

""

which, as it contains quotes, makes it through the check here corpus.py#L55 which should remove empty strings.

Have you listened to common_voice_en_16759015.mp3? Maybe that will give us more info on what the string originally was and how it got validated!?

@kdavis-mozilla
Copy link
Contributor

Just listened to it. It's someone reading back HTML tags.

I guess it was originally something like

"<HTML>one equals<em>"

and common.py#L69 turned that into

""

One nice thing is that's the only occurrence of this problem in the en test set and it doesn't occur in the en dev or train set.

@kdavis-mozilla
Copy link
Contributor

I guess maybe the solution is to add a language specific preprocessor that removes strings that are just ""?

@antimora
Copy link
Author

Thanks for looking into this.

Couldn't the solution be more general than removing "" (double quotes)?

If we could strip the beginning and ending quotes, then preprocessor would have caught "empty" string sentences. To me, the both of these lines are the same:

"Hello there"
Hello there

I see plenty of instances where transcripts start and end with quotes. Also from a developer's perspective it is confusing to see some transcripts are quoted and others not.

Related to the quote topic, I have noticed many transcripts contain "" (double quotes). Is this artifact created by the similar HTML cleansing logic you have described? Could we also collapse double quotes just like we are doing with double spaces. I saw a specific logic for it in common.py preprocessor.

Here is examples of double quotes from test:


"It was also known as the ""Sunflower""."
"Alston commented that he felt the cartoonist ""might have had some racial intent""."
"Karina Smirnoff of ""Dancing With The Stars"" hosted the following month."
"""The wind told me that you know about love"" the boy said to the sun."
"""Like everybody learns,"" he said."
"""Bambalio"" refers to a tendency to stammer."
"""I can work for the rest of today,"" the boy answered."
"The episode ""Father's Day"" depicts two younger versions of Jackie also played by Coduri."
"""Fatima,"" the girl said, averting her eyes."
"""This desert was once a sea,"" he said."
"It included a new production of ""Passion"" directed by Jamie Lloyd."
"He jumped up and turned quickly to face the imagined terror screaming ""Get back!"""
"You might hear ""font families"" more than ""typefaces"", even though they could mean the same thing."
"""Getting to play someone as unrestricted as a vampire is a thrill,"" she says."
"""Good-bye,"" said the boy."
"Philip then tells his parents that he was suspended for ""singing"" the National Anthem."
"It was later revealed that the letter was a prank concocted by ""The eXile""."
"The launch was flawless; all systems were ""go"", except for Doctor Wang's experiment."
"For centuries after her death, Welshmen cried-out ""Revenge for Gwenllian"" when engaging in battle."
"""And I'm certain you'll find it,"" the alchemist said."
"System B, however, does not depend explicitly on ""t"" so it is time-invariant."
"Beachley narrates the Seven Network factual series ""Beach Cops""."
"""Let's stop this,"" another commander said."
"Anna Austen asked about the acceptation of the word ""alliteration""."
"""Should I understand the Emerald Tablet?"" the boy asked."
"This bridge is unofficially referred to as ""Blackwater Bridge"" by Coalition Forces operating there."
"""This is the first phase of the job,"" he said."

@kdavis-mozilla
Copy link
Contributor

In my experience

"Hello there"

and

Hello there

are not spoken in the same manner.

The first, with quotes, is spoken with a bit more inflection with a rising tone in the first word to emphasize the speaker of the sentence is quoting someone else where as the second is spoken with no such effect. Stripping the quotes is thus removing information.

Double quotes, generally, indicate escaped quotes. Though this may not be the case for all double quotes in the text. For example

"It included a new production of ""Passion"" directed by Jamie Lloyd."

has escaped quotes around the word "Passion".

@antimora
Copy link
Author

In that case, this needs to be make clearer (in documentation or instructions) because from the context neither the reader, nor validator would know that quoted text should be read any differently. But I agree if the transcript contained quoted text, then it is read differently: He said "hello there", for example.

However, I still believe most quotes surrounding the transcripts are text processing artifacts of some sort. These two transcripts, from my previous comment, for instance, contain quotes within quotes. If it was true what you said about quoted texts are read differently, then how quotes within quotes should be read?

"""Getting to play someone as unrestricted as a vampire is a thrill,"" she says."
"""Good-bye,"" said the boy."

I get that as much information as possible should be preserved generally but in these cases I believe the readers, validators, and developers think quoted transcripts (beginning and end) are nothing more than text surrounded by quotes.

Perhaps, this is not the right place to address this quote issue. Probably it should be addressed somewhere upstream. If you could point me to the right direction, I'll be happy to follow up.

@kdavis-mozilla
Copy link
Contributor

As we don't document that one raises the tone of one's voice when reading a question, we also don't document the change in intonation when reading a quotation. This is simply part of what's entailed in "reading aloud".

However, I agree with you in that I also do not believe most quotes are surrounding the various sentences are to indicate the sentence is a quotation. In the majority of quoted text, e.g.

"Hello, how are you?"

the quotes simply are a means to delineate the text from its surroundings.

However, there are some cases in which I think the text contains quotes, but I'd have to look in detail at the entire pipeline to really differentiate between these two cases. I think @phirework has a much better view of the entire pipeline than I do. So maybe phirework could chime in?

@phirework
Copy link
Contributor

Re: OP - Kelly's correct, the original sentence was <html lang%3D"en">, and it was added to our db early enough that it was probably from before we had any proper sanitization. We can certainly add logic to the CorporaCreator to not pick entries that are just an empty string like that.

On the question of too many quotes, it looks like it has to do with the settings we're using for fast-csv, the library we use to write to TSV, and how it handles quoted fields that can potentially contain the delimiter (\t in our case). Can I get you to report that in https://github.com/Common-Voice/common-voice-bundler/ instead and I'll tweak the config? Feel free to just link to this discussion, since I'll be the one fixing it. Unfortunately I can't just move this issue since they're in separate Github orgs.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants