Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bulk-ban or bulk-remove sentences #4491

Open
laubonghaudoi opened this issue May 27, 2024 · 4 comments
Open

Support bulk-ban or bulk-remove sentences #4491

laubonghaudoi opened this issue May 27, 2024 · 4 comments

Comments

@laubonghaudoi
Copy link
Contributor

laubonghaudoi commented May 27, 2024

I originally thought that this issue is only specific to the zh-hk locale, but later realize that this is quite widespread and seriously harming the data quality of many languages. So currently, some languages are full of junk sentences, and it's not just a few but thousands of junk sentences. Like this screenshot below, the zh-cn, zh-hk and possibly nan-tw are full of sentences that shouldn't be added in the first place (I suspect that they were added before the sentence validator were in place)
image

It is impractical to simply ask our volunteers to report these sentences because there are too many, It's also inefficient to do this one by one. These junk sentences discourage volunteers from recording and disrupts the normal dataset distribution. As we expand Common Voice to lower-resource languages, it is necessary to come up with a way to bulk remove or ban sentences. Because it is easy for people to simply dump a dictionary into Common Voice, when it is hard to find text corpus for the language.

The ???? in the screenshot is only one extreme example. In zh-CN, there are lots of unreadable, highly repetitive sentences like this
image
which I think are early Wikipedia dumps. They are obviously extracted from the first sentence of every Wikipedia article.

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented May 27, 2024

I think those sentences with "???" inside are introduced by an encoding bug when the old sentence collector database is incorporated into CV. It is unfortunately irreversible, and in some cases (western languages where their alphabet is mostly Latin/ASCII with some Unicode additions) they are recorded - because the human brain can deduct them. But for eastern languages it is quite a problem.

See:
#4048
#4138

There is one attempt to remove them from the released corpus, but it is not merged yet:
common-voice/CorporaCreator#127

It might also be caused by wrong encoding in other inclusion methods of course.

@jessicarose
Copy link
Collaborator

I think this is a really useful discussion and mirrors both concerns we've heard from other language community members and internal discussions. I'll be bringing this into planning meetings next week and will be able to come back to you with more information as team discussions expand on this and we do some research into technical explorations. Thank you so much for flagging this, it's an incredibly useful issue at a really useful time for the team and I appreciate you both raising it.

@irvin
Copy link
Member

irvin commented Jun 21, 2024

(Add some bg info) The wiki dump of zh-cn came from really early days when we need a working-in-progress sst model besides English and we need build text corpus fast for contracting recording firms form china to record.

At that time one sentences only recording once, so fetching Wikipedia seems to be the only way we can have hundred thousands of sentences in really short time.

We had try hard to adjust the parameter to raise the quality, and this is the best we can have at than.

@irvin
Copy link
Member

irvin commented Jun 21, 2024

For bulk-remove,

As a core contributor from both nan-TW and zh-tw corpus, this is very necessary tools for us if we want to ensure the quality of text corpus and cv database.

Before the collector was published on the official sites, we proof-reading all sentences before it went online, but nowadays, it's totally out of our control - everyone can add sentences, and and we don't have ways to evaluate them before hand.

We had more or less given up on ensuring the quality of things now, so It would be much appreciated if we can have this to do QC in some way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants