Support bulk-ban or bulk-remove sentences #4491

laubonghaudoi · 2024-05-27T23:18:15Z

I originally thought that this issue is only specific to the zh-hk locale, but later realize that this is quite widespread and seriously harming the data quality of many languages. So currently, some languages are full of junk sentences, and it's not just a few but thousands of junk sentences. Like this screenshot below, the zh-cn, zh-hk and possibly nan-tw are full of sentences that shouldn't be added in the first place (I suspect that they were added before the sentence validator were in place)

It is impractical to simply ask our volunteers to report these sentences because there are too many, It's also inefficient to do this one by one. These junk sentences discourage volunteers from recording and disrupts the normal dataset distribution. As we expand Common Voice to lower-resource languages, it is necessary to come up with a way to bulk remove or ban sentences. Because it is easy for people to simply dump a dictionary into Common Voice, when it is hard to find text corpus for the language.

The ???? in the screenshot is only one extreme example. In zh-CN, there are lots of unreadable, highly repetitive sentences like this

which I think are early Wikipedia dumps. They are obviously extracted from the first sentence of every Wikipedia article.

The text was updated successfully, but these errors were encountered:

HarikalarKutusu · 2024-05-27T23:27:49Z

I think those sentences with "???" inside are introduced by an encoding bug when the old sentence collector database is incorporated into CV. It is unfortunately irreversible, and in some cases (western languages where their alphabet is mostly Latin/ASCII with some Unicode additions) they are recorded - because the human brain can deduct them. But for eastern languages it is quite a problem.

See:
#4048
#4138

There is one attempt to remove them from the released corpus, but it is not merged yet:
common-voice/CorporaCreator#127

It might also be caused by wrong encoding in other inclusion methods of course.

jessicarose · 2024-06-14T14:07:30Z

I think this is a really useful discussion and mirrors both concerns we've heard from other language community members and internal discussions. I'll be bringing this into planning meetings next week and will be able to come back to you with more information as team discussions expand on this and we do some research into technical explorations. Thank you so much for flagging this, it's an incredibly useful issue at a really useful time for the team and I appreciate you both raising it.

irvin · 2024-06-21T04:44:24Z

(Add some bg info) The wiki dump of zh-cn came from really early days when we need a working-in-progress sst model besides English and we need build text corpus fast for contracting recording firms form china to record.

At that time one sentences only recording once, so fetching Wikipedia seems to be the only way we can have hundred thousands of sentences in really short time.

We had try hard to adjust the parameter to raise the quality, and this is the best we can have at than.

irvin · 2024-06-21T04:47:47Z

For bulk-remove,

As a core contributor from both nan-TW and zh-tw corpus, this is very necessary tools for us if we want to ensure the quality of text corpus and cv database.

Before the collector was published on the official sites, we proof-reading all sentences before it went online, but nowadays, it's totally out of our control - everyone can add sentences, and and we don't have ways to evaluate them before hand.

We had more or less given up on ensuring the quality of things now, so It would be much appreciated if we can have this to do QC in some way.

jessicarose mentioned this issue Jun 14, 2024

docs/sentences/correcting existing data: more information needed + migrations docs needed[DOCS] #4510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bulk-ban or bulk-remove sentences #4491

Support bulk-ban or bulk-remove sentences #4491

laubonghaudoi commented May 27, 2024 •

edited

Loading

HarikalarKutusu commented May 27, 2024 •

edited

Loading

jessicarose commented Jun 14, 2024

irvin commented Jun 21, 2024 •

edited

Loading

irvin commented Jun 21, 2024 •

edited

Loading

Support bulk-ban or bulk-remove sentences #4491

Support bulk-ban or bulk-remove sentences #4491

Comments

laubonghaudoi commented May 27, 2024 • edited Loading

HarikalarKutusu commented May 27, 2024 • edited Loading

jessicarose commented Jun 14, 2024

irvin commented Jun 21, 2024 • edited Loading

irvin commented Jun 21, 2024 • edited Loading

laubonghaudoi commented May 27, 2024 •

edited

Loading

HarikalarKutusu commented May 27, 2024 •

edited

Loading

irvin commented Jun 21, 2024 •

edited

Loading

irvin commented Jun 21, 2024 •

edited

Loading