created or.toml #107

psubhashish · 2020-06-09T07:15:04Z

No description provided.

MichaelKohler

Thanks! You can find the sample extraction here: https://github.com/Common-Voice/cv-sentence-extractor/suites/774458492/artifacts/8112008 (see https://discourse.mozilla.org/t/scraper-automatic-sample-sentences-extracted-in-pull-request/55217 for a full explanation).

I see a few issues at first glance:

There are English sentences in the OR wikipedia? (having a look at the allowed_symbols config option might help here instead of using the "disallowed_symbols")
There seem to be more abbreviations

Additionally to that, I highly suggest adding a blocklist as well: https://github.com/Common-Voice/cv-sentence-extractor#using-disallowed-words

Happy to help if you have any questions.

psubhashish · 2020-06-11T11:22:20Z

Hey Michael, thanks for flagging these. As a Wikipedia editor myself, I couldn't stop myself fixing some of the issues that you flagged. :-) So, there it goes -- I have started checking the English sentences and some are actual content (the rest being quotes like someone saying something about some person/place/incident -- original quotes are kept without translation in some articles) but fixing will take longer. The good news is many articles were due maintenance tags and deletion (oops) and this became a good excuse for some cleanup for good. Pat your shoulders as you indirectly contributed to Wikipedia! I'm yet to work on the blocklist.

In the meantime, is it possible to run the code and create such sample text that contains English? Maybe something I can share with the Wikipedia community so more helping hands can clean up. Also, the extractor needs to be told to not collect the citations or footnotes. It's "References" in English Wikipedia and "ଆଧାର" or "ଟୀକା" in Odia. I see some such citations included in the file that you sent.

Your comment says "requested changes". Does that mean that I need to work on the disallowed word list and this article both? I am a bit unsure what is the ask for this very file "or.toml" and would appreciate if you can help.

MichaelKohler · 2020-06-11T12:16:36Z

In the meantime, is it possible to run the code and create such sample text that contains English?

You can run as explained in the README, and use the option --no-check such as:

cargo run -- extract -l or -d ../wikiextractor/text/ --no_check >> wiki.or.all.txt

Note that this will take quite some time, and we will not be able to use that resulting file, as we have a limit of sentences per article we can take.

Might be easier to take the extraction from WikiExtractor and extract the sentences from there, then you don't have to run this script here just to identify all sentences. However, you'll need to do that to generate the block list, so probably a win-win if you do it.

Also, the extractor needs to be told to not collect the citations or footnotes. It's "References" in English Wikipedia and "ଆଧାର" or "ଟୀକା" in Odia. I see some such citations included in the file that you sent.

As we're using WikiExtractor before running our script, we do not have that info. And as far as I can see there is no such option in WikiExtractor?

Your comment says "requested changes". Does that mean that I need to work on the disallowed word list and this article both? I am a bit unsure what is the ask for this very file "or.toml" and would appreciate if you can help.

In the end we can merge this PR and run the extraction once the following is achieved:

Error rate is below 7% (I think, I always forget the exact number) - Get at least 3 different native speakers (ideally linguistics) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.
Rules file is correct from a technical standpoint (this is currently the case, but you most likely will require more changes down the road to decrease the error rate, so I will review again then)
(if using a blocklist) We know with what parameters this blocklist got generated

If it's achievable to get the error rate down to an acceptable level only with the rule file and no blocklist, that's an option too, but I heavily doubt that as we've seen a lot of improvement for other languages once a blocklist was added (as described in the README).

created or.toml

8b5fb0f

MichaelKohler requested changes Jun 9, 2020

View reviewed changes

MichaelKohler added the waiting on feedback label Jul 14, 2020

MichaelKohler marked this pull request as draft September 1, 2020 16:10

MichaelKohler changed the base branch from master to main October 27, 2020 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

created or.toml #107

created or.toml #107

psubhashish commented Jun 9, 2020

MichaelKohler left a comment

psubhashish commented Jun 11, 2020 •

edited

Loading

MichaelKohler commented Jun 11, 2020 •

edited

Loading

created or.toml #107

Are you sure you want to change the base?

created or.toml #107

Conversation

psubhashish commented Jun 9, 2020

MichaelKohler left a comment

Choose a reason for hiding this comment

psubhashish commented Jun 11, 2020 • edited Loading

MichaelKohler commented Jun 11, 2020 • edited Loading

psubhashish commented Jun 11, 2020 •

edited

Loading

MichaelKohler commented Jun 11, 2020 •

edited

Loading