Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Cirrus Extractor vs. "normal" Extractor - who creates cleaner texts? #282

Open
PhilipMay opened this issue Feb 21, 2022 · 1 comment

Comments

@PhilipMay
Copy link

Hi,

which Extractor do you think creates cleaner texts? The Cirrus Extractor or the "normal" Extractor?

I am asking because I want to use the Wikipedia texsts to train language models based on them. see https://en.wikipedia.org/wiki/BERT_(language_model)

Thanks
Philip

@PhilipMay PhilipMay changed the title Question: Cirrus Extractor vs. "normal" Extractor Question: Cirrus Extractor vs. "normal" Extractor - who creates cleaner texts? Feb 21, 2022
@adno
Copy link

adno commented Dec 30, 2022

Hi,

I just finished (a first version of) a word list project based on the "normal" extractor and XML dumps. I managed to a do a reasonably good job by adding additional cleanup, but if I were to start from scratch I would use the cirrus dumps instead.

The output of the "normal" extractor is a mess (see #300) – you just cannot use it as is if you want clean text.

The cirrus dumps are already cleaned up, so only minimal processing is needed. That said, the current cirrus-extract.py script in this project doesn't work with current cirrus dumps, where articles have "_type":"_doc" (the script requires "_type":"page"). Also, even though the cirrus dump is already relatively clean (compared to wikiextractor output), it would be reasonable to do a little more cleanup than cirrus-extract.py does. This seems like a good start for doing that: https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py (Note that it's specifically for Japanese, so one would need to adjust it based on the target language.)

There are also various minor differences like that headings (and perhaps some other parts of the text included in the XML dumps/wikiextractor output) are omitted from cirrus search dump text.

To sum up: If you need clean text, you can choose from the following options:

  1. Modify my script for word lists, wikipedia-word-frequency-clean, to clean up wikiextractor output. It should be super easy, just process the return values of remove_markup(line) as you need. (Note that original English BERT language model by Google was trained from wikiextractor output with additional cleanup too.)

  2. Modify the script for Japanese BERT to clean up cirrus search dumps: https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py (Slightly more additional work, but I would bet better results.)

  3. As a last resort, just use the "text":… of each page from the cirrus search dumps "as is". It will still be cleaner than wikiextractor output.

The good thing about wikiextractor is that you can modify it to have custom processing of various Wikipedia markup (templates, links, etc.). But if you all you need is just clean text, it just doesn't cut it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants