You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text was updated successfully, but these errors were encountered:
PhilipMay
changed the title
Question: Cirrus Extractor vs. "normal" Extractor
Question: Cirrus Extractor vs. "normal" Extractor - who creates cleaner texts?
Feb 21, 2022
I just finished (a first version of) a word list project based on the "normal" extractor and XML dumps. I managed to a do a reasonably good job by adding additional cleanup, but if I were to start from scratch I would use the cirrus dumps instead.
The output of the "normal" extractor is a mess (see #300) – you just cannot use it as is if you want clean text.
The cirrus dumps are already cleaned up, so only minimal processing is needed. That said, the current cirrus-extract.py script in this project doesn't work with current cirrus dumps, where articles have "_type":"_doc" (the script requires "_type":"page"). Also, even though the cirrus dump is already relatively clean (compared to wikiextractor output), it would be reasonable to do a little more cleanup than cirrus-extract.py does. This seems like a good start for doing that: https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py (Note that it's specifically for Japanese, so one would need to adjust it based on the target language.)
There are also various minor differences like that headings (and perhaps some other parts of the text included in the XML dumps/wikiextractor output) are omitted from cirrus search dump text.
To sum up: If you need clean text, you can choose from the following options:
Modify my script for word lists, wikipedia-word-frequency-clean, to clean up wikiextractor output. It should be super easy, just process the return values of remove_markup(line) as you need. (Note that original English BERT language model by Google was trained from wikiextractor output with additional cleanup too.)
As a last resort, just use the "text":… of each page from the cirrus search dumps "as is". It will still be cleaner than wikiextractor output.
The good thing about wikiextractor is that you can modify it to have custom processing of various Wikipedia markup (templates, links, etc.). But if you all you need is just clean text, it just doesn't cut it.
Hi,
which Extractor do you think creates cleaner texts? The Cirrus Extractor or the "normal" Extractor?
I am asking because I want to use the Wikipedia texsts to train language models based on them. see https://en.wikipedia.org/wiki/BERT_(language_model)
Thanks
Philip
The text was updated successfully, but these errors were encountered: