The package contains a huge amout of Persian text, collected from the following sources:
- Common Crawl: 65GB (link)
- MirasText: 12G
- W2C – Web to Corpus: 1GB (link)
- Persian Wikipedia (March 2020 dump): 787MB (link)
- Leipzig Corpora: 424M (link)
- VOA corpus: 66MB (link)
- Persian poems corpus: 61MB (link)
- TEP: Tehran English-Persian parallel corpus: 33MB (link)
Each resource is modified to exclude non-text content (urls, html, non-utf-8 content, etc). I have also dropped the lines that do not contain any Persian text. I have not done any deduplication; so there might be repeated content.
The overall data is here (~70GB, ~13.5million paragraphs).
Note: since the files are relatively large, you probably shouldn't download in your browser.
A good way to download the files is to use gsutil
(see the here for more). This would give details on the total download size, download progress, etc:
gsutil -m cp -R gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt .
Copying gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt...
/ [0/1 files][600.2 MiB/ 69.8 GiB] 0% Done
You can also use tools like wget
:
$ wget https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt
--2020-05-17 14:53:08-- https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68720495550 (64G) [text/plain]
Saving to: ‘commoncrawl_fa_merged.txt.1’
commoncrawl_fa_merged.txt.1 0%[ ] 542.30M 55.9MB/s eta 17m 44s
If you find this repo useful, please include a reference to this repository.