Persian Raw Text - متن خام فارسی

The package contains a huge amout of Persian text, collected from the following sources:

Common Crawl: 65GB (link)
MirasText: 12G
W2C – Web to Corpus: 1GB (link)
Persian Wikipedia (March 2020 dump): 787MB (link)
Leipzig Corpora: 424M (link)
VOA corpus: 66MB (link)
Persian poems corpus: 61MB (link)
TEP: Tehran English-Persian parallel corpus: 33MB (link)

Each resource is modified to exclude non-text content (urls, html, non-utf-8 content, etc). I have also dropped the lines that do not contain any Persian text. I have not done any deduplication; so there might be repeated content.

The overall data is here (~70GB, ~13.5million paragraphs).

Note: since the files are relatively large, you probably shouldn't download in your browser. A good way to download the files is to use gsutil (see the here for more). This would give details on the total download size, download progress, etc:

gsutil -m cp -R gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt  .
Copying gs://danielk-files/farsi-text/merged_files/all_text_merged_cleaned.txt...
/ [0/1 files][600.2 MiB/ 69.8 GiB]   0% Done

You can also use tools like wget:

$ wget https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt 
--2020-05-17 14:53:08--  https://storage.googleapis.com/danielk-files/farsi-text/merged_files/commoncrawl_fa_merged.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68720495550 (64G) [text/plain]
Saving to: ‘commoncrawl_fa_merged.txt.1’

commoncrawl_fa_merged.txt.1                    0%[                                                                                              ] 542.30M  55.9MB/s    eta 17m 44s

Credits

If you find this repo useful, please include a reference to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persian Raw Text - متن خام فارسی

Credits

About

Releases

Packages

persiannlp/persian-raw-text

Folders and files

Latest commit

History

Repository files navigation

Persian Raw Text - متن خام فارسی

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages