-
Notifications
You must be signed in to change notification settings - Fork 64
Fetching the data
ogrisel edited this page Jan 6, 2011
·
3 revisions
You can get the latest wikipedia dumps for the english articles here (around 5.4GB compressed, 23 GB uncompressed):
enwiki-latest-pages-articles.xml.bz2
The DBPedia links and entities types datasets are available here (16.4GB compressed):
Index of individual DBpedia 3.5.1 dumps
All of those datasets are also available from the Amazon cloud as public EBS volumes:
Wikipedia XML dataset EBS Volume: snap-8041f2e9 (all languages - 500GB)
DBPedia Triples dataset EBS Volume: snap-63cf3a0a (all languages - 67GB)
See the wiki page Running pignlproc scripts on a EC2 Hadoop cluster for instructions on how to setup you Hadoop cluster on EC2.