The initial list was provided by Kevyn Collins-Thomson from the University of Michigan School of Information.
-
Long general-purpose list of datasets:
-
This website has dozens of public datasets - some fun, some a bit, well.. quirky. external link:
-
The Academic Torrents site has a growing number of datasets, including a few text collections that might be of interest (Wikipedia, email, twitter, academic, etc.) for current or future projects.
-
Google Books n-gram corpus
- External link: http://books.google.com/ngrams
- Dataset: external link: http://aws.amazon.com/datasets/8172056142375670
-
Common Crawl: • Currently 6 billion Web documents (81 Tb) • Amazon S3 Public Data Set
-
Business/commercial data Yelp external link:
- http://www.yelp.com/developers/documentation/v2/search_api
- Upcoming Deprecation of Yelp API v2 on June 30, 2018 (Posted by Yelp Jun 28, 2017)
-
Internet Archive (huge, ever-growing archive of the Web going back to 1990s) external link:
-
WikiData:
-
World Food Facts
-
Data USA - a variety of census data
-
U.S. Government open data - datasets from 75 agencies and subagencies
-
NASA data portal - space and earth science