Here we implement a method for finding pages within a category most related to a given page or search query. We use Natural Language Processing techniques, specifically the Latent Semantic Analysis technique, in order to perform our matches. We use Wikipedia's publicly available API for the collection of document content. Our system is portable, built upon a series of Docker containers:
- a Python container
joshuacook/miniconda
based upon thebusybox
andminiconda
images. - a Mongo DB container from the latest public
mongo
image - a data container for Mongo DB from
tianon/true
Usage of this repository requires the installation of Docker. Please refer to the Docker documentation for installation on your system.
Once Docker has been installed, properly configured and launched, no additional work is necessary.
We provide a basic API to this functionality via the following command line arguments:
./bin/categories #CATEGORY#
- will display a category and its associated sub-categories
./bin/download
./bin/notebook
- launches an interactive notebook
- allows the user to manage categories, pages, and queries
- provides easy access to an IPython shell to the database and object models
./bin/page #PAGE#
- will display the content of a particular page
./bin/pages #CATEGORY#
- will display the titles of pages associated with a given category
./bin/search #CATEGORY# #N_OF_MATCHES# #QUERY_STRING#
- will display
N
articles in a `#CATEGORY# that are most similar to a passed query string - the query string may or may not be a stored page
- will display
./bin/start_db
- starts the
mongo
andmongodata
containers - necessary to run the system
- starts the
$ ./bin/download #CATEGORY#
We use Wikipedia's publicly available This process is a fairly straight-forward API call.
Note that it is also possible to pass a yaml file containing a list of categories in the following format:
categories:
- Machine learning
- Game theory
- Algorithms
- Linear algebra
$ ./bin/download data/these_categories.yml