Skip to content

Latest commit

 

History

History
70 lines (52 loc) · 2.62 KB

README.md

File metadata and controls

70 lines (52 loc) · 2.62 KB

Matching Wikipedia Articles in a Category via Latent Semantic Analysis

Here we implement a method for finding pages within a category most related to a given page or search query. We use Natural Language Processing techniques, specifically the Latent Semantic Analysis technique, in order to perform our matches. We use Wikipedia's publicly available API for the collection of document content. Our system is portable, built upon a series of Docker containers:

  1. a Python container joshuacook/miniconda based upon the busybox and miniconda images.
  2. a Mongo DB container from the latest public mongo image
  3. a data container for Mongo DB from tianon/true

Installation

Usage of this repository requires the installation of Docker. Please refer to the Docker documentation for installation on your system.

Once Docker has been installed, properly configured and launched, no additional work is necessary.

API

We provide a basic API to this functionality via the following command line arguments:

  • ./bin/categories #CATEGORY#
    • will display a category and its associated sub-categories
  • ./bin/download
    • uses Wikipedia's publicly accessible API to download pages associated with a given category. These pages are stored in Mongo DB. Will return an error if the category is misspelled.
    • It is not necessary to put cateogies in quotes.
    • Alternatively, a yaml file can be passed. See Below
  • ./bin/notebook
    • launches an interactive notebook
    • allows the user to manage categories, pages, and queries
    • provides easy access to an IPython shell to the database and object models
  • ./bin/page #PAGE#
    • will display the content of a particular page
  • ./bin/pages #CATEGORY#
    • will display the titles of pages associated with a given category
  • ./bin/search #CATEGORY# #N_OF_MATCHES# #QUERY_STRING#
    • will display N articles in a `#CATEGORY# that are most similar to a passed query string
    • the query string may or may not be a stored page
  • ./bin/start_db
    • starts the mongo and mongodata containers
    • necessary to run the system

Document Collection

$ ./bin/download #CATEGORY#

We use Wikipedia's publicly available This process is a fairly straight-forward API call.

Note that it is also possible to pass a yaml file containing a list of categories in the following format:

categories:
  - Machine learning
  - Game theory
  - Algorithms
  - Linear algebra
$ ./bin/download data/these_categories.yml