Matching Wikipedia Articles in a Category via Latent Semantic Analysis

Here we implement a method for finding pages within a category most related to a given page or search query. We use Natural Language Processing techniques, specifically the Latent Semantic Analysis technique, in order to perform our matches. We use Wikipedia's publicly available API for the collection of document content. Our system is portable, built upon a series of Docker containers:

a Python container joshuacook/miniconda based upon the busybox and miniconda images.
a Mongo DB container from the latest public mongo image
a data container for Mongo DB from tianon/true

Installation

Usage of this repository requires the installation of Docker. Please refer to the Docker documentation for installation on your system.

Once Docker has been installed, properly configured and launched, no additional work is necessary.

API

We provide a basic API to this functionality via the following command line arguments:

./bin/categories #CATEGORY#
- will display a category and its associated sub-categories
./bin/download
- uses Wikipedia's publicly accessible API to download pages associated with a given category. These pages are stored in Mongo DB. Will return an error if the category is misspelled.
- It is not necessary to put cateogies in quotes.
- Alternatively, a yaml file can be passed. See Below
./bin/notebook
- launches an interactive notebook
- allows the user to manage categories, pages, and queries
- provides easy access to an IPython shell to the database and object models
./bin/page #PAGE#
- will display the content of a particular page
./bin/pages #CATEGORY#
- will display the titles of pages associated with a given category
./bin/search #CATEGORY# #N_OF_MATCHES# #QUERY_STRING#
- will display N articles in a `#CATEGORY# that are most similar to a passed query string
- the query string may or may not be a stored page
./bin/start_db
- starts the mongo and mongodata containers
- necessary to run the system

Document Collection

$ ./bin/download #CATEGORY#

We use Wikipedia's publicly available This process is a fairly straight-forward API call.

Note that it is also possible to pass a yaml file containing a list of categories in the following format:

categories:
  - Machine learning
  - Game theory
  - Algorithms
  - Linear algebra

$ ./bin/download data/these_categories.yml

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
data		data
doc		doc
src		src
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matching Wikipedia Articles in a Category via Latent Semantic Analysis

Installation

API

Document Collection

About

Releases

Packages

Languages

joshuacook/latent_semantic_analysis

Folders and files

Latest commit

History

Repository files navigation

Matching Wikipedia Articles in a Category via Latent Semantic Analysis

Installation

API

Document Collection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages