Skip to content

Latest commit

 

History

History
38 lines (23 loc) · 1.99 KB

README.md

File metadata and controls

38 lines (23 loc) · 1.99 KB

LightLDA

Latent Dirichlet Allocation - inference in topic models (optimized by Metropolis - Hastings - Walker)

Description:

Implementation of an unsupervised text-clustering machine learning algorithm. Latent Dirichlet Allocation (LDA) is a probabilistic generative model that extracts thematic structure in a documentation of texts (corpus). The algorithm assumes that every topic is a collection of words with certain probabilities and every article (text) in the corpus is a distribution of these topics.

A good introduction to Latent Dirichlet Allocation can be found on Edwin Chen's data sciene blog

Reducing sampling complexity:

The usual implementation of LDA -the collapsed Gibbs sampling approach- can be expensive if we try to capture hundreds, thousands, or hundred of thousands of topic. Researches has shown that another decomposition of the collapsed conditional probability , using Walker's Alias table and Metropolis-Hasting, can yield an order of magnitude speed up.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Project uses Python and many relevant statistical, graphical libaries, notably:

  • panda
  • numpy
  • matplotlib

Information on how to install these packages can be found only. I highly recommend using Anaconda platform.

conda install pandas

Dataset:

The example uses the dataset from the Associated Press (AP) - which contains 2,246 articles and 10,473 words

Authors

  • Tuan Nguyen Doan - Initial work - tuangauss

This is a self-learning project and I hope to learn from the expertise of the community. Please reach out to me if you have any suggestion or ideas.