Skip to content

Short project to de-duplicate records based on fuzzy matching and machine learning, using Python and ElasticSearch

License

Notifications You must be signed in to change notification settings

ogierpaul/suricate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Suricate

A simple and effective framework for finding duplicate entities between datasets. It also draws heavily on fuzzy matching, using both tf-idf and python-levenshtein (fuzzywuzzy) package. Based on a modular architecture using Pandas and Scikit-learn base classes (transformer), it is completely customizable and pipelineable.

Aim: Using machine learning to find duplicate records

Examples

Duplicate records, or record matching, may occur in different environments:

  • Merging two systems of informations (ex: two ERP systems), where you need to identify which supplier companies are the same
  • Finding a person between two databases (ex: online survey with email vs windows login)
  • ...

The aim is to compare a dataframe (target) with another (right)

  • create a similarity matrix between the two set of records
  • label the data as 0 --> not a match and 1 --> match
  • train a Classifier on the data
  • predict

Project Structure

  • suricate: python package
    • contains the code with the base class to do the deduplication
  • tests:
    • test library for the suricate package (in progress)
  • tutorial:
    • Jupyter Notebooks to guide how to use the suricate package

About

Short project to de-duplicate records based on fuzzy matching and machine learning, using Python and ElasticSearch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published