Skip to content

A pig udf to compute and use the HyperLogLog algorithm

Notifications You must be signed in to change notification settings

xadrnd/pig-hyperloglog

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pig-hyperloglog

Several Apache Pig user defined functions (UDFs) to compute and use the HyperLogLog algorithm.

Other implementations exist (for example, this one). This project was implemented to complement the hyperloglog mysql plugin and uses the exact same implementation. Thus, it enables you to compute a HLL string in a pig script, import the results into MySQL, and then invoke the MySQL HLL functions on the data to analyze the data and get cardinality estimation.

Usage

Four separate UDFs exist -
HLL_CREATE, HLL_COMPUTE, HLL_MERGE, HLL_MERGE_COMPUTE.
These are exactly the same functions as in the hyperloglog mysql plugin, so check out its documentation.
You can also see the UdfTest.java for examples.

Note: When used from Apache pig, you need to register the project jar file, but also make sure that the libpighll.so file (or DLL on windows) can be found in the java library path.

What if I do not use Apache Pig

The HyperLogLog class is a java class the wraps the underlying c++ implementation.
It can be used from Hadoop map-reduce, Hive, HBase or any other JVM based program.

Compilation

Prerequisites: You should have CMake and Maven 2 installed.

git submodule update --init
cd jni/mysql-hyperloglog
git submodule update --init
cd ..
cmake .
make
cd ..
mvn package

Note: Tested on ubuntu, but should work fine on most platforms.

About

A pig udf to compute and use the HyperLogLog algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 77.6%
  • C++ 19.6%
  • CMake 2.2%
  • Shell 0.6%