Several Apache Pig user defined functions (UDFs) to compute and use the HyperLogLog algorithm.
Other implementations exist (for example, this one). This project was implemented to complement the hyperloglog mysql plugin and uses the exact same implementation. Thus, it enables you to compute a HLL string in a pig script, import the results into MySQL, and then invoke the MySQL HLL functions on the data to analyze the data and get cardinality estimation.
Four separate UDFs exist -
HLL_CREATE, HLL_COMPUTE, HLL_MERGE, HLL_MERGE_COMPUTE.
These are exactly the same functions as in the hyperloglog mysql plugin, so check out its documentation.
You can also see the UdfTest.java for examples.
Note: When used from Apache pig, you need to register the project jar file, but also make sure that the libpighll.so file (or DLL on windows) can be found in the java library path.
The HyperLogLog class is a java class the wraps the underlying c++ implementation.
It can be used from Hadoop map-reduce, Hive, HBase or any other JVM based program.
Prerequisites: You should have CMake and Maven 2 installed.
git submodule update --init
cd jni/mysql-hyperloglog
git submodule update --init
cd ..
cmake .
make
cd ..
mvn package
Note: Tested on ubuntu, but should work fine on most platforms.