Columnix

Columnix is a columnar storage format, similar to Parquet and ORC.

The experiment was to beat Parquet's read performance in Spark for flat schemas, while simultaneously reducing the disk footprint by utilizing newer compression algorithms such as lz4 and zstd.

Columnix supports:

row groups
indexes (at the row group level, and file level)
vectorized reads
predicate pushdown
lazy reads
AVX2 and AVX512 predicate matching
memory-mapped IO

Spark's Parquet reader supports 1-4, but has no support for lazy reads, only limited SIMD support (whatever the JVM provides) and IO is through HDFS.

Support for complex schemas was not a goal of the project. The format has no support for Parquet's Dremel-style definition & repetition levels or ORC's compound types (struct, list, map, union).

The library does not currently support encoding of data prior to (or instead of) compression, for example run-length or dict encoding, despite placeholders in the code alluding to it. It was next on the TODO list, but I'd like to explore alternative approaches such as github.com/chriso/treecomp.

The following bindings are provided:

Python (ctypes): ./contrib/columnix.py
Spark (JNI): chriso/columnix-spark

One major caveat: the library uses mmap for reads. There is no HDFS compatibility and so there is limited real world use for the time being.

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
bin		bin
contrib		contrib
lib		lib
test		test
.clang-format		.clang-format
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Columnix

About

Releases

Packages

Languages

License

chriso/columnix

Folders and files

Latest commit

History

Repository files navigation

Columnix

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages