Skip to content

Cosmian/cloudproof_spark

Repository files navigation

Cloudproof Spark Library

workflow

The Cloudproof Java library provides a Spark-friendly API to Cosmian's Cloudproof Encryption.

Cloudproof Encryption secures data repositories and applications in the cloud with advanced application-level encryption and encrypted search.

Licensing

The library is available under a dual licensing scheme Affero GPL/v3 and commercial. See LICENSE.md for details.

Cryptographic primitives

The library is based on:

  • CoverCrypt algorithm which allows creating ciphertexts for a set of attributes and issuing user keys with access policies over these attributes. CoverCrypt offers Post-Quantum resistance.

Getting Started

Using in Java projects

This library is open-source software and is available on Maven Central.

<dependency>
    <groupId>com.cosmian.cloudproof.spark</groupId>
    <artifactId>cloudproof_spark</artifactId>
    <version>1.0.0</version>
</dependency>

From this repository

1/ Install SBT

  • For Linux, download and extract ZIP file

2/ Install Spark

3/ Download the CSV file organizations-2000000.csv from https://www.datablist.com/learn/csv/download-sample-csv-files and put it at the root folder

wget https://github.com/datablist/sample-csv-files/raw/main/files/organizations/organizations-2000000.csv
7za x organizations-2000000.csv

4/ Execute:

mvn package && spark-submit --class "CloudproofSpark" --master "local[*]" target/cloudproof_spark-1.0.0.jar

or:

sbt assembly && spark-submit --class "CloudproofSpark" --master "local[*]" target/scala-2.12/CloudproofSpark-assembly-1.0.0.jar

Reading the code

  • src/main/scala/com/cosmian/cloudproof/spark/CloudproofSpark.scala is the main entrypoint, it contains the Spark code to read the CSV, write the encrypted parquet files and read the encrypted parquet files again (with different keys)
  • src/main/java/com/cosmian/cloudproof/spark/CoverCryptCryptoFactory.java is the class responsible to encrypt/decrypt the files and the columns with CoverCrypt
  • src/main/java/com/cosmian/cloudproof/spark/EncryptionMapping.java is simple class to encapsulate the mapping in a string form (because Spark config is only working with strings), read it and choose the correct policy for a specific file/column.

Parquet format

Parquet format is described here.

Benchmarks

  • Timings are about:
    • Spark boot
    • reading CSV dataset
    • writing output in Parquet format
    • Du to JVM execution and Spark boot, timings are unstable: given values represent an idea of the performance.
  • Parquet Encryption deals with files and columns
  • CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  • Datasets are CSV files:
    • 100 000 lines (14M)
    • 500 000 lines (71M)
    • 1 000 000 lines (141M)
    • 2 000 000 lines (283M)
  • Size of output regroups all .parquet and .crc files sizes

Quick summary

Without post-quantum resistance, CoverCrypt scheme overhead size and performance are equivalent to classic symmetric encryption algorithm like AES256-GCM but with a hybrid cryptographic system with multiple benefits.

Parquet without encryption

100_000 lines 500_000 lines 1_000_000 lines 2_000_000 lines
Size of output 17M 66M 104M 169M
Timings 7s 24s 31s 31s

Parquet with classic AES256-GCM encryption

100_000 lines 500_000 lines 1_000_000 lines 2_000_000 lines
Size of output 19M 77M 117M 183M
Timings 6s 23s 31s 36s

Parquet with CoverCrypt encryption

100_000 lines 500_000 lines 1_000_000 lines 2_000_000 lines
Size of output 20M 78M 118M 185M
Timings 9s 24s 33s 40s

Parquet with CoverCrypt encryption (post quantum resistant)

100_000 lines 500_000 lines 1_000_000 lines 2_000_000 lines
Size of output 21M 85M 126M 192M
Timings 9s 24s 36s 42s

Cryptographic overhead

A description of the cryptographic overhead is given here.

Testing

To test the TestCloudproof.scala, run

sbt "test:testOnly -- -oD"