The Cloudproof Java library provides a Spark-friendly API to Cosmian's Cloudproof Encryption.
Cloudproof Encryption secures data repositories and applications in the cloud with advanced application-level encryption and encrypted search.
- Licensing
- Cryptographic primitives
- Getting Started
- Reading the code
- Parquet format
- Benchmarks
- Cryptographic overhead
- Testing
The library is available under a dual licensing scheme Affero GPL/v3 and commercial. See LICENSE.md for details.
The library is based on:
- CoverCrypt algorithm which allows
creating ciphertexts for a set of attributes and issuing user keys with access
policies over these attributes.
CoverCrypt
offers Post-Quantum resistance.
This library is open-source software and is available on Maven Central.
<dependency>
<groupId>com.cosmian.cloudproof.spark</groupId>
<artifactId>cloudproof_spark</artifactId>
<version>1.0.0</version>
</dependency>
1/ Install SBT
2/ Install Spark
- Download and extract from source
3/ Download the CSV file organizations-2000000.csv
from https://www.datablist.com/learn/csv/download-sample-csv-files and put it at the root folder
wget https://github.com/datablist/sample-csv-files/raw/main/files/organizations/organizations-2000000.csv
7za x organizations-2000000.csv
4/ Execute:
mvn package && spark-submit --class "CloudproofSpark" --master "local[*]" target/cloudproof_spark-1.0.0.jar
or:
sbt assembly && spark-submit --class "CloudproofSpark" --master "local[*]" target/scala-2.12/CloudproofSpark-assembly-1.0.0.jar
src/main/scala/com/cosmian/cloudproof/spark/CloudproofSpark.scala
is the main entrypoint, it contains the Spark code to read the CSV, write the encrypted parquet files and read the encrypted parquet files again (with different keys)src/main/java/com/cosmian/cloudproof/spark/CoverCryptCryptoFactory.java
is the class responsible to encrypt/decrypt the files and the columns with CoverCryptsrc/main/java/com/cosmian/cloudproof/spark/EncryptionMapping.java
is simple class to encapsulate the mapping in a string form (because Spark config is only working with strings), read it and choose the correct policy for a specific file/column.
Parquet format is described here.
- Timings are about:
- Spark boot
- reading CSV dataset
- writing output in Parquet format
- Du to JVM execution and Spark boot, timings are unstable: given values represent an idea of the performance.
- Parquet Encryption deals with files and columns
- CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
- Datasets are CSV files:
- 100 000 lines (14M)
- 500 000 lines (71M)
- 1 000 000 lines (141M)
- 2 000 000 lines (283M)
- Size of output regroups all
.parquet
and.crc
files sizes
Without post-quantum resistance, CoverCrypt
scheme overhead size and performance are equivalent to classic symmetric encryption algorithm like AES256-GCM but with a hybrid cryptographic system with multiple benefits.
100_000 lines | 500_000 lines | 1_000_000 lines | 2_000_000 lines | |
---|---|---|---|---|
Size of output | 17M | 66M | 104M | 169M |
Timings | 7s | 24s | 31s | 31s |
100_000 lines | 500_000 lines | 1_000_000 lines | 2_000_000 lines | |
---|---|---|---|---|
Size of output | 19M | 77M | 117M | 183M |
Timings | 6s | 23s | 31s | 36s |
100_000 lines | 500_000 lines | 1_000_000 lines | 2_000_000 lines | |
---|---|---|---|---|
Size of output | 20M | 78M | 118M | 185M |
Timings | 9s | 24s | 33s | 40s |
100_000 lines | 500_000 lines | 1_000_000 lines | 2_000_000 lines | |
---|---|---|---|---|
Size of output | 21M | 85M | 126M | 192M |
Timings | 9s | 24s | 36s | 42s |
A description of the cryptographic overhead is given here.
To test the TestCloudproof.scala, run
sbt "test:testOnly -- -oD"