This guide provides the instructions to setup or configure supported cloud storage systems, usage examples to transfer data stored in supported file formats.
- Prerequisites
- Exasol Named Connection Object
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
- Azure Data Lake Gen1 Storage
- Azure Data Lake Gen2 Storage
Before start, you should deploy the latest jar file to the Exasol BucketFS bucket and create UDF scripts for importing or exporing data from cloud storage filesystems. Please follow the instructions from deployment guide if you have not done so.
Additionally, you can read the user guide in order to get familiar with cloud-storage-etl-udfs in general.
In this guide, we will be providing examples using RETAIL
schema and
SALES_POSITIONS
table.
Moreover, we are going to show how to provide secure access credentials using UDF parameters and Exasol named connection object. We highly recommend to use connection objects to provide credentials to UDFs so that secrets are not displayed in audit logs.
Amazon Web Services uses security credentials, for example Access Key and Secret Key, in order to verify and check whether you have permission to acces a resource such S3.
Thus, if you want to import data files from an Amazon Simple Storage Service (Amazon S3) bucket, your credentials must allow that access.
Thereofore, in order to access the Amazon S3 bucket you should provide the following property keys to the UDF scripts with values from AWS credentials access key and secret key.
S3_ACCESS_KEY
S3_SECRET_KEY
However, when using Multi Factor Authentication (MFA), you can also provide
additional session token using S3_SESSION_TOKEN
parameter.
Please follow the Amazon credentials management best practices when creating credentials.
Additionally, you should provide S3 endpoint:
S3_ENDPOINT
, for example, s3.eu-central-1.amazonaws.com.
An endpoint is the URL of the entry point for an AWS resource. For example,
https://dynamodb.us-west-2.amazonaws.com
is the endpoint for the Amazon
DynamoDB service in the US West (Oregon) Region.
First of all create a named connection object and encode credentials a key-value
pairs separated by semicolon (;
).
Using AWS access and secret keys:
CREATE OR REPLACE CONNECTION S3_CONNECTION
TO ''
USER ''
IDENTIFIED BY 'S3_ACCESS_KEY=<AWS_ACCESS_KEY>;S3_SECRET_KEY=<AWS_SECRET_KEY>';
Or together with session token:
CREATE OR REPLACE CONNECTION S3_CONNECTION
TO ''
USER ''
IDENTIFIED BY 'S3_ACCESS_KEY=<AWS_ACCESS_KEY>;S3_SECRET_KEY=<AWS_SECRET_KEY>;S3_SESSION_TOKEN=<AWS_SESSION_TOKEN>';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 's3a://<BUCKET>/import/orc/sales_positions/*'
DATA_FORMAT = 'ORC'
S3_ENDPOINT = 's3.<REGION>.amazonaws.com'
CONNECTION_NAME = 'S3_CONNECTION'
PARALLELISM = 'nproc()*2';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 's3a://<BUCKET>/export/parquet/sales_positions/'
DATA_FORMAT = 'PARQUET'
S3_ENDPOINT = 's3.<REGION>.amazonaws.com'
CONNECTION_NAME = 'S3_CONNECTION'
PARALLELISM = 'iproc(), floor(random()*2)';
In this case, you should provide each access credentials as key-value pairs.
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 's3a://<BUCKET>/import/orc/sales_positions/*'
DATA_FORMAT = 'ORC'
S3_ACCESS_KEY = '<AWS_ACCESS_KEY>'
S3_SECRET_KEY = '<AWS_SECRET_KEY>'
S3_SESSION_TOKEN = '<AWS_SESSION_TOKEN>'
S3_ENDPOINT = 's3.<REGION>.amazonaws.com'
PARALLELISM = 'nproc()*2';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 's3a://<BUCKET>/export/parquet/sales_positions/'
DATA_FORMAT = 'PARQUET'
S3_ACCESS_KEY = '<AWS_ACCESS_KEY>'
S3_SECRET_KEY = '<AWS_SECRET_KEY>'
S3_ENDPOINT = 's3.<REGION>.amazonaws.com'
PARALLELISM = 'iproc(), floor(random()*2)';
Currently, we only support s3a
URI scheme; therefore, BUCKET_PATH
should
start with it.
Similar to Amazon S3, you need to provide security credentials to the UDF in order to access Google Cloud Storage.
For this, you need to set two property keys when running UDF:
GCS_PROJECT_ID
GCS_KEYFILE_PATH
The GCS_PROJECT_ID is a Google Cloud Platform (GCP) project identifier. It is unique string for you project which is composed of the project name and a randomly assigned number. Please check out the GCP creating and managing projects page for more information.
The GCS_KEYFILE_PATH is path to the location of GCP service account private key file, usually stored in JSON format.
A Google Cloud Platform service account is an identity that an instance or an application can use to authenticate its identity and perform authorized tasks on Google cloud resources. It is special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access Google APIs. Please check out the GCP introduction to service accounts, understanding service accounts and generating service account private key documentation pages.
Once the service account is generated, give enough permissions to it to access the Google Cloud Storage objects and download its private key as JSON file.
To be able use the service account private key in the UDF, it should be uploaded to the bucket in Exasol BucketFS.
Upload GCP service account keyfile:
curl -X PUT -T gcp-<PROJECT_ID>-service-keyfile.json \
http://w:[email protected]:2580/<BUCKET_NAME>/gcp-<PROJECT_ID>-service-keyfile.json
Please make sure that the bucket is secure and only readable by users who run the UDFs.
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'gs://<GCS-STORAGE>/import/avro/sales_positions/*'
DATA_FORMAT = 'AVRO'
GCS_PROJECT_ID = '<GCP_PORJECT_ID>'
GCS_KEYFILE_PATH = '/buckets/bfsdefault/<BUCKET_NAME>/gcp-<PROJECT_ID>-service-keyfile.json'
PARALLELISM = 'nproc()*4';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'gs://<GCS-STORAGE>/export/parquet/sales_positions/'
DATA_FORMAT = 'PARQUET'
GCS_PROJECT_ID = '<GCP_PORJECT_ID>'
GCS_KEYFILE_PATH = '/buckets/bfsdefault/<BUCKET_NAME>/gcp-<PROJECT_ID>-service-keyfile.json'
PARALLELISM = 'iproc(), floor(random()*4)';
You should notice that UDF access GCS object using gs
URI scheme.
The Exasol user defined functions have access to buckets from predefined file
path, /buckets/bfsdefault/
. As a result, you can refer to your buckets when
running the cloud-storage-etl-udfs by providing a path to it such as
/buckets/bfsdefault/<BUCKET_NAME>/
. Please check out the Exasol manual to
learn more about Exasol synchronous cluster filesystem BucketFS.
Azure Blob Storage containers can be accessed using two possible authorization mechanisms.
AZURE_SECRET_KEY
AZURE_SAS_TOKEN
The AZURE_SECRET_KEY is 512-bit storage access keys that can be generated after creating a storage account. It is used to authorize access to the storage accounts.
The AZURE_SAS_TOKEN is a Shared Access Signature (SAS) that provides secure access to storage account with granular control over how the clients can access the data.
You should provider either one of these parameters when using cloud-storage-elt-udfs to access the Azure Blob Storage containers.
Please refer to Azure documentation on creating storage account, managing storage access keys and using shared access signatures (SAS).
Create a named connection for using with Azure secret key:
CREATE OR REPLACE CONNECTION AZURE_BLOB_SECRET_CONNECTION
TO ''
USER ''
IDENTIFIED BY 'AZURE_SECRET_KEY=<AZURE_SECRET_KEY>';
Or for using with Azure SAS token:
CREATE OR REPLACE CONNECTION AZURE_BLOB_SAS_CONNECTION
TO ''
USER ''
IDENTIFIED BY 'AZURE_SAS_TOKEN=<AZURE_SAS_TOKEN>';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/import/orc/*'
DATA_FORMAT = 'ORC'
CONNECTION_NAME = 'AZURE_BLOB_SECRET_CONNECTION'
PARALLELISM = 'nproc()';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/import/orc/*'
DATA_FORMAT = 'ORC'
CONNECTION_NAME = 'AZURE_BLOB_SAS_CONNECTION'
PARALLELISM = 'nproc()';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/export/parquet/'
DATA_FORMAT = 'PARQUET'
CONNECTION_NAME = 'AZURE_BLOB_SECRET_CONNECTION'
PARALLELISM = 'iproc()';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/export/parquet/'
DATA_FORMAT = 'PARQUET'
CONNECTION_NAME = 'AZURE_BLOB_SAS_CONNECTION'
PARALLELISM = 'iproc()';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/import/orc/*'
DATA_FORMAT = 'ORC'
AZURE_SECRET_KEY = '<AZURE_SECRET_KEY>'
PARALLELISM = 'nproc()';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/import/orc/*'
DATA_FORMAT = 'ORC'
AZURE_SAS_TOKEN = '<AZURE_SAS_TOKEN>'
PARALLELISM = 'nproc()';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/export/parquet/'
DATA_FORMAT = 'PARQUET'
AZURE_SECRET_KEY = '<AZURE_SECRET_KEY>'
PARALLELISM = 'iproc()';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'wasbs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.blob.core.windows.net/export/parquet/'
DATA_FORMAT = 'PARQUET'
AZURE_SAS_TOKEN = '<AZURE_SAS_TOKEN>'
PARALLELISM = 'iproc()';
The Azure Blob Storage container path URI scheme can be wasbs
or wasb
.
The following properties should be provided to the UDF in order to access the Azure Data Lake (Gen1) Storage.
AZURE_CLIENT_ID
AZURE_CLIENT_SECRET
AZURE_DIRECTORY_ID
The AZURE_CLIENT_ID is the Azure Active Directory (AD) App registation Application ID.
The AZURE_CLIENT_SECRET is the secret key generated for the Application ID.
The AZURE_DIRECTORY_ID is ht Active Directory (AD) Directory (Tenant) ID.
Please check out the Azure documentation on how to create service to service authentication using Active Directory and Azure AD application and service principal. These Azure documentation pages should show how obtain required configuration settings.
Finally, make sure that the client id has an access permissions to the Gen1 storage container or its child directories.
Create a named connection object that includes secure credentials for Azure ADLS Storage in the identification field:
CREATE OR REPLACE CONNECTION AZURE_ADLS_CONNECTION
TO ''
USER ''
IDENTIFIED BY 'AZURE_CLIENT_ID=<AZURE_CLIENT_ID>;AZURE_CLIENT_SECRET=<AZURE_CLIENT_SECRET>;AZURE_DIRECTORY_ID=<AZURE_DIRECTORY_ID>';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'adl://<AZURE_CONTAINER_NAME>.azuredatalakestore.net/import/avro/*'
DATA_FORMAT = 'AVRO'
CONNECTION_NAME = 'AZURE_ADLS_CONNECTION'
PARALLELISM = 'nproc()';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'adl://<AZURE_CONTAINER_NAME>.azuredatalakestore.net/export/parquet/'
DATA_FORMAT = 'PARQUET'
CONNECTION_NAME = 'AZURE_ADLS_CONNECTION'
PARALLELISM = 'iproc()';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'adl://<AZURE_CONTAINER_NAME>.azuredatalakestore.net/import/avro/*'
DATA_FORMAT = 'AVRO'
AZURE_CLIENT_ID = '<AZURE_CLIENT_ID>'
AZURE_CLIENT_SECRET = '<AZURE_CLIENT_SECRET>'
AZURE_DIRECTORY_ID = '<AZURE_DIRECTORY_ID>'
PARALLELISM = 'nproc()';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'adl://<AZURE_CONTAINER_NAME>.azuredatalakestore.net/export/parquet/'
DATA_FORMAT = 'PARQUET'
AZURE_CLIENT_ID = '<AZURE_CLIENT_ID>'
AZURE_CLIENT_SECRET = '<AZURE_CLIENT_SECRET>'
AZURE_DIRECTORY_ID = '<AZURE_DIRECTORY_ID>'
PARALLELISM = 'iproc()';
The container path should start with adl
URI scheme.
Currently, the Azure Data Lake Gen2 Storage UDF requires secret key of storage account in order to authenticate itself.
AZURE_SECRET_KEY
Please refer to Azure documentation on creating storage account and managing storage access keys.
Create a named connection object that includes secret key for Azure Data Lake Gen2 Storage in the identification field:
CREATE OR REPLACE CONNECTION AZURE_ABFS_CONNECTION
TO ''
USER ''
IDENTIFIED BY 'AZURE_SECRET_KEY=<AZURE_SECRET_KEY>';
IMPORT INTO RETAIL.SALES_POSITIONS
FROM SCRIPT ETL.IMPORT_PATH WITH
BUCKET_PATH = 'abfs://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.dfs.core.windows.net/import/orc/*'
DATA_FORMAT = 'ORC'
CONNECTION_NAME = 'AZURE_ABFS_CONNECTION'
PARALLELISM = 'nproc()';
EXPORT RETAIL.SALES_POSITIONS
INTO SCRIPT ETL.EXPORT_PATH WITH
BUCKET_PATH = 'abfss://<AZURE_CONTAINER_NAME>@<AZURE_ACCOUNT_NAME>.dfs.core.windows.net/export/parquet/'
DATA_FORMAT = 'PARQUET'
CONNECTION_NAME = 'AZURE_ABFS_CONNECTION'
PARALLELISM = 'iproc(), floor(random()*2)';
The bucket path should start with abfs
or abfss
URI scheme.