- How to Contribute
- Architecture of the solution
- Local development environment
- Continuous integration environment
- Running system tests
We'd love to accept your patches and contributions to this project. There are just a few small guidelines you need to follow.
Contributions to this project must be accompanied by a Contributor License Agreement. You (or your employer) retain the copyright to your contribution; this simply gives us permission to use and redistribute your contributions as part of the project. Head over to https://cla.developers.google.com/ to see your current agreements on file or to sign a new one.
You generally only need to submit a CLA once, so if you've already submitted one (even if it was for a different project), you probably don't need to do it again.
All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.
See Code of Conduct
The latest documentation about design of Oozie To Airflow converter can be found in The Design Document. Please take a look to understand how the conversion process works.
You can easily setup your local environment to modify the code and run tests and conversions. The unit tests and conversion can be all run locally and they do not require Oozie-enabled cluster nor running Apache Airflow instance.
The environment can be setup via the virtualenv setup. You can easily create such virtualenv using virtualenvwrapper.
An example of such local environment setup (with virtualenvwrapper):
mkvirtualenv -p python3.8 oozie-to-airflow
pip install -e .
Then later you can switch to such virtualenv by running:
workon oozie-to-airflow
After installing o2a with pip install -e .
you will have o2a converter added to your path and your
local sources will be installed via symbolic links. You simply install a project in editable mode
(i.e. setuptools "develop mode") from a local project path.
While in your virtualenv, you can re-install all the requirements via pip install -r requirements.txt
or pip install -e .
to repeat "develop mode" installation.
You can also separately add the bin subdirectory to your
PATH, then all the scripts described later in the documentation can be run without adding ./bin
prefix.
This can be done for example by adding similar line to your .bash_profile
or bin/postactivate
from your virtual environment:
export PATH=${PATH}:<INSERT_PATH_TO_YOUR_OOZIE_PROJECT>/bin
Otherwise you need to run all the scripts from the bin subdirectory, for example:
./bin/o2a --help
In all the example commands below it is assumed that the bin directory is in your PATH.
We are using a number of checks for quality checks of the code. They are verified during Travis build but also you can install:
Pre-commit hook by running:
pre-commit install
Pre-push hook by running:
pre-commit install --hook-type pre-push
You can also run all the checks manually by running:
pre-commit run --all-files
You might need to install xmllint
and docker
if you do not have it locally. The first can be done with
apt install libxml2-utils
on Linux or brew install xmlstarlet
on MacOS. The second can be done
according to the instructions.
You can always skip running the tests by providing --no-verify
flag to git commit
command.
You can check all commands of pre-commit framework at https://pre-commit.com/
While you are in your local virtualenv, you can run the unit tests. Currently, the test directory is set up in a such a way that the folders in tests directory mirrors the structure of the o2a directory.
Unit tests are run automatically in Travis CI and when you have pre-commit hooks installed. You can also run all unit tests using o2a-run-all-unit-tests script.
All example conversions can by run via the o2a-run-all-conversions script. It is also executed during automated tests.
You can generate dependency graphs automatically from the code via
o2a-generate-dependency-graph but you need graphviz
installed locally.
The latest dependencies generated:
You can also see dependency cycles in case there are some cycles in o2a-dependency-cycles.png
The project integrates with Travis CI. To enable saving of the build process artifacts, you must configure the authorization mechanisms for Google Cloud Storage. For this purpose, it is necessary to set two environment variables: GCP_SERVICE_ACCOUNT
, GCP_BUCKET_NAME
.
To do this, follow these steps:
- To simplify the instructions, set the environment variable:
export PROJECT_ID="$(gcloud config get-value project)"
export ACCOUNT_NAME=o2a-build-artifacts-travis-ci
export ACCOUNT_EMAIL="${ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
export BUCKET_NAME=o2a-build-artifacts
- Create the service account that will be used by Travis
gcloud iam service-accounts create "${ACCOUNT_NAME}"
- Create a new private key for the service account, and save a copy of it in the
o2a-build-artifacts-sa.json
file.
gcloud iam service-accounts keys create --iam-account "${ACCOUNT_EMAIL}" o2a-build-artifacts-sa.json
- Create the bucket
gsutil mb "gs://${BUCKET_NAME}"
- Enables the Bucket Policy Only feature on Cloud Storage bucket
gsutil bucketpolicyonly set on "gs://${BUCKET_NAME}"
- Grant permission to make a bucket's objects publicly readable:
gsutil iam ch allUsers:objectViewer "gs://${BUCKET_NAME}"
- Grant permission to create and overwrite a bucket's objects by service account:
gsutil iam ch "serviceAccount:${ACCOUNT_EMAIL}:objectAdmin" "gs://${BUCKET_NAME}"
- Set environement variable on Travis CI
travis env set GCP_SERVICE_ACCOUNT "$(cat o2a-build-artifacts-sa.json)" --private
travis env set GCP_BUCKET_NAME "${BUCKET_NAME}" --public
- Remove a service account from local disk
rm o2a-build-artifacts-sa.json
Oozie to Airflow has a set of system tests that test end-2-end functionality of conversion and execution of workflows using Cloud environment with Cloud Dataproc and Cloud Composer as described in the README.md
We can run examples defined in the examples folder as system tests. The system tests use an existing Composer, Dataproc cluster and Oozie run in the Dataproc cluster to prepare HDFS application folder structure and trigger the tests automatically.
You can run the tests using this command:
o2a-run-sys-tests --application <APPLICATION> --phase <PHASE>
Default phase is convert - it only converts the oozie workflow to Airflow DAG without running the tests on either Oozie nor Composer
When you run the script with --help
you can see all the options. You can setup autocomplete
with -A
option - this way you do not have to remember all the options.
Current options:
Usage: o2a-run-sys-test [FLAGS] [-A|-S|-K|-W]
Executes prepare or run phase for integration testing of O2A converter.
Flags:
-h, --help
Shows this help message.
-a, --application <APPLICATION>
Application (from examples dir) to run the tests on. Must be specified unless -S or -A are specified.
One of [childwf decision demo el fs git mapreduce pig shell spark ssh subwf]
-p, --phase <PHASE>
Phase of the test to run. One of [prepare-configuration convert prepare-dataproc test-composer test-oozie test-compare-artifacts]. Defaults to convert.
-C, --composer-name <COMPOSER_NAME>
Composer instance used to run the operations on. Defaults to o2a-integration
-L, --composer-location <COMPOSER_LOCATION>
Composer locations. Defaults to europe-west1
-c, --cluster <CLUSTER>
Cluster used to run the operations on. Defaults to oozie-51
-b, --bucket <BUCKET>
Airflow Composer DAG bucket used. Defaults to bucket that is used by Composer.
-r, --region <REGION>
GCP Region where the cluster is located. Defaults to europe-west3
-v, --verbose
Add even more verbosity when running the script.
-d, --dot
Creates files in the DOT representation.
If you have the graphviz program in PATH, the files will also be converted to the PNG format.
If you have the graphviz program and the imgcat programs in PATH, the files will also be displayed in the console
Optional commands to execute:
-K, --ssh-to-composer-worker
Open shell access to Airflow's worker. This allows you to test commands in the context of the Airflow instance.
It is worth noting that it is possible to access the database.
The kubectl exec command is used internally, so not all SSH features are available.
-S, --ssh-to-dataproc-master
SSH to Dataproc's cluster master. All SSH features are available by this options.
Arguments after -- are passed to gcloud compute ssh command as extra args.
-W, --open-oozie-web-ui
Creates a SOCKS5 proxy server that redirects traffic through Dataproc's cluster master and
opens Google Chrome with a proxy configuration and a tab with the Oozie web interface.
-A, --setup-autocomplete
Sets up autocomplete for o2a-run-sys-tests
You do not need to specify the parameters once you run the script with your chosen flags. The latest parameters used are stored and cached locally in .ENVIRONMENT_NAME files in .o2a-run-sys-test-cache-dir and used next time when you run the script.
In case you want to clean up the cache, simply remove all the files from that directory.
The following phases are defined for the system tests:
-
prepare-configuration - prepares configuration based on passed Dataproc/Composer parameters
-
convert - converts the example application workflow to DAG and stores it in
output/<APPLICATION>
directory -
prepare-dataproc - prepares Dataproc cluster to execute both Composer and Oozie jobs. The preparation is:
-
Local filesystem:
${HOME}/o2a/<APPLICATION>
directory contains application to be uploaded to HDFS -
Local filesystem:
${HOME}/o2a/<APPLICATION>.properties
property file to run the oozie job -
HDFS: /user/${user.name}/examples/apps/ - the application is stored in this HDFS directory
-
-
test-composer - runs tests on Composer instance. Artifacts are downloaded to the
output-artifacts/<APPLICATION>/composer
directory. -
test-oozie - runs tests on Oozie in Hadoop cluster. Artifacts are downloaded to the
output-artifacts/<APPLICATION>/oozie
directory. -
test-compare-artifacts - run tests on Oozie and Composer instance and displays a comparison of artifact differences.
The typical scenario to run the tests are:
Running application via Oozie:
o2a-run-sys-test --phase prepare-dataproc --application <APP> --cluster <CLUSTER>
o2a-run-sys-test --phase test-oozie
Running application via composer:
o2a-run-sys-test --phase prepare-dataproc --application <APP> --cluster <CLUSTER>
o2a-run-sys-test --phase test-composer
In order to run system tests with sub-workflows you need to have the sub-workflow application already
present in HDFS, therefore you need to run at least
o2a-run-sys-test --phase prepare-dataproc --application <SUBWORKFLOW_APP>
For example in case of the demo application, you need to run at least once
o2a-run-sys-test --phase prepare-dataproc --application childwf
because childwf
is used as sub-workflow
in the demo application.
In order to upload a new version to PyPi you need to have the appropriate credentials. There are scripts that package the application and upload it to the test or to the production PyPi instance:
- o2a-package-upload-test - prepares and uploads the prepared package to the test PyPi
- o2a-package-upload - prepares and uploads the prepared package to the production PyPi
Make sure to update the version of the package in setup.py before preparing/updating.