A place for Agora's ETL, data testing, and data analysis
This configuration-driven data pipeline uses a config file - which is easy for
engineers, analysts, and project managers to understand - to drive the entire ETL process. The code in src/agoradatatools
uses
parameters defined in a config file to determine which kinds of extraction and transformations a particular
dataset needs to go through before the resulting data is serialized as json files that can be loaded into Agora's data repository.
In the spirit of importing datasets with the minimum amount of transformations, one can simply add a dataset to the config file, and run the tool.
This src/agoradatatools
implementation was influenced by the "Modern Config Driven ELT Framework for Building a
Data Lake" talk given at the Data + AI Summit of 2021.
Python notebooks that describe the custom logic for various datasets are located in /data_analysis/notebooks
.
The json files generated by src/agoradatatools
are written to folders in the Agora Synapse project by default,
although you can modify the destination Synapse folder in the config file.
Note that running the pipeline does not automatically update the Agora database in any environment. Ingestion of generated json files into the Agora databases is handled by agora-data-manager.
You can run the pipeline in any of the following ways:
- Seqera Platform: is the simplest, but least flexible, way to run the pipeline; it does not require Synapse permissions, creating a Synapse PAT, or setting up the Synapse Python client.
- Locally: requires installing Python and Pipenv, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client.
- Docker: requires installing Docker, obtaining the required Synapse permissions, and creating a Synpase PAT.
When running the pipeline, you must specify the config file that will be used. There are two config files that are checked into this repo:
test_config.yaml
places the transformed datasets in the Agora Testing Data folder in synapse; write files to this folder to perform data validation.config.yaml
places the transformed datasets the Agora Live Data synapse folder; write files to this folder once you've validated that the ETL process is generating files suitable for release. Note that files in the Agora Live Data folder are not automatically released, so if 'bad' file versions do get written to this folder it's not the end of the world. A releasable manifest file can be generated by a subsequent ETL processing run into the folder, or manually if necessary.
You may also create a custom config file to use locally to target specific dataset(s) or transforms of interest, and/or to write the generated json files to a different Synapse location. See the config file section for additional information.
This pipeline can be executed without any local installation, permissions, or credentials; the Sage Bionetworks Seqera Platform workspace is configured to use Agora's Synapse credentials, which can be found in LastPass in the "Shared-Agora" Folder.
The instructions to trigger the workflow can be found at Sage-Bionetworks-Workflows/nf-agora
- Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend here. If you see a green unlocked lock icon, then you should be good to go.
- Obtain write access to the destination Synapse project, e.g. Agora Synapse project
- Create a Synapse personal access token (PAT)
- Set up your Synapse Python client locally
Your configured Synapse credentials can be used to run this package both locally and using Docker, as outlined below.
Perform the following one-time steps to set up your local environment and obtain the required Synapse permissions:
-
This package uses Python, if you have not already, please install pyenv to manage your Python versions. Versions supported by this package are all versions >=3.7 and <3.11. If you do not install
pyenv
make sure that Python andpip
are installed correctly and have been added to your PATH by runningpython3 --version
andpip3 --version
. If your installation was successful, your terminal will return the versions of Python andpip
that you installed. Note: If you havepyenv
it will install a specific version of Python for you. -
Install
pipenv
by runningpip install pipenv
. -
Install
git
if you have not done so already using these instructions -
Clone this Github Repository to your local machine by opening your terminal, navigating to the directory that you want this repository to be cloned and running
git clone https://github.com/Sage-Bionetworks/agora-data-tools.git
. After cloning is complete, navigate into the newly createdagora-data-tools
directory. -
Install
agoradatatools
locally using pipenv:- pipenv
pipenv install # To develop locally you want to add --dev # pipenv install --dev pipenv shell
- pipenv
-
You can check if the package was installed correctly by running
adt --help
in the terminal. If it returns instructions about how to use the CLI, installation was successful and you can run the pipeline by providing the desired config file as an argument. Be sure to review these instructions prior to executing a processing run. The following example command will execute the pipeline usingtest_config.yaml
and the default options:adt test_config.yaml
There is a publicly available GHCR repository automatically built via GitHub Actions. That said, you may want to develop using Docker locally on a feature branch.
If you don't want to deal with Python paths and dependencies, you can use Docker to run the pipeline. Perform the following one-time step to set up your Docker environment and obtain the required Synapse permissions:
- Install Docker.
Once you have completed the one-time setup step outlined above, execute the pipeline by running the following command and providing your PAT and the desired config file as an argument. The following example command will execute the pipeline in Docker using test_config.yaml
:
# This creates a local Docker image
docker build -t agora-data-tools .
docker run -e SYNAPSE_AUTH_TOKEN=<your PAT> agora-data-tools adt test_config.yaml
In order to test the GitHub Actions workflow locally:
- install act and Docker
- create a .secrets file in the root directory of the folder with a SYNAPSE_USER and a SYNAPSE_PASS value*
Then run:
act -v --secret-file .secrets
The repository is currently using Agora's credentials for Synapse. Those can be found in LastPass in the "Shared-Agora" Folder.
Unit tests can be run by calling pytest from the command line.
python -m pytest
Parameters:
destination
: Defines the default target location (folder) that the generated json files are written to; this value can be overridden on a per-dataset basisstaging_path
: Defines the location of the staging folder that the generated json files are written togx_folder
: Defines the Synapse ID of the folder that generated GX reports are written to. This key must always be present in the config file. A valid Synapse ID assigned togx_folder
is required ifgx_enabled
is set totrue
for any dataset. If this key is missing from the dataset, or if it is set tonone
whengx_enabled
istrue
for any dataset, an error will be thrown.gx_table
: Defines the Synapse ID of the table that generated GX reporting is posted to. This key must always be present in the config file. A valid Synapse ID assigned togx_table
is required ifgx_enabled
is set totrue
for any dataset. If this key is missing from the dataset, or if it is set tonone
whengx_enabled
istrue
for any dataset, an error will be thrown.sources/<source>
: Source files for each dataset are defined in thesources
section of the config file.sources/<source>/<source>_files
: A list of source file information for the dataset.sources/<source>/<source>_files/name
: The name of the source file/dataset.sources/<source>/<source>_files/id
: The Synapse ID of the source file. Dot notation is supported to indicate the version of the file to use.sources/<source>/<source>_files/format
: The format of the source file.datasets/<dataset>
: Each generated json file is named<dataset>.json
datasets/<dataset>/files
: A list of source files for the datasetname
: The name of the source file (this name is the reference the code will use to retrieve a file from the configuration)id
: Synapse id of the fileformat
: The format of the source file
datasets/<dataset>/final_format
: The format of the generated output file.datasets/<dataset>/gx_enabled
: Whether or not GX validation should be run on the dataset.true
will run GX validation,false
or the absence of this key will skip GX validation.datasets/<dataset>/gx_nested_columns
: A list of nested columns that should be validated using GX nested validation. Failure to include this key and a valid list of columns will result in an error because the nested fields will not be converted to a JSON-parseable string prior to validation. This key is not needed ifgx_enabled
is not set totrue
or if the dataset does not have nested fields.datasets/<dataset>/provenance
: The Synapse id of each entity that the dataset is derived from, used to populate the generated file's Synapse provenance. (The Synapse API calls this "Activity")datasets/<dataset>/destination
: Override the default destination for a specific dataset by specifying a synID, or use*dest
to use the default destinationdatasets/<dataset>/column_rename
: Columns to be renamed prior to data transformationdatasets/<dataset>/agora_rename
: Columns to be renamed after data transformation, but prior to json serializationdatasets/<dataset>/custom_transformations
: The list of additional transformations to apply to the dataset; a value of 1 indicates the default transformation