Skip to content

Latest commit

 

History

History
130 lines (74 loc) · 6.49 KB

etl.adoc

File metadata and controls

130 lines (74 loc) · 6.49 KB

Appy ETL

Getting the apps into the Appy.

Tip

ETL has been automated via cron in the Elastic Beanstalk cluster.

As such, ETL-specific set-up of your local machine is no longer necessary; and, cron commands do not use the argcmdr manage command.

Nonetheless, careful review of the ETL crontab is necessary each year, to ensure that it is suited to current expectations.

And, the below documentation (and the argcmdr manage.py) are suggested background materials for understanding Appy ETL and its crontab.

As with many of Appy’s management commands, ETL is implemented within Appy’s Django framework; and, argcmdr management commands are available to handle common invocations of these, as well as to factor out execution via Docker container. This way, the application may be managed from an appropriately configured, local shell.

Note that, in addition to the local set-up described in README.md, it is assumed that the local environment has been configured with appropriate variables, here mainly relating to the remote target database:

export PGHOST=localhost
export PGPORT=5434
export PGUSER=appy_reviews
export PGDATABASE=appy_reviews
export PGPASSWORD=!p@ssw0rd?
export DATABASE_URL=postgres://$PGUSER:$PGPASSWORD@$PGHOST:$PGPORT/$PGDATABASE

Though only DATABASE_URL above is actually required, the above form is recommended, to additionally enable easy access to the target database via psql.

Also note that, in the above, the database host is set to localhost — this reflects that the remote target database most likely cannot be connected to directly, and its connection must instead be tunneled. (For easy management of SSH tunnels, consider tunneldb.)

ETL also requires the Wufoo API key:

export WUFOO_API_KEY=AAAA-BBBB-CCCC-DDDD

With the above environment variables set, Appy’s management commands will forward these to the ETL process.

Application stage

The period before the fellowship’s application deadline is here considered the "application stage." Management commands are capable of executing ETL across all available data; however, this is not recommended. Instead, we’d like only to process data for the current stage of the current program year. Importing data for the current program year can be as simple as the following:

# import wufoo survey data for 2020
# (this works but is not recommended)
manage etl --year=2020 wufoo

However, ETL management commands feature configuration preset flags for the stages of the process, which can be expected to ensure the best combination of settings for the given stage:

manage etl --stage=application wufoo
Note

Because the current program year is not supplied to the above command, it must first be configured in Appy’s settings.py: REVIEW_PROGRAM_YEAR.

Testing

It is strongly recommended that all ETL commands are tested, at least once, for the program year. This ensures validity against changes in input data, etc.

As documented by its --help text, the above command enables you to either cache survey data as local CSV, or to run without actually writing to the database ("dry run"), or both:

manage etl -v2 --stage=application wufoo -n --csv

Execution

Application-stage ETL may be performed with the following individual commands.

  1. read Wufoo survey data into (raw data) database tables of the form survey_NAME_YEAR:

    manage etl -v2 --stage=application wufoo
  2. upsert records into Appy’s application tables:

    manage etl -v2 --stage=application apps
Note

Step 2 may be very slow! If you’re concerned, you might ensure its operation and monitor its progress by inspection of database table pg_stat_activity.

Note also that the above two commands may be invoked at once with the --all flag:

manage etl -v2 --stage=application --all

Review stage

Review-stage ETL differs from the former in that:

  1. Application survey data is ignored — tables relating to either parts 1 or 2 of the application will not be updated.

  2. Reviewer registrations — previously extracted from Wufoo, and loaded into a survey table, but otherwise ignored — will be read, and invitation emails sent to reviewers to begin the application-reviewing process.

Configuration

ETL in this stage includes sending reviewer invitation emails, and which may include data we won’t commit to the repository, such as the SMTP server credentials and a URL to join a Slack group.

The SLACK_URL in particular may be specified via the shell, and will be passed through to the ETL process. This can be configured in the shell environment, or for example in Bash passed directly to the command process:

SLACK_URL="https://.../" manage etl ...

However, the above form is not recommended for server credentials, and so the shell environment should at least be extended for these:

export SMTP_USER="user-name"
export SMTP_PASSWORD="p@!ssw0rd"
export SLACK_URL="https://.../"

Testing

ETL command testing in the review stage may focus more narrowly on the processing of reviewer registrations and sending of reviewer invitation emails.

For example, the following command variant filters the reviewer registrations under consideration to those given by the argument(s) --invite-only:

manage etl apps [--dry-run] [email protected] [--invite-only=...]

(If --dry-run is also suppplied, no database changes will be persisted, and no emails will be sent.)

If it’s necessary to insert additional (test) reviewer records for consideration by the above command, this can be done via the Wufoo survey, (and loaded via the wufoo subcommand). Or, these can be inserted directly into the survey database table, (though this will result in mild inconsistencies between Wufoo and Appy data); for example:

INSERT INTO survey_reviewer_2020 ("EntryId", "Field1", "Field2", "Field3", "Field7", "Field8") VALUES ('98', 'First Test1', 'Last', '[email protected]', 'sure', 'yep');

Execution

On the day following the application deadline, first run the above application-stage ETL — one last time.

Then begin running the review-stage command:

manage etl -v2 --stage=review --all

Note that it is important to continue running ETL, for at least a little while following the application deadline. Not only may additional reviewers (be goaded to) register. As important, recommendation letters will continue to trickle in, for at least a little while.