Getting the apps into the Appy.
Tip
|
ETL has been automated via cron in the Elastic Beanstalk cluster. As such, ETL-specific set-up of your local machine is no longer necessary; and, cron commands do not use the argcmdr Nonetheless, careful review of the ETL crontab is necessary each year, to ensure that it is suited to current expectations. And, the below documentation (and the argcmdr See: cron-leader.config. |
As with many of Appy’s management commands, ETL is implemented within Appy’s Django framework; and, argcmdr management commands are available to handle common invocations of these, as well as to factor out execution via Docker container. This way, the application may be managed from an appropriately configured, local shell.
Note that, in addition to the local set-up described in README.md, it is assumed that the local environment has been configured with appropriate variables, here mainly relating to the remote target database:
export PGHOST=localhost export PGPORT=5434 export PGUSER=appy_reviews export PGDATABASE=appy_reviews export PGPASSWORD=!p@ssw0rd? export DATABASE_URL=postgres://$PGUSER:$PGPASSWORD@$PGHOST:$PGPORT/$PGDATABASE
Though only DATABASE_URL
above is actually required, the above form is recommended, to additionally enable easy access to the target database via psql
.
Also note that, in the above, the database host is set to localhost
— this reflects that the remote target database most likely cannot be connected to directly, and its connection must instead be tunneled. (For easy management of SSH tunnels, consider tunneldb.)
ETL also requires the Wufoo API key:
export WUFOO_API_KEY=AAAA-BBBB-CCCC-DDDD
With the above environment variables set, Appy’s management commands will forward these to the ETL process.
The period before the fellowship’s application deadline is here considered the "application stage." Management commands are capable of executing ETL across all available data; however, this is not recommended. Instead, we’d like only to process data for the current stage of the current program year. Importing data for the current program year can be as simple as the following:
# import wufoo survey data for 2020 # (this works but is not recommended) manage etl --year=2020 wufoo
However, ETL management commands feature configuration preset flags for the stages of the process, which can be expected to ensure the best combination of settings for the given stage:
manage etl --stage=application wufoo
Note
|
Because the current program year is not supplied to the above command, it must first be configured in Appy’s |
It is strongly recommended that all ETL commands are tested, at least once, for the program year. This ensures validity against changes in input data, etc.
As documented by its --help
text, the above command enables you to either cache survey data as local CSV, or to run without actually writing to the database ("dry run"), or both:
manage etl -v2 --stage=application wufoo -n --csv
Application-stage ETL may be performed with the following individual commands.
-
read Wufoo survey data into (raw data) database tables of the form
survey_NAME_YEAR
:manage etl -v2 --stage=application wufoo
-
upsert records into Appy’s application tables:
manage etl -v2 --stage=application apps
Note
|
Step 2 may be very slow! If you’re concerned, you might ensure its operation and monitor its progress by inspection of database table |
Note also that the above two commands may be invoked at once with the --all
flag:
manage etl -v2 --stage=application --all
Review-stage ETL differs from the former in that:
-
Application survey data is ignored — tables relating to either parts 1 or 2 of the application will not be updated.
-
Reviewer registrations — previously extracted from Wufoo, and loaded into a survey table, but otherwise ignored — will be read, and invitation emails sent to reviewers to begin the application-reviewing process.
ETL in this stage includes sending reviewer invitation emails, and which may include data we won’t commit to the repository, such as the SMTP server credentials and a URL to join a Slack group.
The SLACK_URL
in particular may be specified via the shell, and will be passed through to the ETL process. This can be configured in the shell environment, or for example in Bash passed directly to the command process:
SLACK_URL="https://.../" manage etl ...
However, the above form is not recommended for server credentials, and so the shell environment should at least be extended for these:
export SMTP_USER="user-name" export SMTP_PASSWORD="p@!ssw0rd" export SLACK_URL="https://.../"
ETL command testing in the review stage may focus more narrowly on the processing of reviewer registrations and sending of reviewer invitation emails.
For example, the following command variant filters the reviewer registrations under consideration to those given by the argument(s) --invite-only
:
manage etl apps [--dry-run] [email protected] [--invite-only=...]
(If --dry-run
is also suppplied, no database changes will be persisted, and no emails will be sent.)
If it’s necessary to insert additional (test) reviewer records for consideration by the above command, this can be done via the Wufoo survey, (and loaded via the wufoo
subcommand). Or, these can be inserted directly into the survey database table, (though this will result in mild inconsistencies between Wufoo and Appy data); for example:
INSERT INTO survey_reviewer_2020 ("EntryId", "Field1", "Field2", "Field3", "Field7", "Field8") VALUES ('98', 'First Test1', 'Last', '[email protected]', 'sure', 'yep');
On the day following the application deadline, first run the above application-stage ETL — one last time.
Then begin running the review-stage command:
manage etl -v2 --stage=review --all
Note that it is important to continue running ETL, for at least a little while following the application deadline. Not only may additional reviewers (be goaded to) register. As important, recommendation letters will continue to trickle in, for at least a little while.