Skip to content

Latest commit

 

History

History
478 lines (318 loc) · 10.8 KB

quickstart-tutorial.md

File metadata and controls

478 lines (318 loc) · 10.8 KB

Quickstart Tutorial

The quickest way to get your first workflow deployed on Aqueduct


Installation and Setup

First things first, we'll install the Aqueduct pip package and start Aqueduct in your terminal:

!pip3 install aqueduct-ml
!aqueduct start

Next, we import everything we need and create our Aqueduct client:

from aqueduct import Client, op, metric, check
import pandas as pd

client = Client()

Note that the API key associated with the server can also be found in the output of the aqueduct start command.


Accessing Data

The base data for our workflow is the hotel reviews dataset in the pre-built Demo that comes with the Aqueduct server. This code does two things -- (1) it loads a connection to the demo database, and (2) it runs a SQL query against that DB and returns a pointer to the resulting dataset.

demo_db = client.resource("Demo")
reviews_table = demo_db.sql("select * from hotel_reviews;")

# You will see the type of `reviews_table` is an Aqueduct TableArtifact.
print(type(reviews_table))

# Calling .get() allows us to retrieve the underlying data from the TableArtifact and
# returns it to you as a Python object.
reviews_table.get()

Output

hotel_name review_date reviewer_nationality review
0 H10 Itaca 2017-08-03 Australia Damaged bathroom shower screen sealant and ti...
1 De Vere Devonport House 2016-03-28 United Kingdom No Negative The location and the hotel was ver...
2 Ramada Plaza Milano 2016-05-15 Kosovo No Negative Im a frequent traveler i visited m...
3 Aloft London Excel 2016-11-05 Canada Only tepid water for morning shower They said ...
4 The Student Hotel Amsterdam City 2016-07-31 Australia No Negative The hotel had free gym table tenni...
... ... ... ... ...
95 The Chesterfield Mayfair 2015-08-25 Denmark Bad Reading light And light in bathNo Positive
96 Hotel V Nesplein 2015-08-27 Turkey Nothing except the construction going on the s...
97 Le Parisis Paris Tour Eiffel 2015-10-20 Australia When we arrived we had to bring our own baggag...
98 NH Amsterdam Museum Quarter 2016-01-26 Belgium No stairs even to go the first floor Restaura...
99 Barcel Raval 2017-07-07 United Kingdom Air conditioning a little zealous Nice atmosp...

100 rows × 4 columns

reviews_table is an Artifact -- simply a wrapper around some data -- in Aqueduct terminology and will now serve as the base data for our workflow. We can apply Python functions to it in order to transform it.


Transforming Data

A piece of Python code that transforms an Artifact is called an Operator, which is simply just a decorated Python function. Here, we'll write a simple operator that takes in our reviews table and calculates the length of the review string. It's not too exciting, but it should give you a sense of how Aqueduct works.

@op
def transform_data(reviews):
    '''
    This simple Python function takes in a DataFrame with hotel reviews
    and adds a column called strlen that has the string length of the
    review.    
    '''
    reviews['strlen'] = reviews['review'].str.len()
    return reviews

strlen_table = transform_data(reviews_table)

Notice that we added @op above our function definition: This tells Aqueduct that we want to run this function as a part of an Aqueduct workflow. A function decorated with @op can be called like a regular Python function, and Aqueduct takes note of this call to begin constructing a workflow.

Now that we have our string length operator, we can get a preview of our data by calling .get()

strlen_table.get()

Output

hotel_name review_date reviewer_nationality review strlen
0 H10 Itaca 2017-08-03 Australia Damaged bathroom shower screen sealant and ti... 82
1 De Vere Devonport House 2016-03-28 United Kingdom No Negative The location and the hotel was ver... 84
2 Ramada Plaza Milano 2016-05-15 Kosovo No Negative Im a frequent traveler i visited m... 292
3 Aloft London Excel 2016-11-05 Canada Only tepid water for morning shower They said ... 368
4 The Student Hotel Amsterdam City 2016-07-31 Australia No Negative The hotel had free gym table tenni... 167
... ... ... ... ... ...
95 The Chesterfield Mayfair 2015-08-25 Denmark Bad Reading light And light in bathNo Positive 47
96 Hotel V Nesplein 2015-08-27 Turkey Nothing except the construction going on the s... 456
97 Le Parisis Paris Tour Eiffel 2015-10-20 Australia When we arrived we had to bring our own baggag... 672
98 NH Amsterdam Museum Quarter 2016-01-26 Belgium No stairs even to go the first floor Restaura... 156
99 Barcel Raval 2017-07-07 United Kingdom Air conditioning a little zealous Nice atmosp... 72

100 rows × 5 columns


Adding Metrics

We're going to apply a Metric to our strlen_table, which will calculate a numerical summary of our predictions (in this case, just the mean string length).

@metric
def average_strlen(strlen_table):
    return (strlen_table["strlen"]).mean()

avg_strlen = average_strlen(strlen_table)
avg_strlen.get()

Output:

223.18

Note that metrics are denoted with the @metric decorator. Metrics can be computed over any operator, and even other metrics.


Adding Checks

Let's say that we want to make sure the average strlen of hotel reviews never exceeds 250 characters. We can add a check over the avg_strlen metric.

@check(severity="error")
def limit_avg_strlen(avg_strlen):
    return avg_strlen < 250

limit_avg_strlen(avg_strlen)

Output:

<aqueduct.artifacts.bool_artifact.BoolArtifact at 0x7f7e65b46ee0>

Note that checks are denoted with the @check decorator. Checks can also computed over any operator or metric. Setting the severity to "error" will automatically fail the workflow if this check is ever violated. Check severity can also be set to "warning" (default), which only print a warning message on any violation.


Saving Data

Finally, we can save the transformed table strlen_table back to the Aqueduct demo database. See here for more details around using resource objects.

demo_db.save(strlen_table, table_name="strlen_table", update_mode="replace")

Note that this save is not performed until the flow is actually published.


Publishing the Flow

This creates the flow in Aqueduct. You will receive a URL below that will take you to the Aqueduct UI which will show you the status of your workflow runs, and allow you to inspect the data.

client.publish_flow(name="review_strlen", artifacts=[strlen_table])

Output:

<aqueduct.flow.Flow at 0x7f7e61d9cdc0>

And we're done! We've created our first workflow together, and you're off to the races.


There is a lot more you can do with Aqueduct, including having flows run automatically on a cadence, parameterizing flows, and reading to and writing from many different data resources (S3, Postgres, etc.). Check out the other tutorials and examples here for a deeper dive!