A generic scriptable Python client for downloading datasets from the Web Services offered by the USGS/EPA Water Quality Portal: an alternative to manual use of the WQP website.
The project consists of the following components:
-
A client module,
pywqp_client.py
, which obtains WQP data in CSV format or in native Water Quality XML. This module is suitable for inclusion in Python programs. -
A support module,
wqx_mappings.py
, which can be used independently:-
Defines the relationship between the WQX-Outbound 2.0 XML format and the fundamental tabular forms represented in CSV and TSV results;
-
Provides utility methods to create
pandas
DataFrame objects from WQP XML query responses (which do not suffer from the vulnerabilities of character escaping that plague CSV and TSV formats.)
-
-
A convenience wrapper,
pywqp.py
, which manages common query-and-convert actions from the command line. When it is invoked, the commandline parameters are sent to an instance of pywqp_client.py. -
A parameter validation module,
pywqp_validator.py
How do I download WQP data from my Python program?
How do I convert my download to a pandas dataframe?
How do I stash my download to the local filesystem?
How do I run the module tests?
What can I do with wqx_mappings?
You will need to ensure that the pywqp
folder is in your system path before you can do anything or else you will not be able to import pywqp_client
. The easiest way to do this is simply to use pip to pull the package directly from github.
$ pip install git+https://github.com/USGS-CIDA/pywqp.git
To upgrade to the latest version, simply use the --upgrade flag:
$ pip install git+https://github.com/USGS-CIDA/pywqp.git --upgrade
Alternatively, you can download the package and add it manually to your path.
The core resource of pywqp_client
is the class RESTClient
. Instantiation is fairly simple:
import pywpq_client
client_instance = pywqp_client.RESTClient()
client_instance
is now ready to run any of the functions exposed by RESTClient
. Examples of the important ones are shown below. In all cases, the example name "client_instance" is reused for simplicity, but that name has no particular significance. Name your objects as you wish.
This function makes a call to the Water Quality Portal server specified in the host_url
argument. The other arguments are as follows:
-
verb
: a literal String representing the HTTP method of the Request. This method accepts only'get'
or'head'
: WQP currently doesn't support any other HTTP methods. -
resource_label
: an identifier for the kind of data being requested. The defined labels are the keys of theRESTClient.resource_types
Dictionary. At time of writing, these are supported:-
'station'
: Station, or "Site", data refers to locations at which sampling is deemed to have occurred. -
'result'
: Result data refers to actual measurements. Will always include Station, date/time of observation, and metadata about the observation event. -
'simplestation'
: a very small subset of Station information, used mostly for interaction with geospatial systems.
-
-
params
is a Dictionary containing WQP REST parameters. -
There is one standard WQP parameter which is not recognized in
params
:mimeType
. This one is given its own Python parameter,mime-type
, because currently pywqp supports only CSV and XML download formats. There are only two accepted values for this parameter:-
'text/xml'
-
'text/csv'
(which is the default value if this parameter is omitted.)
-
Example: downloading CSV data for Stations in Boone County, Iowa, US that have made pH observations.
verb = 'get'
host_url = 'http://waterqualitydata.us'
resource_label = 'station'
params = {'countrycode': 'US', 'statecode': 'US:19', 'countycode': 'US:19:015', 'characteristicName': 'pH'}
result = client_instance.request_wqp_data(verb, host_url, resource_label, params, mime_type='text/csv')
When working with a module like pywqp, it's often very helpful to be able to produce a query that duplicates the one being issued by the module. The duplicate query can be run independently though a utility such as curl (or a browser, as long as the browser handles outbound query parameter urlencoding correctly.)
pywqp provides this via create_rest_url
, a function that takes the same host_url
, resource_label
, params
, and mime_type
arguments that are made to a call to request_wqp_data
. Instead of making a call to WQP and returning a requests.response
object, create_rest_url
returns a paste-ready URL that can be set from a different client.
host_url = 'http://waterqualitydata.us'
resource_label = 'station'
params = {'countrycode': 'US', 'statecode': 'US:19', 'countycode': 'US:19:015', 'characteristicName': 'pH'}
equivalent_url = client_instance.create_rest_url(host_url, resource_label, params, mime_type='text/csv')
print(equivalent_url)
will print
http://waterqualitydata.us/Station/search?characteristicName=pH&mimeType=csv&zip=no&statecode=US%3A19&countrycode=US&countycode=US%3A19%3A015
request_wqp_data
returns a requests.response
object. pywqp_client
lets you do two things with that response
:
-
Convert the dataset to an in-memory pandas dataframe.
-
Stash the dataset on your local filesystem.
The next two examples show how to do those things.
dataframe = client_instance.response_as_pandas_dataframe(response)
Note that the filepathname
argument can be either relative or absolute. If it's relative, the stash_response
method will coerce it to an absolute based on the current directory. However, sometimes during Python execution the "current directory" is not obvious. Absolute filepathnme
s are recommended.
filepathname = '/home/whb/examples/wqp_example.csv'
client_instance.stash_response(response, filepathname)
As a convenience, pywqp also allows the storage of a complete HTTP response message, including status line and headers. This is done by setting the optional boolean parameter raw_http=True
.
filepathname = '/home/whb/examples/wqp_example.csv'
client_instance.stash_response(response, filepathname, raw_http=True)
This will give you a file on disk that opens with content something like this:
HTTP/1.1 200 OK
Date: Thu, 24 Jul 2014 15:42:52 GMT
NWIS-Site-Count: 49
Total-Site-Count: 203
STORET-Site-Count: 154
WQP-Job-ID: 14242
STEWARDS-Site-Count: 0
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Total-Result-Count
Access-Control-Expose-Headers: Total-Site-Count
Content-Type: text/csv
Note that stashing HTTP Response data to disk is a simple convenience to incorporate. On the other hand, pywqp does not support saving pandas dataframes to disk. If you're sufficiently advanced to do that, you probably already know how to use HDF5; if not, there are plenty of resources out there (e.g. Python and HDF5
The project also contains a BDD test suite written in lettuce. This is located in the tests
folder. Shocking, I know.
You can run pywqp's tests whenever you like. You should run them whenever you have made significant local changes. Especially if you want to submit a pull request, of course.
If you don't take direct advantage of the virtualenv setup information (dev_setup.sh
and requirements.txt
), you can still use them as a guide to ensuring that you know which needed versions and libraries to install. The only dependency for running the tests should be lettuce
itself.
Lettuce is extremely simple to run. From the pywqp root:
cd tests
lettuce
Its output is pretty straightforward to read, too.
Although pywqp_client.py
uses wqx_mappings.py
to manage all clientside XML-to-DataFrame work, there are other, independent, uses for wqx_mappings
.
The first use is that wqx_mappings
contains a logically complete description of the mappings between WQX-Outbound 2.0 XML and the "canonical" tabular forms represented by CSV and TSV content. These mappings are represented by the module-level data structures:
-
context_descriptors
, which identifies the logically significant container nodes in WQX content, as XPath-like expressions; -
column_mappings
, which is a dictionary whose keys are XPath-like expressions and whose values are column names; -
tabular_defs
, which is a dictionary whose keys are tabular definition types, and whose values are tuples of column names, defining the sequence in which columns appear in the table. -
val_xpaths
, which is a dictionary whose keys are context node type names, and whose values are dictionaries mapping column names (keys) to RELATIVE XPath-like expressions that identify the node containing the text to be entered into a cell for any row constructed while the node is "in context".
This class exposes some helpful methods and properties. Instantiation is simple:
import wqx_mappings
mapper_instance = wqx_mappings.WQXMapper()</tt>
table_type = mapper_instance.determine_table_type(response)