A repository dedicated to developing a geospatial data science prototype (see issue: https://github.com/developmentseed/labs/issues/292).
- The Objective
- Literature Support
- The Challenge
- Proposed open-source, available datasets
- Proposed Methodology
- Hypothesis
- Setting up your local environment
- Reproducing the Results
To explore the use of machine learning techniques on publicly available, open-sourced datasets to demonstrate the potential to predict cholera in endemic regions of the world, which could be developed further as part of a public health planning and decision making tool for humanitarian organizations. Develop a PoC based only on open-source data to showcase ML capabilities in this space which could be developed further to support decision tool development in this space, and provide more context to cholera patterns than is provided by cases alone.
In cholera-endemic countries, there is support of environmental signatures between seasonal outbreaks which could be explored and used to develop a framework for an early warning system. See also The seasonality of cholera in sub-Saharan Africa: a statistical modelling study, for supporting work in this area.
- Cholera can be both endemic (where seasonal patterns are more likely) and epidemic (based on a perfect storm of conditions). Environmental factors are just one component, there are a number of complex factors at play (e.g., human development indices, access to basic hand-washing, natural disasters, etc.)
- Surveillance isn’t perfect, many areas across the globe have varying levels/resources allocated to disease surveillance (as we’ve seen with Covid) so reported cases aren’t always representative of the true picture.
Focus on an area where cholera has been identified as a major issue, and where subnational and sub annual surveillance data is available: Sub-Saharan Africa. Data availability during this time frame will also allow us to take advantage of a number of remotely sensed variables captured over the same time-frame.
- Cholera surveillance (published cholera outbreak data for research purposes
(see linked repo and associated manuscript, only outbreak data will be
extracted and used for demonstration purposes in this PoC:
Cholera outbreaks in sub-Saharan Africa 2010-2019,
see
data/outbreak_data.csv
)
Below are a list of potential indicator datasets for inclusion into the Cholera Lab study based on literature support (Gwenzi & Sanganyado 2019; Lessler et al. 2018; Perez-Saez et al. 2022; Moore et al. 2017, and others outlined below more specifically below)
- A review of the risk of cholera outbreaks and urbanization in sub-Saharan Africa
- Hydroclimatology 👈 we’ll focus predominantly on these
- Geographic location of urban areas
- Urban environment - sanitation
- Urban behavior
- The Impact of Climate Change on Cholera: A Review on the Global Status and Future Challenges
- Precipitation/flood = increase in disease potential due to disruption of water systems/Increased spread
- Drought = increased spread
- The seasonality of cholera in sub-Saharan Africa: a statistical modelling study
- mean monthly fraction of area flooded
- mean monthly air temperature
- cumulative monthly precipitation
- Estimating cholera risk from an exploratory analysis of its association with satellite-derived land surface temperatures dataset from which outbreak data was extracted
- Precipitation (anomalies)
- Air temperature (anomalies)
- Land surface temperature
- Findings: shows LST anomalies estimated 2 months in advance of cholera incidence for each pixel in the five locations, with all regions revealing varying degrees of warm and cold temperature pixels in the analyses
Based on available Indicators for both spatial and temporal extent of our AOI (Sub-Saharan Africa from 2010-2019) we will extract the following environmental parameters for our investigation.
Variable | Temporal Resolution | Spatial Resolution | Data Availability | Data Source |
---|---|---|---|---|
Land Surface Temperature | monthly | 1.11 km | 1995-2020 | CEDA |
Precipitation | monthly | 5 km | 1981- near present | CHIRPS, with multiple access points, including USCB Storage and SERVIR GLOBAL |
Soil Moisture | daily | 0.25 degrees; approx 27-28 km | 1991-2021 | ESA Climate Data Dashboard |
- Data collection and spatial exploratory data analysis. We’ll explore what patterns, over both space and time, can be observed from the cholera outbreaks themselves. We’ll also explore the literature to understand what remotely sensed environmental factors (e.g., precipitation, temperature) that have been suggested as drivers for disease spread.
- Development of pre-processing pipeline for remotely sensed EO data. We’ll develop a pre-processing pipe-line to ensure our satellite data is assembled and aggregated at the same level (i.e., monthly values for each district) as our outbreak data and ready to be ingested into a ML model.
- ML model exploration. We’ll explore a number of ML approaches (e.g., Random Forest, SVMs, etc.) to understand the patterns between cholera outbreaks and the environmental drivers we have identified.
- Visualize model results and share findings. We’ll provide visuals of our model results and share our findings in a collection of Jupyter notebooks.
Environmental factors alone won’t unravel this very complex relationship, but they can help identify spatio-temporal patterns that could help assist in allocating resources and support.
If you are running macOS, consider installing Homebrew, if
not already installed, as there are macOS-specific instructions below that make
use of homebrew
that can simplify the setup process.
This repository contains files larger than 50 MB, and thus requires the use of Git Large File Storage (LFS) for managing them. In order to obtain these large files during repository cloning, you must install Git Large File Storage.
On macOS, the easiest way to install Git LFS is via Homebrew:
brew install git-lfs
Once installed, initialize it:
git lfs install
To track new types of large files (larger than 50 MB), you must tell Git LFS to track them, typically by extension. For example, to track all Shapefiles:
git lfs track "*.shp"
You can then add and commit such files like any other file in the repository.
Note that the git lfs track
command will modify the .gitattributes
file when
given a new pattern to track. When this occurs, be sure to add .gitattributes
to your commit, along with the newly tracked large files.
Install conda
. The recommended way to do this is by installing
miniforge:
brew install miniforge
conda init
Then, close your terminal and open a new terminal session.
Once, conda
is installed, run the following commands in your terminal from the
root of this repository to create the environment used for this repository:
conda env create
conda activate geo-ds-cholera
Whenever you modify the environment.yml
file, run the following command to
update your conda environment:
conda env update
If you haven't already done so, create a .env
file at the root of this
repository (ignored by git
), which you can perform by making a copy of
.env-example
, like so:
# This copies .env-example to .env, unless .env already exists
cp -n .env-example .env
Edit your .env
file, setting values as appropriate for yourself, as this file
is not committed to git, and thus is not shared with others because it intended
to contain sensitive, user-specific values. Some parts of the code in this
repository will load values from your .env
file, and thus may either fail to
run or skip certain parts of logic, if your .env
file does not contain
properly configured values.
In order to allow notebooks in this repository to import modules in this
repository, you must perform a local, editable pip
install:
pip install -e .
To aid development, this repository uses the pre-commit
tool, which is
installed into the conda environment created above. To install the pre-commit
hooks defined in .pre-commit-config.yaml
, you must run the following command
from the root of your cloned repository working directory:
pre-commit install --install-hooks
If you wish to run the pre-commit hooks in order to check your changes prior to
committing your changes to git, you can run the following command, but note that
files that are untracked by git will be ignored by the pre-commit hooks.
Therefore, if there are untracked files that you wish to check, you must at
least use git add
to stage them in order for the pre-commit hooks to check
them:
pre-commit run -a
After setting up your local environment (see above), you may reproduce our results as follows:
- Run
exploration/zonal-means.ipynb
to reproduce the individual zonal means CSV files under thedata
directory. The inputs to this notebook are theoutbreaks.csv
and shapefile found under thesrc/cholera/resources
path. - Run
exploration/aggregate-zonal-means.ipynb
to reproduce the aggregate zonal means CSV file under thedata
directory. The inputs are the individual zonal means produced by the previous step.