This Code Pattern will focus on and guide you through how to use scikit learn
and python
in Watson Studio to predict opioid prescribers based off of a 2014 kaggle dataset.
Opioid prescriptions and overdoses are becoming an increasingly overwhelming problem for the United States, even causing a declared state of emergency in recent months. Though we, as data scientists, may not be able to single handedly fix this problem, we can dive into the data and figure out what exactly is going on and what may happen in the future given current circumstances.
This Code Pattern aims to do just that: it dives into a kaggle dataset which looks at opioid overdose deaths by state as well as different, unique physicians, their credentials, specialties, whether or not they've prescribed opioids in 2014 as well as the specific names of the prescriptions they have prescribed. Follow along to see how to explore the data in a Watson Studio notebook, visualize a few initial findings in a variety of ways, including geographically, using Pixie Dust. Pixie Dust is a great library to use when you need to explore your data visually very quickly. It literally only needs one line of code! Once that initial exploration is complete, this Code Pattern uses the machine learning library, scikit learn, to train several models and figure out which have the most accurate predictions of opioid prescriptions. Scikit learn, if you're unfamiliar, is a machine learning library, which is commonly used by data scientists due to its ease of use. Specifically, by using the library you're able to easily access a number of machine learning classifiers which you can implement with relatively minimal lines of code. Even more, scikit learn allows you to visualize your output, showcasing your findings. Because of this, the library is often used in machine learning classes to teach what different classifiers do- much like the comparative output this Code Pattern highlights! Ready to dive in?
- opioid-prescription-prediction.ipynb: The notebook we'll be using with no output.
- Log into IBM's Watson Studio service.
- Upload the data as a data asset into Watson Studio.
- Start a notebook in Watson Studio and input the data asset previously created.
- Explore the data with pandas
- Create data visualizations with Pixie Dust.
- Train machine learning models with scikit learn.
- Evaluate their prediction performance.
- IBM Watson Studio: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
- Jupyter Notebook: An open source web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text.
- PixieDust: Provides a Python helper library for IPython Notebook.
- Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
- Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.
- pandas: A Python library providing high-performance, easy-to-use data structures.
This Code Pattern consists of two activities:
-
Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
-
Create a new project by clicking
+ New project
and choosingData Science
: -
Enter a name for the project name and click
Create
. -
NOTE: By creating a project in Watson Studio a free tier
Object Storage
service andWatson Machine Learning
service will be created in your IBM Cloud account. Select theFree
storage type to avoid fees. -
Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the
Assets
andSettings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.
- From the new project
Overview
panel, click+ Add to project
on the top right and choose theNotebook
asset type.
-
Fill in the following information:
- Select the
From URL
tab. [1] - Enter a
Name
for the notebook and optionally a description. [2] - Under
Notebook URL
provide the following url: https://github.com/IBM/predict-opioid-prescribers/blob/master/notebooks/opioid-prescription-prediction.ipynb [3] - For
Runtime
select thePython 3.5
option. [4]
- Select the
-
Click the
Create
button. -
TIP: Once successfully imported, the notebook should appear in the
Notebooks
section of theAssets
tab.
-
This notebook uses the datasets: opioids.csv, overdoses.csv, and perscriber-info.csv. We need to load these assets to our project.
-
From the new project
Overview
panel, click+ Add to project
on the top right and choose theData
asset type. -
A panel on the right of the screen will appear to assit you in uploading data. Follow the numbered steps in the image below.
- Ensure you're on the
Load
tab. [1] - Click on the
browse
option. From your machine, browse to the location of the[opioids.csv
,overdoses.csv
andperscriber-info.csv
files in this repository, and upload it. [not numbered] - Once uploaded, go to the
Files
tab. [2] - Ensure the files appear. [3]
- Ensure you're on the
-
Now all assets should appear in your project overview.
-
Click the
(►) Run
button to start stepping through the notebook. -
Stop at the
Insert Pandas Data Frame
sections. -
Click on the
1001
data icon in the top right. The data files should show up. -
Click on each and select
Insert Pandas Data Frame
. Once you do that, a whole bunch of code will show up in the highlighted cell. -
Make sure your
opioids.csv
is saved asdf_data_1
,overdoses.csv
is saved asdf_data_2
andprescriber_info.csv
is saved asdf_data_3
so that it is consistent with the original notebook.
To get familiar with your data, explore it with visualizations and by looking at subsets of the data. For example, we see that though California has the highest overdoses, when we correct for population we see that West Virginia actually has the highest rate of overdoses per capita.
Every dataset has its imperfections. Let's clean ours up by making the States consistent and changing our columns to allow us to use them as integers.
You can check out the output in the notebook or in the image below. In this step we run several machine learning models in order to evaluate which is the most effective at predicting opioid prescribers. Though it is beyond the scope of this pattern, by predicting these opioid prescribers you are laying the framework to predict the likelihood that a certain type of doctor prescribes opioids. Additionally, if we had more years of data (beyond 2014) we could also predict future rates of overdoses. For now, we'll just take a look at the models.
After running various classifiers, we find that Random Forest, Gradient Boosting and our Ensemble models had the best performance on predicting opioid prescribers.
Awesome job following along! Now go try and take this further or apply it to a different use case!
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
- AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
- Watson Studio: Master the art of data science with IBM's Watson Studio
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.