Tools Used: Excel, MySQL, Tableau
- What is an EDA?
EDA stands for Exploratory Data Analysis. It's the process where we explore and analyze a dataset to understand its main characteristics and patterns. In simple terms, EDA helps us to dig into the data and find interesting information about it. We look at different aspects of the data, such as its size, and distribution. We also identify any missing or unusual values. By visualizing the data using graphs or charts, we can spot trends or relationships between variables. Overall, EDA helps us get a better understanding of the data before diving into more complex analysis or making decisions based on the information it provides.
- What is the dataset about?
The "Hospice - General Information" dataset obtained from CMS (Centers for Medicare & Medicaid Services) provides a comprehensive overview of hospice providers. It encompasses crucial details including addresses, phone numbers, and ownership information. This dataset serves as a valuable resource for gaining insights into the characteristics of various hospice facilities. By analyzing this data, researchers, policymakers, and healthcare professionals can better understand the distribution, ownership patterns, and other key aspects of hospice care provision. The dataset offers a foundation for conducting research, improving service delivery, and making informed decisions in the field of end-of-life care.
- View of the Dataset
SELECT *
FROM hospice.gen_info
There are 6,091 rows and 12 columns in the dataset
The columns are: CMS_Certification_Number, Facility_Name, Address_Line_1, Address_Line_2, City, State, Zip_Code, County_Name, Phone_Number, CMS_Region, Ownership_Type, Certification_Date
- Finding Missing Values
SELECT
SUM(CASE WHEN CMS_Certification_Number IS NULL THEN 1 ELSE 0 END) AS Missing_CMS_Certification_Number,
SUM(CASE WHEN Facility_Name IS NULL THEN 1 ELSE 0 END) AS Missing_Facility_Name,
SUM(CASE WHEN Address_Line_1 IS NULL THEN 1 ELSE 0 END) AS Missing_Address_Line_1,
SUM(CASE WHEN Address_Line_2 IS NULL THEN 1 ELSE 0 END) AS Missing_Address_Line_2,
SUM(CASE WHEN City IS NULL THEN 1 ELSE 0 END) AS Missing_City,
SUM(CASE WHEN State IS NULL THEN 1 ELSE 0 END) AS Missing_State,
SUM(CASE WHEN Zip_Code IS NULL THEN 1 ELSE 0 END) AS Missing_Zip_Code,
SUM(CASE WHEN County_Name IS NULL THEN 1 ELSE 0 END) AS Missing_County_Name,
SUM(CASE WHEN Phone_Number IS NULL THEN 1 ELSE 0 END) AS Missing_Phone_Number,
SUM(CASE WHEN CMS_Region IS NULL THEN 1 ELSE 0 END) AS Missing_CMS_Region,
SUM(CASE WHEN Ownership_Type IS NULL THEN 1 ELSE 0 END) AS Missing_Ownership_Type,
SUM(CASE WHEN Certification_Date IS NULL THEN 1 ELSE 0 END) AS Missing_Certification_Date
FROM hospice.gen_info;
There were no missing values in the dataset
- Q1: How many distinct hospitals are in the dataset?
SELECT
COUNT(DISTINCT Facility_Name) as num_of_facilities
FROM hospice.gen_info
- Q2: How many cities, states, and counties are in the dataset?
SELECT
COUNT(DISTINCT City) as num_of_cities,
COUNT(DISTINCT State) as num_of_states,
COUNT(DISTINCT County_Name) as num_of_counties
FROM hospice.gen_info
After some research, the reason why there's 55 states is because 'GU' (Guam), 'DC' (District of Columbia), 'MP' (Northern Mariana Islands), 'PR' (Puerto Rico), 'VI' (U.S. Virgin Islands) were also included
- Q3: How many hospitals are in each state?
SELECT
DISTINCT state,
COUNT(facility_name) as num_of_hospitals
FROM hospice.gen_info
GROUP BY state
ORDER BY num_of_hospitals DESC
- **Q4: How many hospitals were there among each ownership type?
SELECT Ownership_Type,
COUNT(*) AS Hospital_Count
FROM hospice.gen_info
GROUP BY Ownership_Type;
- **Q5: How many hospitals were there among each CMS Region?
SELECT CMS_Region,
COUNT(*) AS Hospital_Count
FROM hospice.gen_info
GROUP BY CMS_Region
ORDER BY Hospital_Count DESC;
- Q6: What were the top 5 most common area codes among all hospitals?
SELECT SUBSTRING(Phone_Number, 1, 4) AS Area_Code,
COUNT(*) AS Hospital_Count
FROM hospice.gen_info
GROUP BY Area_Code
ORDER BY Hospital_Count DESC
LIMIT 5;
- Q7: How many hospitals were certified by each year?
SELECT YEAR(STR_TO_DATE(Certification_Date, '%m/%d/%y')) AS Certification_Year,
COUNT(*) AS Hospital_Count
FROM hospice.gen_info
GROUP BY Certification_Year
ORDER BY Certification_Year;
- Q8: What were the top 5 cities with the most hospitals?
SELECT City,
COUNT(*) AS Hospital_Count
FROM hospice.gen_info
GROUP BY City
ORDER BY Hospital_Count DESC
LIMIT 5;
- Q9: What were the top 5 counties with the most hospitals?
SELECT County_Name,
COUNT(*) AS Hospital_Count
FROM hospice.gen_info
GROUP BY County_Name
ORDER BY Hospital_Count DESC
LIMIT 5;
- Q10: What percentage of hospitals were for-profit and non-profit?
SELECT
Ownership_Type,
COUNT(*) AS Hospital_Count,
ROUND((COUNT(*)/(SELECT COUNT(*) FROM hospice.gen_info)) * 100, 2) AS Percentage
FROM hospice.gen_info
WHERE Ownership_Type IN ('non-profit', 'for-profit')
GROUP BY Ownership_Type;