Skip to content

GitHub-Insight-ANZ-Lab/copilot-challenge-data-engineer-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

COVID19 Worldwide Testing Data

INTRODUCTION

    This dataset tested_worldwide.csv is origin from Kaggle. This dataset, which has the number of tests conducted over time, is important to help making sense of daily reported cases and understanding how COVID-19 is truly spreading in each country.

INSTRUCTIONS

    1. Use Copilot Chat to create a new notebook in your project. Use command /newnotebook and name it as "COVID19 Worldwide Testing Data".
    2. Use Copilot and Copilot Chat to develop the exercise and support your learning.

EXERCISE

    Our analysis tries to provide an answer to this question: **Which countries have reported the highest number of positive cases in relation to the number of tests conducted?**

1. Understanding the Data

    1.1. Import the dataset and display the first 5 rows of the dataframe. 

    1.2. Display the number of rows and columns in the dataframe. 

    1.3. Display the data types of each column. 

    1.4. Display the number of missing values in each column. 

    1.5. Display the number of unique values in each column.

2. Data Cleaning

    2.1. Drop the columns that are not needed for the analysis. 

    2.2. Rename the columns to make them more readable. 

    2.3. Drop the rows that have missing values. 

    2.4. Convert the data types of the columns to the appropriate types. 

    2.5. Display the number of missing values in each column.

3. Extracting the Top Ten Countries with Most Covid-19 Cases.

    3.1. Create a new dataframe that contains the total number of positive cases for each country. 

    3.2. Sort the dataframe in descending order of the total number of positive cases. 

    3.3. Display the top ten countries with the most positive cases.

4. Identifying the Highest Positive Against Tested Cases

    4.1. Create a new dataframe that contains the total number of tests conducted for each country. 

    4.2. Sort the dataframe in descending order of the total number of tests conducted. 

    4.3. Display the top ten countries with the most tests conducted.

5. Identifying top three countries that have had the highest number of positive cases against the number of tests carried out

    5.1. Merge the two dataframes created in the previous steps. 

    5.2. Create a new column that contains the ratio of positive cases to the number of tests conducted. 

    5.3. Sort the dataframe in descending order of the ratio of positive cases to the number of tests conducted. 

    5.4. Display the top three countries with the highest ratio of positive cases to the number of tests conducted.

6. Displaying the Results

    6.1. Display the results a chart that shows the top three countries with the highest ratio of positive cases to the number of tests conducted.

    6.2. Display the results in a chart that shows the top ten countries with the most positive cases.

    6.3. Display the results in a chart that shows the top ten countries with the most tests conducted.

7. Conclusion

    7.1. What are your conclusions? 

    7.2. What are the limitations of this analysis? 

    7.3. What are the next steps you would take to improve this analysis?