In this project, I have to use one of Udacity's curated datasets and investigate it using NumPy and Pandas. I chose the TMDB dataset with over 10,000 observations and applied the entire data analysis process, started by posing a question and finishing by sharing my findings.
In the supporting lesson content, I was introduced to the key steps in data analysis process -
- Choosing a dataset
- Asking questions
- Data wrangling
- Exploratory data analysis
- Drawing conclusions
I had to apply the lessons learned to see how all the steps fit together to answer my questions. I used Python and some of its libraries to wrangle, explore, analyze and visualize data and this made the implementation of the data analysis process a lot easier.
The project requires Python 3 plus the following python libraries:
- Pandas
- NumPy
- Matplotlib
I used Jupyter Notebook to run and execute the code.
After completing the project, I learned following:
- The key steps in a typical data analysis process
- Comfortable posing and answering questions with a given dataset
- Know how to investigate problems in a dataset and wrangle the data into a format that can be used
- Practice communicating the results of the analysis
- Be able to use vectorized operations in NumPy and Pandas to speed up the data analysis code
- Be familiar with pandas' Series and DataFrame objects, which let's accessing data more conveniently
- Know how to use Matplotlib to produce plots showing the findings