This project deals with basic data preprocessing using Spark and we build a complete end to end machine learning pipeline using Spark. We also explore how to work with SQL in PySpark and how to work with dataframes in Spark.
To run the code, you need to install
1. Spark
2. Jupyter Notebook
3. Python 3.x
4. Pandas and Numpy
Step # 1
Install python
Step # 2 For Installing Spark, I recommend using this tutorial
Step # 3 For Installing Jupyter notebooks
Step # 4 After installing python, you can install Pandas and numpy using pip.