This project deals with basic data preprocessing using Spark and we build a complete end to end machine learning pipeline using Spark. We also explore how to work with SQL in PySpark and how to work with dataframes in Spark.
To run the code, you need to install
1. Spark
2. Jupyter Notebook
3. Python 3.x
4. Pandas and Numpy
Step # 1
Install python https://www.python.org/downloads/
Step # 2 For Installing Spark, I recommend using this tutorial https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows/
Step # 3 For Installing Jupyter notebooks https://jupyter.org/install
Step # 4 After installing python, you can install Pandas and numpy using pip.