Skip to content

Latest commit

 

History

History
33 lines (22 loc) · 844 Bytes

File metadata and controls

33 lines (22 loc) · 844 Bytes

Basic Data Preprocessing and feature engineering for machine learning using Spark

This project deals with basic data preprocessing using Spark and we build a complete end to end machine learning pipeline using Spark. We also explore how to work with SQL in PySpark and how to work with dataframes in Spark.

Prerequisites

To run the code, you need to install

1. Spark
2. Jupyter Notebook
3. Python 3.x
4. Pandas and Numpy 

Installing the Preequisites

Step # 1

Install python https://www.python.org/downloads/

Step # 2 For Installing Spark, I recommend using this tutorial https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows/

Step # 3 For Installing Jupyter notebooks https://jupyter.org/install

Step # 4 After installing python, you can install Pandas and numpy using pip.