Basic Data Preprocessing and feature engineering for machine learning using Spark

This project deals with basic data preprocessing using Spark and we build a complete end to end machine learning pipeline using Spark. We also explore how to work with SQL in PySpark and how to work with dataframes in Spark.

Prerequisites

To run the code, you need to install

1. Spark
2. Jupyter Notebook
3. Python 3.x
4. Pandas and Numpy

Installing the Preequisites

Step # 1

Install python https://www.python.org/downloads/

Step # 2 For Installing Spark, I recommend using this tutorial https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows/

Step # 3 For Installing Jupyter notebooks https://jupyter.org/install

Step # 4 After installing python, you can install Pandas and numpy using pip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Basic Data Preprocessing and feature engineering for machine learning using Spark

Prerequisites

Installing the Preequisites

Files

README.md

Latest commit

History

README.md

File metadata and controls

Basic Data Preprocessing and feature engineering for machine learning using Spark

Prerequisites

Installing the Preequisites