Skip to content

Husnain08/Building-data-pipelines-in-Spark

Repository files navigation

Basic Data Preprocessing and feature engineering for machine learning using Spark

This project deals with basic data preprocessing using Spark and we build a complete end to end machine learning pipeline using Spark. We also explore how to work with SQL in PySpark and how to work with dataframes in Spark.

Prerequisites

To run the code, you need to install

1. Spark
2. Jupyter Notebook
3. Python 3.x
4. Pandas and Numpy 

Installing the Preequisites

Step # 1

Install python https://www.python.org/downloads/

Step # 2 For Installing Spark, I recommend using this tutorial https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows/

Step # 3 For Installing Jupyter notebooks https://jupyter.org/install

Step # 4 After installing python, you can install Pandas and numpy using pip.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published