Basic Data Preprocessing and feature engineering for machine learning using Spark

This project deals with basic data preprocessing using Spark and we build a complete end to end machine learning pipeline using Spark. We also explore how to work with SQL in PySpark and how to work with dataframes in Spark.

Prerequisites

To run the code, you need to install

1. Spark
2. Jupyter Notebook
3. Python 3.x
4. Pandas and Numpy

Installing the Preequisites

Step # 1

Install python https://www.python.org/downloads/

Step # 2 For Installing Spark, I recommend using this tutorial https://bigdata-madesimple.com/guide-to-install-spark-and-use-pyspark-from-jupyter-in-windows/

Step # 3 For Installing Jupyter notebooks https://jupyter.org/install

Step # 4 After installing python, you can install Pandas and numpy using pip.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Data Preprocessing using Spark		Data Preprocessing using Spark
spark_bigdatafundamentals		spark_bigdatafundamentals
2. Big Data Fundamentals with PySpark.ipynb		2. Big Data Fundamentals with PySpark.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic Data Preprocessing and feature engineering for machine learning using Spark

Prerequisites

Installing the Preequisites

About

Releases

Packages

Languages

Husnain08/Building-data-pipelines-in-Spark

Folders and files

Latest commit

History

Repository files navigation

Basic Data Preprocessing and feature engineering for machine learning using Spark

Prerequisites

Installing the Preequisites

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages