This project aims to predict delivery time on Zomato using a distributed computing approach with Sparkling Water, a framework by H2O.ai that integrates H2O’s machine learning capabilities with Apache Spark. By leveraging the power of a multinode Spark cluster, we can efficiently handle large datasets and complex computations, ensuring faster processing and model training.
The project involves the following steps:
- Setting up a multinode Spark cluster.
- Initiating a Spark session and creating an H2O context across the cluster.
- Loading and preprocessing the data to handle missing values and prepare it for model training.
- Training the machine learning model using H2O’s algorithms.
- Evaluating the model’s performance and making predictions.
data/
: Directory containing the dataset files used for training and testing the model.sparkling_water_running_images/
: Directory containing images of various stages of the project.spark3.py
: The main Python script to run the project.requirements.txt
: Lists the dependencies required for the project.Project Work Outline.txt
: Provides a detailed outline of the project work.pyspark_multinode_instructions.txt
: Contains instructions for setting up the multinode cluster in ubuntu environment and running the project.
The `data` folder contains the dataset files used for training and testing the model.
The `sparkling_water_running_images` folder contains images of various stages of the project, including the setup and prediction outputs.
- `spark3.py`: The main Python script to run the project.
- `requirements.txt`: Lists the dependencies required for the project.
- `Project Work Outline.txt`: Provides a detailed outline of the project work.
- `instructions.txt`: Contains instructions for setting up and running the project.
Detailed instructions for setting up and running the project are available in the `instructions.txt` file.
The project requirements are listed in the `requirements.txt` file.
Please follow `instructions.txt` file
- The Root Mean Square Error (RMSE) of our prediction model is 4.41.
- A screenshot of the prediction output is available below and in the `sparkling_water_running_images` folder with the filename `model_training_prediction.png`.
- A screenshot of the actual and predicted delivery time is available below and in the `sparkling_water_running_images` folder with the filename `prediction_plot.png`.
Contributions are welcome! Please open an issue or submit a pull request for any changes.