Wine Dataset Classification and Analysis using K-Nearest Neighbors (KNN) & Random Forest

This project focuses on classifying wine samples based on their chemical composition using the K-Nearest Neighbors (KNN) algorithm. We explore the dataset to understand feature distributions, evaluate data correlations, and implement data transformations for optimal model performance. This repository provides code, analysis, and insights gained from applying machine learning techniques to the Wine dataset.

Project Overview

In this study, we utilize the Wine dataset to classify wines into categories based on their chemical attributes. Through extensive exploratory data analysis (EDA) and preprocessing techniques, we prepare the data to ensure optimal model performance. The analysis includes normalization, standardization, and parameter tuning, providing a structured workflow for machine learning classification.

Data Description

The Wine dataset, obtained from the UCI Machine Learning Repository, contains 14 features representing chemical attributes of different wine samples and a class label identifying three distinct wine types.

Key features include:

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

Each sample is labeled as one of three wine types, facilitating supervised classification.

Methodology

Exploratory Data Analysis

We conducted a detailed EDA to explore feature distributions, check for outliers, and analyze correlations:

Distribution Analysis: Understanding the spread and skew of data in each feature.
Correlation Analysis: Identifying correlations among features (e.g., Total Phenols and Flavanoids) to gauge their influence on wine classification.
Data Separation: Examining features like Proline and Color Intensity, which exhibit distinct separation across classes.

Data Transformation

Normalization
To standardize the scale across features, we normalized data to a range of [0, 1], ensuring uniform weight across variables during distance calculations in KNN.

Standardization
For further uniformity, we standardized the dataset to have a mean of 0 and a standard deviation of 1, which is crucial in distance-based models like KNN, helping to minimize the impact of outliers.

Model Training and Evaluation

Using KNeighborsClassifier from scikit-learn, we configured key parameters:

n_neighbors: Number of nearest neighbors for classification.
weights: Influence of distance on neighbor selection.
metric: Distance calculation method (e.g., Euclidean).
p: Power parameter for Minkowski distance.

Following data preparation, we applied KNN and evaluated the model’s performance on both normalized and standardized datasets.

Results

With optimal parameter tuning and preprocessing, both normalization and standardization produced accuracies above 95%, suggesting that:

Data Separability: The dataset's inherent feature separability contributed to model accuracy.
Preprocessing: Effective scaling enhanced KNN’s ability to generalize.
Hyperparameter Selection: Carefully selected hyperparameters like n_neighbors and weights were critical to achieving high accuracy.

Conclusion

This study confirms that feature separability, appropriate preprocessing, and hyperparameter tuning significantly impact classification accuracy in KNN models. Our findings support the efficacy of KNN in distinguishing wine types based on chemical composition.

References

Aich S., Al-Absi A.A., et al. (2018) - A classification approach using various feature sets for predicting wine quality using machine learning.
Arauzo-Azofra A., et al. (2011) - Feature selection methods in classification problems.
[Further references related to this study]

Images and Visuals

Anova F-Value
Distribution Analysis
Correlation Heatmap
- Feature Distribution
Model Performance
- Final Result

Installation and Usage

To set up the project locally, follow these steps:

Clone this repository

git clone https://github.com/yourusername/wine-dataset-analysis.git

Install dependencies
Use the requirements.txt file to install the necessary libraries:
```
pip install -r requirements.txt
```
Run the Jupyter Notebook
Launch Jupyter Notebook or JupyterLab and open the main analysis notebook to execute the code.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
visualizations		visualizations
PPT.pdf		PPT.pdf
Project report.pdf		Project report.pdf
README.md		README.md
wine.ipynb		wine.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wine Dataset Classification and Analysis using K-Nearest Neighbors (KNN) & Random Forest

Table of Contents

Project Overview

Data Description