BUISNESSCASE & GOAL OF PROJECT: BASED ON GIVEN FEATURE OF DATASET WE NEED TO PREDICT THE PERFOMANCE RATING OF EMPLOYEE
INX Future Inc Employee Performance - Project
The Data science project which is given here is an analysis of employee performance
A trained model which can predict the employee performance based on factors as inputs. This will be used to hire employees Recommendations to improve the employee performance based on insights from analysis The given Employee dataset consist of 1200 rows. The features present in the data are 28 columns. The shape of the dataset is 1200x28. The 28 features are classified into quantitative and qualitative where 19 features are quantitative (11 columns consists numeric data & 8 columns consists ordinal data) and 8 features are qualitative. EmpNumber consist alphanumerical data (distinct values) which doesn't play a role as a relevant feature for performance rating.
From Correlation we can get the important aspects of the data, Correlation between features and Performance Rating.Correlation is a statistical measure that expresses the extent to which two variables are linearly related.The analysis of the project has gone through the stage of Univariate,Bivariate & Multivariate analysis, correlation analysis and analysis by each department to satisfy the project goal.
The dataset consists of Categorical data and Numerical data. The Target variable consist of ordinal data, so this is a classification problem.The multiple machine learning model used in this project is Support vector classifier, Random forest classifier & Artifical neural network[Multilayer percepton]. from above all models Artifical neural network[Multilayer percepton] predicts higher accuracy 95.80%.
One of the important goal of this project is to find the important feature affecting the performance rating. The important features were predicted using the machine learning model feature importance technique. The main technique used in the preprocessing data using the Mannual & Frequency encoding method to convert the string - categorical data into numerical data, because, Most of machine learning methods are based on numerical methods where strings are not supportive. The overall project was performed and achieved the goals by using the machine learning model and visualization techniques.
The data was given from the IABAC for this project where the collected source is IABAC™. The data is based on INX Future Inc, (referred as INX ). It is one of the leading data analytics and automation solutions provider with over 15 years of global business presence. INX is consistently rated as top 20 best employers past 5 years. The data is not from the real organization. The whole project was done in Jupiter notebook with python platform.
Data were analyzed by describing the features present in the data. the features play the bigger part in the analysis. The features tell the relation between the dependent and independent variables. Pandas also help to describe the datasets answering following questions early in our project. The data present in the dataset are divided into numerical and categorical data.
Categorical Features EmpNumber Gender EducationBackground MaritalStatus EmpDepartment EmpJobRole BusinessTravelFrequency OverTime Attrition
Numerical Features Age DistanceFromHome EmpHourlyRate NumCompaniesWorked EmpLastSalaryHikePercent TotalWorkExperienceInYears TrainingTimesLastYear ExperienceYearsAtThisCompany ExperienceYearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Ordinal Features EmpEducationLevel EmpEnvironmentSatisfaction EmpJobInvolvement EmpJobLevel EmpJobSatisfaction EmpRelationshipSatisfaction EmpWorkLifeBalance PerformanceRating
-
Univariate Analysis: In univariate analysis we get the unique labels of categorical features, as well as get the range & density of numbers
-
Bivariate Analysis: In bivariate analysis we check the feature relationship with target veriable.
-
Multivariate Analysis: In multivariate Analysis check the relationship between two veriable with respect to the target veriable.
CONCLUSION There are some features are positively correlated with performance rating( Target variable) [Emp Environment Satisfaction,Emp Last Salary Hike Percent,Emp Work Life Balance]
Basic Check & Statistical Measures* Their is no constant column is present in Numerical as well as categoriacl data.
In general, one of the first few steps in exploring the data would be to have a rough idea of how the features are distributed with one another. To do so, we shall invoke the familiar distplot function from the Seaborn plotting library. The distribution has been done by both numerical features. it will show the overall idea about the density and majority of data present in a different level.
The age distribution is starting from 18 to 60 where the most of the employees are laying between 30 to 40 age count Employees are worked in the multiple companies up to 8 companies where most of the employees worked up to 2 companies before getting to work here. The hourly rate range is 65 to 95 for majority employees work in this company. In General, Most of Employees work up to 5 years in this company. Most of the employees get 11% to 15% of salary hike in this company.
-
Check Missing Value: Their is no missing value in data
-
Categorical Data Conversion: Handel categorical data with the help of frequency and mannual encoding, because feature is contain lot's of labels
-
Mannual Encoding: Mannual encoding is a best techinque to handel categorical feature with the help of map function, map the labels based on frequency.
-
Frequency Encoding: Frequency encoding is an encoding technique to transform an original categorical variable to a numerical variable by considering the frequency distribution of the data getting value counts.
-
Outlier Handling Some features are contain outliers so we are impute this outlier with the help of IQR because in all features data is not normally distributed
-
Feature Transformation: In YearsSinceLastPromotion some skewed & kurtosis is present, so we are use Square Root Transformation techinque
- Square root transformation: Square root transformation is one of the many types of standard transformations.This transformation is used for count data (data that follow a Poisson distribution) or small whole numbers. Each data point is replaced by its square root. Negative data is converted to positive by adding a constant, and then transformed.
- Q-Q Plot: Q–Q plot is a probability plot, a graphical method for comparing two probability distributions by plotting their quantiles against each other.
- Scaling The Data: scaling the data with the help of Standard scalar
- Standard Scaling: Standardization is the process of scaling the feature, it assumes the feature follow normal distribution and scale the feature between mean and standard deviation, here mean is 0 and standard deviation is always 1.
-
Drop unique and constant feature: Dropping employee number because this is a constant column as well as drop Years Since Last Promotion because we create a new feaure using square root transformation
-
Checking Correlation: Checking correlation with the help of heat map, and get the their is no highly correlated feature is present.
Heatmap: A heatmap is a graphical representation of data that uses a system of color-coding to represent different values.
-
Check Duplicates: In this data Their is no dupicates is present.
-
PCA: Use pca to reduce the dimension of data, Data is contain total 27 feature after dropping unique and constant column,from PCA it shows the 25 feature has less varaince loss, so we are going to select 25 feature.
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset.
- Saving Pre-Process Data: save the all preprocess data in new file and add target feature to it.
-
Define Dependant and Independant Features:
-
Balancing the data: The data is imbalance, so we need to balance the data with the help of SMOTE
SMOTE: SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them. SMOTE synthesises new minority instances between existing minority instances. 3.Splitting Training And Testing Data: 80% data use for training & 20% data used for testing
- AIM: Create a sweet spot model (Low bias, Low variance)
HERE WE WILL BE EXPERIMENTING WITH THREE ALGORITHM
- Random Forest
- Decision Tree
- Gradient Boosting
Save model with the helpof pickle file
Jupyter
Pandas Numpy Matplotlib Seaborn pylab Scipy Sklearn Pickle
PLOT USED
- Violinplot: It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.
- CountPlot: countplot is used to Show the counts of observations in each categorical bin using bars. Sales: The Performace rating level 3 is more in the sales department. The male performance rating the little bit higher compared to female.
-
Human Resources: The majority of the employees lying under the level 3 performance . The older people are performing low in this department. The female employees in HR department doing really well in their performance.
-
Development: The maximum number of employees are level 3 performers. Employees of all age are performing at the level of 3 only. The gender-based performance is nearly same for both.
-
Data Science: The highest average of level 3 performance is in data science department. Data science is the only department where less number of level 2 performers. The overall performance is higher compared to all departments. Male employees are doing good in this department.
-
Research & Development: The age factor is not deviating from the level of performance here where different employees with different age are there in every level of performance. The R&D has the good female employees in their performance.
-
Finance: The finance department performance is exponentially decreasing when age increases. The male employees are doing good. The experience factor is inversely relating to the performance level.
The top three important features affecting the performance rating are ordered with their importance level as follows,
- EmpLastSalaryHikePercent
- EmpEnvironmentSatisfaction
- YearsSinceLastPromotion
- Age
- DistanceFromHome
- EmpEducationLevel
- Age
- DistanceFromHome
- EmpEducationLevel
The trained model is created using the machine learning algorithm as follows with the accuracy score,
- 1.Random Forest Classifier: 95.04% accuracy
- 2.Dicision Tree classifier: 77.71% accuracy
- 3.Gradient Boosting Classifier: 94.48% accuracy
- The overall employee performance can be achieved by employee environment satisfaction. The company needs to focus more on the employee environment satisfaction.
- The salary hike will give the boost to the employees to perform well.
- Promote the employee ervery 6th month
- Improve Employee's work-life balance this affects the performance rating.
- While recruiting for HR, consider the female candidates where they perform well compared to male.
- The development and sales department is having an overall higher performance comparing to rest of the departments. While some of the employees who gives feedback like Low & Medium from Job Satisfaction & Relationship Satisfaction feature, such employees gives Excellent performance more in number. So company should focus on them. umber. So company should focus on them.