ProblemSolving-FrameWork (Note: in Progress)

Framework to convert Business Problem into Data Problem

Framework which I used:

Understand Business problem using TOSCAR framework
- T : Trouble
- O : Owner
- S : Success Criteria
- C : Constraints
- A : Actors
- R : Refrences

T - Trouble	O - Owner	S - Success Criteria	C - Constraints	A - Actors	R - Refrences
Why?	Who will own implementation?	How we will measure sucess?	How frequently a customer can be contacted in a day/week/month?	CEO (in our case)	Have we tried this in past?
Why now?	CEO (in our case)	Average spending of existing customer?	How much we can spend?	Opeartion team maybe next	What were results?
Why existing customers?	Sales Director maybe next	Percentage of customer buying over fixed time period?	Do we have any CRM?	Customer management	How current cross-sell or up-sell are being handled?
Any competition benchmark?	Customer management director maybe next	What to achive in which time frame?	How much past data do we have?	Sales Team
Do we have any Cross-sell going on?	What does CEO feels about this problem?	Can we achive same sucess by other ways?	Do we capture behavioral data?	Director
Where does it fit in your(CEO priority)?	Do other in organization feals even this is a problem?		Any regulation to follow or keep in head?

Frame Problem Statement
Break points 2 into smaller problem
Convert Smaller problems to data problems

Classification Problem
Regression Problem

Getting Started

When given a new ML project, workflow could be as following:

Define Problem
Investigate and characterize the problem and clarify the project goal.
Summarize Data
Use descriptive statistics and visualization techniques to get a grasp of data.
- Descriptive Statistics
  data dimension, type, attribute features (count, mean, std, min/max, percentiles), class categories, correlations between attributes, skew of univariate distributions
- Visualization
  univariate plots(histograms, density plot, boxplot), multivariate plots(correlation matrix plot, scatter plot matrix)
Data Preprocessing [Incompleted]

3.1. Transformation
The reason for preprocessing data is that different algorithms make different assumptions about data requiring different transformation. Here are some common processing techniques:
- Rescaling
  To limit varying attributes ranges all between 0 and 1. Useful for weight-inputs regression/neural networks and kNN.
- Standardization
  To transform attributes with a Gaussian distribution to a standard Gaussian distribution (0 mean and 1 std). Useful for linear/logistic regression and LDA
- Normalization
  To rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra). Useful for sparse dataset with varying attribute scales, weight-input neural network and kNN
- Binarization
  To transform data using a binary threshold.(1 for above threshold and 0 for below threshold)
3.2. Feature Selection
Irrelevant or partially relevant features can negatively impact model performance, such as decreasing the accuracy of many models. Feature Selection is to select features that contribute most to the prediction variable or output in which you are interested. It can help reduce overfiting, improve accuracy, reduce training time. Here are some common processing techniques:
- Statistical Test Selection with *chi-2*
  To select those features that have the strongest relationship with output variable
- Recursive Feature Elimination (RFE)
  To recursively removing attributes and building a model on those attributes that remain.
- Principal Component Analysis (PCA)
  A kind of data reduction technique. It uses linear algebra to transform the dataset into a compressed form and choose the number of dimensions or principal components in the transformed result.
- Feature importance
  To use bagged decision trees such as Random Forest and Extra Trees to estimate the importance of features.
Algorithm Evaluation
- Separate train/test dataset (Resampling)
  In most cases, k-fold Cross Validation technique (e.g. k = 3, 5 or 10) will be used to estimate algorithm performance with less variance. At first, the dataset will be splited into k parts. Then the algorithm is trained on k-1 folds with one held back and tested on the held back fold. Repeatedly, each fold of the dataset will be given a chance to be the held back test set. After all these, you can summarize using the mean and std of such k different performance scores.
- Performance Metrics
  Choice of metrics influence how the performance of ML algorithms is measure and compared, as it represents how you weight the importance of different characteristics in the output results and ultimate choice of which algorithm to choose.
  - For Classification Problem
    1. Classification Accuracy
      Classification Accuracy is the ratio of the number of correct predictions and the numberof all predictions. Only suitable for equal number of obsevations in each class, and all predictions and prediction errors are equally important.
    2. Logorithmic Loss
      Logorithmic Loss is to evaluate the predictions of probabilities of membership to a given class. Corrected or incorrected prediction errors will be rewarded/punished proportionally to the comfidence of the prediction. Smaller logloss is better with 0 for a perfect logloss.
    3. Area Under ROC Curve (AUC)
      AUC is used to evaluate binary classification problem, representing a model's ability to discriminate between positive and negative classes. (1 for perfect prediction. 0.5 for as good as random.)
      ROC can be broken down into sensitivity (true positive rate) and specificity (true negative rate)
    4. Confusion Matrix
      Confusion Matrix is representation of models' classes accuracy. Generally, the majority of the predictions fall on the diagonal line of the matrix.
    5. Classification Report
      A scikit-learn lib provided classification report including precision, recall and F1-score.
  - For Regression Problem
    1. Mean Absolute Error (MAE) L1-norm
      MAE is the sum of the absolute differences between predictions and actual values.
    2. Mean Squared Error (MSE) L2-norm
      MSE is the sum of square root of the mean squared error
    3. R^2
      R^2 is an indication of the goodness of fit of a set of predictions to the actual values range between 0 (non-fit) and 1 (perfiect fit). Statistically, it is called coefficient of determination
- Spot-Checking Algorithm
  Spot-Checking is a way to discover which algs perform well on ML problem. Since we do not know which algs will work best on our dataset, we can make a guess and further dig deeper. Here are some common algs:
  - Classification
    - Logistic Regression (Linear)
    - Linear Discriminant Analysis (Linear)
    - k-Nearest Neighbors (Non-linear)
    - Naive Bayes (Non-linear)
    - Classification and Regression (Non-linear)
    - Support Vector Machine (Non-linear)
  - Regression
    - Linear Regression (Linear)
    - Ridge Regression (Linear)
    - LASSO Linear Regression (Linear)
    - Elastic Net Regression (Linear)
    - Naive Bayes (Non-linear)
    - Classification and Regression (Non-linear)
    - Support Vector Machine (Non-linear)
Improve Results
- Ensemble
  Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.
  1. Bagging
    Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions.
  2. Boosting
    Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may sometimes over fit on the training data.
  3. Stacking
    Here we use a learner to combine output from different learners. This can lead to decrease in either bias or variance error depending on the combining learner we use.
- Params Tuning
  Algs tuning is the final step in AML before finalizing model. In scikit-learn, there are two simple methods for params tuning:
  1. Grid Search Param Tuning
    Methodically build and evaluate a model for each combination of algorithm parameters specified in a grid
  2. Random Search Param Tuning
    Sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. A model is constructed and evaluated for each combination of parameters chosen.

Using hyper-parameter optimization, we can also find correct penalty to use

Model	Optimize	Range of values
`Linear Regression`	- fit_intercept - normalize	- True/False - True/False
`Logistic Regression`	- Penalty - C	- l1 or l2 - 0.001, 0.01.....10...100
`k-NN`	- n_neighbors -p	- 2, 4, 8, 16... - 2,3
`SVM`	-C - gamma - class_weight	- 0.001,0.01..10..100..1000 - scale, auto or float, RS* - balanced , None
`Ridge`	- alpha - fit_intercept - normalize	- 0.01, 0.1, 1.0, 10, 100 - True/False - True/False
`Lasso`	- alpha - normalize	- 0.1, 1.0, 10 - True/False
`Decision Tree`	- criterion - splitter - max_depth - min_sample_split - min_sample_leaf	- gini, entropy -best, random - 5,8,10
`Random Forest`	- n_estimators - max_depth - min_samples_split - min_samples_leaf - max features	- 120, 300, 500, 800, 1200 - 5, 8, 15, 25, 30, None - 1,2,5,10,15,100 - 1,2,5,10 - log2, sqrt, None
`XGBoost`	- eta - gamma - max_depth - min_child_weight - subsample - colsample_bytree - lambda - alpha	- 0.01,0.015, 0.025, 0.05, 0.1 - 0.05-0.1,0.3,0.5,0.7,0.9,1.0 - 3,5,7,9,12,15,17,25 - 1,3,5,7 - 0.6, 0.7, 0.8, 0.9, 1.0 - 0.6, 0.7, 0.8, 0.9, 1.0 - 0.01-0.1, 1.0 , RS* - 0, 0.1, 0.5, 1.0 RS*
`AdaBoost`	- n_estimators - algorithm - DT_criterion - DT_splitter - DT_mini_samples_split	- 50,250,500 - samme, samme.r - gini,entropy - random, best - 5,10,20
`Stacking Classifier`	- xgboost - random forest - voting classifier -logistic regression - passthrough=True

Show Results
- Predictions on validation dataset
- Create standalone model on entire training dataset
- Save model for later use

This is Private Link for now

Write an email to --> [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
LICENSE		LICENSE
README.md		README.md
performance-metrics(Classification).py)		performance-metrics(Classification).py)
pima-indians-diabetes-data.csv		pima-indians-diabetes-data.csv
save-load_Model.py		save-load_Model.py
spot_checking(Classificatinn).py		spot_checking(Classificatinn).py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProblemSolving-FrameWork (Note: in Progress)

Framework to convert Business Problem into Data Problem

Getting Started

This is Private Link for now

About

Releases

Packages

Languages

License

MvMukesh/ProblemSolving-FrameWork-ML

Folders and files

Latest commit

History

Repository files navigation

ProblemSolving-FrameWork (Note: in Progress)

Framework to convert Business Problem into Data Problem

Getting Started

This is Private Link for now

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages