Variable Types
- Categorical Variables: Nominal and Ordinal
- Numerical Variables: Discrete and continuous
- Mixed variables: strings and numbers
- Datetime variables
Variable Types | Code + Blog Link | Video Link |
---|---|---|
Variable Characteristics
- Missing Data
- Cardinality
- Category Frequency
- Distributions
- Outliers
- Magnitude
Variable Characteristics | Code + Blog Link | Video Link |
---|---|---|
Missing Data Imputation
- For Numerical Variables
- Mean and Median Imputation
- Arbitrary value imputation
- End of Tail Imputation
- For Categorical Variables
- Frequent category imputation
- Adding a missing category
- Random Sample Imputation
- Adding a missing indicator
- Imputation with Scikit-learn
- Imputation with Feature-engine
Missing Data Imputation | Code + Blog Link | Video Link |
---|---|---|
Multivariate Imputation
- MICE
- KNN imputation
Multivariate Imputation | Code + Blog Link | Video Link |
---|---|---|
Categorical Variable Encoding
- Traditional Techniques
- One hot encoding: simple and of frequent categories
- Ordinal / Label encoding: arbitrary and ordered
- Count / Frequency encoding
- Monotonic Relationship
- Target mean encoding
- Weight of evidence
- Ordered label encoding
- Alternative Techniques
- Binary encoding
- Feature hashing
- Probability Ratio
- For Rare Labels
- One hot encoding of frequent categories
- Grouping of rare categories
- Rare Label encoding
- Encoding with Scikit-learn
- Encoding with category encoders
Categorical Variable Encoding | Code + Blog Link | Video Link |
---|---|---|
Variable Transformation
- Mathematical Transformations
- Logarithic
- Exponential / Power
- Reciprocal
- Box-Cox
- Yeo-Johnson
- Discretisation
- Unsupervised
- Equal-width
- Equal-frequency
- K means
- Supervised
- Decision Tree
- Unsupervised
- Other
- Transformation with Scikit-learn
Variable Transformation | Code + Blog Link | Video Link |
---|---|---|
Discretisation
- Arbitrary
- Equal-frequency discretisation
- Equal-width discretisation
- K-means discretisation
- Discretisation with trees
- Discretisation with Scikit-learn
- Discretisation with Feature-engine
Discretisation | Code + Blog Link | Video Link |
---|---|---|
Outliers
- Discretisation
- Capping / Censoring
- Trimming / Truncation
Outliers | Code + Blog Link | Video Link |
---|---|---|
Feature Scaling
- Standardisation (common one)
- MinMaxScaling (common one)
- MaxAbsoluteScaling
- RobustScaling
- Scaling to absolute maxima
- Scaling to median & quantiles
- Scaling to unit norm
Models Effected by magnitude of feature
- Linear & Logistic Regression
- SVM
- KNN
- K-means Clustering
- LDA
- PCA
- Neural Networks
Models insensitive to feature magnitude
- Tree Based Models
- Classification & Regression Trees
- Random Forest
- Gradient Boosted Trees
Feature Scaling | Code + Blog Link | Video Link |
---|---|---|
Mixed variables
- Creating new variables from strings and numbers
Mixed variables | Code + Blog Link | Video Link |
---|---|---|
Datetime Variables
- Extracting day, month, week, semester, year ...etc
- Extracting hour, min, sec ...etc
- Capturing Elapsed time
- Time between transactions
- Age
- Working with timezones
Datetime | Code + Blog Link | Video Link |
---|---|---|
Text
- Characters, Words, Unique words
- Lexical diversity
- Sentences, Paragraphs
- Bag of Words
- TFiDF
Transactions & Time Series
- Aggregate data
- Number of payments in last 3, 6, 12 months
- Time since last transaction
- Total spending in last month
Feature Combination
- Ratio : total debt with income --> Debt to income ratio
- Sum : Debt in different credit cards --> total debt
- Subtraction : Income without expenses --> disposable income
Pipelines
- Classification Pipeline
- Regression Pipeline
- Pipeline with cross-validation
Pipelines | Code + Blog Link | Video Link |
---|---|---|