Feature Engineering Checklist

Rahul Gupta
7 min readNov 15, 2020

There are a lot of good reasons for using a Checklist while doing Feature Engineering in a Machine Learning project. I have listed a few below that I found to be True in my project.

  • To eliminate mistakes
  • To ensure consistency
  • To ensure that everything necessary is completed and nothing is missed.
  • To reduce decision fatigue by not forcing us to remember every little thing

Here is the checklist to guide you through your machine learning project.

1. Feature Selection

2. Handling Mixed Variables

3. Feature Extraction/Feature Generation

4. Imputation: Replace missing values

5. Outliers

6. Rare Label Engineering

7. Convert categorical columns/Feature Encoding

8. Feature Scaling

9. Extracting Date

10. Feature Transformation

Feature Selection

Feature Selection means selecting only the most useful features to train on. When presented data with very high dimensionality, models usually choke because:

  • High cost of data collection
  • Training time increases exponentially with number of features.
  • Models have increasing risk of overfitting with increasing number of features.

Remove features that are highly correlated with each other (Multicollinearity).

def correlation(dataset, threshold):
col_corr = set() # Set of all the names of correlated columns
corr_matrix = dataset.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) > threshold:
# we are interested in absolute coeff value
colname = corr_matrix.columns[i]
# getting the name of column
col_corr.add(colname)
return col_corr
#To get list of correlated features and drop them
corr_features = correlation(X_train, 0.7)
X_train.drop(corr_features,axis=1)

Remove features that have low variance (have same value for the majority/all the observations).

from sklearn.feature_selection import VarianceThreshold
var_thres=VarianceThreshold(threshold=0)
var_thres.fit(data)
# To see remaining columns
data.columns[var_thres.get_support()]
# To see dropped columns
for feature in constant_columns:
print(feature)

Feature Selection Techniques

  1. Filter Based
  • Correlation
  • Chi-square
  • ANOVA test (Analysis of Variance)

2. Wrapper Based

  • RFE (Recursive Feature Elimination)
  • Forward Selection
  • Backward Elimination

3. Embedded

  • Lasso
  • Ridge

Mixed Variables

  • Extract the categorical part in one variable, and the numerical part in a different variable, to see if that adds value to predictive model.
  • Categorical part may require further processing. Ex- conversion to upper case, removal of special characters.
  • Drop the original variable

Feature Extraction/Feature Generation

  • Combining existing features to produce a more useful one. For example, a car’s mileage may be very correlated with its age, so, they can be merged them into one feature that represents the car’s wear and tear.
  • Add derived columns and check their correlation with target feature. If the correlation is better than base columns individually, keep the derived column else drop it.

Imputation: Replace missing values

There are o lot of techniques that can be used to impute values. How to choose a technique depends on the dataset under consideration.

  • Do Nothing

Imputing values for tree based models is not required as they can handle it themselves. You just let the algorithm handle the missing data.

Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (Ex- XGBoost)

  • Drop the column

If percentage of missing data in column/s is very high

  • Drop the row

If percentage of missing data in row/s is very high

  • Mean, Median, Mode imputation (missing completely at random)

Most commonly used technique. It doesn’t factor the correlations between features. It only works on the column level. It will give poor results on encoded categorical features (do NOT use it on categorical features). Can calculate mean grouped by a highly correlated feature.

  • Imputation by arbitrary value, for ex — 999 (Not missing at random)

It captures the importance of missingness if there is one. The rationale is that if the value is missing, it is for a reason.

  • Add a variable to denote missing data

It captures the importance of missingness (if any).

  • Fill missing data by using an algorithm like KNN

Outliers

  • An outlier is a data point which is significantly different from the remaining data.

Outlier detection

  • For continuous variables : Use Boxplot to check the presence of outliers.
  • For discrete variables: Calculate the percentage for each value. The values with less than 1% can be considered as outliers.
  • Find outliers by using 1 of the 2 methods listed below:
  1. Mean & Standard deviation method
  • If the distribution is Gaussian, outliers will lie outside the mean plus or minus 3 times the standard deviation of the variable.
uppper_boundary=df[‘Age’].mean() + 3* df[‘Age’].std()lower_boundary=df[‘Age’].mean() — 3* df[‘Age’].std()print(lower_boundary), print(uppper_boundary),print(df[‘Age’].mean())

2. IQR method

  • If the variable is not normally distributed, a general approach is to calculate the quantiles, and then the inter-quantile range (IQR).
Q1=train[sel_col[i]].quantile(0.25)Q3= train[sel_col[i]].quantile(0.85)IQR= Q3-Q1print(“Lower Bound”, Q1–1.5 * IQR)print(“Upper Bound”, Q1 + 1.5 * IQR)
  • Investigate the presence of outliers in data to see if there is any dependency of outliers on other variables.
  • Some algorithms like Adaboost are very sensitive to outliers. While decision tree ignores the presence of outliers.

Techniques

  • Capping or top-coding

It is a technique to replace outliers by the upper boundary values. Use it for discrete variables with less than 1% values.

  • Discretization, also known as Binning

It is the process of transforming continuous variables into discrete variables. In other words, it converts numbers to categories. It helps to improve model performance by grouping of similar attributes. It helps handle outliers by placing these values into the lower or higher intervals.

  • Mean/Median/Mode imputation

As shown above in the imputation section

  • Discard outliers

Drop the rows, only if the dataset is huge.

Rare Labels

Rare labels are labels within a categorical variable that are only present for a small percentage of the observations.

Cases & Techniques

  • One predominant category (95%-99%) (Low Variance)

These types of variables often are not useful for predictions and should be removed them from the set of features.

  • A small number of categories (less than 5)

Engineering rare labels in variables with very few categories will not improve the performance of the algorithm. So, no changes are required.

  • High cardinality

Replace the rare label by most frequent label. Rare labels can be grouped together in a separate category called ‘Rare’.

Feature Encoding

Techniques

  1. Nominal
  • One Hot Encoding

Replace categorical variable by Boolean values to indicate whether certain label is true for that observation.

  • One Hot Encoding for Multi Categorical Variables

Perform rare label engineering and then encode the feature.

  • Mean Encoding

Replace the label by mean of the target for that label.

  • Count/Frequency Encoding

Replace each label by count/frequency within that category. If 2 labels appear the same amount of times in the dataset, that is, contain the same number of observations, they will be merged: may lose valuable information. It adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power.

2. Ordinal

  • Label Encoding

Replace the label by some ordinal number if meaningful.

  • Target Guided Ordinal encoding (prone to overfitting)

These methods create a monotonic relationship between the categorical variable and the target. These methods can also be used for numeric variables after discretization.

  • Ordering the labels according to the target

It means assigning a number to the label, but this numbering, this ordering, is informed by the mean of the target within the label. We calculate the mean of the target for each label/category, then we order the labels according to these mean from smallest to biggest, and we number them accordingly.

  • Replacing labels by the risk factor/Mean Encoding

Replacing labels by the risk factor means essentially replacing the label by the mean of the target for that label. It is same as ordering the labels according to target, but here numbering is not done. The labels are replaced by mean value.

  • Probability ratio encoding

For each label, we calculate the mean of target=1, that is the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ), which is calculated as (1-mean). And then, we calculate the ratio P(1)/P(0) and replace the labels by that ratio.

Feature Scaling

Algorithms based on Euclidean distance & Gradient Descent are very sensitive to scale of the features. They don’t perform well when input numerical attributes have very different scales. Gradient descent converges much faster when scaling is done.

Feature Scaling is however not required for algorithms which are not distance based. For example: Decision Tree, Random Forest

Techniques

  1. Normalization or Min-Max scaling (0 to 1)
  • The transformed variable ranges from 0 to 1.
  • It is very sensitive to outliers
Z = (X — Xmin) / (Xmax — Xmin)

2. Standardization (Most frequently used)

  • The procedure involves subtracting the mean of each observation and then dividing by the standard deviation
Z = (X — µ) / (std deviation)

Feature Transformation

Gaussian transformation is converting a feature in Gaussian/Normal distribution. It is required for Linear Regression & Logistic Regression algorithms only.

  1. Visualize data distribution
  • Histogram
  • Q-Q plot

2. Techniques (to transform variable in Gaussian distribution)

  • Logarithmic transformation
  • Reciprocal transformation
  • Square root transformation
  • Exponential transformation (more general, you can use any exponent)
  • Boxcox transformation
  • Discretization/Binning can be used to remediate both outliers as well as skewed data distribution

Gaussian distribution

Properties of Gaussian Distribution:

  • Unimodal: one mode
  • Symmetrical: left and right halves are mirror images
  • Bell-shaped: maximum height (mode) at the mean
  • Mean, Mode, and Median are all located in the center

CONCLUSION

To summarize, it is very important to follow a checklist for feature engineering in a machine learning project. The objective of this post is to just give you a peek into this field at a very high level.

In each step of the checklist, multiple techniques are available. The choice of the technique will massively depend upon the dataset and the objective of project. In some cases, a combination of techniques will need to be tried separately to find out what gives a better accuracy.

--

--

Rahul Gupta

Director - Technology @ Cadient | Data-driven Leader | Agile evangelist | Business Intelligence | CAPM® | OCP®