June 30 - B3DB Feature Selection

Jun 30, 2021

Why have feature selection? Isn't the more the data, the best prediction? Well... not precisely. Here are several reasons for selecting features in an ML model:

Models with less number of features have higher explainability
It is easier to implement machine learning models with reduced features
Fewer features lead to enhanced generalization which in turn reduces overfitting
Feature selection removes data redundancy
Training time of models with fewer features is significantly lower
Models with fewer features are less prone to errors

Useful code

df_run.isna().sum() -> Count the mssing values for all columns in DataFrame

Points to consider when doing Feature Selection

Check for extreme large values, what is the threshold? -> 1e10 is the threshold
Check for quasi-constant features? Let's say 99% -> Not needed at this stage
Correlation Threshold -> 0.8
What is the label of interest? -> BBB i suppose (+ / -)
Label encoding is also important when training ML models. Some literature uses +1/-1 encoding but for standard use let's use 1 for BBB+ and 0 for BBB-.