Why have feature selection? Isn't the more the data, the best prediction? Well... not precisely. Here are several reasons for selecting features in an ML model:

  • Models with less number of features have higher explainability
  • It is easier to implement machine learning models with reduced features
  • Fewer features lead to enhanced generalization which in turn reduces overfitting
  • Feature selection removes data redundancy
  • Training time of models with fewer features is significantly lower
  • Models with fewer features are less prone to errors

Useful code

df_run.isna().sum() -> Count the mssing values for all columns in DataFrame

Points to consider when doing Feature Selection

  • Check for extreme large values, what is the threshold? -> 1e10 is the threshold
  • Check for quasi-constant features? Let's say 99% -> Not needed at this stage
  • Correlation Threshold -> 0.8
  • What is the label of interest? -> BBB i suppose (+ / -)
  • Label encoding is also important when training ML models. Some literature uses +1/-1 encoding but for standard use let's use 1 for BBB+ and 0 for BBB-.