June 30 - B3DB Feature Selection
Why have feature selection? Isn't the more the data, the best prediction? Well... not precisely. Here are several reasons for selecting features in an ML model:
- Models with less number of features have higher explainability
- It is easier to implement machine learning models with reduced features
- Fewer features lead to enhanced generalization which in turn reduces overfitting
- Feature selection removes data redundancy
- Training time of models with fewer features is significantly lower
- Models with fewer features are less prone to errors
Useful code
df_run.isna().sum() -> Count the mssing values for all columns in DataFrame
Points to consider when doing Feature Selection
- Check for extreme large values, what is the threshold? -> 1e10 is the threshold
- Check for quasi-constant features? Let's say 99% -> Not needed at this stage
- Correlation Threshold -> 0.8
- What is the label of interest? -> BBB i suppose (+ / -)
- Label encoding is also important when training ML models. Some literature uses +1/-1 encoding but for standard use let's use 1 for BBB+ and 0 for BBB-.