July 06 - B3DB Feature Selection
It's time to integrate in one script the different strategies for data cleaning and feature selection upon multiple criteria in a single python
script. This will have the following structure:
- Load data
- Deal with extreme values (for example infinite)
- Deal with NaN Values
- Remove constant features using varianche threshold in Scikit-Learn
- Removing duplicate features using transpose with pandas
- Drop correlated features
Useful code for Data Wrangling
NaN
values
df.isnull().sum()
-> will show you the missing values present in each column
df.isnull().sum().sum()
-> will show the total missing values in the data frame.
Max & Min values in dataframe
df_X.max(axis = 0).sort_values(ascending = False).head(n=25)
-> max data for each column, sorted from largest to smallest, first 25 elements.
df_X.max(axis = 1).sort_values(ascending = False).head(n=25)
-> max data per instance, sorted from largest to smallest, first 25 elements.
Note: min
method can be used.
Extreme values
df_X.max(axis = 0).sort_values(ascending = False).where(lambda x : x > 1.e6).count()
-> count columns where the max value is an extreme value.
df_X.where(lambda x : x > 1.e6).count().sum()
-> count number of occurrences for extreme values in whole data frame
df_X.mask(df_X > 1.e10, np.nan, inplace = True)
-> replace all extreme values with NaN