July 06 - B3DB Feature Selection | The DataChemist's Journey

It's time to integrate in one script the different strategies for data cleaning and feature selection upon multiple criteria in a single python script. This will have the following structure:

Load data
Deal with extreme values (for example infinite)
Deal with NaN Values
Remove constant features using varianche threshold in Scikit-Learn
Removing duplicate features using transpose with pandas
Drop correlated features

Useful code for Data Wrangling

`NaN` values

df.isnull().sum() -> will show you the missing values present in each column
df.isnull().sum().sum() -> will show the total missing values in the data frame.

Max & Min values in dataframe

df_X.max(axis = 0).sort_values(ascending = False).head(n=25) -> max data for each column, sorted from largest to smallest, first 25 elements.
df_X.max(axis = 1).sort_values(ascending = False).head(n=25) -> max data per instance, sorted from largest to smallest, first 25 elements.
Note: min method can be used.

Extreme values

df_X.max(axis = 0).sort_values(ascending = False).where(lambda x : x > 1.e6).count() -> count columns where the max value is an extreme value.
df_X.where(lambda x : x > 1.e6).count().sum() -> count number of occurrences for extreme values in whole data frame
df_X.mask(df_X > 1.e10, np.nan, inplace = True) -> replace all extreme values with NaN

Useful code for Data Wrangling

NaN values

Max & Min values in dataframe

Extreme values

`NaN` values