It's time to integrate in one script the different strategies for data cleaning and feature selection upon multiple criteria in a single python script. This will have the following structure:

  1. Load data
  2. Deal with extreme values (for example infinite)
  3. Deal with NaN Values
  4. Remove constant features using varianche threshold in Scikit-Learn
  5. Removing duplicate features using transpose with pandas
  6. Drop correlated features

Useful code for Data Wrangling

NaN values

df.isnull().sum() -> will show you the missing values present in each column
df.isnull().sum().sum() -> will show the total missing values in the data frame.

Max & Min values in dataframe

df_X.max(axis = 0).sort_values(ascending = False).head(n=25) -> max data for each column, sorted from largest to smallest, first 25 elements.
df_X.max(axis = 1).sort_values(ascending = False).head(n=25) -> max data per instance, sorted from largest to smallest, first 25 elements.
Note: min method can be used.

Extreme values

df_X.max(axis = 0).sort_values(ascending = False).where(lambda x : x > 1.e6).count() -> count columns where the max value is an extreme value.
df_X.where(lambda x : x > 1.e6).count().sum() -> count number of occurrences for extreme values in whole data frame
df_X.mask(df_X > 1.e10, np.nan, inplace = True) -> replace all extreme values with NaN