June 23 - Exploratory Data Analysis | The DataChemist's Journey

Exploratory Data Analysis (EDA) is the base work for any implementation with a Data Set. Usually, data is not clean, nor neatly formatted. Mucho of the time is spent dealing with NaN (missing) values, funny data and packages that are not working as they are supposed to. For this initial EDA I will try to take a look upon NaN values present in the chemical descriptor dataset.

Notes for this process:

Count the number of occurances of a given value in a pandas dataframe.
If I use B3DB[B3DB.threshold == 'NaN'].count() all columns are listed with the number of times they have a 'NaN' value in the threshold column.
I notice that I often mix or confuse the use of df.count() and df.sum(). Count method can be applied to numeric or categorical values. For example if there is a list with elements [man, woman, man, man, woman], applying .count() should yield the output man = 3, woman = 2. On the other hand, Sum method can only be applied to numerical data. If we have a list with the numbers [3, 5, 4, 1], then applying .sum() should give as output 13. The tricky part comes when these methods are combined. When a boolean filtration is made and upon this a .sum() is called, the effect is the same as counting for the filter value.
The command len(B3DB[B3DB.threshold == 'NaN']) also works to know the number of instances that meet the filter value, in this case NaN.

Cumulative Frequency Plot

These plots are a graphical representations of distributions over discrete variables. They are useful to see the progress of the accumulated frequency of a variable over a range of classes. They allow us to answer the following type of questions:

What percentage of the data is above/below N value?
What are the values in the data that are above/below N% of the dataset?

For my use case (B3DB dataset), I am interested to find the cumulative distribution of NaN values per molecular descriptor. This database is made up from ~7500 molecules, each one of them with ~1700 molecular descriptors, which were calculated with Mordred. As it is easy to think, the number of features is extremely large for an interpretable model, in this case we can see a scatter plot of the number of missing datapoints per descriptor and their frequency.

scatter plot

Just as it seems, the distribution is skewed to the sides, meaning that there are a lot of molecular descriptors (high frequency) with a few missing datapoints and also a significant number of molecular descriptors with a lot missing datapoints. This can be problematic for the further analysis. If the scatter plot is transformed to a cumulative frequency plot, we can appreciate this same information in the followin manner.

cumulative frequency

Where we can see that about 10% of all descriptors miss between ~7000 and ~500 molecular descriptors. After that, the graph quickly fades, meaning that about 80% of the descriptors have a very little number of missing values. Nevertheless that's a lot of low-quality molecular features. Beacause of this analysis, new descriptors will be calculated using PaDel.