July 08 - Classification Algorithms | The DataChemist's Journey

Today I am coming back to my main project 3BDB with the task to script different common classification Machine Learning (ML) models for this database. First of all, it's important to go back to the basics and remember what type of problem we are tackling. In this case, it is classification, which is one of the two different families in Supervised Learning (the other one is regression).

Just as we will be doing here, there are multiple methods perform a classification model, which in our case is to predict if a molecule is has permeability for the Blood-Brain Barrier which in the case that it permeable, and the molecule has a + tag and - when it lacks this property. This type of grouping the data between two classes (+/-) makes it a classification problem and because we have a tag for all molecules, it is a supervised learning approach.

When applied to classification problems, the main families of techniques or common algorithms can be grouped into Support Vector Machines (SVM), K Nearest Neighbours (KNN), Decision Trees and Ensemble Methods. Today I will be focusing in SVM, with its implementation using scikit-learn package for Python.

Support Vector Machines are a set of supervised learning methdos used for classification, regression and outliers detection. For our case, we will only focuse on classifications tasks. The subset of SVM in classification is called Support Vector Classification (SVC). More of it can be read in scikit-learn documentation.

For all models, we will need to look for the best hyperparameters. This process is called hyperparameter tunning, which can be used using the GridSearchCV, which trains different methods at multiple hyperparameter values. For SVC, the hyperparameters of interest are its kernel, C constant and gamma.

Merging two databases

Here is the code I used to merge the chemical feature dataframe (df_descriptors) and all the molecules located in the original dataset (df_target) with its category data (BBB+/BBB-).

import pandas as pd

df_target = pd.read_csv("b3db_all_molecules.csv", sep = ",", usecols = ['ID', 'category'])
df_descriptors = pd.read_csv('selected_padel_descriptors.csv')

df_new = pd.merge(df_target, df_descriptors, left_on = 'ID', right_on = 'Name', how = 'inner', validate = '1:1').drop('Name', axis = 1)
df_new.to_csv('feature_matrix_y_x.csv', sep = ',', index = None)

Notes on saving data

Here is a bunch of script that might be useful (most probably not) in the future, there are some examples of saving code into a TXT file, dumping trained models, etc.

# 8. Save classification report
    report = classification_report(y_test, y_pred, output_dict = True)
    report = pd.DataFrame(report).transpose()
    report.to_csv('{}_report.txt'.format(args.input), sep = " ")

Save (pickle), trained model from sci-kit
from joblib import dump, load
>>> dump(clf, 'filename.joblib') 

USEFUL CODE: For multiple classifiers
# Define the classifiers
classifiers = [LogisticRegression(), LinearSVC(), SVC(), KNeighborsClassifier()]

# Fit the classifiers
for c in classifiers:
    c.fit(X,y)

#3
# Instantiate an RBF SVM
svm = SVC()

# Instantiate the GridSearchCV object and run the search
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svm, parameters)
searcher.fit(X_train, y_train)

# Report the best parameters and the corresponding score
print("Best CV params", searcher.best_params_)
print("Best CV accuracy", searcher.best_score_)

# Report the test accuracy using these best parameters
print("Test accuracy of best grid search hypers:", searcher.score(X_test, y_test))