August 2 - 50% done! | The DataChemist's Journey

This month will be the most intense regarding work in the internship. Most of the issues or extensions on the work have been already overseen, now is time to invest a good amount of work and hours to work out the main projects in my internship. Here are the notes of the day ...

Notes on AutoMl (auto-sklearn)

Available classifiers are: AdaBoost, Bernoulli naïve Bayes, decision tree, extremely randomized trees, Gaussian naïve Bayes, gradient boosting, kNN, LDA, linear SVM, kernel SVM, multinomial naïve Bayes, passive aggressive, QDA, random forest, linear classification vis stochastic gradient descent. The implementation is very well described in their paper.

The classifiers taken into account by auto-sklearn are more than what we are "manually" considering, but on the other hand, it doesn't includes logistic regression nor XGBoost.

If our "manual" implementation keeps failing due to the search space, we could definitely try this package as it has a really easy implementation.

Notes on Bayesian optimization

Regarding this technique, it seems it could be really useful in the case that we have a more constrained time for computation or want fine-tuned hyperparameters. It will be definitely useful to have in the toolbox.

Further thoughts:

Another reason for the time problems that we have experienced can be related to the nested CV currently implemented, as this is increasing the model generation steps by a factor of 50 (10 outer cv x 5 inner cv).
Just out of curiosity, is there any difference between computation time/waiting queue in Cedar and Graham?

ML hyperparameter tuning workflow with `BayesianOptimization`:

from bayes_opt import BayesianOptimization
# Gradient Boosting Machine
def gbm_cl_bo(max_depth, max_features, learning_rate, n_estimators, subsample):
    params_gbm = {}
    params_gbm['max_depth'] = round(max_depth)
    params_gbm['max_features'] = max_features
    params_gbm['learning_rate'] = learning_rate
    params_gbm['n_estimators'] = round(n_estimators)
    params_gbm['subsample'] = subsample
    scores = cross_val_score(GradientBoostingClassifier(random_state=123, **params_gbm),
                             X_train, y_train, scoring=acc_score, cv=5).mean()
    score = scores.mean()
    return score
# Run Bayesian Optimization
start = time.time()
params_gbm ={
    'max_depth':(3, 10),
    'max_features':(0.8, 1),
    'learning_rate':(0.01, 1),
    'n_estimators':(80, 150),
    'subsample': (0.8, 1)
}
gbm_bo = BayesianOptimization(gbm_cl_bo, params_gbm, random_state=111)
gbm_bo.maximize(init_points=20, n_iter=4)
print('It takes %s minutes' % ((time.time() - start)/60))

Data Balancing Methods

There are two general strategies to balance data: oversampling and undersampling. Undersampling methods reduce the number of samples in the majority class to find a balance bewteen minority and majority class. For our specific user case (Drug Discovery), under-sampling techniques are not preferred as valuable (and expensive) information is lost when dropping instances from the dataset.

RUS (Random under-sampling): this method will randomly reduce the number of majority class samples up to the size of minority class samples to create a balanced dataset.
ROS (Random over-sampling): increases the minority class samples up to the size of majority class samples and will create a balanced dataset.
SMOTE (Synthetic minority over-sampling technique): creates synthetic samples for the minority class rather than by oversampling with replacement.
ADASYN (Adaptive synthetic): generates minority class samples that are harder to learn than those that are easier to learn. This is how this method can reduce the learning bias and can also adaptively shift the decision boundary to focus on those hard-to-learn samples.

Notes on AutoMl (auto-sklearn)

Notes on Bayesian optimization

ML hyperparameter tuning workflow with BayesianOptimization:

Data Balancing Methods

ML hyperparameter tuning workflow with `BayesianOptimization`: