September 28 - Discussing Training Dataset and Cross-Validation

Today @fwmeng88 and I had a very fruitful discussion about some modifications that can be done to the scripts for model training as these have experienced some running problems, specially memory-related. These topics invited us to reevaluate the way in which we are performing the nested-cross-validation (cv) in the script as it is related to the way in which the hyperparameters are optimized.

The current situation is the following: we have implemented an outer_fold of k=10, which means that in a 10-iteration-loop , we will take 90% of the dataset (k-1/k) as training information and use the 10% left as test dataset. The first problem is that as we implement an inner_fold, let's say k=5, then we are taking 80% of the training data to fit the desired ML model. In this configuration of outer/inner-cv only 72% of the total dataset (0.9*0.8) goes into the "real" model training. We think that ruling out this much data is detrimental to the model that we would like to construct and we both agree that we should maximize the data that goes into training the model. We think that 90% of the dataset actually going into training may be ideal, but also consider that 80% would be an acceptable minimum. At this point is when we consider the trade-offs of increasing/decreasing the number of outer and inner folds and how this affects the hyperparameter optimization. Here I briefly show the structure in pseudocode of our scripts: (if the pseudocode is confusing, you can check the actual script in this link

outer_cv = 10-k-fold

for fold in outer_cv
	training_data, test_data = [k-1, 1] folds in outer_cv
	hyperparams = search_space
	optimized_hyperparams = optimize(objective_function(training_data), search_space, iterations=n)
	test_score = best_classifier(optimized_hyperparams).score(test_data)

As we see, for k loops a k-1 subset of the data is being used to train the classifier, which goes into an optimization of the hyperparameters based on an objective function. Then, the test_score is calculated upon unseen datausing a classifier with the optimized hyperparameters.

The pseudo code (and most relevant thing in this discussion) of the objective function to be optimized is the following:

def objective_function(training_data):
	my_classifier(*hyperparams)
	error = inner_cv_score(my_classifier, training_data, cv=5-k-fold) # inner-cross-validation
	minimize(error)

Here is where the inner-cross-validation occurs, as the objective function seeks to minimize the error result from the training of the classifier with distinct hyperparameters using a k-fold-cross-validation strategy. So in a general way to say, we would like to calculate this error score using most of the available data as this is what is guiding the optimization function to choose better hyperparameters.

We discussed several alternatives to the problem of increasing the portion of the dataset that goes into training and one of those was simple increasing both inner and outer cv-folds to k=10 which would give us 81% of the data. We also discussed another take, in which we would remove the inner_cross_validation and adapt the objective_function to do the following:

def objective_function(training_data):
	training, validation = split(training_data)
	my_classifier(*hyperparams)
	my_classifier.fit(training)
	error = my_classifier.score(validation)
	minimize(error)

By this strategy we would "manually" further split the training_data, and have a specific control over the portion of the dataset that goes into training, and then calculate the error upon the validation data. As we are constantly iterating over the objective function to chose the best hyperparameters, what guides the optimization and avoid overfitting of the dataset is the validation part. Still, after the optimization function finished its iterations, the "real" error would be calculated using the test dataset, which will be in fact unseen-data, both by the fitting method and the validation error.