Evaluation, Cross-Validation, and Model Selection

By Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de. Based on the model selection framework of his Google summer of code 2011 project | Saurabh Mahindre - github.com/Saurabh7 as a part of Google Summer of Code 2014 project mentored by - Heiko Strathmann

This notebook illustrates the evaluation of prediction algorithms in Shogun using cross-validation, and selecting their parameters using grid-search. We demonstrate this for a toy example on Binary Classification using Support Vector Machines and also a regression problem on a real world dataset.

General Idea

Cross validation aims to estimate an algorithm's performance on unseen data. For example, one might be interested in the average classification accuracy of a Support Vector Machine when being applied to new data, that it was not trained on. This is important in order to compare the performance different algorithms on the same target. Most crucial is the point that the data that was used for running/training the algorithm is not used for testing. Different algorithms here also can mean different parameters of the same algorithm. Thus, cross-validation can be used to tune parameters of learning algorithms, as well as comparing different families of algorithms against each other. Cross-validation estimates are related to the marginal likelihood in Bayesian statistics in the sense that using them for selecting models avoids overfitting.

Evaluating an algorithm's performance on training data should be avoided since the learner may adjust to very specific random features of the training data which are not very important to the general relation. This is called overfitting. Maximising performance on the training examples usually results in algorithms explaining the noise in data (rather than actual patterns), which leads to bad performance on unseen data. This is one of the reasons behind splitting the data and using different splits for training and testing, which can be done using cross-validation. Let us generate some toy data for binary classification to try cross validation on.

In []:
%pylab inline
%matplotlib inline
# include all Shogun classes
from modshogun import *
# generate some ultra easy training data
gray()
n=20
title('Toy data for binary classification')
X=hstack((randn(2,n), randn(2,n)+1))
Y=hstack((-ones(n), ones(n)))
_=scatter(X[0], X[1], c=Y , s=100)
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Class 1", "Class 2"], loc=2)

# training data in Shogun representation
features=RealFeatures(X)
labels=BinaryLabels(Y)
Populating the interactive namespace from numpy and matplotlib

Types of splitting strategies

As said earlier Cross-validation is based upon splitting the data into multiple partitions. Shogun has various strategies for this. The base class for them is CSplittingStrategy.

K-fold cross-validation

Formally, this is achieved via partitioning a dataset $X$ of size $|X|=n$ into $k \leq n$ disjoint partitions $X_i\subseteq X$ such that $X_1 \cup X_2 \cup \dots \cup X_n = X$ and $X_i\cap X_j=\emptyset$ for all $i\neq j$. Then, the algorithm is executed on all $k$ possibilities of merging $k-1$ partitions and subsequently tested on the remaining partition. This results in $k$ performances which are evaluated in some metric of choice (Shogun support multiple ones). The procedure can be repeated (on different splits) in order to obtain less variance in the estimate. See [1] for a nice review on cross-validation using different performance measures.

In []:
k=5
normal_split=CrossValidationSplitting(labels, k)

Stratified cross-validation

On classificaiton data, the best choice is stratified cross-validation. This divides the data in such way that the fraction of labels in each partition is roughly the same, which reduces the variance of the performance estimate quite a bit, in particular for data with more than two classes. In Shogun this is implemented by CStratifiedCrossValidationSplitting class.

In []:
stratified_split=StratifiedCrossValidationSplitting(labels, k)

Leave One Out cross-validation

Leave One Out Cross-validation holds out one sample as the validation set. It is thus a special case of K-fold cross-validation with $k=n$ where $n$ is number of samples. It is implemented in LOOCrossValidationSplitting class.

Let us visualize the generated folds on the toy data.

In []:
split_strategies=[stratified_split, normal_split]
In []:
#code to visualize splitting
def get_folds(split, num):
    split.build_subsets()
    x=[]
    y=[]
    lab=[]
    for j in range(num):
        indices=split.generate_subset_indices(j)
        x_=[]
        y_=[]
        lab_=[]
        for i in range(len(indices)):
            x_.append(X[0][indices[i]])
            y_.append(X[1][indices[i]])
            lab_.append(Y[indices[i]])
        x.append(x_)
        y.append(y_)
        lab.append(lab_)
    return x, y, lab
    
def plot_folds(split_strategies, num):
    for i in range(len(split_strategies)):
        x, y, lab=get_folds(split_strategies[i], num)
        figure(figsize=(18,4))
        gray()
        suptitle(split_strategies[i].get_name(), fontsize=12)
        for j in range(0, num):
            subplot(1, num, (j+1), title='Fold %s' %(j+1))
            scatter(x[j], y[j], c=lab[j], s=100)
        
_=plot_folds(split_strategies, 4)

Stratified splitting takes care that each fold has almost the same number of samples from each class. This is not the case with normal splitting which usually leads to imbalanced folds.

Toy example: Binary Support Vector Classification

Following the example from above, we will tune the performance of a SVM on the binary classification problem. We will

  • demonstrate how to evaluate a loss function or metric on a given algorithm
  • then learn how to estimate this metric for the algorithm performing on unseen data
  • and finally use those techniques to tune the parameters to obtain the best possible results.

The involved methods are

  • LibSVM as the binary classification algorithms
  • the area under the ROC curve (AUC) as performance metric
  • three different kernels to compare
In []:
# define SVM with a small rbf kernel (always normalise the kernel!)
C=1
kernel=GaussianKernel(2, 0.001)
kernel.init(features, features)
kernel.set_normalizer(SqrtDiagKernelNormalizer())
classifier=LibSVM(C, kernel, labels)

# train
_=classifier.train()

Ok, we now have performed classification on the training data. How good did this work? We can easily do this for many different performance measures.

In []:
# instanciate a number of Shogun performance measures
metrics=[ROCEvaluation(), AccuracyMeasure(), ErrorRateMeasure(), F1Measure(), PrecisionMeasure(), RecallMeasure(), SpecificityMeasure()]

for metric in metrics:
    print metric.get_name(), metric.evaluate(classifier.apply(features), labels)
ROCEvaluation 1.0
AccuracyMeasure 1.0
ErrorRateMeasure 0.0
F1Measure 1.0
PrecisionMeasure 1.0
RecallMeasure 1.0
SpecificityMeasure 1.0

Note how for example error rate is 1-accuracy. All of those numbers represent the training error, i.e. the ability of the classifier to explain the given data.

Now, the training error is zero. This seems good at first. But is this setting of the parameters a good idea? No! A good performance on the training data alone does not mean anything. A simple look up table is able to produce zero error on training data. What we want is that our methods generalises the input data somehow to perform well on unseen data. We will now use cross-validation to estimate the performance on such.

We will use CStratifiedCrossValidationSplitting, which accepts a reference to the labels and the number of partitions as parameters. This instance is then passed to the class CCrossValidation, which does the estimation using the desired splitting strategy. The latter class can take all algorithms that are implemented against the CMachine interface.

In []:
metric=AccuracyMeasure()
cross=CrossValidation(classifier, features, labels, stratified_split, metric)

# perform the cross-validation, note that this call involved a lot of computation
result=cross.evaluate()

# the result needs to be casted to CrossValidationResult
result=CrossValidationResult.obtain_from_generic(result)

# this class contains a field "mean" which contain the mean performance metric
print "Testing", metric.get_name(), result.mean
Testing AccuracyMeasure 0.475

Now this is incredibly bad compared to the training error. In fact, it is very close to random performance (0.5). The lesson: Never judge your algorithms based on the performance on training data!

Note that for small data sizes, the cross-validation estimates are quite noisy. If we run it multiple times, we get different results.

In []:
print "Testing", metric.get_name(), [CrossValidationResult.obtain_from_generic(cross.evaluate()).mean for _ in range(10)]
Testing AccuracyMeasure [0.5, 0.55, 0.45, 0.45, 0.5, 0.575, 0.5, 0.5, 0.5, 0.5]

It is better to average a number of different runs of cross-validation in this case. A nice side effect of this is that the results can be used to estimate error intervals for a given confidence rate.

In []:
# 25 runs and 95% confidence intervals
cross.set_num_runs(25)
cross.set_conf_int_alpha(0.05)

# perform x-validation (now even more expensive)
cross.evaluate()
result=cross.evaluate()
result=CrossValidationResult.obtain_from_generic(result)

print "Testing is in [%.2f, %.2f] with mean %.2f with %.0f%% confidence" \
%(result.conf_int_low, result.conf_int_up, result.mean, (1-result.conf_int_alpha)*100)
Testing is in [0.50, 0.52] with mean 0.51 with 95% confidence

Using this machinery, it is very easy to compare multiple kernel parameters against each other to find the best one. It is even possible to compare a different kernel.

In []:
widths=2**linspace(-5,25,10)
results=zeros(len(widths))
upper=zeros(len(widths))
lower=zeros(len(widths))

for i in range(len(results)):
    kernel.set_width(widths[i])
    result=CrossValidationResult.obtain_from_generic(cross.evaluate())
    results[i]=result.mean
    upper[i]=result.conf_int_up
    lower[i]=result.conf_int_low
    
plot(log2(widths), results, 'blue')
fill_between(log2(widths), lower, upper, color='blue', alpha=0.3)
xlabel("log2 Kernel width")
ylabel(metric.get_name())
_=title("Accuracy for different kernel widths")

print "Best Gaussian kernel width %.2f" % widths[results.argmax()], "gives", results.max()

# compare this with a linear kernel
classifier.set_kernel(LinearKernel())
lin_k=CrossValidationResult.obtain_from_generic(cross.evaluate())
plot([log2(widths[0]), log2(widths[len(widths)-1])], [lin_k.mean,lin_k.mean], 'r')

# please excuse this horrible code :)
fill_between([log2(widths[0]), log2(widths[len(widths)-1])], [lin_k.conf_int_low,lin_k.conf_int_low],\
             [lin_k.conf_int_up, lin_k.conf_int_up], color="red", alpha=0.3)
print "Linear kernel gives", lin_k.mean

_=legend(["Gaussian", "Linear"], loc="lower center")
Best Gaussian kernel width 33554432.00 gives 0.771
Linear kernel gives 0.772

This gives a brute-force way to select paramters of any algorithm implemented under the CMachine interface. The cool thing about this is, that it is also possible to compare different model families against each other. Below, we compare a a number of regression models in Shogun on the Boston Housing dataset.

Regression problem and cross-validation

Various regression models in Shogun are now used to predict house prices using the boston housing dataset. Cross-validation is used to find best parameters and also test the performance of the models.

In []:
feats=RealFeatures(CSVFile('../../../data/multiclass/categorical_dataset/fm_housing.dat'))
labels=RegressionLabels(CSVFile('../../../data/multiclass/categorical_dataset/housing_label.dat'))
preproc=RescaleFeatures()
preproc.init(feats)
feats.add_preprocessor(preproc)
feats.apply_preprocessor(True)

#Regression models
ls=LeastSquaresRegression(feats, labels)

tau=1
rr=LinearRidgeRegression(tau, feats, labels)

width=1
tau=1
kernel=GaussianKernel(feats, feats, width)
kernel.set_normalizer(SqrtDiagKernelNormalizer())
krr=KernelRidgeRegression(tau, kernel, labels)


regression_models=[ls, rr, krr]

Let us use cross-validation to compare various values of tau paramter for ridge regression (Regression notebook). We will use MeanSquaredError as the performance metric. Note that normal splitting is used since it might be impossible to generate "good" splits using Stratified splitting in case of regression since we have continous values for labels.

In []:
n=30
taus = logspace(-4, 1, n)

#5-fold cross-validation
k=5
split=CrossValidationSplitting(labels, k)
metric=MeanSquaredError()
cross=CrossValidation(rr, feats, labels, split, metric)
cross.set_num_runs(50)

errors=[]
for tau in taus:
    #set necessary parameter
    rr.set_tau(tau)
    result=cross.evaluate()
    result=CrossValidationResult.obtain_from_generic(result)
    #Enlist mean error for all runs
    errors.append(result.mean)

figure(figsize=(20,6))
suptitle("Finding best (tau) parameter using cross-validation", fontsize=12)
p=subplot(121)
title("Ridge Regression")
plot(taus, errors, linewidth=3)
p.set_xscale('log')
p.set_ylim([0, 80])
xlabel("Taus")
ylabel("Mean Squared Error")

cross=CrossValidation(krr, feats, labels, split, metric)
cross.set_num_runs(50)

errors=[]
for tau in taus:
    krr.set_tau(tau)
    result=cross.evaluate()
    result=CrossValidationResult.obtain_from_generic(result)
    #print tau, "error", result.mean
    errors.append(result.mean)

p2=subplot(122)
title("Kernel Ridge regression")
plot(taus, errors, linewidth=3)
p2.set_xscale('log')
xlabel("Taus")
_=ylabel("Mean Squared Error")
-c:15: RuntimeWarning: [WARN] In file /home/buildslave/nightly_default/build/src/shogun/evaluation/CrossValidation.cpp line 107: LinearRidgeRegression does not support locking. Autolocking is skipped. Set autolock flag to false to get rid of warning.


A low value of error certifies a good pick for the tau paramter which should be easy to conclude from the plots. In case of Ridge Regression the value of tau i.e. the amount of regularization doesn't seem to matter but does seem to in case of Kernel Ridge Regression. One interpretation of this could be the lack of over fitting in the feature space for ridge regression and the occurence of over fitting in the new kernel space in which Kernel Ridge Regression operates.
Next we will compare a range of values for the width of Gaussian Kernel used in Kernel Ridge Regression

In []:
n=50
widths=logspace(-2, 3, n)
krr.set_tau(0.1)
metric=MeanSquaredError()
k=5
split=CrossValidationSplitting(labels, k)
cross=CrossValidation(krr, feats, labels, split, metric)
cross.set_num_runs(10)
errors=[]
for width in widths:
    kernel.set_width(width)
    result=cross.evaluate()
    result=CrossValidationResult.obtain_from_generic(result)
    #print width, "error", result.mean
    errors.append(result.mean)
    
figure(figsize=(15,5))
p=subplot(121)
title("Finding best width using cross-validation")
plot(widths, errors, linewidth=3)
p.set_xscale('log')
xlabel("Widths")
_=ylabel("Mean Squared Error")

The values for the kernel parameter and tau may not be independent of each other, so the values we have may not be optimal. A brute force way to do this would be to try all the pairs of these values but it is only feasible for a low number of parameters.

In []:
n=40
taus = logspace(-3, 0, n)
widths=logspace(-1, 4, n)

cross=CrossValidation(krr, feats, labels, split, metric)
cross.set_num_runs(1)

x, y=meshgrid(taus, widths)
grid=array((ravel(x), ravel(y)))
print grid.shape

errors=[]
for i in range(0, n*n):
    krr.set_tau(grid[:,i][0])
    kernel.set_width(grid[:,i][1])
    result=cross.evaluate()
    result=CrossValidationResult.obtain_from_generic(result)
    errors.append(result.mean)
errors=array(errors).reshape((n, n))
(2, 1600)

In []:
from mpl_toolkits.mplot3d import Axes3D
#taus = logspace(0.5, 1, n)
jet()
fig=figure(figsize(15,7))
ax=subplot(121)
c=pcolor(x, y, errors)
_=contour(x, y, errors, linewidths=1, colors='black')
_=colorbar(c)
xlabel('Taus')
ylabel('Widths')
ax.set_xscale('log')
ax.set_yscale('log')

ax1=fig.add_subplot(122, projection='3d')
ax1.plot_wireframe(log10(y),log10(x), errors, linewidths=2, alpha=0.6)
ax1.view_init(30,-40)
xlabel('Taus')
ylabel('Widths')
_=ax1.set_zlabel('Error')
<matplotlib.figure.Figure at 0x4598950>

Let us approximately pick the good parameters using the plots. Now that we have the best parameters, let us compare the various regression models on the data set.

In []:
#use the best parameters
rr.set_tau(1)
krr.set_tau(0.05)
kernel.set_width(2)

title_='Performance on Boston Housing dataset'
print "%50s" %title_
for machine in regression_models:
    metric=MeanSquaredError()
    cross=CrossValidation(machine, feats, labels, split, metric)
    cross.set_num_runs(25)
    result=cross.evaluate()
    result=CrossValidationResult.obtain_from_generic(result)
    print "-"*80
    print "|", "%30s" % machine.get_name(),"|", "%20s" %metric.get_name(),"|","%20s" %result.mean ,"|"  
print "-"*80
             Performance on Boston Housing dataset
--------------------------------------------------------------------------------
-c:12: RuntimeWarning: [WARN] In file /home/buildslave/nightly_default/build/src/shogun/evaluation/CrossValidation.cpp line 107: LeastSquaresRegression does not support locking. Autolocking is skipped. Set autolock flag to false to get rid of warning.

-c:12: RuntimeWarning: [WARN] In file /home/buildslave/nightly_default/build/src/shogun/evaluation/CrossValidation.cpp line 107: LinearRidgeRegression does not support locking. Autolocking is skipped. Set autolock flag to false to get rid of warning.



|         LeastSquaresRegression |     MeanSquaredError |        29.8998485354 |
--------------------------------------------------------------------------------
|          LinearRidgeRegression |     MeanSquaredError |        30.0281735774 |
--------------------------------------------------------------------------------
|          KernelRidgeRegression |     MeanSquaredError |        10.9085462437 |
--------------------------------------------------------------------------------

A standard way of selecting the best parameters of a learning algorithm is by Grid Search. This is done by an exhaustive search of a specified parameter space. CModelSelectionParameters is used to select various parameters and their ranges to be used for model selection. A tree like structure is used where the nodes can be CSGObject or the parameters to the object. The range of values to be searched for the parameters is set using build_values() method.

In []:
#Root
param_tree_root=ModelSelectionParameters()

#Parameter tau
tau=ModelSelectionParameters("tau")
param_tree_root.append_child(tau)

# also R_LINEAR/R_LOG is available as type
min_value=0.01
max_value=1
type_=R_LINEAR
step=0.05
base=2
tau.build_values(min_value, max_value, type_, step, base)

Next we will create CModelSelectionParameters instance with a kernel object which has to be appended the root node. The kernel object itself will be append with a kernel width parameter which is the parameter we wish to search.

In []:
#kernel object
param_gaussian_kernel=ModelSelectionParameters("kernel", kernel)
gaussian_kernel_width=ModelSelectionParameters("width")
gaussian_kernel_width.build_values(0.1, 6.0, R_LINEAR, 0.5, 2.0)

#kernel parameter 
param_gaussian_kernel.append_child(gaussian_kernel_width)
param_tree_root.append_child(param_gaussian_kernel)
In []:
# cross validation instance used
cross_validation=CrossValidation(krr, feats, labels, split, metric)
cross_validation.set_num_runs(1)
In []:
# model selection instance
model_selection=GridSearchModelSelection(cross_validation, param_tree_root)
In []:
print_state=False
best_parameters=model_selection.select_model(print_state)

best_parameters.apply_to_machine(krr)
best_parameters.print_tree()
result=cross_validation.evaluate()
result=CrossValidationResult.obtain_from_generic(result)

print 'Error with Best parameters:', result.mean
Error with Best parameters: 9.45366891915

The error value using the parameters obtained from Grid Search is pretty close (and should be better) to the one we had seen in the last section. Grid search suffers from the curse of dimensionality though, which can lead to huge computation costs in higher dimensions.

References

[1] Forman, G. and Scholz, M. (2009). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement. Technical report, HP Laboratories.