Evaluation, Cross-Validation, and Model Selection

By Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de. Based on the model selection framework of his Google summer of code 2011 project

This notebook illustrates the evaluation of prediction algorithms in Shogun using cross-validation, and their parameters selection using grid-search. We demonstrate this for a toy example on Binary Classification using Support Vector Machines.

General Idea

Cross validation is aims to estimate an algorithms performance on unseen data. For example, one might be interested in the average classification accuracy of a Support Vector Machine when being applied to new data, that it was not trained on. This is important in order to compare the performance different algorithms on the same target. Most crucial is the point that the data that was used for running/training the algorithm is not used for testing. Different algorithms here also can mean different parameters of the same algorithm. Thus, cross-validation can be used to tune parameters of learning algorithms, as well as comparing different families of algorithms against each other. Cross-validation estimates are related to the marginal likelihood in Bayesian statistics in the sense that using them for selecting models avoids overfitting.

Formally, this is achieved via partitioning a dataset \(X\) of size \(|X|=n\) into \(k<n\) disjoint partitions \(X_i\subseteq X\) such that \(X_1 \cup X_2 \cup \dots \cup X_n\) such that \(X_i\cap X_j=\emptyset\) for all \(i\neq j\). Then, the algorithm is executed on all \(k\) possibilities of merging \(k-1\) partitions and subsequently tested on the remaining partition. This results in \(k\) performances which are evaluated in some metric of choice (Shogun support multiple ones). The procedure can be repeated (on different splits) in order to obtain less variance in the estimate. See TODO for more information, and [1] for a nice review on cross-validation using different performance measures.

There are various strategies to generate splits for cross-validation. We start by a very simply illustrational case.

Toy example: Binary Support Vector Classification

Following the example from above, we will tune the performance of a SVM on a binary classification problem. We will

The involved methods are

In []:
# include all Shogun classes
from modshogun import *
In []:
# generate some ultra easy training data
X=hstack((randn(2,n), randn(2,n)+1))
y=hstack((-ones(n), ones(n)))
#plot(X[0,y==-1],X[1,y==-1], 'bo')
#plot(X[0,y==1],X[1,y==1], 'ro')
#_=legend(["Class 1", "Class 2"])
In []:
# training data in Shogun representation

# define SVM with a small rbf kernel (always normalise the kernel!)
kernel=GaussianKernel(2, 0.001)
kernel.init(features, features)
classifier=LibSVM(C, kernel, labels)

# train

Ok, we now have performed classification on the training data. How good did this work? We can easily do this for many different performance measures.

In []:
# instanciate a number of Shogun performance measures
metrics=[ROCEvaluation(), AccuracyMeasure(), ErrorRateMeasure(), F1Measure(), PrecisionMeasure(), RecallMeasure(), SpecificityMeasure()]

for metric in metrics:
    print metric.get_name(), metric.evaluate(classifier.apply(features), labels)
ROCEvaluation 1.0
AccuracyMeasure 1.0
ErrorRateMeasure 0.0
F1Measure 1.0
PrecisionMeasure 1.0
RecallMeasure 1.0
SpecificityMeasure 1.0

Note how for example error rate is 1-accuracy. All of those numbers represent the training error, i.e. the ability of the classifier to explain the given data.

Now, the training error is zero. This seems good at first. But is this setting of the parameters a good idea? No! A good performance on the training data alone does not mean anything. A simple look up table is able to produce zero error on training data. What we want is that our methods generalises the input data somehow to perform well on unseen data. We will now use cross-validation to estimate the performance on such.

Cross-validation is based upon splitting the data into multiple partitions. Shogun has various strategies for this. On classificaiton data, the best choice is stratified cross-validation (TODO link). This divided the data in such way that the fraction of labels in each partition is roughly the same, which reduces the variance of the performance estimate quite a bit, in particular for data with more than two classes.

In Shogun this is realised via instance of the abstract base class CSplittingStrategy. We will use CStratifiedCrossValidationSplitting, which accepts a reference to the labels and the number of partitions as parameters. This instance is then passed to the class CCrossValidation, which does the estimation using the desired splitting strategy. The latter class can take all algorithms that are implemented against the CMachine interface.

In []:
split=StratifiedCrossValidationSplitting(labels, k)
cross=CrossValidation(classifier, features, labels, split, metric)

# perform the cross-validation, note that this call involved a lot of computation

# the result needs to be casted to CrossValidationResult

# this class contains a field "mean" which contain the mean performance metric
print "Testing", metric.get_name(), result.mean
Testing AccuracyMeasure 0.357142857143

Now this is incredibly bad compared to the training error. In fact, it is very close to random performance (0.5). The lesson: Never judge your algorithms based on the performance on training data!

Note that for small data sizes, the cross-validation estimates are quite noisy. If we run it multiple times, we get different results.

In []:
print "Testing", metric.get_name(), [CrossValidationResult.obtain_from_generic(cross.evaluate()).mean for _ in range(10)]
Testing AccuracyMeasure [0.35714285714285715, 0.35714285714285715, 0.35714285714285715, 0.35714285714285715, 0.35714285714285715, 0.35714285714285715, 0.35714285714285715, 0.35714285714285715, 0.35714285714285715, 0.35714285714285715]

It is better to average a number of different runs of cross-validation in this case. A nice side effect of this is that the results can be used to estimate error intervals for a given confidence rate.

In []:
# 20 runs and 95% confidence intervals

# perform x-validation (now even more expensive)

print "Testing is in [%.2f, %.2f] with mean %.2f with %.0f%% confidence" \
%(result.conf_int_low, result.conf_int_up, result.mean, (1-result.conf_int_alpha)*100)
Testing is in [0.34, 0.36] with mean 0.35 with 95% confidence

Using this machinery, it is very easy to compare multiple kernel parameters against each other to find the best one. It is even possible to compare a different kernel.

In []:

for i in range(len(results)):
plot(log2(widths), results, 'blue')
fill_between(log2(widths), lower, upper, color='blue', alpha=0.3)
xlabel("log2 Kernel width")
_=title("Accuracy for different kernel widths")

print "Best Gaussian kernel width %.2f" % widths[results.argmax()], "gives", results.max()

# compare this with a linear kernel
plot([log2(widths[0]), log2(widths[len(widths)-1])], [lin_k.mean,lin_k.mean], 'r')

# please excuse this horrible code :)
fill_between([log2(widths[0]), log2(widths[len(widths)-1])], [lin_k.conf_int_low,lin_k.conf_int_low],\
             [lin_k.conf_int_up, lin_k.conf_int_up], color="red", alpha=0.3)
print "Linear kernel gives", lin_k

_=legend(["Gaussian", "Linear"], loc="lower center")
Best Gaussian kernel width 32.00 gives 0.650476190476
Linear kernel gives CrossValidationResult

This gives a brute-force way to select paramters of any algorithm implemented under the CMachine interface. The cool thing about this is, that it is also possible to compare different model families against each other. Below, we compare a a number of binary classifiers in Shogun.


[1] Forman, G. and Scholz, M. (2009). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement. Technical report, HP Laboratories.

Coming Soon

In []: