In this notebook we will see how machine learning problems are generally represented and solved in Shogun. As a primer to Shogun's many capabilities, we will see how various types of data and its attributes are handled and also how prediction is done.
Machine learning concerns the construction and study of systems that can learn from data via exploiting certain types of structure within these. The uncovered patterns are then used to predict future data, or to perform other kinds of decision making. Two main classes (among others) of Machine Learning algorithms are: predictive or supervised learning and descriptive or Unsupervised learning. Shogun provides functionality to address those (and more) problem classes.
#To import all Shogun classes
from modshogun import *
For the unsupervised learning approach we are only given inputs, \(\mathcal{D} = {(x_i)}^{\text N}_{i=1}\) , and the goal is to find “interesting patterns” in the data.
Let us consider an example, we have a dataset about various attributes of individuals and we know whether or not they are diabetic. The data reveals certain configurations of attributes that correspond to diabetic patients and others that correspond to non-diabetic patients. When given a set of attributes for a new patient, the goal is to predict whether the patient is diabetic or not. This type of learning problem falls under Supervised learning, in particular, classification.
Shogun provides the capability to load datasets of different formats using CFile.
A real world dataset: Pima Indians Diabetes data set is used now. We load the LibSVM
format file using Shogun's LibSVMFile class. The LibSVM
format is: \[\space \text {label}\space \text{attribute1:value1 attribute2:value2 }...\]\[\space.\]\[\space .\] LibSVM uses the so called "sparse" format where zero values do not need to be stored.
#Load the file
data_file=LibSVMFile('../../../data/toy/diabetes_scale.svm')
This results in a LibSVMFile object which we will later use to access the data.
SpareRealFeatures
(sparse features handling 64 bit float
type data) are used to get the data from the file. Since LibSVM
format files have labels included in the file, load_with_labels
method of SpareRealFeatures
is used. In this case it is interesting to play with two attributes, Plasma glucose concentration and Body Mass Index (BMI) and try to learn something about their relationship with the disease. We get hold of the feature matrix using get_full_feature_matrix
and row vectors 1 and 5 are extracted. These are the attributes we are interested in.
f=SparseRealFeatures()
trainlab=f.load_with_labels(data_file)
mat=f.get_full_feature_matrix()
#exatract 2 attributes
glucose_conc=mat[1]
BMI=mat[5]
#generate a numpy array
feats=array(glucose_conc)
feats=vstack((feats, array(BMI)))
print feats, feats.shape
In numpy, this is a matrix of 2 row-vectors of dimension 768. However, in Shogun, this will be a matrix of 768 column vectors of dimension 2. This is beacuse each data sample is stored in a column-major fashion, meaning each column here corresponds to an individual sample and each row in it to an atribute like BMI, Glucose concentration etc. To convert the extracted matrix into Shogun format, RealFeatures
are used which are nothing but the above mentioned Dense features of 64bit Float
type. To do this call RealFeatures
with the matrix (this should be a 64bit 2D numpy array) as the argument.
#convert to shogun format
feats_train=RealFeatures(feats)
Some of the general methods you might find useful are:
get_feature_matrix()
: The feature matrix can be accessed using this.get_num_features()
: The total number of attributes can be accesed using this.get_num_vectors()
: To get total number of samples in data.get_feature_vector()
: To get all the attribute values (A.K.A feature vector) for a particular sample by passing the index of the sample as argument.
#Get number of features(attributes of data) and num of vectors(samples)
feat_matrix=feats_train.get_feature_matrix()
num_f=feats_train.get_num_features()
num_s=feats_train.get_num_vectors()
print('Number of attributes: %s and number of samples: %s' %(num_f, num_s))
print('Number of rows of feature matrix: %s and number of columns: %s' %(feat_matrix.shape[0], feat_matrix.shape[1]))
print('First column of feature matrix (Data for first individual):')
print feats_train.get_feature_vector(0)
In this particular problem, our data can be of two types: diabetic or non-diabetic, so we need binary labels. This makes it a Binary Classification problem, where the data has to be classified in two groups.
#convert to shogun format labels
labels=BinaryLabels(trainlab)
The labels can be accessed using get_labels
and the confidence vector using get_values
. The total number of labels is available using get_num_labels
.
n=labels.get_num_labels()
print 'Number of labels:', n
The training data will now be preprocessed using CPruneVarSubMean. This will basically remove data with zero variance and subtract the mean. Passing a True
to the constructor makes the class normalise the varaince of the variables. It basically dividies every dimension through its standard-deviation. This is the reason behind removing dimensions with constant values. It is required to initialize the preprocessor by passing the feature object to init
before doing anything else. The raw and processed data is now plotted.
preproc=PruneVarSubMean(True)
preproc.init(feats_train)
feats_train.add_preprocessor(preproc)
feats_train.apply_preprocessor()
# Store preprocessed feature matrix.
preproc_data=feats_train.get_feature_matrix()
# Plot the raw training data.
figure(figsize=(13,6))
pl1=subplot(121)
gray()
_=scatter(feats[0, :], feats[1,:], c=labels, s=50)
vlines(0, -1, 1, linestyle='solid', linewidths=2)
hlines(0, -1, 1, linestyle='solid', linewidths=2)
title("Raw Training Data")
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
pl1.legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
#Plot preprocessed data.
pl2=subplot(122)
_=scatter(preproc_data[0, :], preproc_data[1,:], c=labels, s=50)
vlines(0, -5, 5, linestyle='solid', linewidths=2)
hlines(0, -5, 5, linestyle='solid', linewidths=2)
title("Training data after preprocessing")
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
pl2.legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
gray()
Horizontal and vertical lines passing through zero are included to make the processing of data clear. Note that the now processed data has zero mean.
train()
the machine on some training data to be able to learn from it. Then we apply()
it to test data to get predictions. Some of these are:
Moving on to the prediction part, Liblinear, a linear SVM is used to do the classification (more on SVMs in this notebook). A linear SVM will find a linear separation with the largest possible margin. Here C is a penalty parameter on the loss function.
#prameters to svm
C=0.9
svm=LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)
#train
svm.train()
size=100
We will now apply
on test features to get predictions. For visualising the classification boundary, the whole XY is used as test data, i.e. we predict the class on every point in the grid.
x1=linspace(-5.0, 5.0, size)
x2=linspace(-5.0, 5.0, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))
#apply on test grid
predictions = svm.apply(grid)
#get output labels
z=predictions.get_values().reshape((size, size))
#plot
jet()
figure(figsize=(9,6))
title("Classification")
c=pcolor(x, y, z)
_=contour(x, y, z, linewidths=1, colors='black', hold=True)
_=colorbar(c)
_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
gray()
Let us have a look at the weight vector of the separating hyperplane. It should tell us about the linear relationship between the features. The decision boundary is now plotted by solving for \(\bf{w}\cdot\bf{x}\) + \(\text{b}=0\). Here \(\text b\) is a bias term which allows the linear function to be offset from the origin of the used coordinate system. Methods get_w()
and get_bias()
are used to get the necessary values.
w=svm.get_w()
b=svm.get_bias()
x1=linspace(-2.0, 3.0, 100)
#solve for w.x+b=0
def solve (x1):
return -( ( (w[0])*x1 + b )/w[1] )
x2=map(solve, x1)
#plot
figure(figsize=(7,6))
plot(x1,x2, linewidth=2)
title("Decision boundary using w and bias")
_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
print 'w :', w
print 'b :', b
For this problem, a linear classifier does a reasonable job in distinguishing labelled data. An interpretation could be that individuals below a certain level of BMI and glucose are likely to have no Diabetes. For problems where the data cannot be separated linearly, there are more advanced classification methods, as for example all of Shogun's kernel machines, but more on this later. To play with this interactively have a look at this: web demo
How do you assess the quality of a prediction? Shogun provides various ways to do this using CEvaluation. The preformance is evaluated by comparing the predicted output and the expected output. Some of the base classes for performance measures are:
Evaluating on training data should be avoided since the learner may adjust to very specific random features of the training data which are not very important to the general relation. This is called overfitting. Maximising performance on the training examples usually results in algorithms explaining the noise in data (rather than actual patterns), which leads to bad performance on unseen data. The dataset will now be split into two, we train on one part and evaluate performance on other using CAccuracyMeasure.
#split features for training and evaluation
num_train=700
feats=array(glucose_conc)
feats_t=feats[:num_train]
feats_e=feats[num_train:]
feats=array(BMI)
feats_t1=feats[:num_train]
feats_e1=feats[num_train:]
feats_t=vstack((feats_t, feats_t1))
feats_e=vstack((feats_e, feats_e1))
feats_train=RealFeatures(feats_t)
feats_evaluate=RealFeatures(feats_e)
Let's see the accuracy by applying on test features.
label_t=trainlab[:num_train]
labels=BinaryLabels(label_t)
label_e=trainlab[num_train:]
labels_true=BinaryLabels(label_e)
svm=LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)
#train and evaluate
svm.train()
output=svm.apply(feats_evaluate)
#use AccuracyMeasure to get accuracy
acc=AccuracyMeasure()
acc.evaluate(output,labels_true)
accuracy=acc.get_accuracy()*100
print 'Accuracy(%):', accuracy
To evaluate more efficiently cross-validation is used. As you might have wondered how are the parameters of the classifier selected? Shogun has a model selection framework to select the best parameters. More description of these things in this notebook.
This section will demonstrate another type of machine learning problem on real world data.
The task is to estimate prices of houses in Boston using the Boston Housing Dataset provided by StatLib library. The attributes are: Weighted distances to employment centres and percentage lower status of the population. Let us see if we can predict a good relationship between the pricing of houses and the attributes. This type of problems are solved using Regression analysis.
The data set is now loaded using LibSVMFile as in the previous sections and the attributes required (7th and 12th vector ) are converted to Shogun format features.
temp_feats=RealFeatures(CSVFile('../../../data/multiclass/categorical_dataset/fm_housing.dat'))
labels=RegressionLabels(CSVFile('../../../data/multiclass/categorical_dataset/housing_label.dat'))
#rescale to 0...1
preproc=RescaleFeatures()
preproc.init(temp_feats)
temp_feats.add_preprocessor(preproc)
temp_feats.apply_preprocessor(True)
mat = temp_feats.get_feature_matrix()
dist_centres=mat[7]
lower_pop=mat[12]
feats=array(dist_centres)
feats=vstack((feats, array(lower_pop)))
print feats, feats.shape
#convert to shogun format features
feats_train=RealFeatures(feats)
The tool we will use here to perform regression is Kernel ridge regression. Kernel Ridge Regression is a non-parametric version of ridge regression where the kernel trick is used to solve a related linear ridge regression problem in a higher-dimensional space, whose results correspond to non-linear regression in the data-space. Again we train on the data and apply on the XY grid to get predicitions.
from mpl_toolkits.mplot3d import Axes3D
size=100
x1=linspace(0, 1.0, size)
x2=linspace(0, 1.0, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))
#Train on data(both attributes) and predict
width=1.0
tau=0.5
kernel=GaussianKernel(feats_train, feats_train, width)
krr=KernelRidgeRegression(tau, kernel, labels)
krr.train(feats_train)
kernel.init(feats_train, grid)
out = krr.apply().get_labels()
The out
variable now contains a relationship between the attributes. Below is an attempt to establish such relationship between the attributes individually. Separate feature instances are created for each attribute. You could skip the code and have a look at the plots directly if you just want the essence.
#create feature objects for individual attributes.
feats_test=RealFeatures(x1.reshape(1,len(x1)))
feats_t0=array(dist_centres)
feats_train0=RealFeatures(feats_t0.reshape(1,len(feats_t0)))
feats_t1=array(lower_pop)
feats_train1=RealFeatures(feats_t1.reshape(1,len(feats_t1)))
#Regression with first attribute
kernel=GaussianKernel(feats_train0, feats_train0, width)
krr=KernelRidgeRegression(tau, kernel, labels)
krr.train(feats_train0)
kernel.init(feats_train0, feats_test)
out0 = krr.apply().get_labels()
#Regression with second attribute
kernel=GaussianKernel(feats_train1, feats_train1, width)
krr=KernelRidgeRegression(tau, kernel, labels)
krr.train(feats_train1)
kernel.init(feats_train1, feats_test)
out1 = krr.apply().get_labels()
#Visualization of regression
fig=figure(figsize(20,6))
#first plot with only one attribute
fig.add_subplot(131)
title("Regression with 1st attribute")
_=scatter(feats[0, :], labels.get_labels(),c=ones(560), cmap=gray(), s=20)
_=xlabel('Weighted distances to employment centres ')
_=ylabel('Median value of homes')
_=plot(x1,out0, linewidth=3)
#second plot with only one attribute
fig.add_subplot(132)
title("Regression with 2nd attribute")
_=scatter(feats[1, :], labels.get_labels(),c=ones(560), cmap=gray(), s=20)
_=xlabel('% lower status of the population')
_=ylabel('Median value of homes')
_=plot(x1,out1, linewidth=3)
#Both attributes and regression output
ax=fig.add_subplot(133, projection='3d')
z=out.reshape((size, size))
gray()
title("Regression")
ax.plot_wireframe(y, x, z, linewidths=2, alpha=0.4)
ax.set_xlabel('% lower status of the population')
ax.set_ylabel('Distances to employment centres ')
ax.set_zlabel('Median value of homes')
ax.view_init(25, 40)