Machine Learning with Shogun

By Saurabh Mahindre - as a part of Google Summer of Code 2014 project mentored by - Heiko Strathmann - -

In this notebook we will see how machine learning problems are generally represented and solved in Shogun. As a primer to Shogun's many capabilities, we will see how various types of data and its attributes are handled and also how prediction is done.


Machine learning concerns the construction and study of systems that can learn from data via exploiting certain types of structure within these. The uncovered patterns are then used to predict future data, or to perform other kinds of decision making. Two main classes (among others) of Machine Learning algorithms are: predictive or supervised learning and descriptive or Unsupervised learning. Shogun provides functionality to address those (and more) problem classes.

In [1]:
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
#To import all Shogun classes
from shogun import *
Populating the interactive namespace from numpy and matplotlib

In a general problem setting for the supervised learning approach, the goal is to learn a mapping from inputs $x_i\in\mathcal{X} $ to outputs $y_i \in \mathcal{Y}$, given a labeled set of input-output pairs $ \mathcal{D} = {(x_i,y_i)}^{\text N}_{i=1} $$\subseteq \mathcal{X} \times \mathcal{Y}$. Here $ \mathcal{D}$ is called the training set, and $\text N$ is the number of training examples. In the simplest setting, each training input $x_i$ is a $\mathcal{D}$ -dimensional vector of numbers, representing, say, the height and weight of a person. These are called $\textbf {features}$, attributes or covariates. In general, however, $x_i$ could be a complex structured object, such as an image.

  • When the response variable $y_i$ is categorical and discrete, $y_i \in$ {1,...,C} (say male or female) it is a [classification]( problem.
  • When it is continuous (say the prices of houses) it is a [regression]( problem.
For the unsupervised learning approach we are only given inputs, $\mathcal{D} = {(x_i)}^{\text N}_{i=1}$ , and the goal is to find “interesting patterns” in the data.

Using datasets

Let us consider an example, we have a dataset about various attributes of individuals and we know whether or not they are diabetic. The data reveals certain configurations of attributes that correspond to diabetic patients and others that correspond to non-diabetic patients. When given a set of attributes for a new patient, the goal is to predict whether the patient is diabetic or not. This type of learning problem falls under Supervised learning, in particular, classification.

Shogun provides the capability to load datasets of different formats using CFile.</br> A real world dataset: Pima Indians Diabetes data set is used now. We load the LibSVM format file using Shogun's LibSVMFile class. The LibSVM format is: $$\space \text {label}\space \text{attribute1:value1 attribute2:value2 }...$$$$\space.$$$$\space .$$ LibSVM uses the so called "sparse" format where zero values do not need to be stored.

In [2]:
#Load the file
data_file=LibSVMFile(os.path.join(SHOGUN_DATA_DIR, 'uci/diabetes/diabetes_scale.svm'))

This results in a LibSVMFile object which we will later use to access the data.

Feature representations

To get off the mark, let us see how Shogun handles the attributes of the data using CFeatures class. Shogun supports wide range of feature representations. We believe it is a good idea to have different forms of data, rather than converting them all into matrices. Among these are: $\hspace {20mm}$

  • [String features]( Implements a list of strings. Not limited to character strings, but could also be sequences of floating point numbers etc. Have varying dimensions.
  • [Dense features]( Implements dense feature matrices
  • [Sparse features]( Implements sparse matrices.
  • [Streaming features]( For algorithms working on data streams (which are too large to fit into memory)

SpareRealFeatures (sparse features handling 64 bit float type data) are used to get the data from the file. Since LibSVM format files have labels included in the file, load_with_labels method of SpareRealFeatures is used. In this case it is interesting to play with two attributes, Plasma glucose concentration and Body Mass Index (BMI) and try to learn something about their relationship with the disease. We get hold of the feature matrix using get_full_feature_matrix and row vectors 1 and 5 are extracted. These are the attributes we are interested in.

In [3]:

#exatract 2 attributes

#generate a numpy array
feats=vstack((feats, array(BMI)))
print(feats, feats.shape)
(array([[ 0.487437  , -0.145729  ,  0.839196  , ...,  0.21608   ,
         0.266332  , -0.0653266 ],
       [ 0.00149028, -0.207153  , -0.305514  , ..., -0.219076  ,
        -0.102832  , -0.0938897 ]]), (2, 768))

In numpy, this is a matrix of 2 row-vectors of dimension 768. However, in Shogun, this will be a matrix of 768 column vectors of dimension 2. This is beacuse each data sample is stored in a column-major fashion, meaning each column here corresponds to an individual sample and each row in it to an atribute like BMI, Glucose concentration etc. To convert the extracted matrix into Shogun format, RealFeatures are used which are nothing but the above mentioned Dense features of 64bit Float type. To do this call RealFeatures with the matrix (this should be a 64bit 2D numpy array) as the argument.

In [4]:
#convert to shogun format

Some of the general methods you might find useful are:

  • get_feature_matrix(): The feature matrix can be accessed using this.
  • get_num_features(): The total number of attributes can be accesed using this.
  • get_num_vectors(): To get total number of samples in data.
  • get_feature_vector(): To get all the attribute values (A.K.A feature vector) for a particular sample by passing the index of the sample as argument.</li></ul>
In [5]:
#Get number of features(attributes of data) and num of vectors(samples)

print('Number of attributes: %s and number of samples: %s' %(num_f, num_s))
print('Number of rows of feature matrix: %s and number of columns: %s' %(feat_matrix.shape[0], feat_matrix.shape[1]))
print('First column of feature matrix (Data for first individual):')
Number of attributes: 2 and number of samples: 768
Number of rows of feature matrix: 2 and number of columns: 768
First column of feature matrix (Data for first individual):
[ 0.487437    0.00149028]

Assigning labels

In supervised learning problems, training data is labelled. Shogun provides various types of labels to do this through Clabels. Some of these are:

  • [Binary labels]( Binary Labels for binary classification which can have values +1 or -1.
  • [Multiclass labels]( Multiclass Labels for multi-class classification which can have values from 0 to (num. of classes-1).
  • [Regression labels]( Real-valued labels used for regression problems and are returned as output of classifiers.
  • [Structured labels]( Class of the labels used in Structured Output (SO) problems
</br> In this particular problem, our data can be of two types: diabetic or non-diabetic, so we need binary labels. This makes it a Binary Classification problem, where the data has to be classified in two groups.

In [6]:
#convert to shogun format labels

The labels can be accessed using get_labels and the confidence vector using get_values. The total number of labels is available using get_num_labels.

In [7]:
print('Number of labels:', n)
('Number of labels:', 768)

Preprocessing data

It is usually better to preprocess data to a standard form rather than handling it in raw form. The reasons are having a well behaved-scaling, many algorithms assume centered data, and that sometimes one wants to de-noise data (with say PCA). Preprocessors do not change the domain of the input features. It is possible to do various type of preprocessing using methods provided by CPreprocessor class. Some of these are:

  • [Norm one]( Normalize vector to have norm 1.
  • [PruneVarSubMean]( Substract the mean and remove features that have zero variance.
  • [Dimension Reduction]( Lower the dimensionality of given simple features.
    • [PCA]( Principal component analysis.
    • [Kernel PCA]( PCA using kernel methods.
    </li></ul> The training data will now be preprocessed using CPruneVarSubMean. This will basically remove data with zero variance and subtract the mean. Passing a True to the constructor makes the class normalise the varaince of the variables. It basically dividies every dimension through its standard-deviation. This is the reason behind removing dimensions with constant values. It is required to initialize the preprocessor by passing the feature object to init before doing anything else. The raw and processed data is now plotted.

In [8]:

# Store preprocessed feature matrix.
In [9]:
# Plot the raw training data.
_=scatter(feats[0, :], feats[1,:], c=labels, s=50)
vlines(0, -1, 1, linestyle='solid', linewidths=2)
hlines(0, -1, 1, linestyle='solid', linewidths=2)
title("Raw Training Data")
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
pl1.legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)

#Plot preprocessed data.
_=scatter(preproc_data[0, :], preproc_data[1,:], c=labels, s=50)
vlines(0, -5, 5, linestyle='solid', linewidths=2)
hlines(0, -5, 5, linestyle='solid', linewidths=2)
title("Training data after preprocessing")
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
pl2.legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)

Horizontal and vertical lines passing through zero are included to make the processing of data clear. Note that the now processed data has zero mean.

CMachine is Shogun's interface for general learning machines. Basically one has to train() the machine on some training data to be able to learn from it. Then we apply() it to test data to get predictions. Some of these are:

  • [Kernel machine]( Kernel based learning tools.
  • [Linear machine]( Interface for all kinds of linear machines like classifiers.
  • [Distance machine]( A distance machine is based on a a-priori choosen distance.
  • [Gaussian process machine]( A base class for Gaussian Processes.
  • And many more

Moving on to the prediction part, Liblinear, a linear SVM is used to do the classification (more on SVMs in this notebook). A linear SVM will find a linear separation with the largest possible margin. Here C is a penalty parameter on the loss function.

In [10]:
#prameters to svm

svm=LibLinear(C, feats_train, labels)



We will now apply on test features to get predictions. For visualising the classification boundary, the whole XY is used as test data, i.e. we predict the class on every point in the grid.

In [11]:
x1=linspace(-5.0, 5.0, size)
x2=linspace(-5.0, 5.0, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))

#apply on test grid
predictions = svm.apply(grid)
#get output labels
z=predictions.get_values().reshape((size, size))

c=pcolor(x, y, z)
_=contour(x, y, z, linewidths=1, colors='black', hold=True)

_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
<matplotlib.figure.Figure at 0x7fe879073e50>

Let us have a look at the weight vector of the separating hyperplane. It should tell us about the linear relationship between the features. The decision boundary is now plotted by solving for $\bf{w}\cdot\bf{x}$ + $\text{b}=0$. Here $\text b$ is a bias term which allows the linear function to be offset from the origin of the used coordinate system. Methods get_w() and get_bias() are used to get the necessary values.

In [12]:

x1=linspace(-2.0, 3.0, 100)

#solve for w.x+b=0
def solve (x1):
    return -( ( (w[0])*x1 + b )/w[1] )
x2=list(map(solve, x1))

plot(x1,x2, linewidth=2)
title("Decision boundary using w and bias")
_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)

print('w :', w)
print('b :', b)
('w :', array([-0.42804589, -0.21630841]))
('b :', 0.3170020152494004)

For this problem, a linear classifier does a reasonable job in distinguishing labelled data. An interpretation could be that individuals below a certain level of BMI and glucose are likely to have no Diabetes. For problems where the data cannot be separated linearly, there are more advanced classification methods, as for example all of Shogun's kernel machines, but more on this later. To play with this interactively have a look at this: web demo

Evaluating performance and Model selection

How do you assess the quality of a prediction? Shogun provides various ways to do this using CEvaluation. The preformance is evaluated by comparing the predicted output and the expected output. Some of the base classes for performance measures are:

Evaluating on training data should be avoided since the learner may adjust to very specific random features of the training data which are not very important to the general relation. This is called overfitting. Maximising performance on the training examples usually results in algorithms explaining the noise in data (rather than actual patterns), which leads to bad performance on unseen data. The dataset will now be split into two, we train on one part and evaluate performance on other using CAccuracyMeasure.

In [13]:
#split features for training and evaluation
feats_t=vstack((feats_t, feats_t1))
feats_e=vstack((feats_e, feats_e1))


Let's see the accuracy by applying on test features.

In [14]:

svm=LibLinear(C, feats_train, labels)

#train and evaluate

#use AccuracyMeasure to get accuracy
print('Accuracy(%):', accuracy)
('Accuracy(%):', 73.52941176470588)

To evaluate more efficiently cross-validation is used. As you might have wondered how are the parameters of the classifier selected? Shogun has a model selection framework to select the best parameters. More description of these things in this notebook.

More predictions: Regression

This section will demonstrate another type of machine learning problem on real world data.</br> The task is to estimate prices of houses in Boston using the Boston Housing Dataset provided by StatLib library. The attributes are: Weighted distances to employment centres and percentage lower status of the population. Let us see if we can predict a good relationship between the pricing of houses and the attributes. This type of problems are solved using Regression analysis.

The data set is now loaded using LibSVMFile as in the previous sections and the attributes required (7th and 12th vector ) are converted to Shogun format features.

In [15]:
temp_feats=RealFeatures(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/fm_housing.dat')))
labels=RegressionLabels(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/housing_label.dat')))

#rescale to 0...1
mat = temp_feats.get_feature_matrix()


feats=vstack((feats, array(lower_pop)))
print(feats, feats.shape)
#convert to shogun format features
(array([[ 0.26920314,  0.34896198,  0.34896198, ...,  0.09438114,
         0.11451409,  0.12507161],
       [ 0.08967991,  0.2044702 ,  0.06346578, ...,  0.10789183,
         0.13107064,  0.16970199]]), (2, 506))

The tool we will use here to perform regression is Kernel ridge regression. Kernel Ridge Regression is a non-parametric version of ridge regression where the kernel trick is used to solve a related linear ridge regression problem in a higher-dimensional space, whose results correspond to non-linear regression in the data-space. Again we train on the data and apply on the XY grid to get predicitions.

In [16]:
from mpl_toolkits.mplot3d import Axes3D
x1=linspace(0, 1.0, size)
x2=linspace(0, 1.0, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))

#Train on data(both attributes) and predict
kernel=GaussianKernel(feats_train, feats_train, width)
krr=KernelRidgeRegression(tau, kernel, labels)
kernel.init(feats_train, grid)
out = krr.apply().get_labels()

The out variable now contains a relationship between the attributes. Below is an attempt to establish such relationship between the attributes individually. Separate feature instances are created for each attribute. You could skip the code and have a look at the plots directly if you just want the essence.

In [17]:
#create feature objects for individual attributes.

#Regression with first attribute
kernel=GaussianKernel(feats_train0, feats_train0, width)
krr=KernelRidgeRegression(tau, kernel, labels)
kernel.init(feats_train0, feats_test)
out0 = krr.apply().get_labels()

#Regression with second attribute 
kernel=GaussianKernel(feats_train1, feats_train1, width)
krr=KernelRidgeRegression(tau, kernel, labels)
kernel.init(feats_train1, feats_test)
out1 = krr.apply().get_labels()
In [18]:
#Visualization of regression
#first plot with only one attribute
title("Regression with 1st attribute")
_=scatter(feats[0, :], labels.get_labels(), cmap=gray(), s=20)
_=xlabel('Weighted distances to employment centres ')
_=ylabel('Median value of homes')

_=plot(x1,out0, linewidth=3)

#second plot with only one attribute
title("Regression with 2nd attribute")
_=scatter(feats[1, :], labels.get_labels(), cmap=gray(), s=20)
_=xlabel('% lower status of the population')
_=ylabel('Median value of homes')
_=plot(x1,out1, linewidth=3)

#Both attributes and regression output
ax=fig.add_subplot(133, projection='3d')
z=out.reshape((size, size))
ax.plot_wireframe(y, x, z, linewidths=2, alpha=0.4)
ax.set_xlabel('% lower status of the population')
ax.set_ylabel('Distances to employment centres ')
ax.set_zlabel('Median value of homes')
ax.view_init(25, 40)