We aim to find the linear function (line) that best explains the data, i.e. that minimises some measure of deviation to the training data $\mathcal{D}$. One such measure is the sum of squared distances. The Ordinary Least Sqaures method minimizes the sum of squared distances between the observed responses in the dataset and the responses predicted by the linear approximation.

In []:

```
%pylab inline
%matplotlib inline
# import all shogun classes
from modshogun import *
slope=3
X_train=rand(30)*10
y_train=slope*(X_train)+random.randn(30)*2+2
y_true=slope*(X_train)+2
X_test=concatenate((linspace(0,10, 50),X_train))
#Convert data to shogun format features
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)
```

`train()`

it. This also generates the $\text w$ from the general equation described above. To access $\text w$ use `get_w()`

.

In []:

```
ls=LeastSquaresRegression(feats_train, labels_train)
ls.train()
w=ls.get_w()
print 'Weights:'
print w
```

`apply`

this trained machine to our test data to get the ouput values.

In []:

```
out=ls.apply(feats_test).get_labels()
```

In []:

```
figure(figsize=(20,5))
#Regression and true plot
pl1=subplot(131)
title('Regression')
_=plot(X_train,labels_train, 'ro')
_=plot(X_test,out, color='blue')
_=plot(X_train, y_true, color='green')
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="b")
p3 = Rectangle((0, 0), 1, 1, fc="g")
pl1.legend((p1, p2, p3), ["Training samples", "Predicted output", "True relationship"], loc=2)
xlabel('Samples (X)', fontsize=12)
ylabel('Response variable (Y)', fontsize=12)
#plot residues
pl2=subplot(132)
title("Squared error and output")
_=plot(X_test,out, linewidth=2)
gray()
_=scatter(X_train,labels_train,c=ones(50) ,cmap=gray(), s=40)
for i in range(50,80):
plot([X_test[i],X_test[i]],[out[i],y_train[i-50]] , linewidth=2, color='red')
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="b")
pl2.legend((p1, p2), ["Error/residuals to be squared", "Predicted output"], loc=2)
xlabel('Samples (X)', fontsize=12)
ylabel('Response variable (Y)', fontsize=12)
jet()
```

The function we choose should not only best fit the training data but also generalise well. If the coefficients/weights are unconstrained, they are susceptible to high variance and overfitting. To control variance, one has to regularize the coefficients i.e. control how large the coefficients grow. This is what is done in Ridge regression which is L2 (sum of squared components of $\bf w$) regularized form of least squares. A penalty is imposed on the size of coefficients. The error to be minimized is:

$$E({\bf{w}}) = \sum_{i=1}^N(y_i-{\bf w}\cdot {\bf x}_i)^2 + \tau||{\bf w}||^2$$

Here $\tau$ imposes a penalty on the weights.

By differentiating the regularised training error and equating to zero, we find the optimal $\bf w$, given by:

$${\bf w} = \left(\tau {\bf I}+ \sum_{i=1}^N{\bf x}_i{\bf x}_i^T\right)^{-1}\left(\sum_{i=1}^N y_i{\bf x}_i\right)$$

In []:

```
tau=0.8
rr=LinearRidgeRegression(tau, feats_train, labels_train)
rr.train()
w=rr.get_w()
print w
out=rr.apply(feats_test).get_labels()
```

In []:

```
figure(figsize=(20,5))
#Regression and true plot
pl1=subplot(131)
title('Ridge Regression')
_=plot(X_train,labels_train, 'ro')
_=plot(X_test, out, color='blue')
_=plot(X_train, y_true, color='green')
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="b")
p3 = Rectangle((0, 0), 1, 1, fc="g")
pl1.legend((p1, p2, p3), ["Training samples", "Predicted output", "True relationship"], loc=2)
xlabel('Samples (X)', fontsize=12)
ylabel('Response variable (Y)', fontsize=12)
jet()
```

`set_tau()`

method is used to set the necessary parameter.

In []:

```
#Generate Data
def generate_data(N, D):
w = randn(D,1)
X=zeros((N,D))
y=zeros((N,1))
for i in range(N):
x=randn(1,D)
for j in range(D):
X[i][j]=x[0][j]
y=dot(X,w) + randn(N,1);
y.reshape(N,)
return X,y.T
def generate_weights(taus, feats_train, labels_train):
preproc=PruneVarSubMean(True)
preproc.init(feats_train)
feats_train.add_preprocessor(preproc)
feats_train.apply_preprocessor()
weights=[]
rr=LinearRidgeRegression(tau, feats_train, labels_train)
#vary regularization
for t in taus:
rr.set_tau(t)
rr.train()
weights.append(rr.get_w())
return weights, rr
def plot_regularization(taus, weights):
ax = gca()
ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])
ax.plot(taus, weights, linewidth=2)
xlabel('Tau', fontsize=12)
ylabel('Weights', fontsize=12)
ax.set_xscale('log')
```

The mean squared error (MSE) of an estimator measures the average of the squares of the errors. CMeanSquaredError class is used to compute the MSE as :

$$\frac{1}{|L|} \sum_{i=1}^{|L|} (L_i - R_i)^2$$

Here $L$ is the vector of predicted labels and $R$ is the vector of real labels.

We use 5-fold cross-validation to compute MSE and have a look at how MSE varies with regularisation.

In []:

```
def xval_results(taus):
errors=[]
for t in taus:
rr.set_tau(t)
splitting_strategy=CrossValidationSplitting(labels_train, 5)
# evaluation method
evaluation_criterium=MeanSquaredError()
# cross-validation instance
cross_validation=CrossValidation(rr, feats_train, labels_train, splitting_strategy, evaluation_criterium, False)
cross_validation.set_num_runs(100)
cross_validation.set_conf_int_alpha(0.05)
result=cross_validation.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
errors.append(result.mean)
return errors
```

Data with dimension: 10 and number of samples: 200 is now sampled.

In []:

```
n = 500
taus = logspace(-6, 4, n)
figure(figsize=(20,6))
suptitle('Effect of Regularisation for 10-dimensional data with 200 samples', fontsize=12)
matrix, y=generate_data(200,10)
feats_train = RealFeatures(matrix.T)
labels_train= RegressionLabels(y[0])
weights, rr=generate_weights(taus, feats_train, labels_train)
errors=xval_results(taus)
p1=subplot(121)
plot_regularization(taus, weights)
p2=subplot(122)
plot(taus, errors)
p2.set_xscale('log')
xlabel('Tau', fontsize=12)
ylabel('Error', fontsize=12)
jet()
```

In []:

```
figure(figsize=(20,6))
suptitle('Effect of Regularisation for 10-dimensional data with 10 samples', fontsize=12)
matrix, y=generate_data(10,10)
feats_train = RealFeatures(matrix.T)
labels_train= RegressionLabels(y[0])
weights, rr=generate_weights(taus, feats_train, labels_train)
errors=xval_results(taus)
p1=subplot(121)
plot_regularization(taus, weights)
p2=subplot(122)
plot(taus, errors)
p2.set_xscale('log')
xlabel('Tau', fontsize=12)
ylabel('Error', fontsize=12)
jet()
```

$$ \min \|X^T\beta - y\|^2 + \lambda\|\beta\|_1$$

In Shogun, following equivalent form is solved, where increasing $C$ selects more variables:

$$\min \|X^T\beta - y\|^2 \quad s.t. \|\beta\|_1 \leq C $$

One way to solve this regularized form is by using Least Angle Regression (LARS).

LARS is essentially forward stagewise made fast. LARS can be briefly described as follows.

- Start with an empty set.
- Select $x_j$ that is most correlated with residuals.
- Proceed in the direction of $x_j$ until another variable $x_k$ is equally correlated with residuals.
- Choose equiangular direction between $x_j$ and $x_k$.
- Proceed until third variable enters the active set, etc.

It should be noticed that instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles (equal correlations) with each of the variables currently in our set (equiangular).

In []:

```
#sample some data
X=rand(10)*1.5
for i in range(9):
x=random.standard_normal(10)*0.5
X=vstack((X, x))
y=ones(10)
feats_train=RealFeatures(X)
labels_train=RegressionLabels(y)
```

In []:

```
#Preprocess data
preproc=PruneVarSubMean()
preproc.init(feats_train)
feats_train.add_preprocessor(preproc)
feats_train.apply_preprocessor()
preprocessor=NormOne()
preprocessor.init(feats_train)
feats_train.add_preprocessor(preprocessor)
feats_train.apply_preprocessor()
print "(No. of attributes, No. of samples) of data:"
print feats_train.get_feature_matrix().shape
```

`get_path_size()`

.

In []:

```
#Train and generate weights
la=LeastAngleRegression()
la.set_labels(labels_train)
la.train(feats_train)
size=la.get_path_size()
print ("Size of path is %s" %size)
```

`get_w_for_var()`

method is used. The argument is the index of the variable which should be in the range [0, path_size).

In []:

```
#calculate weights
weights=[]
for i in range(size):
weights.append(la.get_w_for_var(i))
s = sum(abs(array(weights)), axis=1)
print ('Max. norm is %s' %s[-1])
figure(figsize(30,7))
#plot 1
ax=subplot(131)
title('Lasso path')
ax.plot(s, weights, linewidth=2)
ymin, ymax = ylim()
ax.vlines(s[1:-1], ymin, ymax, linestyle='dashed')
xlabel("Norm")
ylabel("weights")
#Restrict norm to half for early termination
la.set_max_l1_norm(s[-1]*0.5)
la.train(feats_train)
size=la.get_path_size()
weights=[]
for i in range(size):
weights.append(la.get_w_for_var(i))
s = sum(abs(array(weights)), axis=1)
#plot 2
ax2=subplot(132)
title('Lasso path with restricted norm')
ax2.plot(s, weights, linewidth=2)
ax2.vlines(s[1:-1], ymin, ymax, linestyle='dashed')
xlabel("Norm")
ylabel("weights")
print ('Restricted norm is %s' %(s[-1]))
```

In []:

```
feats=RealFeatures(CSVFile('../../../data/multiclass/categorical_dataset/fm_housing.dat'))
train_labels=RegressionLabels(CSVFile('../../../data/multiclass/categorical_dataset/housing_label.dat'))
mat = feats.get_feature_matrix()
crime_rate=mat[0]
feats_train=RealFeatures(crime_rate.reshape(1, len(mat[0])))
preproc=RescaleFeatures()
preproc.init(feats_train)
feats_train.add_preprocessor(preproc)
feats_train.apply_preprocessor(True)
# Store preprocessed feature matrix.
preproc_data=feats_train.get_feature_matrix()
size=500
x1=linspace(0, 1, size)
```

In []:

```
width=0.5
tau=0.5
kernel=GaussianKernel(feats_train, feats_train, width)
krr=KernelRidgeRegression(tau, kernel, train_labels)
krr.train(feats_train)
feats_test=RealFeatures(x1.reshape(1,len(x1)))
kernel.init(feats_train, feats_test)
out = krr.apply().get_labels()
```

In []:

```
#Visualization of regression
fig=figure(figsize(6,6))
#first plot with only one attribute
title("Regression with 1st attribute")
_=scatter(preproc_data[0:], train_labels.get_labels(), c=ones(560), cmap=gray(), s=20)
_=xlabel('Crime rate')
_=ylabel('Median value of homes')
_=plot(x1,out, linewidth=3)
```

We are given a labeled set of input-output pairs $\mathcal{D}=(x_i,y_i)^N_{i=1}\subseteq \mathcal{X} \times \mathcal{Y}$ where $x\in\mathcal{X}$ and $y\in \mathcal{Y}$ and the primary problem is as follows: $$\arg\min_{\mathbf{w},\mathbf{\xi}, b } ({\frac{1}{2} \|\mathbf{w}\|^2 +C \sum_{i=1}^n (\xi_i+ {\xi_i}^*)) }$$

For the constraints: $$ {\bf w}^T{\bf x}_i+b-c_i-\xi_i\leq 0, \, \forall i=1\dots N$$ $$ -{\bf w}^T{\bf x}_i-b-c_i^*-\xi_i^*\leq 0, \, \forall i=1\dots N $$ with $c_i=y_i+ \epsilon$ and $c_i^*=-y_i+ \epsilon$

The resulting dual optimaization problem is: $$ \max_{{\bf \alpha},{\bf \alpha}^*} -\frac{1}{2}\sum_{i,j=1}^N(\alpha_i-\alpha_i^*)(\alpha_j-\alpha_j^*) {\bf x}_i^T {\bf x}_j-\sum_{i=1}^N(\alpha_i+\alpha_i^*)\epsilon - \sum_{i=1}^N(\alpha_i-\alpha_i^*)y_i\\$$ $$ \mbox{wrt}:$$ $${\bf \alpha},{\bf \alpha}^*\in{\bf R}^N\\ \mbox{s.t.}: 0\leq \alpha_i,\alpha_i^*\leq C,\, \forall i=1\dots N\\ \sum_{i=1}^N(\alpha_i-\alpha_i^*)y_i=0 $$

This class also support the $\nu$-SVR regression version of the problem, where $\nu$ replaces the $\epsilon$ parameter and represents an upper bound on the action of margin errors and a lower bound on the fraction of support vectors. The resulting problem generally takes a bit longer to solve. The details and comparison of these two versioins can be found in [1].

`svr_param`

argument is the $\epsilon$-tube for the $\epsilon$ version and is the $\nu$ parameter in other case.

In []:

```
# Use different kernels
gaussian_kernel=GaussianKernel(feats_train, feats_train, 0.1)
#Polynomial kernel of degree 2
poly_kernel=PolyKernel(feats_train, feats_train, 2, True)
linear_kernel=LinearKernel(feats_train, feats_train)
kernels=[linear_kernel, poly_kernel, gaussian_kernel]
```

In []:

```
svr_param=1
svr_C=10
svr=LibSVR(svr_C, svr_param, gaussian_kernel, train_labels, LIBSVR_EPSILON_SVR)
#Visualization of regression
x1=linspace(0, 1, size)
feats_test_=RealFeatures(x1.reshape(1,len(x1)))
def svr_regress(kernels):
fig=figure(figsize(8,8))
for i, kernel in enumerate(kernels):
svr.set_kernel(kernel)
svr.train()
out=svr.apply(feats_test_).get_labels()
#subplot(1,len(kernels), i)
#first plot with only one attribute
title("Support Vector Regression")
_=scatter(preproc_data[0:], train_labels.get_labels(), c=ones(560), cmap=gray(), s=20)
_=xlabel('Crime rate')
_=ylabel('Median value of homes')
_=plot(x1,out, linewidth=3)
ylim([0, 40])
p1 = Rectangle((0, 0), 1, 1, fc="r")
p2 = Rectangle((0, 0), 1, 1, fc="b")
p3 = Rectangle((0, 0), 1, 1, fc="g")
_=legend((p1, p2, p3), ["Gaussian Kernel", "Linear Kernel", "Polynomial Kernel"], loc=1)
svr_regress(kernels)
```

In []:

```
import time
gaussian_kernel=GaussianKernel(feats, feats, 13)
nus=[0.2, 0.4, 0.6, 0.8]
epsilons=[0.16, 0.09, 0.046, 0.0188]
svr_C=10
def compare_svr(nus, epsilons):
time_eps=[]
time_nus=[]
for i in range(len(epsilons)):
svr_param=1
svr=LibSVR(svr_C, epsilons[i], gaussian_kernel, train_labels, LIBSVR_EPSILON_SVR)
t_start=time.clock()
svr.train()
time_test=(time.clock() - t_start)
time_eps.append(time_test)
for i in range(len(nus)):
svr_param=1
svr=LibSVR(svr_C, nus[i], gaussian_kernel, train_labels, LIBSVR_NU_SVR)
t_start=time.clock()
svr.train()
time_test=(time.clock() - t_start)
time_nus.append(time_test)
print "-"*72
print "|", "%15s" % 'Nu' ,"|", "%15s" % 'Epsilon',"|","%15s" % 'Time (Nu)' ,"|", "%15s" % 'Time(Epsilon)' ,"|"
for i in range(len(nus)):
print "-"*72
print "|", "%15s" % nus[i] ,"|", "%15s" %epsilons[i],"|","%15s" %time_nus[i] ,"|", "%15s" %time_eps[i] ,"|"
print "-"*72
title_='SVR Performance on Boston Housing dataset'
print "%50s" %title_
compare_svr(nus, epsilons)
```