*Maximum Mean Discrepancy*. There are two types of tests available, a quadratic time test and a linear time test. Both come in various flavours.
Independence testing is currently based in the *Hilbert Schmidt Independence Criterion*.

In [1]:

```
%pylab inline
%matplotlib inline
# import all Shogun classes
from modshogun import *
```

To set the context, we here briefly describe statistical hypothesis testing. Informally, one defines a hypothesis on a certain domain and then uses a statistical test to check whether this hypothesis is true. Formally, the goal is to reject a so-called *null-hypothesis* $H_0$, which is the complement of an *alternative-hypothesis* $H_A$.

To distinguish the hypotheses, a test statistic is computed on sample data. Since sample data is finite, this corresponds to sampling the true distribution of the test statistic. There are two different distributions of the test statistic -- one for each hypothesis. The *null-distribution* corresponds to test statistic samples under the model that $H_0$ holds; the *alternative-distribution* corresponds to test statistic samples under the model that $H_A$ holds.

In practice, one tries to compute the quantile of the test statistic in the null-distribution. In case the test statistic is in a high quantile, i.e. it is unlikely that the null-distribution has generated the test statistic -- the null-hypothesis $H_0$ is rejected.

There are two different kinds of errors in hypothesis testing:

- A
*type I error*is made when $H_0: p=q$ is wrongly rejected. That is, the test says that the samples are from different distributions when they are not. - A
*type II error*is made when $H_A: p\neq q$ is wrongly accepted. That is, the test says that the samples are from the same distribution when they are not.

A so-called *consistent* test achieves zero type II error for a fixed type I error.

To decide whether to reject $H_0$, one could set a threshold, say at the $95\%$ quantile of the null-distribution, and reject $H_0$ when the test statistic lies below that threshold. This means that the chance that the samples were generated under $H_0$ are $5\%$. We call this number the *test power* $\alpha$ (in this case $\alpha=0.05$). It is an upper bound on the probability for a type I error. An alternative way is simply to compute the quantile of the test statistic in the null-distribution, the so-called *p-value*, and to compare the p-value against a desired test power, say $\alpha=0.05$, by hand. The advantage of the second method is that one not only gets a binary answer, but also an upper bound on the type I error.

In order to construct a two-sample test, the null-distribution of the test statistic has to be approximated. One way of doing this for any two-sample test is called *bootstrapping*, or the *permutation* test, where samples from both sources are mixed and permuted repeatedly and the test statistic is computed for every of those configurations. While this method works for every statistical hypothesis test, it might be very costly because the test statistic has to be re-computed many times. For many test statistics, there are more sophisticated methods of approximating the null distribution.

Shogun implements statistical testing in the abstract class CHypothesisTest. All implemented methods will work with this interface at their most basic level. This class offers methods to

- compute the implemented test statistic,
- compute p-values for a given value of the test statistic,
- compute a test threshold for a given p-value,
- sampling the null distribution, i.e. perform the permutation test or bootstrappig of the null-distribution, and
- performing a full two-sample test, and either returning a p-value or a binary rejection decision. This method is most useful in practice. Note that, depending on the used test statistic, it might be faster to call this than to compute threshold and test statistic seperately with the above methods.

There are special subclasses for testing two distributions against each other (CTwoSampleTest, CIndependenceTest), kernel two-sample testing (CKernelTwoSampleTest), and kernel independence testing (CKernelIndependenceTest), which however mostly differ in internals and constructors.

$\DeclareMathOperator{\mmd}{MMD}$
An important class of hypothesis tests are the *two-sample tests*.
In two-sample testing, one tries to find out whether two sets of samples come from different distributions. Given two probability distributions $p,q$ on some *arbritary* domains $\mathcal{X}, \mathcal{Y}$ respectively, and i.i.d. samples $X=\{x_i\}_{i=1}^m\subseteq \mathcal{X}\sim p$ and $Y=\{y_i\}_{i=1}^n\subseteq \mathcal{Y}\sim p$, the two sample test distinguishes the hypothesises

In order to solve this problem, it is desirable to have a criterion than takes a positive unique value if $p\neq q$, and zero if and only if $p=q$. The so called *Maximum Mean Discrepancy (MMD)*, has this property and allows to distinguish any two probability distributions, if used in a *reproducing kernel Hilbert space (RKHS)*. It is the distance of the mean embeddings $\mu_p, \mu_q$ of the distributions $p,q$ in such a RKHS $\mathcal{F}$ -- which can also be expressed in terms of expectation of kernel functions, i.e.

Note that this formulation does not assume any form of the input data, we just need a kernel function whose feature space is a RKHS, see [2, Section 2] for details. This has the consequence that in Shogun, we can do tests on any type of data (CDenseFeatures, CSparseFeatures, CStringFeatures, etc), as long as we or you provide a positive definite kernel function under the interface of CKernel.

We here only describe how to use the MMD for two-sample testing. Shogun offers two types of test statistic based on the MMD, one with quadratic costs both in time and space, and one with linear time and constant space costs. Both come in different versions and with different methods how to approximate the null-distribution in order to construct a two-sample test.

In order to illustrate kernel two-sample testing with Shogun, we use a couple of toy distributions. The first dataset we consider is the 1D Standard Gaussian

$p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{\sigma^2}\right)$

with mean $\mu$ and variance $\sigma^2$, which is compared against the 1D Laplace distribution

$p(x)=\frac{1}{2b}\exp\left(-\frac{|x-\mu|}{b}\right)$

with the same mean $\mu$ and variance $2b^2$. In order to increase difficulty, we set $b=\sqrt{\frac{1}{2}}$, which means that $2b^2=\sigma^2=1$.

In [2]:

```
# use scipy for generating samples
from scipy.stats import norm, laplace
def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=sqrt(0.5)):
# sample from both distributions
X=norm.rvs(size=n, loc=mu, scale=sigma2)
Y=laplace.rvs(size=n, loc=mu, scale=b)
return X,Y
```

In [3]:

```
mu=0.0
sigma2=1
b=sqrt(0.5)
n=220
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)
# plot both densities and histograms
figure(figsize=(18,5))
suptitle("Gaussian vs. Laplace")
subplot(121)
Xs=linspace(-2, 2, 500)
plot(Xs, norm.pdf(Xs, loc=mu, scale=sigma2))
plot(Xs, laplace.pdf(Xs, loc=mu, scale=b))
title("Densities")
xlabel("$x$")
ylabel("$p(x)$")
_=legend([ 'Gaussian','Laplace'])
subplot(122)
hist(X, alpha=0.5)
xlim([-5,5])
ylim([0,100])
hist(Y,alpha=0.5)
xlim([-5,5])
ylim([0,100])
legend(["Gaussian", "Laplace"])
_=title('Histograms')
```

*all* (infinitely many) higher order moments.

In [4]:

```
print "Gaussian vs. Laplace"
print "Sample means: %.2f vs %.2f" % (mean(X), mean(Y))
print "Samples variances: %.2f vs %.2f" % (var(X), var(Y))
```

We now describe the quadratic time MMD, as described in [1, Lemma 6], which is implemented in Shogun. All methods in this section are implemented in CQuadraticTimeMMD, which accepts any type of features in Shogun, and use it on the above toy problem.

An unbiased estimate for the MMD expression above can be obtained by estimating expected values with averaging over independent samples

$$ \mmd_u[\mathcal{F},X,Y]^2=\frac{1}{m(m-1)}\sum_{i=1}^m\sum_{j\neq i}^mk(x_i,x_j) + \frac{1}{n(n-1)}\sum_{i=1}^n\sum_{j\neq i}^nk(y_i,y_j)-\frac{2}{mn}\sum_{i=1}^m\sum_{j\neq i}^nk(x_i,y_j) $$A biased estimate would be

$$ \mmd_b[\mathcal{F},X,Y]^2=\frac{1}{m^2}\sum_{i=1}^m\sum_{j=1}^mk(x_i,x_j) + \frac{1}{n^ 2}\sum_{i=1}^n\sum_{j=1}^nk(y_i,y_j)-\frac{2}{mn}\sum_{i=1}^m\sum_{j\neq i}^nk(x_i,y_j) .$$Computing the test statistic using CQuadraticTimeMMD does exactly this, where it is possible to choose between the two above expressions. Note that some methods for approximating the null-distribution only work with one of both types. Both statistics' computational costs are quadratic both in time and space. Note that the method returns $m\mmd_b[\mathcal{F},X,Y]^2$ since null distribution approximations work on $m$ times null distribution. Here is how the test statistic itself is computed.

In [5]:

```
# turn data into Shogun representation (columns vectors)
feat_p=RealFeatures(X.reshape(1,len(X)))
feat_q=RealFeatures(Y.reshape(1,len(Y)))
# choose kernel for testing. Here: Gaussian
kernel_width=1
kernel=GaussianKernel(10, kernel_width)
# create mmd instance of test-statistic
mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)
# compute biased and unbiased test statistic (default is unbiased)
mmd.set_statistic_type(BIASED)
biased_statistic=mmd.compute_statistic()
mmd.set_statistic_type(UNBIASED)
unbiased_statistic=mmd.compute_statistic()
print "%d x MMD_b[X,Y]^2=%.2f" % (len(X), biased_statistic)
print "%d x MMD_u[X,Y]^2=%.2f" % (len(X), unbiased_statistic)
```

In [6]:

```
# this is not necessary as bootstrapping is the default
mmd.set_null_approximation_method(PERMUTATION)
mmd.set_statistic_type(UNBIASED)
# to reduce runtime, should be larger practice
mmd.set_num_null_samples(100)
# now show a couple of ways to compute the test
# compute p-value for computed test statistic
p_value=mmd.compute_p_value(unbiased_statistic)
print "P-value of MMD value %.2f is %.2f" % (unbiased_statistic, p_value)
# compute threshold for rejecting H_0 for a given test power
alpha=0.05
threshold=mmd.compute_threshold(alpha)
print "Threshold for rejecting H0 with a test power of %.2f is %.2f" % (alpha, threshold)
# performing the test by hand given the above results, note that those two are equivalent
if unbiased_statistic>threshold:
print "H0 is rejected with confidence %.2f" % alpha
if p_value<alpha:
print "H0 is rejected with confidence %.2f" % alpha
# or, compute the full two-sample test directly
# fixed test power, binary decision
binary_test_result=mmd.perform_test(alpha)
if binary_test_result:
print "H0 is rejected with confidence %.2f" % alpha
significance_test_result=mmd.perform_test()
print "P-value of MMD test is %.2f" % significance_test_result
if significance_test_result<alpha:
print "H0 is rejected with confidence %.2f" % alpha
```

In [7]:

```
# precompute kernel to be faster for null sampling
p_and_q=mmd.get_p_and_q()
kernel.init(p_and_q, p_and_q);
precomputed_kernel=CustomKernel(kernel);
mmd.set_kernel(precomputed_kernel);
# increase number of iterations since should be faster now
mmd.set_num_null_samples(500);
p_value_boot=mmd.perform_test();
print "P-value of MMD test is %.2f" % p_value_boot
```

In [8]:

```
num_samples=500
# sample null distribution
mmd.set_num_null_samples(num_samples)
null_samples=mmd.sample_null()
# sample alternative distribution, generate new data for that
alt_samples=zeros(num_samples)
for i in range(num_samples):
X=norm.rvs(size=n, loc=mu, scale=sigma2)
Y=laplace.rvs(size=n, loc=mu, scale=b)
feat_p=RealFeatures(reshape(X, (1,len(X))))
feat_q=RealFeatures(reshape(Y, (1,len(Y))))
mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)
alt_samples[i]=mmd.compute_statistic()
```

Visualise both distributions, $H_0:p=q$ is rejected if a sample from the alternative distribution is larger than the $(1-\alpha)$-quantil of the null distribution. See [1] for more details on their forms. From the visualisations, we can read off the test's type I and type II error:

- type I error is the area of the null distribution being right of the threshold
- type II error is the area of the alternative distribution being left from the threshold

In [9]:

```
def plot_alt_vs_null(alt_samples, null_samples, alpha):
figure(figsize=(18,5))
subplot(131)
hist(null_samples, 50, color='blue')
title('Null distribution')
subplot(132)
title('Alternative distribution')
hist(alt_samples, 50, color='green')
subplot(133)
hist(null_samples, 50, color='blue')
hist(alt_samples, 50, color='green', alpha=0.5)
title('Null and alternative distriution')
# find (1-alpha) element of null distribution
null_samples_sorted=sort(null_samples)
quantile_idx=int(num_samples*(1-alpha))
quantile=null_samples_sorted[quantile_idx]
axvline(x=quantile, ymin=0, ymax=100, color='red', label=str(int(round((1-alpha)*100))) + '% quantile of null')
_=legend()
```

In [10]:

```
plot_alt_vs_null(alt_samples, null_samples, alpha)
```

As already mentioned, bootstrapping the null distribution is expensive business. There exist a couple of methods that are more sophisticated and either allow very fast approximations without guarantees or reasonably fast approximations that are consistent. We present a selection from [2], which are implemented in Shogun.

The first one is a spectral method that is based around the Eigenspectrum of the kernel matrix of the joint samples. It is faster than bootstrapping while being a consistent test. Effectively, the null-distribution of the biased statistic is sampled, but in a more efficient way than the bootstrapping approach. The converges as

$$ m\mmd^2_b \rightarrow \sum_{l=1}^\infty \lambda_l z_l^2 $$where $z_l\sim \mathcal{N}(0,2)$ are i.i.d. normal samples and $\lambda_l$ are Eigenvalues of expression 2 in [2], which can be empirically estimated by $\hat\lambda_l=\frac{1}{m}\nu_l$ where $\nu_l$ are the Eigenvalues of the centred kernel matrix of the joint samples $X$ and $Y$. The distribution above can be easily sampled. Shogun's implementation has two parameters:

- Number of samples from null-distribution. The more, the more accurate. As a rule of thumb, use 250.
- Number of Eigenvalues of the Eigen-decomposition of the kernel matrix to use. The more, the better the results get. However, the Eigen-spectrum of the joint gram matrix usually decreases very fast. Plotting the Spectrum can help. See [2] for details.

If the kernel matrices are diagonal dominant, this method is likely to fail. For that and more details, see the original paper. Computational costs are much lower than bootstrapping, which is the only consistent alternative. Since Eigenvalues of the gram matrix has to be computed, costs are in $\mathcal{O}(m^3)$.

Below, we illustrate how to sample the null distribution and perform two-sample testing with the Spectrum approximation in the class CQuadraticTimeMMD. This method only works with the biased statistic.

In [11]:

```
# optional: plot spectrum of joint kernel matrix
from numpy.linalg import eig
# get joint feature object and compute kernel matrix and its spectrum
feats_p_q=mmd.get_p_and_q()
mmd.get_kernel().init(feats_p_q, feats_p_q)
K=mmd.get_kernel().get_kernel_matrix()
w,_=eig(K)
# visualise K and its spectrum (only up to threshold)
figure(figsize=(18,5))
subplot(121)
imshow(K, interpolation="nearest")
title("Kernel matrix K of joint data $X$ and $Y$")
subplot(122)
thresh=0.1
plot(w[:len(w[w>thresh])])
_=title("Eigenspectrum of K until component %d" % len(w[w>thresh]))
```

In [12]:

```
# threshold for eigenspectrum
thresh=0.1
# compute number of eigenvalues to use
num_eigen=len(w[w>thresh])
# finally, do the test, use biased statistic
mmd.set_statistic_type(BIASED)
#tell Shogun to use spectrum approximation
mmd.set_null_approximation_method(MMD2_SPECTRUM)
mmd.set_num_eigenvalues_spectrum(num_eigen)
mmd.set_num_samples_spectrum(num_samples)
# the usual test interface
p_value_spectrum=mmd.perform_test()
print "Spectrum: P-value of MMD test is %.2f" % p_value_spectrum
# compare with ground truth bootstrapping
mmd.set_null_approximation_method(PERMUTATION)
mmd.set_num_null_samples(num_samples)
p_value_boot=mmd.perform_test()
print "Bootstrapping: P-value of MMD test is %.2f" % p_value_spectrum
```

$\DeclareMathOperator{\var}{var}$ Another method for approximating the null-distribution is by matching the first two moments of a Gamma distribution and then compute the quantiles of that. This does not result in a consistent test, but usually also gives good results while being very fast. However, there are distributions where the method fail. Therefore, the type I error should always be monitored. Described in [2]. It uses

$$ m\mmd_b(Z) \sim \frac{x^{\alpha-1}\exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)} $$where

$$ \alpha=\frac{(\textbf{E}(\text{MMD}_b(Z)))^2}{\var(\text{MMD}_b(Z))} \qquad \text{and} \qquad \beta=\frac{m \var(\text{MMD}_b(Z))}{(\textbf{E}(\text{MMD}_b(Z)))^2} $$Then, any threshold and p-value can be computed using the gamma distribution in the above expression. Computational costs are in $\mathcal{O}(m^2)$. Note that the test is parameter free. It only works with the biased statistic.

In [13]:

```
# tell Shogun to use gamma approximation
mmd.set_null_approximation_method(MMD2_GAMMA)
# the usual test interface
p_value_gamma=mmd.perform_test()
print "Gamma: P-value of MMD test is %.2f" % p_value_gamma
# compare with ground truth bootstrapping
mmd.set_null_approximation_method(PERMUTATION)
p_value_boot=mmd.perform_test()
print "Bootstrapping: P-value of MMD test is %.2f" % p_value_spectrum
```

In [14]:

```
# type I error is false alarm, therefore sample data under H0
num_trials=50
rejections_gamma=zeros(num_trials)
rejections_spectrum=zeros(num_trials)
rejections_bootstrap=zeros(num_trials)
num_samples=50
alpha=0.05
for i in range(num_trials):
X=norm.rvs(size=n, loc=mu, scale=sigma2)
Y=laplace.rvs(size=n, loc=mu, scale=b)
# simulate H0 via merging samples before computing the
Z=hstack((X,Y))
X=Z[:len(X)]
Y=Z[len(X):]
feat_p=RealFeatures(reshape(X, (1,len(X))))
feat_q=RealFeatures(reshape(Y, (1,len(Y))))
# gamma
mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)
mmd.set_null_approximation_method(MMD2_GAMMA)
mmd.set_statistic_type(BIASED)
rejections_gamma[i]=mmd.perform_test(alpha)
# spectrum
mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)
mmd.set_null_approximation_method(MMD2_SPECTRUM)
mmd.set_num_eigenvalues_spectrum(num_eigen)
mmd.set_num_samples_spectrum(num_samples)
mmd.set_statistic_type(BIASED)
rejections_spectrum[i]=mmd.perform_test(alpha)
# bootstrap (precompute kernel)
mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)
p_and_q=mmd.get_p_and_q()
kernel.init(p_and_q, p_and_q)
precomputed_kernel=CustomKernel(kernel)
mmd.set_kernel(precomputed_kernel)
mmd.set_null_approximation_method(PERMUTATION)
mmd.set_num_null_samples(num_samples)
mmd.set_statistic_type(BIASED)
rejections_bootstrap[i]=mmd.perform_test(alpha)
```

In [15]:

```
convergence_gamma=cumsum(rejections_gamma)/(arange(num_trials)+1)
convergence_spectrum=cumsum(rejections_spectrum)/(arange(num_trials)+1)
convergence_bootstrap=cumsum(rejections_bootstrap)/(arange(num_trials)+1)
print "Average rejection rate of H0 for Gamma is %.2f" % mean(convergence_gamma)
print "Average rejection rate of H0 for Spectrum is %.2f" % mean(convergence_spectrum)
print "Average rejection rate of H0 for Bootstrapping is %.2f" % mean(rejections_bootstrap)
```

So far, we basically had to precompute the kernel matrix for reasonable runtimes. This is not possible for more than a few thousand points. The linear time MMD statistic, implemented in CLinearTimeMMD can help here, as it accepts data under the streaming interface CStreamingFeatures, which deliver data one-by-one.

And it can do more cool things, for example choose the best single (or combined) kernel for you. But we need a more fancy dataset for that to show its power. We will use one of Shogun's streaming based data generator, CGaussianBlobsDataGenerator for that. This dataset consists of two distributions which are a grid of Gaussians where in one of them, the Gaussians are stretched and rotated. This dataset is regarded as challenging for two-sample testing.

In [16]:

```
# paramters of dataset
m=20000
distance=10
stretch=5
num_blobs=3
angle=pi/4
# these are streaming features
gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)
gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)
# stream some data and plot
num_plot=1000
features=gen_p.get_streamed_features(num_plot)
features=features.create_merged_copy(gen_q.get_streamed_features(num_plot))
data=features.get_feature_matrix()
figure(figsize=(18,5))
subplot(121)
grid(True)
plot(data[0][0:num_plot], data[1][0:num_plot], 'r.', label='$x$')
title('$X\sim p$')
subplot(122)
grid(True)
plot(data[0][num_plot+1:2*num_plot], data[1][num_plot+1:2*num_plot], 'b.', label='$x$', alpha=0.5)
_=title('$Y\sim q$')
```

We now describe the linear time MMD, as described in [1, Section 6], which is implemented in Shogun. A fast, unbiased estimate for the original MMD expression which still uses all available data can be obtained by dividing data into two parts and then compute

$$ \mmd_l^2[\mathcal{F},X,Y]=\frac{1}{m_2}\sum_{i=1}^{m_2} k(x_{2i},x_{2i+1})+k(y_{2i},y_{2i+1})-k(x_{2i},y_{2i+1})- k(x_{2i+1},y_{2i}) $$where $ m_2=\lfloor\frac{m}{2} \rfloor$. While the above expression assumes that $m$ data are available from each distribution, the statistic in general works in an online setting where features are obtained one by one. Since only pairs of four points are considered at once, this allows to compute it on data streams. In addition, the computational costs are linear in the number of samples that are considered from each distribution. These two properties make the linear time MMD very applicable for large scale two-sample tests. In theory, any number of samples can be processed -- time is the only limiting factor.

We begin by illustrating how to pass data to CLinearTimeMMD. In order not to loose performance due to overhead, it is possible to specify a block size for the data stream.

In [17]:

```
block_size=100
# if features are already under the streaming interface, just pass them
mmd=LinearTimeMMD(kernel, gen_p, gen_q, m, block_size)
# compute an unbiased estimate in linear time
statistic=mmd.compute_statistic()
print "MMD_l[X,Y]^2=%.2f" % statistic
# note: due to the streaming nature, successive calls of compute statistic use different data
# and produce different results. Data cannot be stored in memory
for _ in range(5):
print "MMD_l[X,Y]^2=%.2f" % mmd.compute_statistic()
```

In [18]:

```
# data source
gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)
gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)
# retreive some points, store them as non-streaming data in memory
data_p=gen_p.get_streamed_features(100)
data_q=gen_q.get_streamed_features(data_p.get_num_vectors())
print "Number of data is %d" % data_p.get_num_vectors()
# cast data in memory as streaming features again (which now stream from the in-memory data)
streaming_p=StreamingRealFeatures(data_p)
streaming_q=StreamingRealFeatures(data_q)
# it is important to start the internal parser to avoid deadlocks
streaming_p.start_parser()
streaming_q.start_parser()
# example to create mmd (note that m can be maximum the number of data in memory)
mmd=LinearTimeMMD(GaussianKernel(10,1), streaming_p, streaming_q, data_p.get_num_vectors(), 1)
print "Linear time MMD statistic: %.2f" % mmd.compute_statistic()
```

As for any two-sample test in Shogun, bootstrapping can be used to approximate the null distribution. This results in a consistent, but slow test. The number of samples to take is the only parameter. Note that since CLinearTimeMMD operates on streaming features, *new* data is taken from the stream in every iteration.

Bootstrapping is not really necessary since there exists a fast and consistent estimate of the null-distribution. However, to ensure that any approximation is accurate, it should always be checked against bootstrapping at least once.

Since both the null- and the alternative distribution of the linear time MMD are Gaussian with equal variance (and different mean), it is possible to approximate the null-distribution by using a linear time estimate for this variance. An unbiased, linear time estimator for

$$ \var[\mmd_l^2[\mathcal{F},X,Y]] $$can simply be computed by computing the empirical variance of

$$ k(x_{2i},x_{2i+1})+k(y_{2i},y_{2i+1})-k(x_{2i},y_{2i+1})-k(x_{2i+1},y_{2i}) \qquad (1\leq i\leq m_2) $$A normal distribution with this variance and zero mean can then be used as an approximation for the null-distribution. This results in a consistent test and is very fast. However, note that it is an approximation and its accuracy depends on the underlying data distributions. It is a good idea to compare to the bootstrapping approach first to determine an appropriate number of samples to use. This number is usually in the tens of thousands.

CLinearTimeMMD allows to approximate the null distribution in the same pass as computing the statistic itself (in linear time). This should always be used in practice since seperate calls of computing statistic and p-value will operator on different data from the stream. Below, we compute the test on a large amount of data (impossible to perform quadratic time MMD for this one as the kernel matrices cannot be stored in memory)

In [19]:

```
mmd=LinearTimeMMD(kernel, gen_p, gen_q, m, block_size)
print "m=%d samples from p and q" % m
print "Binary test result is: " + ("Rejection" if mmd.perform_test(alpha) else "No rejection")
print "P-value test result is %.2f" % mmd.perform_test()
```

$\DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$ Now which kernel do we actually use for our tests? So far, we just plugged in arbritary ones. However, for kernel two-sample testing, it is possible to do something more clever.

Shogun's kernel selection methods for MMD based two-sample tests are all based around [3, 4]. For the CLinearTimeMMD, [3] describes a way of selecting the *optimal* kernel in the sense that the test's type II error is minimised. For the linear time MMD, this is the method of choice. It is done via maximising the MMD statistic divided by its standard deviation and it is possible for single kernels and also for convex combinations of them. For the CQuadraticTimeMMD, the best method in literature is choosing the kernel that maximised the MMD statistic [4]. For convex combinations of kernels, this can be achieved via a $L2$ norm constraint. A detailed comparison of all methods on numerous datasets can be found in [5].

MMD Kernel selection in Shogun always involves an implementation of the base class CMMDKernelSelection, which defines the interface for kernel selection. If combinations of kernel should be considered, there is a sub-class CMMDKernelSelectionComb. In addition, it involves setting up a number of baseline kernels $\mathcal{K}$ to choose from/combine in the form of a CCombinedKernel. All methods compute their results for a fixed set of these baseline kernels. We later give an example how to use these classes after providing a list of available methods.

CMMDKernelSelectionMedian Selects from a set CGaussianKernel instances the one whose width parameter is closest to the median of the pairwise distances in the data. The median is computed on a certain number of points from each distribution that can be specified as a parameter. Since the median is a stable statistic, one does not have to compute all pairwise distances but rather just a few thousands. This method a useful (and fast) heuristic that in many cases gives a good hint on where to start looking for Gaussian kernel widths. It is for example described in [1]. Note that it may fail badly in selecting a good kernel for certain problems.

CMMDKernelSelectionMax Selects from a set of arbitrary baseline kernels a single one that maximises the used MMD statistic -- more specific its estimate. $$ k^*=\argmax_{k\in\mathcal{K}} \hat \eta_k, $$ where $\eta_k$ is an empirical MMD estimate for using a kernel $k$. This was first described in [4] and was empirically shown to perform better than the median heuristic above. However, it remains a heuristic that comes with no guarantees. Since MMD estimates can be computed in linear and quadratic time, this method works for both methods. However, for the linear time statistic, there exists a better method.

CMMDKernelSelectionOpt Selects the optimal single kernel from a set of baseline kernels. This is done via maximising the ratio of the linear MMD statistic and its standard deviation. $$ k^*=\argmax_{k\in\mathcal{K}} \frac{\hat \eta_k}{\hat\sigma_k+\lambda}, $$ where $\eta_k$ is a linear time MMD estimate for using a kernel $k$ and $\hat\sigma_k$ is a linear time variance estimate of $\eta_k$ to which a small number $\lambda$ is added to prevent division by zero. These are estimated in a linear time way with the streaming framework that was described earlier. Therefore, this method is only available for CLinearTimeMMD. Optimal here means that the resulting test's type II error is minimised for a fixed type I error.

*Important:*For this method to work, the kernel needs to be selected on*different*data than the test is performed on. Otherwise, the method will produce wrong results.CMMDKernelSelectionCombMaxL2 Selects a convex combination of kernels that maximises the MMD statistic. This is the multiple kernel analogous to CMMDKernelSelectionMax. This is done via solving the convex program $$ \boldsymbol{\beta}^*=\min_{\boldsymbol{\beta}} \{\boldsymbol{\beta}^T\boldsymbol{\beta} : \boldsymbol{\beta}^T\boldsymbol{\eta}=\mathbf{1}, \boldsymbol{\beta}\succeq 0\}, $$ where $\boldsymbol{\beta}$ is a vector of the resulting kernel weights and $\boldsymbol{\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. See [3] for details. Note that this method is unable to select a single kernel -- even when this would be optimal. Again, when using the linear time MMD, there are better methods available.

CMMDKernelSelectionCombOpt Selects a convex combination of kernels that maximises the MMD statistic divided by its covariance. This corresponds to \emph{optimal} kernel selection in the same sense as in class CMMDKernelSelectionOpt and is its multiple kernel analogous. The convex program to solve is $$ \boldsymbol{\beta}^*=\min_{\boldsymbol{\beta}} (\hat Q+\lambda I) \{\boldsymbol{\beta}^T\boldsymbol{\beta} : \boldsymbol{\beta}^T\boldsymbol{\eta}=\mathbf{1}, \boldsymbol{\beta}\succeq 0\}, $$ where again $\boldsymbol{\beta}$ is a vector of the resulting kernel weights and $\boldsymbol{\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. The matrix $\hat Q$ is a linear time estimate of the covariance matrix of the vector $\boldsymbol{\eta}$ to whose diagonal a small number $\lambda$ is added to prevent division by zero. See [3] for details. In contrast to CMMDKernelSelectionCombMaxL2, this method is able to select a single kernel when this gives a lower type II error than a combination. In this sense, it contains CMMDKernelSelectionOpt.

In order to use one of the above methods for kernel selection, one has to create a new instance of CCombinedKernel append all desired baseline kernels to it. This combined kernel is then passed to the MMD class. Then, an object of any of the above kernel selection methods is created and the MMD instance is passed to it in the constructor. There are then multiple methods to call

*compute_measures*to compute a vector kernel selection criteria if a single kernel selection method is used. It will return a vector of selected kernel weights if a combined kernel selection method is used. For \shogunclass{CMMDKernelSelectionMedian}, the method does throw an error.*select_kernel*returns the selected kernel of the method. For single kernels this will be one of the baseline kernel instances. For the combined kernel case, this will be the underlying CCombinedKernel instance where the subkernel weights are set to the weights that were selected by the method.

In order to utilise the selected kernel, it has to be passed to an MMD instance. We now give an example how to select the optimal single and combined kernel for the Gaussian Blobs dataset.

In [20]:

```
sigmas=[2**x for x in linspace(-5,5, 10)]
print "Choosing kernel width from", ["{0:.2f}".format(sigma) for sigma in sigmas]
combined=CombinedKernel()
for i in range(len(sigmas)):
combined.append_kernel(GaussianKernel(10, sigmas[i]))
# mmd instance using streaming features
block_size=1000
mmd=LinearTimeMMD(combined, gen_p, gen_q, m, block_size)
# optmal kernel choice is possible for linear time MMD
selection=MMDKernelSelectionOpt(mmd)
# select best kernel
best_kernel=selection.select_kernel()
best_kernel=GaussianKernel.obtain_from_generic(best_kernel)
print "Best single kernel has bandwidth %.2f" % best_kernel.get_width()
```

Now perform two-sample test with that kernel

In [21]:

```
alpha=0.05
mmd=LinearTimeMMD(best_kernel, gen_p, gen_q, m, block_size)
mmd.set_null_approximation_method(MMD1_GAUSSIAN);
p_value_best=mmd.perform_test();
print "Bootstrapping: P-value of MMD test with optimal kernel is %.2f" % p_value_best
```

In [22]:

```
mmd=LinearTimeMMD(best_kernel, gen_p, gen_q, 5000, block_size)
num_samples=500
# sample null and alternative distribution, implicitly generate new data for that
null_samples=zeros(num_samples)
alt_samples=zeros(num_samples)
for i in range(num_samples):
alt_samples[i]=mmd.compute_statistic()
# tell MMD to merge data internally while streaming
mmd.set_simulate_h0(True)
null_samples[i]=mmd.compute_statistic()
mmd.set_simulate_h0(False)
```

In [23]:

```
plot_alt_vs_null(alt_samples, null_samples, alpha)
```

- Two-sample tests on strings
- Two-sample tests on audio data (quite fun)
- Testing for independence with the Hilbert Schmidt Independence Criterion

[1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 671–721.

[2]: Gretton, A., Fukumizu, K., Harchaoui, Z., & Sriperumbudur, B. K. (2012). A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems (pp. 673–681).

[3]: Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., & Fukumizu, K. (2012). Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems.

[4]: Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G. R. G., & Schölkopf, B. (2009). Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems

[5]: Strathmann, H. (2012). M.Sc. Adaptive Large-Scale Kernel Two-Sample Testing. University College London.

In [24]:

```
```