Abstract base class that provides an interface for performing kernel twosample test on streaming data using Maximum Mean Discrepancy (MMD) as the test statistic. The MMD is the distance of two probability distributions \(p\) and \(q\) in a RKHS (see [1] for formal description).
\[ \text{MMD}[\mathcal{F},p,q]^2=\textbf{E}_{x,x'}\left[ k(x,x')\right] 2\textbf{E}_{x,y}\left[ k(x,y)\right] +\textbf{E}_{y,y'}\left[ k(y,y')\right]=\mu_p  \mu_q^2_\mathcal{F} \]
where \(x,x'\sim p\) and \(y,y'\sim q\). The data has to be provided as streaming features, which are processed in blocks for a given blocksize. The blocksize determines how many examples are processed at once. A method for getting a specified number of blocks of data is provided which can optionally merge and permute the data within the current burst. The exact computation of kernel functions for MMD computation is abstract and has to be defined by its subclasses, which should return a vector of function values. Please note that for streaming MMD, the number of data points from both the distributions has to be equal.
Along with the statistic comes a method to compute a pvalue based on a Gaussian approximation of the nulldistribution which is possible in linear time and constant space. Sampling from null is also possible (no permutations but new examples will be used here). If unsure which one to use, sampling with 250 iterations always is correct (but slow). When the sample size is large (>1000) at least, the Gaussian approximation is an accurate and much faster choice.
To choose, use set_null_approximation_method() and choose from
MMD1_GAUSSIAN: Approximates the nulldistribution with a Gaussian. Only use from at least 1000 samples. If using, check if type I error equals the desired value.
PERMUTATION: For permuting available samples to sample nulldistribution.
For kernel selection see CMMDKernelSelection.
[1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & Smola, A. (2012). A Kernel TwoSample Test. Journal of Machine Learning Research, 13, 671721.
Protected Member Functions  
virtual SGVector< float64_t >  compute_squared_mmd (CKernel *kernel, CList *data, index_t num_this_run)=0 
virtual TParameter *  migrate (DynArray< TParameter * > *param_base, const SGParamInfo *target) 
virtual void  one_to_one_migration_prepare (DynArray< TParameter * > *param_base, const SGParamInfo *target, TParameter *&replacement, TParameter *&to_migrate, char *old_name=NULL) 
virtual void  load_serializable_pre () throw (ShogunException) 
virtual void  load_serializable_post () throw (ShogunException) 
virtual void  save_serializable_pre () throw (ShogunException) 
virtual void  save_serializable_post () throw (ShogunException) 
CStreamingMMD  (  ) 
default constructor
CStreamingMMD  (  CKernel *  kernel, 
CStreamingFeatures *  p,  
CStreamingFeatures *  q,  
index_t  m,  
index_t  blocksize = 10000 

) 
constructor.
kernel  kernel to use 
p  streaming features p to use 
q  streaming features q to use 
m  number of samples from each distribution 
blocksize  size of examples that are processed at once when computing statistic/threshold. 
virtual 
destructor
computes a pvalue based on current method for approximating the nulldistribution. The pvalue is the 1p quantile of the null distribution where the given statistic lies in.
The method for computing the pvalue can be set via set_null_approximation_method(). Since the null distribution is normal, a Gaussian approximation is available.
statistic  statistic value to compute the pvalue for 
Reimplemented from CTwoSampleTest.
Definition at line 119 of file StreamingMMD.cpp.

protectedpure virtual 
abstract method that computes the squared MMD
kernel  the kernel to be used for computing MMD. This will be useful when multiple kernels are used 
data  the list of data on which kernels are computed. The order of data in the list is \(x,x',\cdots\sim p\) followed by \(y,y',\cdots\sim q\). It is assumed that detele_data flag is set inside the list 
num_this_run  number of data points in current blocks 
virtual 
Computes the squared MMD for the current data. This is an unbiased estimate. This method relies on compute_statistic_and_variance which has to be defined in the subclasses
Note that the underlying streaming feature parser has to be started before this is called. Otherwise deadlock.
Implements CKernelTwoSampleTest.
Same as compute_statistic(), but with the possibility to perform on multiple kernels at once
multiple_kernels  if true, and underlying kernel is K_COMBINED, method will be executed on all subkernels on the same data 
Implements CKernelTwoSampleTest.
pure virtual 
Same as compute_statistic_and_variance, but computes a linear time estimate of the covariance of the multiplekernelMMD. See [1] for details.
Implemented in CLinearTimeMMD.

pure virtual 
Abstract method that computes MMD and a linear time variance estimate. If multiple_kernels is set to true, each subkernel is evaluated on the same data.
statistic  return parameter for statistic, vector with entry for each kernel. May be allocated before but doesn not have to be 
variance  return parameter for statistic, vector with entry for each kernel. May be allocated before but doesn not have to be 
multiple_kernels  optional flag, if set to true, it is assumed that the underlying kernel is of type K_COMBINED. Then, the MMD is computed on all subkernel separately rather than computing it on the combination. This is used by kernel selection strategies that need to evaluate multiple kernels on the same data. Since the linear time MMD works on streaming data, one cannot simply compute MMD, change kernel since data would be different for every kernel. 
Implemented in CLinearTimeMMD.
computes a threshold based on current method for approximating the nulldistribution. The threshold is the value that a statistic has to have in ordner to reject the nullhypothesis.
The method for computing the pvalue can be set via set_null_approximation_method(). Since the null distribution is normal, a Gaussian approximation is available.
alpha  test level to reject nullhypothesis 
Reimplemented from CTwoSampleTest.
virtual 
computes a linear time estimate of the variance of the squared mmd, which may be used for an approximation of the nulldistribution The value is the variance of the vector of which the MMD is the mean.
virtual 
Implements CKernelTwoSampleTest.
Reimplemented in CLinearTimeMMD.
virtual 
Not implemented for streaming MMD since it uses streaming feautres
Reimplemented from CTwoSampleTest.
Definition at line 307 of file StreamingMMD.cpp.

pure virtualinherited 
returns the statistic type of this test statistic
Implemented in CQuadraticTimeMMD, CNOCCO, CHSIC, and CLinearTimeMMD.

virtual 
Getter for streaming features of p distribution.
Definition at line 314 of file StreamingMMD.cpp.

virtual 
Getter for streaming features of q distribution.
Definition at line 320 of file StreamingMMD.cpp.

virtual 
Performs the complete twosample test on current data and returns a pvalue.
In case null distribution should be estimated with MMD1_GAUSSIAN, statistic and pvalue are computed in the same loop, which is more efficient than first computing statistic and then computung pvalues.
In case of sampling null, superclass method is called.
The method for computing the pvalue can be set via set_null_approximation_method().
Reimplemented from CHypothesisTest.
Definition at line 165 of file StreamingMMD.cpp.

Mimics sampling null for MMD. However, samples are not permutated but constantly streamed and then merged. Usually, this is not necessary since there is the Gaussian approximation for the null distribution. However, in certain cases this may fail and sampling the null distribution might be numerically more stable. Ovewrite superclass method that merges samples.
Reimplemented from CKernelTwoSampleTest.
Definition at line 194 of file StreamingMMD.cpp.

void set_blocksize  (  index_t  blocksize  ) 
Setter for the blocksize of examples to be processed at once
blocksize  new blocksize to use 
virtual 
Not implemented for streaming MMD since it uses streaming feautres
Reimplemented from CTwoSampleTest.
void set_simulate_h0  (  bool  simulate_h0  ) 
simulate_h0  if true, samples from p and q will be mixed and permuted 
Definition at line 263 of file StreamingMMD.h.

Streams num_blocks data from each distribution with blocks of size num_this_run. If m_simulate_h0 is set, it merges the blocks together, shuffles and redistributes between the blocks.
num_blocks  number of blocks to be streamed from each distribution 
num_this_run  number of data points to be streamed for one block 
Definition at line 220 of file StreamingMMD.cpp.

