# Ideas for Google Summer of Code 2011 Projects

Shogun is a software library for machine learning algorithms, with a special focus on large-scale applications. This summer we are looking to extend the library in four different ways: Improving interfaces to other machine learning libraries or integrating them when appropriate, improved i/o support, framework improvements and new machine algorithms. Below is listed a set of suggestions for projects.

back to shogun toolbox homepage

## Interfacing or integrating other Machine Learning Libraries

• #### Interface the Open Kernel Library

Mentor: Cyril Allauzen
Difficulty: Medium
Requirements: C++ programming, machine learning

The OpenKernel library is an open-source software library for designing, combining, learning and using kernels: implement a C++ interface allowing the use in Shogun of kernels created or learned using OpenKernel. To this end you will closely work with the maintainers of the openkernel library and shogun to
1. Create a shared library interface for the kernels in the open kernel library
2. Following shogun's developer tutorial you will develop a class COpenKernel derived from CKernel to interface to the open kernel library kernels.
• #### Integrate updated versions of liblinear and svmocas.

Mentor: Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, machine learning

New versions of liblinear and libocas are availabe that now support multiclass classification. The task is to
1. Merge the newer versions into shogun
2. To integrate the new multi-class linear SVMs utilizing shogun's COFFIN framework implemented in CDotFeatures
3. Implement the Local Binary Pattern (LBP) features, useful in gender recognition based on face detectors
• #### Integrate Vowpal Wabbit

Mentor: John Langford, Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, basic machine learning

http://hunch.net/~vw/ It is a fast SGD implementation optimized for low wall-clock time. It therefore tunes loading data + learning process. The code is available from github
git clone git://github.com/JohnLangford/vowpal_wabbit.git
Investigate how this can be integrated into shogun without sacrificing speed. To do this efficiently it might need to be necassary to first develop a framework for online features. The outlined plan would be
1. Create a StreamingFeatures C++ class that all of vw's algorithms use for loading examples.
2. Define a StreamingMethod base class and derive vw's algorithms from there.
3. Put miscellaeneous vw functions in a util class
4. Extract vew cmdline interface and turn this into an extra separate tool that links to libshogun with vw's algorithms.
5. Then convert the online/streaming algorithms (liblinear,...) that are in shogun to this framework
• #### Integrate C5.0

Mentor: Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, basic machine learning

C5.0 is among the most famous decision trees. It has just recently been released under the GPL - and thus it would be great to integrate it as a classifier in shogun. Since the source code of C5.0 is in C, your task is to encapsulate C5.0 in C++ classes as shogun classifier using shogun features as inputs.
• #### Help merging the projects shogun and multiboost

Mentor: Francois-David Collin, Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, basic machine learning

The two projects shogun and multiboost have recently started a merge. Help the authors in various programming tasks to bring state-of-the-art boosting to shogun..
• #### Implement Olivier Chapelles fast newton based SVM solver

Mentor: Soeren Sonnenburg
Difficulty: Easy
Requirements: C++ programming, basic octave or matlab, basic machine learning

Code is available in matlab from http://mloss.org/software/view/30/. The task is to translate this code to C++ and integrate it in shogun. This involves
1. Defining a new CLinearMachine derived class CNewtonSVM
2. Translating the few lines of matlab code
• #### Integrate SGD-QN

Mentor: Antoine Bordes
Difficulty: Easy
Requirements: C++ programming, basic machine learning

SGD-QN is a fast linear online SVM solver implemented in C that is missing in shogun. The task is to merge that solver withing shogun. This should be easy as other variants of SGD written by the same author are already in shogun. One would again do the merge utilizing shogun's COFFIN framework (implemented in CDotFeatures). Note that sgd-qn is a particularly fast online solver and won the PASCAL Large Scale Learning Challenge.
• #### Integrate OLaRank

Mentor: Antoine Bordes
Difficulty: Medium
Requirements: C++ programming, machine learning

OLaRank is a fast solver for learning SVMs with structured outputs. It is implemented in C/C++. The task is to integrate that solver in shogun.
• #### mloss.org and mldata.org integration

Mentor: Soeren Sonnenburg
Difficulty: Medium
Requirements: C++ programming, python, (optional django)

The machine learning data set repository mldata.org provides data and task downloads in a standardized format. Make shogun to natively be able to read these data formats and interact with the mldata.org website, i.e., by
Note that the optional part of this task aims to help the machine learning community portals for which python and django skills are needed (both portal website's source code are also open source).

### Interfaces to other languages

• #### Implement modular Java interface

Mentor: Mikio Braun
Difficulty: Easy
Requirements: C++ programming, java, swig

Investigate the status of numerical java and implement swig typemaps to translate numerical java variables (matrices, vectors, sparse vectors, strings) into shogun representations. To begin with, develop swig-typemaps for jblas. It might be worth to have things like JRuby, Jython etc. work with the Java bindings of shogun.
• #### Implement modular ruby interface.

Mentor: Mikio Braun
Difficulty: Easy
Requirements: C++ programming, ruby, swig

Investigate the status of numerical ruby and implement swig typemaps to translate numerical ruby variables (matrices, vectors, sparse vectors, strings) into shogun representations.

## Input/Output

• #### Reading and writing of shogun features / objects in .hdf5 (pytables/matlab/h5py) format

Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming

Shogun already has partial support for the hierarchical data format version 5 in CHDF5File. This enables loading and saving files in python,r,octave,matlab,... sometimes providing better hdf5 support than any of the native implementions. The task would be to extend hdf5 support by implementing
1. varialbe length data types support
3. compression support
4. slicing support
5. attribute support
• Extensive information about hdf5 can be found in its reference manual.
• #### Reading and writing of matlab 7.0 files

Mentor: Soeren Sonnenburg
Difficulty: Easy to Medium
Requirements: C++ programming

Matlab 7 file format specifications are publicly available online. In addtion Octave has support for reading files up to version 7.2. The task is to develop a file i/o class based on these specifications that allows for reading and writing .mat files directly. Note that one can borrow a lot of code from octave here (see the files src/ls-mat4.cc and load-save.cc).
• #### Reading and writing of octave 3.x files

Mentor: Soeren Sonnenburg
Difficulty: Easy
Requirements: C++ programming

Natively support loading and writing of octave files. One can make use of the already implemented functionality in octave (files ls-oct-ascii.{cc,h} and ls-oct-binary.{cc,h}. So this task involves a lot of code borrowing.

## Framework Improvements

• #### Built well-designed and flexible cross-validation framework into shogun

Mentor: Soeren Sonnenburg
Difficulty: Difficult
Requirements: C++ programming, basic machine learning

This should include N-fold cross-validation, simple validation, repeated sampling etc. It should be possible to easily exchange the strategy of how examples are split up (e.g. in comp-bio this, could mean that genes in a particular split come from a certain genomic region). This is considered difficult only because you will need to deeply understand the internals of shogun objects. The involved tasks are
1. Understand shogun's parameter framework and extend it to flag parameters for use in model selection.
2. Implement a class CParameterSetting for one particular configuration the learning machine will be run with
3. Implement a class CSettingGenerator class that based on parameter inputs generates one particular CParameterSetting
4. Implement various model selection schemes (train,validation,test split, n-fold cross validation, leave one out cross validation, nested cross validation,...)
• #### Implement framework for online learning

Mentor: Soeren Sonnenburg
Difficulty: Easy
Requirements: C++ programming, machine learning

Currently within shogun it is assumed that the number of data points is known beforehand. This may be troublesome for online learning algorithms. Implement 'online features' with calls to get_next_feature_vector() that always returns another object or NULL. Implemnent it in a way hat is either read from a file, pipe or socket. This is closely related to the vowpal vabbit integration above and one will be able to borrow code from there http://hunch.net/~vw/. In a next step convert online algorithms such as sgd, larank, liblinear to work with those features.
• #### Built generic structured output learning framework

Mentor:Jonas Behr, Nico Goernitz
Difficulty: Difficult
Requirements: C++ programming, machine learning

In many cases machine learning problems result in a classification task where one seeks an algorithm that decides whether a given data point is of one or the other class. Among the most powerful and successful methods solving such problems is the famous support vector machine (SVM). Naturally it's input data consist of high-dimensional lists of features capturing properties of the underlying objects (e.g. weight, size, color, ...). Sometimes these objects are more complex and have a build-in structure like trees and sequences. However, the output is restricted to a binary decision and hence, is not helpful when it comes to predicting e.g. the most likely sequence for speech recognition or genome annotation. To overcome this problem but retain the characteristics of the SVM the structured-output support vector machine (SO-SVM) has been proposed (Tsochantaridis et al [3]). Another powerful class of structured output predictors is the conditional random field (CRF, Lafferty et al [4]). It shares important properties with the SO-SVM but performance might differ depending on the application at hand.

Here, the student will implement a generic SO-SVM and CRF framework to enable handling structured output problems and make it an ideal base for specific user-defined implementations. A crucial point is - of course - speed and therefore implementation covers only primal versions of the SO-SVM. However, this is not a big restriction because most applications already have a rich representation. As application she/he will implement a hidden (semi-)Markov SO-SVM which will be used for transcript mapping or gene prediction. There is already an implementation available in Matlab/C [1]. Code for Viterbi-decoding of semi-Markov models is already available in Shogun [2].

While the programming difficulty is only medium a basic understanding in machine learning is mandatory. This renders the whole problem difficult.

Timetable
1. get used to shogun (especially how to read and write data) [2]
2. test HM-SVM toolbox with toy data [1]
3. get a basic grasp of how the SO-SVM algorithm looks like
4. implement the standard SO-SVM algorithm (a QP-solver is provided)
5. test the SO-SVM by implementing a multi-class SVM (here the joint feature map and argmax-function are especially easy, this is also the first application)
6. implement HM-SVM application (define specific loss- and argmax-functions)
7. implement a CRF and test on the same application

[1] an implementation of the hidden Markov SO-SVM in Matlab/C is available at http://mloss.org/software/view/250/
[2] Shogun framework as well as examples and plenty of doumentation can be downloaded from http://www.shogun-toolbox.org
[3] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun: Support vector machine learning for interdependent and structured output spaces, ICML 2004
[4] J. Lafferty, A. McCallum, F. Pereira: Conditional random fields: Probabilistic models for segmenting and labeling sequence data, ICML 2001
• #### General Solver framework for convex problems.

Mentor: Gunnar Raetsch
Difficulty: Difficult (this might be too ambitious)
Requirements: C++ programming, optimization, machine learning

There exists cvx for matlab which is useful but lacks some natural options like number of iterations. Convexity can be checked via algebraic rules (like sums of convex functions are convex) and convex atoms. The aim is to establish a framework with one solving strategy initially rather then to get perfect solutions for all kinds of convex problems.

## New Machine Learning Algorithms

• #### Implementation of the Reduced Set Method for RBF kernel.

Mentor: Vojtech Franc
Difficulty: Medium
Requirements: C++ programming, matlab or octave, basic math/statistics

The Gaussian kernels are among the most popular dissimilarity measures used in practical applications of kernel methods. Their popularity is due to the flexibility of the decision rule expressed as a linear combination of the Gaussian kernels. The discriminative power of such rule can be smoothly changed by tuning a single parameter (the kernel width) from linear to arbitrary complex one. The eminent flexibility is paid by high complexity of the decision rule, i.e. the number of the Gaussian kernels appearing in the linear combination. High complexity of the decision rule has a detrimental impact on the decision time being one of the most important criteria in practice. The Reduced Set (RS) Methods is a technique which approximates a complex kernel-based decision rule by a simple one using a prescribed number of kernels. The core of the RS methods is to solve so called pre-image problem. In the case of the Gaussian kernels, the pre-image problem leads to highly non-convex smooth optimization. The goal of the project is to implement the existing techniques for its solution, namely:
1. B.Schoekopf, P.Knirsch, and A.Smola, C.Burges. Fast approximation of support vector kernel expansions, and an interpretation of clustering as approximation in feature spaces. DAGM. 1998.
2. J.T.Kwok and I.W.Tsang. The pre-image problem in kernel methods. ICML. 2003.
As a starting point the student can use Matlab implementations of the RS methods which are available in STPRtoolbox.
• #### Implement Kernel Feature Analysis (KFA)

Difficulty: Medium
Requirements: C++ programming, machine learning

Dimension reduction is a prerequisite in many practical applications of machine learning. Kernel Feature Analysis (KFA) is an algorithm for performing dimension reduction using kernel functions. In contrast to the other kernel-based techniques for dimension reduction, KFA determines a low-dimensional subspace using sparse components and thus is suitable for large-scale application. KFA is described in the following two papers. The algorithms can be implemented as sketched in the work of Jiang et al., where further tricks for improving the run-time are discusssed.
1. Sparse Kernel Feature Analysis. Smola et al. (1999)
2. Accelerated Kernel Feature Analysis. Jiang et al. (2006)
• #### ML estimation of parameters of Gaussian Mixture Models with carefully implemented EM algorithm.

Mentor: Vojtech Franc
Difficulty: Difficult
Requirements: C++ programming, matlab or octave, basic math/statistics

The Expectation-Maximization algorithm estimating parameters of the Gaussian Mixture Model is a standard tool used probability density modeling. Despite the simplicity of the EM algorithm itself programming a robust implementation which would be applicable in practice is a tricky problem. The difficulty stems from numerical problems inherently involved in the algorithm, e.g. summation of extremely low numbers or inverting ill conditioned covariance matrices. The goal of the project is a robust implementation which uses several computational tricks and techniques, e.g. performing summation of low numbers in logarithm domain, using SVD decomposition to represent covariance matrices, removing "singular" Gaussian components etc. An initial Matlab implementation will be provided by the mentor.
• #### Implement missing machine learning algorithms

Mentor: Christian Widmer
Difficulty: Easy to Medium
Requirements: C++ programming, basic machine learning, potentially basic skills in Ocatve/MATLAB/Python

Shogun provides a rich set of kernel implementations, including (but not limited to) a rich set of string kernels particularly relevant for applications in Computational Biology. Traditionally, the focus of Shogun has been on Support Vector Machines (SVMs), however there exist numerous other Machine Learning methods, that could also benefit from the set of kernels provided by Shogun. In particular, we have the following two in mind:
1. Gaussian Processes
2. Kernel Logistic Regression

In addition to that, a whole range of pre-processing methods are available that would greatly complement the features that are in shogun already. These include:
1. Kernel PCA (KPCA)
2. Stochastic Neighborhood Embedding (SNE)
3. Locally Linear Embedding (LLE)
4. ISOMAP
5. Multidimensional Scaling (MDS)

For this task, the first step is to scan mloss.org and similar sites for suitable implementations. Once these are identified, the student will have to port the respective code to the Shogun code base, making sure to respect existing interfaces (and add new ones, if necessary). Lastly, documentation, examples and Unit tests will have to be written for all of the integrated methods.

While this project does not involve the implementation of novel ML methods form scratch, it offers a great opportunity for a student to get familiar with a whole range of highly relevant Machine Learning methods in theory and practice. On top of that, the student will gather experience in contributing to a large scientific software project.
• #### Implement SVMbright a GPL'ed version of SVMLight and make it publicly available

Mentor: Soeren Sonnenburg
Difficulty: Difficult
Requirements: C++ programming, support vector machines, optimization theory

SVMlight is one of the first SVM toolboxes. While efficient, it is not open source. The task is to implement the algorithms described in Joachims 98 as free software.
• #### Implementation of / Integration via existing GPL code of latent SVMs.

Mentor: Alexander Binder
Requirements: C++ programming, machine learning
Difficulty: Difficult

SVMs with latent variables
1. checking what implementations licensed under the GPL exist in C/C++
2. to what kind of problems are they practically applicable
3. if necessary implement one latent SVM model
4. implement templates for minimization over user-defined latent variables (difficult!)
5. implement dynamic interfaces
6. implement static interfaces for important application cases
7. implement more than one solution strategy
8. implement heuristics for dealing with large sample sizes
9. check such heuristics for their efficiency on large scale data
• http://www.cs.cornell.edu/~cnyu/latentssvm/, http://research.microsoft.com/pubs/72874/lsvm_amta.pdf and particularly http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.4938&rep=rep1&type=pdf are an ongoing success story in Object detection approaches where the latent variable is the unknown position of an object center and object parts. Felzenswalbs approach received the "Lifetime Achievement" Prize based on its impact in research (see http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html) at the Pascal VOC2010 workshop. Latent parameters are equally appealing for disccriminative methods in other application areas as well.

The task consists of The challenge in implementation lies in solving the non(!)-convex optimization problem in an way which is acceptable for practitioners , including allowing for templates for minimization over the latent variable, which are desried to be consistent with the interfaces to other languages, termination after a maximal runtime or heuristic frameworks which allow to iterate it on smaller chunks of data while keeping only the hardest samples.
• #### Implementation of Boosting type learning method for non-constantly dimensional features defined as sets of fixed dim vectors with varying set size.

Mentor: Alexander Binder
Difficulty: Medium
Requirements: C++ programming, machine learning

http://www.cs.cmu.edu/~rahuls/pub/cvpr2008-unified-rahuls.pdf is a method generally usable (outside its original domain) for features which are sets of fixed dimensional vectors with varying set size. This is a notable relaxation of the classical paradigm of constantly (fixed) dimensional features which could extend the usability of the Shogun toolbox at this edge.