# Shogun Planet

## May 03, 2013 10:57 PM

### Soeren Sonnenburg

#### Shogun Student Applications Statistics for Google Summer of Code 2013

Almost a month has passed since SHOGUN has been accepted for Google Summer of Code 2013. Student application deadline was today (May 6) and shogun received 57 proposals from 52 students. This is quite some increase compared to 2012 (48 applications from 38 students). What is interesting though is that it didn't look that good in the very beginning (see the figure below):

Comparing this to 2012, this curve is much more flat in the beginning but exponentially increasing towards the end. Why is that? We didn't change the way we engaged with students (even though we tried to improve the instructions and added lots of entrance tagged tasks to github issues). We still require patches to be submitted to even be considered. So it is similarly tough to get into gsoc 2013 with us as it was in the previous year.

What is interesting though is that various organizations complained about a slow uptake in the beginning. And it turns out that google did limit the number of student applications from 20 (last year) to 5 (in 2013). This might explain the shape of the curve: Students are more cautious to apply but once the deadline is near the apply to the maximum of 5 to improve their chances. This goes hand-in-hand with the observation that the quality of newly submitted student applications tends to decrease towards the deadline.

So did this new limit hurt? To the contrary! In the end the quality of proposals increased a lot and we were able to even way before the student application deadline start to score/rank students. We are happy to have many very strong candidates this year again. Lets hope we get enough slots to accommodate all of the excellent students and then lets start the fun :)

## April 17, 2013 08:27 PM

### CALL FOR PARTICIPATION: Shogun Machine Learning Workshop, Berlin, Germany, July 12-14, 2013

Data Science, Big-Data are omnipresent terms documenting the need for automated tools to analyze the ever growing wealth of data. To this end we invite practitioners, researchers and students to participate in the first Shogun machine learning workshop. While the workshop is centered around the development and use of the shogun machine learning toolbox, it will also feature general machine learning subjects.

### General Information

The workshop will include:
• A general introduction to machine learning held by Gunnar Raetsch.
• Introductory talks about e.g. Dimension reduction techniques, Kernel-statistical testing, Gaussian Processes, Structured Output learning.
• Contributed talks and a poster session, and a poster-spotlight.
• A discussion panel
• A hands on session on July 13-14

Do not miss the chance to familiarize yourself with the shogun machine learning toolbox for solving various data analysis tasks and to talk to their authors and contributors. The program of the workshop will cover from basic to advanced topics in machine learning and how to approach them using Shogun, which makes it suitable for anyone, no matter if you are a senior researcher or practitioner with many year's of experience, or a junior student willing to discover much more. Interested?

A tentative schedule is available at http://shogun-toolbox.org/page/Events/workshop2013_program.

### Call for contributions

The organizing committee is seeking workshop contributions. The committee will select several submitted contributions for 15-minute talks and poster presentations. The accepted contributions will also be published on the workshop web site.

Amongst other topics, we encourage submission that

• are applications / publications utilizing Shogun
• are highly relevant to practitioners in the field
• are of broad general interest
• are extensions to Shogun

### Submission Guidelines

Send an abstract of your talk/contribution to shogun-workshop2013@shogun-toolbox.org before June 1. Notifications will be given on June 7.

### Registration

Workshop registration is free of charge. However, only a limited number of seats is available. First-come, first-served! Register by filling out the registration form.

### Location and Timeline

The main workshop will take place at c-base Berlin (http://c-base.org/, https://en.wikipedia.org/wiki/C-base) on July 12. It is followed by additional 2-day hands-on sessions held at TU Berlin on July 13-14.

### About the Shogun machine learning toolbox

Shogun is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis. Further information is available at http://www.shogun-toolbox.org.

## April 09, 2013 10:28 AM

### Heiko Strathmann

#### Talk at the EBI in Cambridge

I gave a talk at the EMBL-European Bioninformatic institute in Cambridge, where I visited the group of Oliver Stegle.

The topic was "Adaptive Large-Scale Kernel Two-Sample Testing". Slides can be found here behind this link.

## April 09, 2013 10:14 AM

#### Shogun is in the GSoC 2013

Shogun got accepted in the Google Summer of Code 2013!

Check out our ideas pageThis year, I will be a mentor rather than a student  and I am very excited about this.

I'll be offering two projects:

• Implement Gaussian process classification (joint with Oliver Stegle). This is an extension of the GSoC project last year and should be quite interested while not being too complicated (link)
• Implement unbiased estimators of likelihoods of very large, sparse Gaussian distributions (joint with Erlend Aune and Daniel Simpson). This one is quite challenging since it involved many different topics. However, it should also be very interesting (link)

## April 09, 2013 12:01 AM

### mloss.org

#### GSoC 2013

GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.

Some statistics:

• 177 of 417 projects were accepted, which is a success rate of 42%.
• 40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.

These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.

To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!

## April 08, 2013 10:50 PM

### Fernando Jose Iglesias Garcia

#### Shogun is participating in GSoC 2013

It is difficult to come up with a better way to start off the week than with news as good as this. As usual, we have two main types of project ideas in Shogun:

• Accessibility improvements.
• Core machine learning tasks or framework improvements.

Check out the full ideas list for more detailed descriptions.

Providing that Shogun has plenty of useful and interesting machine learning methods but, unfortunately, some of them are not so accessible for users that are not familiar with Shogun’s code base, this year the accessibility improvements projects seem to be particularly important. We expect to have after this summer more interactive demos showing off Shogun’s capabilities.

Nonetheless, there are also some interesting ideas concerning the implementation of new machine learning algorithms. For example, extensions in the Gaussian Processes to support classification, more dimension reduction techniques (are you an expert in ball trees? then we want you!) and some really challenging projects such as large-scale estimation of sparse Gaussian densities. Of course, last but not least, there is a very nice idea about my favourite topic: structured learning! This idea aims at providing some tools to target general structured output models with SO-SVMs, large-scale solvers and kernels.

## April 08, 2013 10:08 PM

### Soeren Sonnenburg

#### Shogun got accepted at Google Summer of Code 2013

We are happy to announce that the shogun machine learning toolbox will participate in this years google summer of code :-)

SHOGUN is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or exploratory data analysis.

In case you are a talented student interested in implementing and learning about machine learning algorithms - we are looking for you!

We have collected a number of ideas [1] that we consider worthwhile implementing. And don't forget to apply [2]!

## March 19, 2013 05:20 PM

### Heiko Strathmann

#### Shogun 2.1 is out!

We released SHOGUN 2.1. See the announcement (link).

The release features my recent work on kernel selection for MMD-based kernel two-sample testing and a streaming based implementation for this. See blog-entry. We also added a new unit-testing framework, of which I am very excited since we finally get a mechanism to detect code errors. We also got yet another interface language (perl). Very cool stuff and lots of work/blood/sweat/fun with the other guys. Check it out!

Next thing to come here is a workshop on machine learning with SHOGUN on July 12 in the C-Base in Berlin. Stay tuned!

## March 18, 2013 12:14 AM

### mloss.org

#### Scientist vs Inventor

Mikio and I are writing a book chapter about "Open Science in Machine Learning", which will appear in a collection titled "Implementing Computational Reproducible Research". Among many things, we mentioned that machine learning is about inventing new methods for solving problems. Luis Ibanez from Kitware pounced on this statement, and proceeded to give a wonderful argument that we are confusing our roles as scientists with the pressure of being an inventor. The rest of this post is an exact reproduction of Luis' response to our statement.

“... machine learning is concerned with creating new learning methods to perform well on certain application problems.”.

The authors discuss the purpose of machine learning, but under the untold context of “research on machine learning”, and the current landscape of funding research. To clarify, the authors imply that novelty is the purpose of machine learning research. More explicitly, that “developing new methods” is the goal of research.

This is a view (not limited to machine learning) that is commonly widespread, and that in practice is confirmed by the requirements of publishing and pursuit of grant funding. I beg to differ with this view, in the sense that “novelty” is not part of the scientific process at all. Novelty is an artificial condition that has been imposed on scientific workers over the years, due to the need to evaluate performance for the purpose of managing scarce funding resources. The goal of scientific research is to attempt to understand the world by direct observation, crafting of hypothesis and evaluation of hypothesis via reproducible experiments.

The pursuit of novelty (real or apparent) is actually a distraction, and it is one of the major obstacles to the practice of reproducible research. By definition, repeating an experiment, implies, requires and demands to do something that is not new. This distracted overrating of novelty is one of the reasons why scientific workers, and their institutions have come to consider repeatability of experiments as a “waste of time”, since it takes resources away from doing “new things” that could be published or could lead to new streams of funding. This confusion with “novelty” is also behind the lack of interest in reproducing experiments that have been performed by third parties. Since, such actions are “just repeating” what someone else did, and are not adding anything “new”. All, statements that are detrimental to the true practice of the scientific method.

The confusion is evident when one look at calls for proposals for papers in journal, conferences, or for funding programs. All of them call for “novelty”, none of them (with a handful of exceptions) call for reproducibility. The net effect is that we have confused two very different professions: (a) scientific researcher, with (b) inventor. Scientific researchers should be committed to the application of the scientific method, and in it, there is no requirement for novelty. The main commitment is to craft reproducible experiments, since we are after the truth, not after the new. Inventors on the other hand are in the business of coming up with new devices, and are not committed to understanding the world around us.

Most conference, journals, and even funding agencies have confused their role of supporting the understanding the world around us, and have become surrogates for the Patent Office.

In order to make progress in the pursuit of reproducible research, we need to put “novelty” back in its rightful place of being a nice extra secondary or tertiary feature of scientific research, but not a requirement, nor a driving force at all.

## March 17, 2013 02:21 PM

### Soeren Sonnenburg

#### Shogun Toolbox Version 2.1.0 Released!

We have just released shogun 2.1.0. This release contains over 800 commits since 2.0.0 with a load of bugfixes, new features and improvements (see the changelog for details) that make Shogun more efficient, robust and versatile. In particular, Christian Montanari developed a first alpha version of a perl modular interface, Heiko Strathmann did add Linear Time MMD on Streaming Data, Viktor Gal wrote a new structured output solver and Sergey Lisitsyn added support for tapkee - a dimension reduction framework. Read more at http://www.shogun-toolbox.org

## February 06, 2013 12:00 AM

### mloss.org

#### Software Licensing

One of the tricky decisions software authors have to make is "What license should I use for my software?" A recent article in PLoS Computational Biology discusses the different possible avenues open to authors. It gives a balanced view of software licensing, carefully describing the various dimensions authors of software should consider before coming to a decision.

It recommends the following guidelines:

• For the widest possible distribution consider a permissive FOSS license such as the BSD/MIT, Apache, or ECL.
• For assuring that derivatives also benefit FOSS, choose a copyleft FOSS license like the GPL, LGPL, or MPL.
• To those on the fence, there are hybrid or multi-licensing which can achieve the benefits of both open source and proprietary software licenses.
• For protecting the confidentiality of your code, there is the proprietary license.

Naturally being an open source venue, I strongly encourage people to consider the first two options. We also discuss the distinction between FOSS licences in our position paper from 2007.

## January 23, 2013 09:34 AM

### Dear all, save the date - July 12, 2013!

we just received confirmation from the c-base crew [1], [2] - the first shogun machine learning toolbox workshop will take place at c-base this Summer July 12 in Berlin! Anyone interested in participating / this event / or interested in helping to organize it - please talk to us on IRC or post on the mailinglist...

## January 02, 2013 01:25 PM

### mloss.org

#### Chemical compound and drug name recognition task.

CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task.

TASK GOAL AND MOTIVATION Machine learning methods have been especially useful for the automatic recognition of entity mentions in text, a crucial step for further natural language processing tasks. To promote the development of open source software for indexing documents with compounds and recognizing compound mentions in text.

The goal of this task is to promote the implementation of systems that are able to detect mentions in text of chemical compounds and drugs. The recognition of chemical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds or the extraction of pathway and metabolic reaction relations. A range of different methods have been explored for the recognition of chemical compound mentions including machine learning based approaches, rule-based systems and different types of dictionary-lookup strategies.

As has been the case in previous BioCreative efforts (resulting in high impact papers in the field), we expect that successful participants will have the opportunity to publish their system descriptions in a journal article.

CHEMDNER DESCRIPTION The CHEMDNER is one of the tracks posed at the BioCreative IV community challenge (http://www.biocreative.org).

We invite participants to submit results for the CHEMDNER task providing predictions for one or both of the following subtasks:

a) Given a set of documents, return for each of them a ranked list of chemical entities described within each of these documents [Chemical document indexing sub-task]

b) Provide for a given document the start and end indices corresponding to all the chemical entities mentioned in this document [Chemical entity mention recognition sub-task].

For these two tasks the organizers will release training and test data collections. The task organizers will provide details on the used annotation guidelines; define a list of criteria for relevant chemical compound entity types as well as selection of documents for annotation.

REGISTRATION Teams can participate in the CHEMDNER task by registering for track 2 of BioCreative IV. You can register additionally for other tracks too. To register your team go to the following page that provides more detailed instructions: http://www.biocreative.org/news/biocreative-iv/team/

Mailing list and contact information You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html You can also directly send questions to the organizers through e-mail: mkrallinger@cnioes

WORKSHOP CHEMDNER is part of the BioCreative evaluation effort. The BioCreative Organizing Committee will host the BioCreative IV Challenge evaluation workshop (http://www.biocreative.org/events/biocreative-iv/CFP/) at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013

CHEMDNER TASK ORGANIZERS Martin Krallinger, Spanish National Cancer Research Center (CNIO) Obdulia Rabal, University of Navarra, Spain Julen Oyarzabal, University of Navarra, Spain Alfonso Valencia, Spanish National Cancer Research Center (CNIO)

REFERENCES - Vazquez, M., Krallinger, M., Leitner, F., & Valencia, A. (2011). Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics, 30(6-7), 506-519. - Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing, 57-64. - Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics, 24(13), i268-i276. - Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M. V., ... & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22), 2983-2991. - Yeh, A., Morgan, A., Colosimo, M., & Hirschman, L. (2005). BioCreAtIvE task 1A: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1), S2. - Smith, L., Tanabe, L. K., Ando, R. J., Kuo, C. J., Chung, I. F., Hsu, C. N., ... & Wilbur, W. J. (2008). Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2), S2.

## December 26, 2012 12:06 AM

### Heiko Strathmann

#### NIPS paper: Optimal kernel choice for large-scale two-sample tests

NIPS 2012 is already over. Unfortunately, I could not go due to the lack of travel funding. However, as mentioned a few weeks ago, I participated in one paper which is closely related to my Master's project with Arthur Gretton and Massi Pontil. Optimal kernel choice for large-scale two-sample tests. We recently set up a page for the paper where you can download my Matlab implementation of the paper's methods. Feel free to play around with that. I am currently finishing implementing most methods into the SHOGUN toolbox. We also have a poster which was presented at NIPS. See below for all links.
Update: I have completed the kernel selection framework for SHOGUN, it will be included in the next release. See the base class interface here. See an example to use it: single kernel (link) and combined kernels (link). All methods that are mentioned in the paper are included. I also updated the shogun tutorial (link).

At its core, the paper describes a method for selecting the best kernel for two-sample testing with the linear time MMD. Given a kernel $$k$$ and terms
$h_k((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i}),$
the linear time MMD is their empirical mean,

$\hat\eta_k=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i})),$
which is a linear time estimate for the squared distance of the mean embeddings of the distributions where the $$x_i, y_i$$ come from. The quantity allows to perform a two-sample test, i.e. to tell whether the underlying distributions are different.

Given a finite family of kernels $$\mathcal{K}$$, we select the optimal kernel via maximising the ratio of the MMD statistic by a linear time estimate of the standard deviation of the terms
$k_*=\arg \sup_{k\in\mathcal{K}}\frac{ \hat\eta_k}{\hat \sigma_k},$
where $$\hat\sigma_k^2$$ is a linear time estimate of the variance $$\sigma_k^2=\mathbb{E}[h_k^2] - (\mathbb{E}[h_k])^2$$ which can also be computed in linear time and constant space. We give a linear time and constant space empirical estimate of this ratio. We establish consistency of this empirical estimate as
$\left| \sup_{k\in\mathcal{K}}\hat\eta_k \hat\sigma_k^{-1} -\sup_{k\in\mathcal{K}}\eta_k\sigma_k^{-1}\right|=O_P(m^{-\frac{1}{3}}).$

In addition, we describe a MKL style generalisation to selecting weights of convex combinations of a finite number of baseline kernels,
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\},$
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\},$
where $$\hat{Q}$$ is the positive definite empirical covariance matrix of the $$h$$ terms of all pairs of kernels.

We then describe three experiments to illustrate

• That our criterion outperforms existing methods on synthetic and real-life datasets which correspond to hard two-sample problems
• Why multiple kernels can be an advantage in two-sample testing

Supplementary page

## December 11, 2012 12:00 PM

### mloss.org

#### Paper "Ten Simple Rules for the Open Development of Scientific Software" by Prlic and Procter

PLOS Computational Biology has an interesting Editorial on 10 rules for open development of scientific software. The ten rules are:

1. Don't Reinvent the Wheel
2. Code Well
4. Be Transparent
5. Be Simple
6. Don't Be a Perfectionist
7. Nurture and Grow Your Community
10. Science Counts.

## November 28, 2012 10:52 AM

### mloss.org

#### Best Practices for Scientific Computing

I've been following the progress of Software Carpentry for some years now, and have been very impressed by their message that software is the new telescope, and we should invest time and effort to build up skills to ensure that our software is the best quality possible. Otherwise, how can we be sure that our new discoveries are not due to some instrument error?

They wrote a nice short paper titled "Best Practices for Scientific Computing" that highlights practices that would improve the quality of the software, and hence improve research productivity. Here are the 10 recommendations (along with the sub-recommendations).

### 1. Write programs for people, not computers.

1.1 a program should not require its readers to hold more than a handful of facts in memory at once

1.2 names should be consistent, distinctive, and meaningful

1.3 code style and formatting should be consistent

1.4 all aspects of software development should be broken down into tasks roughly an hour long

2.1 rely on the computer to repeat tasks

2.2 save recent commands in a file for re-use

2.3 use a build tool to automate their scientific workflows

### 3. Use the computer to record history.

3.1 software tools should be used to track computational work automatically

### 4. Make incremental changes.

4.1 work in small steps with frequent feedback and course correction

### 5. Use version control.

5.1 use a version control system

5.2 everything that has been created manually should be put in version control

### 6. Don’t repeat yourself (or others).

6.1 every piece of data must have a single authoritative representation in the system

6.2 code should be modularized rather than copied and pasted

6.3 re-use code instead of rewriting it

### 7. Plan for mistakes.

7.1 add assertions to programs to check their operation

7.2 use an off-the-shelf unit testing library

7.3 turn bugs into test cases

7.4 use a symbolic debugger

### 8. Optimize software only after it works correctly.

8.1 use a profiler to identify bottlenecks

8.2 write code in the highest-level language possible

### 9. Document the design and purpose of code rather than its mechanics.

9.1 document interfaces and reasons, not implementations

9.2 refactor code instead of explaining how it works

9.3 embed the documentation for a piece of software in that software

### 10. Conduct code reviews.

10.1 use code review and pair programming when bringing someone new up to speed and when tackling particularly tricky design, coding, and debugging problems

10.2 use an issue tracking tool

## November 13, 2012 11:30 PM

### Soeren Sonnenburg

#### Shogun at PyData NYC 2012

Christian Widmer gave a talk about the shogun machine learning toolbox version 2.0 at PyData NYC 2012. His talk is finally available as a video online.

## November 01, 2012 11:58 PM

#### 0x5ECB1A5ED: Reversing Malware Protocols With Machine Learning

Post by Hugo Gascon: Last week took place the 5th ACM Workshop on Artificial Intelligence and Security,  co-located with the 19th Conference on Computer and Communications Security in Raleigh, NC. This workshop is one of the few focused exclusively on the application of machine learning and AI based methods to privacy and computer security problems, so it was a great event to introduce our latest research, which I have done together with Tammo Krueger, Nicole Krämer and Konrad Rieck. We had the opportunity to present our method to automatically build stateful models of network protocols using machine learning.

Reverse engineering network protocols has been a popular strategy for people wanting to develop open implementations of proprietary protocols. For security researchers it is an effective way to understand how malware that is used to control infected machines in a botnet communicates with its command and control server. Sometimes, these messages are exchanged using communication channels built on top of known protocols like IRC, HTTP or P2P (some of them very quirky, as encrypted blog posts or steganographic image uploads). If the infected machine is a mobile phone is not uncommon for malware to receive and sent instructions over SMS messages.

As the complex time-consuming task it is, there have been many efforts to automate this reversing process. Some of them have focused on how to extract the complete protocol specification, which is very effective if done through taint analysis but not possible relaying only on network traces. Others have focused on understanding the protocols enough to simulate vulnerable services in honeypots (i.e. ScriptGen). Such honeypots can automatically infer parts of a protocol from network traffic but they have not been designed to track more evolved attacks that require a longer sequence of stateful communication. Close to this line of research, we have designed a probabilistic approach to model both the message content and the state machine of the protocol relying only on network traffic. Our method, called PRISMA: Protocol Inference and State Machine Analysis, is able to learn a stateful model that can be used to simulate valid communication of an unknown protocol. To construct this model, our method infers the message format, the state machine as well as rules for propagating information between states using machine learning techniques.

PRISMA builds on special tailored embedding and clustering techniques that allow for large-scale applications with millions of network messages. After preprocessing the network traces and embedding the messages, we define a similarity measure in order to find common structures in the data. We have explored part-based and position-based clustering through non-negative matrix factorization (NMF) to group individual messages into events, but other options for clustering algorithms can be chosen, as long as the procedure assigns a cluster label to each message.

Messages which occur at specific events in the flow of communication often exhibit similar structural features. Thus, to extract event information we exploit this structural dependency. Each of the extracted sequences of events can be seen as a path through the protocol’s state machine. To infer an approximation of this state machine, we use Markov models, where transitions are linked to probabilities. Every state in the model is linked to a set of automatically generated templates of the messages associated with each one of the states. The information flow between the different states during a communication is also automatically inferred and characterized as a set rules. Finally, we have developed a network level module called LENS, which is able to load the inferred model and simulate both sides of a real communication session.

After evaluating the system with real protocols, we have used our method on network traffic collected from malicious software. The following figure shows an example of the state machine inferred from traffic of the Koobface worm:

This method is specially interesting for honeypots, as simulating malware communication can help to deceive an attacker and obtain more information about its behavior. By inspecting the extracted state-machine and the associated templates and rules a malware analyst can also gain insights into the inner workings of a sample from the collected network traces alone. This also makes PRISMA a valuable method for dynamic analysis of malware beyond honeypot applications.
Attacks like call fraud and identity theft often involve sophisticated, stateful attack patterns which on top of normal communication try to harm systems on a higher semantic level than usual attack scenarios. To detect these kind of threats via specially deployed honeypots, at least a minimal understanding of the inherent state machine of a specific service is needed to lure potential attackers and to keep a communication for a sufficiently large number of steps. To this end we propose PRISMA, a method for protocol inspection and state machine analysis, which infers a functional state machine and message format of a protocol from network traffic alone. We apply our method to three real-life network traces ranging from 10.000 up to 2 million messages of both binary and textual protocols. We show that PRISMA is capable of simulating complete and correct sessions based on the learned models. A use case on malware traffic reveals the different states of the execution, rendering PRISMA a valuable tool for malware analysis.

Tammo Krueger, Hugo Gascon, Nicole Krämer and Konrad Rieck
ACM Workshop on Security and Artificial Intelligence (AISEC) October 2012

## November 01, 2012 12:51 PM

#### DIMVA 2013 - Call for Papers.

Here comes some good news: DIMVA – the Conference on Detection of Intrusions and Malware & Vulnerability Assessment – is celebrating its 10th anniversary in Berlin, Germany! For the last years DIMVA has served as a premier forum for advancing the state of the art in intrusion detection and malware analysis. If you are working in this area of computer security, consider submitting a paper:
• Paper Submission Due: February 10, 2013
• Acceptance Notification: March 27, 2013
• Conference: July 18-19, 2013
DIMVA solicits submission of high-quality, original scientific papers presenting novel research on malware analysis, intrusion detection, and related systems security topics. DIMVA encourages submissions from the following broad areas:
• Intrusion Detection
(Novel approaches and domains; Prevention and response; Data leakage and exfiltration; Evasion and other attacks; Result correlation; Potentials and limitations)

• Malware Detection
(Automated analyses; Behavioral models; Prevention and containment; Infiltration; Acquisition and monitoring; Forensics and recovery; Underground economy)

• Vulnerability Assessment
(Vulnerability detection; Vulnerability prevention; Fuzzing techniques; Classification and evaluation; Situational awareness)

## October 28, 2012 09:49 PM

### Heiko Strathmann

#### Nice blog entry about SHOGUN's GSoC 2012

Sören wrote a nice summarising blog post on the GSoC 2012. See here.

## October 28, 2012 01:55 AM

### Soeren Sonnenburg

#### Shogun at Google Summer of Code 2012

The summer came finally to an end and (yes in Berlin we still had 20 C end of October), unfortunately, so did GSoC with it. This has been the second time for SHOGUN to be in GSoC. For those unfamiliar with SHOGUN - it is a very versatile machine learning toolbox that enables unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis. I again played the role of an org admin and co-mentor this year and would like to take the opportunity to summarize enhancements to the toolbox and my GSoC experience: In contrast to last year, we required code-contributions in the application phase of GSoC already, i.e., a (small) patch was mandatory for your application to be considered. This reduced the number of applications we received: 48 proposals from 38 students instead of 70 proposals from about 60 students last year but also increased the overall quality of the applications.

In the end we were very happy to get 8 very talented students and have the opportunity of boosting the project thanks to their hard and awesome work. Thanks to google for sponsoring three more students compared to last GSoC. Still we gave one slot back to the pool for good to the octave project (They used it very wisely and octave will have a just-in-time compiler now, which will benefit us all!).

SHOGUN 2.0.0 is the new release of the toolbox including of course all the new features that the students have implemented in their projects. On the one hand, modules that were already in SHOGUN have been extended or improved. For example, Jacob Walker has implemented Gaussian Processes (GPs) improving the usability of SHOGUN for regression problems. A framework for multiclass learning by Chiyuan Zhang including state-of-the-art methods in this area such as Error-Correcting Output Coding (ECOC) and ShareBoost, among others. In addition, Evgeniy Andreev has made very important improvements w.r.t. the accessibility of SHOGUN. Thanks to his work with SWIG director classes, now it is possible to use python for prototyping and make use of that code with the same flexibility as if it had been written in the C++ core of the project. On the other hand, completely new frameworks and other functionalities have been added to the project as well. This is the case of multitask learning and domain adaptation algorithms written by Sergey Lisitsyn and the kernel two-sample or dependence test by Heiko Strathmann. Viktor Gal has introduced latent SVMs to SHOGUN and, finally, two students have worked in the new structured output learning framework. Fernando Iglesias made the design of this framework introducing the structured output machines into SHOGUN while Michal Uricar has implemented several bundle methods to solve the optimization problem of the structured output SVM.

It has been very fun and interesting how the work done in different projects has been put together very early, even during the GSoC period. Only to show an example of this dealing with the generic structured output framework and the improvements in the accessibility. It is possible to make use of the SWIG directors to implement the application specific mechanisms of a structured learning problem instance in python and then use the rest of the framework (written in C++) to solve this new problem.

Students! You all did a great job and I am more than amazed what you all have achieved. Thank you very much and I hope some of you will stick around.

Besides all these improvements it has been particularly challenging for me as org admin to scale the project. While I could still be deeply involved in each and every part of the project last GSoC, this was no longer possible this year. Learning to trust that your mentors are doing the job is something that didn't come easy to me. Having had about monthly all-hands meetings did help and so did monitoring the happiness of the students. I am glad that it all worked out nicely this year too. Again, I would like to mention that SHOGUN improved a lot code-base/code-quality wise. Students gave very constructive feedback about our (lack) of proper Vector/Matrix/String/Sparse Matrix types. We now have all these implemented doing automagic memory garbage collection behind scenes. We have started to transition to use Eigen3 as our matrix library of choice, which made quite a number of algorithms much easier to implement. We generalized the Label framework (CLabels) to be tractable for not just classification and regression but multitask and structured output learning.

Finally, we have had quite a number of infrastructure improvements. Thanks to GSoC money we have a dedicated server for running the buildbot/buildslaves and website. The ML Group at TU Berlin does sponsor virtual machines for building SHOGUN on Debian and Cygwin. Viktor Gal stepped up providing buildslaves for Ubuntu and FreeBSD. Gunnar Raetschs group is supporting redhat based build tests. We have Travis CI running testing pull requests for breakage even before merges. Code quality is now monitored utilizing LLVMs scan-build. Bernard Hernandez appeared and wrote a fancy new website for SHOGUN.

A more detailed description of the achievements of each of the students follows:

• Kernel Two-sample/Dependence test

Heiko Strathmann, mentored by Arthur Gretton, worked on a framework for kernel-based statistical hypothesis testing. Statistical tests to determine whether two random variables are are equal/different or are statistically independent are an important tool in data-analysis. However, when data are high-dimensional or in non-numerical form (documents, graphs), classical methods fail. Heiko implemented recently developed kernel-based generalisations of classical tests which overcome these issues by representing distributions in high dimensional so-called reproducing kernel Hilbert spaces. By doing so, theoretically any pair samples can be distinguished.

Implemented methods include two-sample testing with the Maximum Mean Discrepancy (MMD) and independence testing using the Hilbert Schmidt Independence Criterion (HSIC). Both methods come in different flavours regarding computational costs and test constructions. For two-sample testing with the MMD, a linear time streaming version is included that can handle arbitrary amounts of data. All methods are integrated into a newly written flexible framework for statistical testing which will be extended in the future. A book-style tutorial with descriptions of algorithms and instructions how to use them is also included.

Like Heiko, Sergey Lisitsyn did participate in the GSoC programme for the second time. This year he focused on implementing multitask learning algorithms. Multitask learning is a modern approach to machine learning that learns a problem together with other related problems at the same time using a shared representation. This approach often leads to a better model for the main task, because it allows the learner to use the commonality among the tasks. During the summer Sergey has ported a few algorithms from the SLEP and MALSAR libraries with further extensions and improvements. Namely, L12 group tree and L1q group multitask logistic regression and least squares regression, trace-norm multitask logistic regression, clustered multitask logistic regression, basic group and group tree lasso logistic regression. All the implemented algorithms use COFFIN framework for flexible and efficient learning and some of the algorithms were implemented efficiently utilizing the Eigen3 library.

• Implementation of / Integration via existing GPL code of latent SVMs.

A generic latent SVM and additionally a latent structured output SVM has been implemented. This machine learning algorithm is widely used in computer vision, namely in object detection. Other useful application fields are: motif finding in DNA sequences, noun phrase coreference, i.e. provide a clustering of the nouns such that each cluster refers to a single object.

It is based on defining a general latent feature Psi(x,y,h) depending on input variable x, output variable y and latent variable h. Deriving a class from the base class allows the user to implement additional structural knowledge for efficient maximization of the latent variable or alternative ways of computation or on-the-fly loading of latent features Psi as a function of the input, output and latent variables.

• Bundle method solver for structured output learning

We have implemented two generic solvers for supervised learning of structured output (SO) classifiers. First, we implemented the current state-of-the-art Bundle Method for Regularized Risk Minimization (BMRM) [Teo et al. 2010]. Second, we implemented a novel variant of the classical Bundle Method (BM) [Lamarechal 1978] which achieves the same precise solution as the BMRM but in time up to an order of magnitude shorter [Uricar et al. 2012].

Among the main practical benefits of the implemented solvers belong their modularity and proven convergence guarantees. For training a particular SO classifier it suffices to provide the solver with a routine evaluating the application specific risk function. This feature is invaluable for designers who can concentrate on tuning the classification model instead of spending time on developing new optimization solvers. The convergence guarantees remove the uncertainty inherent in use of on-line approximate solvers.

The implemented solvers have been integrated to the structured output framework of the SHOGUN toolbox and they have been tested on real-life data.

• Built generic structured output learning framework

During GSoC 2012 Fernando implemented a generic framework for structured output (SO) problems. Structured output learning deals with problems where the prediction made is represented by an object with complex structure, e.g. a graph, a tree or a sequence. SHOGUN's SO framework is flexible and easy to extend [1]. Fernando implemented a naïve cutting plane algorithm for the SVM approach to SO [2]. In addition, he coded a case of use of the framework for labelled sequence learning, the so-called Hidden Markov SVM. The HM-SVM can be applied to solve problems in different fields such as gene prediction in bioinformatics or speech to text in pattern recognition.

• Improving accessibility to shogun

During the latest google summer of code Evgeniy has improved the Python modular interface. He has added new SWIG-based feature - director classes, enabling users to extend SHOGUN classes with python code and made SHOGUN python 3 ready. Evgeniy has also added python's protocols for most usable arrays (like vectors, matrices, features) which makes possible to work with Shogun data structures just like with numpy arrays with no copy at all. For example one can now modify SHOGUN's RealFeatures in place or use.

• Implement Gaussian Processes and regression techniques

Jacob implemented Gaussian Process Regression in the Shogun Machine Learning Toolbox. He wrote a complete implementation of basic GPR as well as approximation methods such as the FITC (First Independent Training Conditional) method and the Laplacian approximation method. Users can utilize this feature to analyze large scale data in a variety of fields. Scientists can also build on this implementation to conduct research extending the capability and applicability of Gaussian Process Regression.

• Build generic multiclass learning framework

Chiyuan defined a new multiclass learning framework within SHOGUN. He re-organized previous multiclass learning components and refactored the CMachine hierarchy in SHOGUN. Then he generalized the existing one-vs-one and one-vs-rest multiclass learning scheme to general ECOC encoding and decoding strategies. Beside this, several specific multiclass learning algorithms are added, including ShareBoost with feature selection ability, Conditional Probability Tree with online learning ability, and Relaxed Tree.

## October 16, 2012 01:08 PM

### Heiko Strathmann

#### Streaming Features for Linear Time MMD

I finally finished an important and very cool extension to my GSoC 2012 project - making the linear time MMD statistic work with streaming based data. In particular, SHOGUN's streaming framework is now used.

By design, the linear time MMD statistic, given as
$\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))$
where
$h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})$
is very well suited for streaming based data since only four examples have to be hold in memory at once. Once, the sum in the h-statistic is computed, used data can be "forgotten". As I described in my M.Sc. thesis (link), this allows to process infinite amounts of data and therefore results in possibly more accurate two-sample tests. This holds in particular in cases where the amount of data needed to solve problems is larger than computer memory.

During the GSoC, I implemented the linear time MMD on the base of SHOGUN's standard features interface, which made it necessary to hold data in memory. With the latest modifications (link to patch), the class for the linear time MMD (class reference), now accepts streaming features (class reference) only. This allows to process arbitrarily large amounts of data in a very comfortable way. In order to not suffer from overhead while streaming examples one by one, a block size may be specified: this number of examples is processed at once and should be chosen as large as fits into memory.

Recall the linear time MMD's distribution is normal and its variance can easily estimated by using the empirical variance of the individual h-statistics (while the MMD is their mean) when the number of samples is large enough. The new implementation in SHOGUN does this on the fly using D. Knuth's online variance algorithm [1] (implementation link). Therefore, a complete two-sample test is now possible in linear time and constant space.

A nice illustration of the advantages of this approach can be found in the examples for the linear time MMD (link). A data generator for artificial data which implements SHOGUN's streaming interface is passed to the MMD class. It produces data from the underlying distribution on the fly.

[1] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn., p. 232. Boston: Addison-Wesley.

## October 12, 2012 04:28 PM

### mloss.org

In a rather self deprecating title "I wanted to Predict Elections with Twitter and all I got was this Lousy Paper" Daniel Gayo-Avello takes us on a tour of how hard it is to do reproducible research, and how often authors take short cuts. From the abstract:

"Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely diﬃcult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods."

It is an interesting survey of papers that use Twitter data.

He lists some flaws in current research on electoral predictions, but they are generally applicable to any machine learning paper (my comments in brackets):

1. It's not prediction at all! I have not found a single paper predicting a future result. (Neither is bootstrap nor cross validation prediction)
2. Chance is not a valid baseline...
3. There is not a commonly accepted way of "counting votes" in Twitter
4. There is not a commonly accepted way of interpreting reality! (In supervised learning, we tend to ignore the fact that there is no ground truth in reality.)
5. Sentiment analysis is applied as a black-box... (As machine learning algorithm get more complex, more people will tend to use machine learning software as a black box)
6. All the tweets are assumed to be trustworthy. (I don't know if anybody is doing adversarial election prediction)
7. Demographics are neglected. (The biased sample problem)
8. Self-selection bias.

The window is closing on those who want to predict the upcoming US elections from X.

## September 11, 2012 01:31 PM

### Heiko Strathmann

#### GSoC 2012 is over

Since a few weeks, GSoC 2012 is over. It has been a pretty cool summer for me. As last year, I learned lots of things. This year though, my project a bit more research oriented -- which is nice since it allowed me to connect my work for SHOGUN with the stuff I do in Uni. I even mentioned it in my Master's dissertation (link) which also was about statistical hypothesis testing with the MMD. Working on the dissertation at the same time as on the GSoC was sometimes exhausting. It eventually worked out fine since both things were closely related. I would only suggest to do other important things if they are connected to the GSoC project. However, if this condition is met, things multiply in terms of the reward you get due to synergistic effects.

The other students working for SHOGUN also did very cool projects. All these are included in the SHOGUN 2.0 release (link). The project now also has a new website so its worth taking a closer look. Some of the other (really talented) guys might stay with SHOGUN as I did last year. This once more gives a major boost to development. Thanks to all those guys. I also owe thanks to Sören and Sergey who organised most things and made this summer so rewarding.

In the near future I will try to put in some extensions to the statistical testing framework that I though of during the summer but did not have time to implement: On-line features for the linear time MMD, a framework for kernel selection which includes all investigated methods from my Master's dissertation, and finally write unit-tests using SHOGUN's new framework for that. I will update the SHOGUN project page of my website (link). I might as well send some tweets to SHOGUN's new twitter account (link).

## August 30, 2012 01:42 PM

### mloss.org

#### John Hunter - the author of matplotlib - has died.

John Hunter the main author of matplotlib has died of cancer. For those interested, his close friend Fernando gives a bit more details here. John was an long term developer of matplotlib (even continuing while he was working in industry) and a father of three kids. You might consider donating to the John Hunter Memorial Fund.

We had John as invited speaker at one of our NIPS machine learning open source software workshops. He gave quite some entertaining talk featuring some live demo. I recall that he started with a command prompt typing everything (including fetching some live stock-exchange data) in python at some insane speed. Videolectures recorded his lecture. I don't know about others but I basically plotted all the scientific results using matplotlib and python for the last several years.

Rest in peace John - your contributions will be remembered.

## August 14, 2012 02:27 PM

### Heiko Strathmann

#### 11th GSoC weekly report: Done!

This will be my last weekly report for this years summer of code! Last week, I did not write a report since I have been very busy with experiments for a rebuttal for the NIPS submission (see 2nd GSoC weekly report). This week was more productive: I continued polishing the new framework for statistical tests, squeezed out some final bugs and made made a few things more effective.

I also created graphical examples for linear and quadratic time MMD and HSIC based tests. These serve the purpose of illustrating how the methods work on simple datasets. They sample the underlying statistic's null and alternative distributions using all different methods I implemented and plot distributions with test thresholds (as well as data). For the MMD tests, the dataset contains samples from two multivariate Gaussian distributions with unit variance in every component and equal means in all but one component. The HSIC tests uses data where dependence is induced via rotation (see last report). Below are screenshots of the output of the examples.

These images were also added to the shogun-tutorial. I added a part about independence testing and corrected some mistakes in there. All methods I implemented are now contained within the tutorial. Another documentation related thing I did was to update doxygen based sourcecode documentation. In particular, I cleaned up the horrible mess in the CStatistics class -- and replaced all ascii-art by LaTeX. Although there are still things to do, my project is now in the status "done" in terms of GSoC :) It was a nice summer! I guess I will be extending it with some ideas that came up while working on with kernel two sample tests recently.

For the last week, I intend to get some unit-testing done and start to focus on things that are needed for our upcoming 2.0 release (Bug hunting, fix warnings, implement things that people request). I will also write an overall summary on the GSoC next month or so. Next month will be busy since I also have to finish my Master's project.

## July 30, 2012 04:50 PM

### Heiko Strathmann

#### 10th GSoC weekly report: Slowly getting ready

Step by step, my project enters a final state :)
Last week, I added new data generation methods, which are used from a new example for independence tests with HSIC. It demonstrates that the HSIC based test is able to capture dependence which is induced by rotating data that has zero correlation -- one of the problems from the paper [1]. Here is a picture; the question is: are the two dimensions dependent? Or moreover, is a test able to capture that? (correlation is almost zero, dependence is induced via rotation)

I also realised that my current class structure had problems doing bootstrapping for HSIC, so I re-factored a bit. Bootstrapping is now also available for HISC using the same code that does it for two-sample-tests. I also removed some redundancy -- both independence and two-sample tests are very similar problems and implementations should share code where possible.

Another thing that was missing so far is to compute test thresholds; so far, only p-values could be computed. Since different people have different tastes about this, I added both methods. Checking a test statistic against a threshold is straight-forward and gives a binary answer; computing a p-value gives the position of the test statistic in the null-distribution -- this contains more information. To compute thresholds, one needs the inverse CDF function for the null-distribution. In the bootstrapping case, it is easy since simply the sample that corresponds to a certain quantile has to be reported. For cases where a normal- or gamma-distribution was fitted, I imported some more routines from the nice ALGLIB toolbox.

For this week, I plan to continue with finishing touches, documentation, examples/tests, etc. Another idea I had is to make the linear time MMD test work with SHOGUN's streaming features, since the infinite or streaming data case is the main area for its usage.

[1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel statistical test of independence. Advances in Neural Information Processing Systems

## July 23, 2012 03:17 PM

### Heiko Strathmann

#### 9th GSoC weekly report: Bugs again! and documentation

I spend quite some fraction of last week on something which is not really related my project: trying to make cross-validation possible for multi-class MKL (multiple kernel learning) machines using my framework from last year's GSoC. To this end, I added subset support to SHOGUN's combined features class; and then went for a bunch of bugs that prevented it from working. But it now does! So cross-validation should now be possible within a lot more situations. Thanks to Eric who reported all the problems.

Apart from that, I worked on documentation for the new statistical testing framework. I added doxygen class descriptions, see for example CQuadraticTimeMMD. More important, I started writing a section for the SHOGUN tutorial, a book-like description of all algorithms. We hope that it will grow in the future. You can find the $$\LaTeX$$ sources at github. We should/will add a live pdf download soon.

Another minor thing I implemented is a data generator class. I think it is nice to illustrate new algorithms with data that is not fixed (aka load a file). The nice thing about this is that it is available for examples from all interfaces -- so far I implemented this separately for c++ and python; this is more elegant now. I bet some of the others projects will need similar methods for their demos too; so please extend the class!

This week, I will add more data generation methods to the generator, in particular data that can be used to illustrate the recently implemented HSIC test for independence. Reference datasets are quite complicated, so this might take a while. Another thing we recently changed is a new framework for unit-tests, so I will write these for all new methods I created recently.

## July 16, 2012 07:58 PM

### Heiko Strathmann

#### 8th GSoC weekly report: Examples, Bugs, and Kernel Choice

Last week was a mixed one. Next to new examples, tests, bugfixes, and helper methods, the biggest implementation is an automatic kernel selection algorithm for the linear time MMD. This is one thing that I worked on during my Master project at UCL.
It selects optimal optimal kernel weights for kernel of the family
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\}$
by solving the convex program
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}$
where $$\hat{Q}$$ is a linear time estimate of the covariance of the MMD estimates and $$\hat{\eta}$$ is a linear time estimate of the MMD.

I already described this a few weeks ago, when the method was developed. It is now integrated into SHOGUN. Efficient kernel selection, yeah :) It uses a convex solver called libqp, which is by Vojtech Franc, one of the mentors of this year's GSoC. Still, I need to think of a nice way of embedding it into SHOGUN's model selection framework, which isn't as straight-forward as it first seems.

This week, bug-hunting continues with a bug that gives wrong results during cross-validation on multi-class machines. Afterwards, I will try to polish my code so far a bit, especially documentation (and tutorial); and continue on more examples/demo for the new framework for statistical testing.

## July 14, 2012 12:10 AM

### Fernando Jose Iglesias Garcia

#### A memory-leak-killer combination

Once you have finished coding a program, tested it, debugged it and checked that its behaviour is the expected one, it is quite normal to be invaded by a superhero-like feeling. You just feel smart. The simple thought that your program may not be actually correct hurts your pride, a lot. Well, C/C++ programmers (along with developers of other languages that do not use garbage collection) know that this can be the case. Your program may be leaking memory; i.e. it has reserved and made use of some memory but not freed it properly.

During these last two years I have been using Valgrind [1] to help me find the source of memory leaks. Valgrind has a tool suited for this purpose and it is called memcheck [2]. Even though I have successfully used this tool several times, I must admit that there were some details in the output that I did not quite understand properly. For example, what was the difference between a memory leak caused by a memory block directly or indirectly lost, if the blocks of memory that were still reachable were caused by errors in my code, etc. I am sure that the answers to these questions are in the manual. But, to be honest, I have never read a manual from the first to the last page before starting to use it. In my honest opinion, it is just too time consuming and full of unintelligible details for a non experienced user. Today, I have found this document [3] extremely useful for this task. The concepts are presented in a simple and concise way, together with pictures and some examples to make the topic clear. A MUST read for anyone getting in touch with Valgrind memcheck.

Apart from this, in SHOGUN we use our own system of reference counting to track the number of objects that hold a pointer to another object. In this way, when one object is deleted, the reference counts to all the other objects this object in question was holding are decremented by one, effectively deleting the other objects if their counts go to zero. This system turns out to be very comfortable to use. And now, why am I talking about this reference counting here? Well, because it is indispensable that you make the correct use of it if you want to have your SHOGUN program free of memory leaks. There is a very nice option which enables traces to let you know when a new SHOGUN object is created, when the count to a reference is increased, decreased and to know when an object is deleted. This option, together with the use of Valgrind, has enabled me to detect a bunch of memory leaks in my code in a matter of minutes FTW! I am completely sure that this task would have taken me at least a couple of hours some time ago.

This definitely made my day and I must thank Sergey Lisitsyn for the suggestion of enabling this debug output and Aleksander for his wonderful document about Valgrind memory leaks reports.

PS. Take a look to the code profiling tool provided by Valgrind, cachegrind [4] and its KDE graphical interface kcachegrind! They are simply awesome.

References

[1] Valgrind. http://valgrind.org/
[2] Valgrind manual page about the tool memcheck.
http://valgrind.org/docs/manual/mc-manual.html
[3] Understanding Valgrind memory leaks reports.
https://sigquit.wordpress.com/2010/02/04/understanding-valgrind-memory-leak-reports/
[4] Valgrind manual page about the tool cachegrind.
http://valgrind.org/docs/manual/cg-manual.html
[5] An interesting thread is stack overflow about different types of memory leaks detected by valgrind memcheck.
http://stackoverflow.com/questions/3840582/still-reachable-leak-detected-by-valgrind

## July 14, 2012 12:10 AM

#### Understanding Valgrind memory leak reports

I tried to write a tutorial focusing on the type of memory leaks as detected internally by Valgrind and the generated output leak reports.

The PDF is available in the following URL:
http://es.gnu.org/~aleksander/valgrind/valgrind-memcheck.pdf

And the simple C tester to generate each type of memory leak (cases 1 to 9 in the Valgrind Manual) is available here:
http://es.gnu.org/~aleksander/valgrind/valgrind-memcheck.c

## July 09, 2012 03:25 PM

### Heiko Strathmann

#### 7th GSoC weekly report: Hilbert Schmidt Independence Criterion

Finally, I started on kernel based (in)dependence tests last week. These are tests that try to find out whether for two random variables $$\textbf{x},\textbf{y}$$ are independent, i.e. whether their joint distribution factorises into the individual ones. The null hypothesis (that may be rejected) is $H_0:P_\textbf{x}P_\textbf{y}=P_{\textbf{x}\textbf{y}}$

These kind of tests basically work like two-sample tests: Given one set of samples from each random variable
$Z=(X,Y)=\{(x_1,y_1,...,(x_m,y_m)\}$
a test statistic is computed and then compared against the distribution of the statistic under the null-hypothesis. If the position is in an upper part of it, the null-hypothesis is rejected since it is unlikely that the current value was generated by it.

The class of independence tests I will implement for my project are all based on the Hilbert Schmidt independence criterion (HSIC), which takes out the above procedure to an reproducing kernel Hilbert space (RKHS). The (biased version of the) HSIC statistic itself is simply given by
$\text{HSIC}_b(Z)=\frac{1}{m^2}\text{trace}(KHLH)$
where $$K,L$$ are kernel matrices of the input samples $$X,Y$$ in some RKHS and $$H=I-\frac{1}{m}\textbf{1}\textbf{1}^T$$ is a centring matrix.

I integrated a general modular framework for independence tests into SHOGUN. The HSIC class is the first kernel-independence test that works. Interfaces are very similar to the two-sample test, however, they are not quite the same for various reasons. That's why there is another class for independence testing next to the one for two-sample testing.

As for the two-sample tests, the null-distribution may simply be approximated by bootstrapping, i.e. merging the samples and computing the statistic for many times. This is now possible for any independence test. Another method to approximate the null-distribution for HSIC is fitting a Gamma distribution [1] as

$m\text{HSIC}=\frac{x^{\alpha-1}\exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)}$ where
$\alpha=\frac{(\textbf{E}(\text{HSIC}_b(Z)))^2}{\text{var}(\text{HSIC}_b(Z))} \quad \text{and}\quad\beta=\frac{m\text{var}(\text{HSIC}_b(Z))}{\textbf{E}(\text{HSIC}_b(Z))}$

It's also already implemented! There are already modular interfaces for the new classes and some simple tests. I will extend these during this weak. Time passes fast: The mid-term-evaluation is this week already. I pretty much enjoyed the first half :)

[1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel statistical test of independence.

## July 04, 2012 10:01 AM

### Fernando Jose Iglesias Garcia

#### Finally, a new project update

It is about time to say something again about how the project is going. First of all, as a matter of excuse (probably a bad one though) I want to say why the blog has not been updated lately. There have been a couple of weeks in which I have been debugging the code for the primal formulation of the SO-SVM based on MOSEK using a simple multiclass classification example. Those two weeks were basically that, debugging testing, debugging again, testing, … it is not interesting stuff to tell the truth but very necessary in any case! However, last week I started to work in the – probably – most interesting part of the project, the Hidden Markov - Support Vector Machine [1].

I like to think of HM-SVMs as a particular instance of the general Structured Output (SO) problem I have been talking about in the previous posts. In SO learning we are trying to find a function $f$ that takes an object from the space $\mathcal{X}$  and outputs and object from the space $\mathcal{Y}$. In HM-SVMs the objects of $\mathcal{Y}$ are sequences of discrete data (e.g. integers). In order to deal with the structure of the sequences properly, we extract features of the data in the space $\mathcal{Y}$ together with the data in the space $\mathcal{X}$. This is done via a function we denote as $\Psi(x, y)$ whose definition appears in the tutorial about HM-SVM that Nico, my GSoC mentor, wrote [2, pages 4 and 5].

Another important piece of the SO-SVM is the argmax function. In HM-SVMs this function is computed using the well-known Viterbi algorithm. I had never studied before this algorithm but I was happy to discover its similar structure to the forward-backward algorithm used to train an HMM from data that I have already studied and implemented [3]. I found this tutorial [4] on the Viterbi algorithm very useful and easy to understand. The implementation I have done in SHOGUN of this algorithm is based on the one in the HMSVM toolbox [5] (in particular, it is in this file [6]).

One more piece of code I have implemented in the toolbox for the HM-SVM is the loss function, commonly denoted as $\Delta$. In this case, we use the Hamming distance defined for strings. In our case, the inputs of the loss function are sequences of the same length so the Hamming loss is simple the count of the number of elements that are different in the sequences (think of it as the minimum number of modifications one may do to one of the sequences to make it equal to the other. Providing that the sequences are of the same length, the only modification allowed is to change the value of an element in one of the sequences).

These three functions, together with some data structures to hold features and labels used by the HM-SVM, constitute what I have implemented so far into SHOGUN about this interesting discriminative model. The plan now is to correct with Nico the possible mistakes I have introduced into the code and to generate some training and test data so I can start debugging phase! My idea is to port this function [7] from HM-SVM toolbox that generates random two-state data.

Comments and suggestions are very welcome!

You can find the code I have been talking about in github:
https://github.com/iglesias/shogun/tree/so/src/shogun/
(mainly in the files structure/HMSVMModel.[cpp|h], features/MatrixFeatures.[cpp|h], structure/HMSVMLabels.[cpp|h]).

References

[1] Altun, Y., Tsochantaridis, I., Hofmann, T. Hidden Markov Support Vector Machines.
http://cs.brown.edu/~th/papers/AltTsoHof-ICML2003.pdf
[2] SHOGUN Hidden Markov SO-SVM written by Nico Görnitz.
https://dl.dropbox.com/u/11020840/shogun/hmsvm.pdf
[3] The code of my HMM implementation for one of the homeworks of the Artificial Intelligence course that I took at KTH. https://github.com/iglesias/hmm-ai-ducks
[4] Viterbi algorithm tutorial.
http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/
html_dev/viterbi_algorithm/s1_pg2.html
[5] Entry in mloss.org for the HM-SVM toolbox written by Gunnar Raetsch and Georg Zeller.
http://mloss.org/software/tags/hmsvm/
[6] Viterbi algorithm implementation in the HM-SVM toolbox.
https://dl.dropbox.com/u/11020840/shogun/best_path.cpp
[7] Data generation for a two-state model in the HM-SVM toolbox.
https://dl.dropbox.com/u/11020840/shogun/simulate_data.m

## July 02, 2012 03:36 PM

### Heiko Strathmann

#### 6th GSoC weekly report: First modular examples and other stuff

Last week's changes were all rather subtle:

• I created some first modular examples in python,
• fixed this big bug in the model selection trees I talked about last week (nasty!),
• added some convenience methods for the two-sample-test constructors (there is now a new method in CFeatures to append feature objects)
• and corrected a bunch of bugs on the fly.

This week, I will do some more work on the examples and then start working on independence testing.

## June 25, 2012 05:26 PM

### Heiko Strathmann

#### 5th GSoC weekly report: Almost finished MMD tests

Last week, I have been working silently (no Internet connection) on finishing all MMD related tests. What is new is a distinction between biased/unbiased test statistics for the quadratic MMD, and the Gaussian approximation of the null-distribution for the linear time MMD - and more tests.

Since the linear time MMD, defined as
$\text{MMD}_l^2=\frac{1}{m}\sum_{i=1}^{m}h((x_{2i-1},y_{2i-1}),(x_{2i},y_{2i}))$
where
$h((x_{i},y_{i}),((x_{j},y_{j}))=k(x_{i},x_{i})+k(y_{i},y_{i})-k(x_{i},y_{j})-k(x_{j},y_{i})$
is normally distributed both in null- and alternative distribution, one can easily compute test-threshold and p-values using the variance of the h-values of the above term as a proxy for the real variance (The null-distribution has zero mean). For large sample sizes, this is an accurate and very cheap way to perform the test: The statistic has to be computed only once whereas bootstrapping would need a few hundred iterations. Another reason why the linear time MMD is well suited for large scale problems.

I also started on integrating my code into the modular interfaces of SHOGUN and will produce some first python examples next week.

Jacob (Gaussian Process Regression project), who uses and extends parts of my model selection code from last years GSoC has found a serious problem in SHOGUN's parameter trees for model selection. I hope to fix it this week - complicated.

When all mentioned things are done, I might start with dependence testing next week.

## June 20, 2012 02:28 PM

### mloss.org

"Much of machine learning (ML) research has lost its connection to problems of import to the larger world of science and society." So begins Kiri Wagstaff's position paper that will have a special plenary session on June 29 at ICML 2012. The paper then goes on to lament about the poor state of affairs in machine learning research. The paper is an interesting read, and it addresses an important question that any adolescent field faces: "How do I justify my existence?"

I'd like to take the half full glass view. Machine Learning already matters!.

Kiri herself uses examples that show that machine learning already has impact. In her introduction, she mentions the CALO project, which forms the basis of Siri on the iPhone 4S, which has revolutionised the way the general public perceives human computer interactions. She also mentions spam detection, which Gmail has generalized to sorting all email with Priority Inbox.

A quick look around the web reveals other success stories:

• The recent technology quarterly section of the Economist 2 June 2012 edition discusses the use of robots and how we would need to start legislating them. Ironically, in our human desire to appropriate blame in case of failure, we may have to block learning. Quoting the article: "This has implications for system design: it may, for instance, rule out the use of artificial neural networks, decision-making systems that learn from example rather than obeying predefined rules."

• Searching for the phrase "machine learning" in PLoS Computational Biology returns 250 hits, showing how machine learning has revolutionised biological research in the high throughput age.

• In high energy physics, particle accelerators use anomaly detection algorithms to only save data which may be interesting. The ultimate learning with data streams application.

At NIPS 2008 at the last talk of the Machine Learning in Computational Biology mini-symposium, I had the pleasure to be inspired by Thomas Lengauer's activities proposing anti-HIV therapy. I'd say that this "solves" challenge number 5 in Kiri's list. Remarkably (unfortunately?), their recommendation site, remains just that, a recommendation site, and has yet to navigate the legislative nightmare of getting a website to prescribe drugs. In an answer to a question, he said that Germany was one of the few places in the world where the legislation even allows for doctors to use such drug recommendation sites. A scan of the titles cited by the review article reveals keywords which would fit comfortably in a machine learning venue:

- multiple linear regression
- simple linear model
- prediction-based classification
- artificial neural networks
- self organising feature maps
- non-parametric methods
- sparse models
- convex optimization

But doom and gloom persists. Why? My personal opinion is that like most successful technologies, machine learning fades into the background once it has impact. In that vein of thought, we can measure the impact of machine learning by the decline of ICML, JMLR and friends. Meanwhile, I'm going to go back to making machine learning disappear...

Please join in the discussion at http://mlimpact.com/.

## June 19, 2012 01:26 PM

### Fernando Jose Iglesias Garcia

#### Second Weekly Report GSoC 2012

Let’s have some fun, fun with optimization!

Basically during the last days my work in the project has been strictly focused on the implementation of the optimization algorithm for SO-SVM (Structured Output – Support Vector Machine). This algorithm is presented and fully analysed in [1]. In the paper the dual formulation of the optimization problem is explored whereas we are concerned with the primal version of it at this point in the project. The reason why the primal has been selected stems from its properties regarding complexity  and efficiency; the dual requires an explicit expansion of the weight vector and the memory requirements are quadratic rather than linear as with the primal formulation. In [2] appears the primal formulation of the algorithm even though the main topic of this article is beyond our current approach for SO learning.

The algorithm belongs to a class called Cutting Plane Algorithms. These optimization methods are based on the idea of refining iteration by iteration the feasible set of solutions using the most violated constraints. In [3] there is a video lecture and around the instant 6:20 it is nicely introduced the intuition of how cutting plane methods work, which I really like.

As it was commented in the previous post, a quadratic program (QP) appears in the algorithm. In particular, every iteration of the algorithm we need to solve a QP that looks like

$\text{minimize}_{\vec{w}, \xi_i} \frac{1}{2} \vec{w}^TC\vec{w} + \sum_{i=1}^N \xi_i$
$s.t.~\forall i~\forall y \in \mathcal{Y} \setminus y_i : \vec{w} \cdot\delta\Psi_i(y) \ge \Delta(y_i, y) - \xi_i$

We are using MOSEK to solve this QP [4]. The QP shown above needs to be translated to a more general formulation as the one provided by MOSEK though. In order to do this, I have created an interface to MOSEK from Shogun inspired by the CPLEX one that is in the toolbox.

Right now the algorithm is close to be finished. However, nothing has been properly tested this far and this makes me feel uncomfortable. Once the algorithm is finally ready, I will start to prepare the first and simplest case of use of the framework, multiclass classification. Debugging and testing time will come then!

References
[1] Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y. Support vector machine learning for interdependent and structured output spaces.
[2] Finley, T., Joachims, T. Training Structural SVMs when Exact Inference is Intractable.
[3] Video lecture by Thomas Finley on Training Structural SVMs when Exact Inference is Intractable. http://videolectures.net/icml08_finley_tssvm/
[4] MOSEK C API Quadratic Optimization. http://docs.mosek.com/6.0/capi/node007.html#329747704
[5] Nowozin, S., Lampert, C. H. Structural Learning and Prediction in Computer Vision. The section 6 is mainly related with this part of the project.

## June 18, 2012 04:42 PM

### Heiko Strathmann

#### 4th GSoC weekly report: Tests, tests, tests...

Since all threshold methods that I implemented so far are quite unintuitive and hard to verify, I spent last week writing tests for most of their parts. I created test-cases based on fixed and random datasets (mainly taken from the papers where the methods were introduced), in which results are compared against MATLAB implementations of the same problem. For most cases, the difference is asserted to be smaller than 10E-15.

Although it takes a lot of time and effort, I believe these automatic tests should come with every more complicated method in SHOGUN. I remember quite some cases where I spent an hour on finding an error that would have been easily caught by a test.

I hope to finish all all statistical tests based on the quadratic MMD this week. I still need to implement the biased version (which can be used along with the threshold methods) and a possibility for choosing the type of statistic to use. I also plan to implement the linear time MMD and corresponding tests (Gaussian approximation of null-distribution). I won't have Internet access this week since I am in nowhere in Poland -- so expect a larger patch at the end of the week.

## June 18, 2012 07:36 AM

### Fernando Jose Iglesias Garcia

#### Third Weekly Report GSoC 2012 – Clash of Assumptions

After a weekend of debugging I can proudly say that I am able to run toy examples using my implementation of SO-SVM for the multiclass classification case of use \o/

Just for the record, and for the fun of its discovery, I am going to write about the last bug I fixed in my code. As it is commented in the previous posts, part of the project deals with solving an optimization problem, a QP in particular. The fact is that I could run the code just fine, no more failing ASSERTS popping up nor segmentation faults. However, something weird was going on with the solution. The weight vector $\vec{w}$ was zero for every problem instance. After double checking with MATLAB’s function quadprog that the correct solution for the problem was effectively non-trivial, I tried adding some lines here and there to ensure that the parameters given to MOSEK’s were the correct ones. All the matrices and vectors were ok. After some more lines to print out the bounds of the problem I found it. The variables of the optimization problem were all fixed to zero!

At the beginning I just assumed that if no bound is given for a variable, then MOSEK would consider that this one is free; i.e. it may take values within $(-\infty, \infty)$. However, this is false. If no bound is given, then the reality is that MOSEK fixes the value of the variable to zero. I still don’t get the point of including variables to the optimization vector whose values are fixed… but that’s another story.

For the next time I will try to remember to ensure that the assumptions I make are correct or, even better, to RTFM

My next objective is to try out the application with bigger examples. Unfortunately my MOSEK license file just allows me to solve problems with up to 300 variables or constraints, which currently prevents me from doing tests with training data bigger than 17 bidimensional training examples!

## June 11, 2012 02:58 PM

### Heiko Strathmann

#### 3rd GSoC weekly report: New threshold methods for quadratic MMD

I finally finished my exams last week and can now work full-time for GSoC and my Master project. Puh! :)

Last week, I implemented two methods to compute a threshold for the quadratic time MMD test:

1. A test based on the Eigenspectrum of the kernel matrix of the joint samples. This is a nice and computationally efficient idea from [1]. The basic idea is that for the case $$P=Q$$, the biased estimate of the null-distribution converges in distribution:
$m\text{MMD}^2_b \rightarrow \sum_{l=1}^\infty \lambda_l z_l^2$
where $$z_l \sim \mathcal{N}(0,2)$$ i.i.d. and  $$\lambda_l$$ are Eigenvalues of which the empirical estimates $$\hat{\lambda}_l$$ are given by the Eigenvalues of the centred kernel matrix $$\tilde{K}=HKH$$ where $$K_{ij}=k(x_i,x_j)$$. Its possible to sample the null-distribution using these estimates and to compute a p-value or threshold using the resulting samples.

2. A heuristic method, also from [1],  that approximates the null-distribution with a gamma-distribution where the first two moments are matched. I.e.
$m\text{MMD}_b(Z) \sim \frac{x^{\alpha-1}\exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)}$ where $\alpha=\frac{(\textbf{E}(\text{MMD}_b(Z)))^2}{\text{var}(\text{MMD}_b(Z))}\quad \text{and}\quad \beta=\frac{m\text{var}(\text{MMD}_b(Z))}{(\textbf{E}(\text{MMD}_b(Z)))^2}$

Both methods need some some distribution function ($$\Gamma$$, etc), which I integrated from ALGLIB. I added some tests to ensure that results equal to these obtained with MATLAB. Along with that come some SGVector based wrappers for SHOGUN functions (sorting, eigenvalues, etc).

Next week, I will do some fine-tuning on the implemented methods and then create tests which illustrate all methods.

[1]: Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). A fast, consistent kernel two-sample test.

## June 04, 2012 01:29 PM

### Heiko Strathmann

#### 2nd GSoC weekly report

Last week, I was again very busy with exams and doing experiments for a NIPS submission.

The latter is somehow related to my GSoC project, and I will implement it once the other stuff is done:
We developed a method for selecting optimal coefficients of a non-negative combination of kernels for the linear time (=large scale) MMD-two-sample test. The criterion that is optimised for is the ratio of the linear MMD $$\eta_k$$ by its standard deviation $$\sigma_k$$, i.e.
$k_*=\arg \sup_{k\in\mathcal{K}} \eta_k \sigma_k^{-1}$. That is equivalent to solving the quadratic program
$\min \{ \beta^T\hat{Q}\beta : \beta^T \hat{\eta}=1, \beta\succeq0\}$
where the combination of kernels is given by
$\mathcal{K}:=\{k : k=\sum_{u=1}^d\beta_uk_u,\sum_{u=1}^d\beta_u\leq D,\beta_u\geq0, \forall u\in\{1,...,d\}\}$
$$\hat{Q}$$ is a linear time estimate of the covariance of the MMD estimates and $$\hat{\eta}$$ is a linear time estimate of the MMD using the above kernel combinations.

Apart from that, I implemented a method to approximate the null-distribution of the quadratic time MMD, which is based on the Eigenspectrum of the kernel matrix of the merged samples from the two distributions, based on [1]. It still needs to be compared against the MATLAB implementation. It comes with some minor helper functions around matrix algebra.

This week, I will finally have my last exam and then continue on the advanced methods for computing test thresholds.

[1]: Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). A fast, consistent kernel two-sample test.

## June 04, 2012 11:10 AM

### Fernando Jose Iglesias Garcia

#### First Weekly Report GSoC 2012

Let’s do the weekly reports kick-off of this summer!

Although GSoC started officially a couple of days ago, on May 21st, I have been working on the project for about two weeks. Next, I am going to summarize what progress has been made during this time.

First of all, based on the code skeletons [1] that Nico wrote, I started with the design of the SO framework in Shogun. The design decisions taken during this phase are summarized in the attached class diagram [2]. In the diagram, the classes in light green existed in Shogun previous to this project whereas the classes filled with light red are brand new. Among the new classes, the class CStructuredModel seeks to offer functionality to put together all the application-dependent parts of a SO problem instance.  The CLossFunction class became very handy since I just needed to extend it with a few methods in order to support the functionality required by SO. The idea of this class is to provide a generic interface for well-defined loss functions (e.g. Hinge loss). Needless to say, the design shown in the diagram is very likely to evolve. For example, CStructuredModel is currently implemented to be used with function pointers for some of its members and this will change to use a more understandable interface with classes.

Initial SO class diagram.

In addition, classes (labels/CStructuredLabels and lib/CStructuredData) to provide labels with structure (e.g. sequences, graphs) have been added. This is probably the feature that distinguishes the most SO learning from the other strategies already present in Shogun.

Finally, the optimization algorithm presented in [3]. This is still work in progress and the code is in CPrimalMosekSOSVM. The main difficulty I have found here is that, in order to solve the quadratic program (QP) that arises, we need to use a non Open Source tool since libqp does not support all the required constraints (in particular inequality constraints of the type $A \cdot x \leq b$ for the QP with box constraints). I have started to write some code in CPrimalMosekSOSVM that makes use of MOSEK to solve the QP. This piece of code is still rather poor and it is just in my local repository.

The current working plan is, in this order: finish the code in CPrimalMosekSOSVM mentioned above (I have set a deadline for this on Friday, June 1st), prepare the first case of use with multiclass SVMs, extend the design creating a class for the $\arg \max$ computation and another one for the structured loss function $\Delta(y_{pred}, y_{true})$.

References
[1] Gist with main concepts of the framework written by Nico Görnitz. https://gist.github.com/2634487.
[2] Structured Output framework – Class Diagram.
[3] Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y. Support vector machine learning for interdependent and structured output spaces.
[4] SO learning branch in git:  https://github.com/iglesias/shogun/tree/so.

## May 28, 2012 08:54 AM

### Heiko Strathmann

#### First GSoC weekly report

I am currently quite busy with my exams, however, the last three will be done soon and I still managed to do initial sketches for the statistical testing framework along with helping on solving problems that occurred because of the massive changes that are currently happening to SHOGUN's label and multi-class system.

Here you can find an UML diagram of the class structure so far. I implemented first simple kernel-two-sample-tests -- the ones based on the linear and the quadratic time MMD metric. For computing a p-value, these two may approximate their null-distribution using a (brute-force) bootstrapping approach based on shuffling data of the two underlying distributions and then computing the statistic multiple times. The bootstrapping code will work for any two-sample based test.

Next steps are: Advanced methods for estimating Null-distributions for the MMD tests.

I also worked with Arthur (mentor) on a version of the MMD that is related to my Master project: A convex combination of (arbritary) kernels for the linear time MMD where the optimal weights are learned by solving a quadratic program. I might implement that into SHOGUN as well. (Who can help me how to interface the QP-solver of SHOGUN?)

## May 22, 2012 09:57 PM

### Fernando Jose Iglesias Garcia

#### GSoC 2012 is already here!

Last month I became selected to develop a project for Shogun in the frame of GSoC. The project’s name is Build a Generic Structured Output Framework Learning and it is mentorized by Nico Görnitz. Here follows a short description of the project:

The aim is to implement tools for structured output (SO) problems. The data in these problems have complex structure (e.g. graphs, sequences) and the traditional learning algorithms fail to find solutions efficiently. Structured output support vector machines and conditional random fields are methods for SO learning. They will be implemented to form Shogun’s first module for SO learning. Finally, these methods will be applied to hidden Markov models-type of problems such as gene prediction.

Feel free to visit my project proposal where, among some personal information, you will be able to find a thorough description of the project together with a tentative schedule and useful references on the topic.

This is going to be a fun summer of coding!

## May 22, 2012 02:25 PM

#### Hello Shogun!

Welcome to iglesiashogun.wordpress.com! In this blog you will be able to find information about my wonderful experience as software developer for the Open Source Machine Learning toolbox Shogun.

Happy coding!

## May 15, 2012 05:31 PM

#### AISEC 2012 - Call for Papers.

I am happy to spread the word about the 5th ACM Workshop on Artificial Intelligence and Security (AISEC). The workshop invites original research papers describing the use of machine learning in security and privacy problems, as well as position papers discussing the role of learning in security. The workshop will be held in conjunction with the renowned ACM Conference on Computer and Communications Security (CCS) in Chicago and accepted papers are published by ACM press.
• Paper Submission Due: July 16, 2012
• Acceptance Notification: August 13, 2012
• Workshop: October 19, 2012
AISEC is one of the few events that specifically targets the challenging niche of machine learning and computer security. As a member of the PC, I am really looking forward to interesting contributions and innovative applications. More information are available at the workshop's website.

## April 26, 2012 10:09 PM

### Soeren Sonnenburg

#### GSoC2012 Accepted Students

Shogun has received an outstanding 9 slots in this years google summer of code. Thanks to google and the hard work of our mentors ranking and discussing with the applicants - we were able to accept 8 talented students (We had to return one slot back to the pool due to not having enough mentors for all tasks. We indicated that octave gets the slot so lets hope they got it and will spend it well :).

Each of the students will work on rather challenging machine learning topics.

The accepted topics and the students attacking them with the help of there mentors are

We are excited to have them all in the team! Happy hacking!

## April 24, 2012 11:24 AM

### mloss.org

#### Google Summer of Code 2012

The list of Google Summer of Code (GSoC) students for 2012 has been announced.

For young programmers, it is probably the easiest way to get your foot into the door by showing that you can contribute to something worthwhile. For open source projects, it is an injection of fresh blood. For academics looking for programmer types, it is good way to differentiate between all the applicants with top marks from universities which you personally do not know.

Among the mentoring organisations which may be of interest to the machine learning community:

A warm welcome to everyone!

## April 09, 2012 09:13 PM

### Soeren Sonnenburg

#### Shogun Student Applications Statistics for Google Summer of Code 2012

A few weeks have passed since SHOGUN has been accepted for Google Summer of Code 2012. Student application deadline was Easter Friday (April 6) and shogun received 48 proposals from 38 students. Some more detailed stats can be found in the figure below.

This is a drastic drop compared with last year (about 60 students submitted 70 proposals). However, this drop can easily be explained: To even apply we required a small patch, which is a big hurdle.
1. One has to get shogun to compile (possibly only easy under debian; for cygwin/MacOSX you have to invest quite some time to even get all of its dependencies).
2. One has to become familiar with git, github and learn how to issue a pull request.
3. And finally understand enough of machine learning, shogun's source code to be able to fix a bug or implement some baseline machine learning method
Nevertheless, about a dozen of proposals didn't come with a patch (even though written on the instructions page that this is required) - an easy reject. In the end the quality of proposals increased a lot and we have many very strong candidates this year. Now we will have to wait to see how many slots we will receive before we can finally start the fun :-)

## March 18, 2012 02:56 AM

### Soeren Sonnenburg

#### Shogun got accepted at Google Summer of Code 2012

SHOGUN has been accepted for Google Summer of Code 2012.

SHOGUN is a machine learning toolbox, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond.

SHOGUN is implemented in C++ and interfaces to all important languages like MATLAB, R, Octave, Python, Lua, Java, C#, Ruby and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.

During Summer of Code 2012 we are looking to extend the library in three different ways:

1. Improving accessibility to shogun by developing improving i/o support (more file formats) and mloss.org/mldata.org integration.
2. Framework improvements (frameworks for regression, multiclass, structured output problems, quadratic progamming solvers).
3. Integration of existing and new machine algorithms.
Check out our ideas list and if you are a talented student, consider applying!

## March 08, 2012 10:58 AM

### mloss.org

#### Open Access is very cheap

Stuart Shieber just posted very convincing evidence that publication does not really cost that much, at least in technically savvy fields. Read about it here...

"Adding it all up, a reasonable imputed estimate for JMLR’s total direct costs other than the volunteered labor (that is, tax accountant, web hosting, domain names, clerical work, etc.) is less than $10,000, covering the almost 1,000 articles the journal has published since its founding — about$10 per article. With regard to whose understanding of JMLR’s financing is better than whose, Yann LeCun I think comes out on top. How do I know all this about JMLR? Because (full disclosure alert) I am [ed. the publisher]."

This shows that Yann LeCun knew what he was talking about in the argument with Kent Anderson in the comments section of the Scholarly Kitchen.

mloss.org does not cost that much either. It was initially hosted by the Friedrich Miescher Laboratory in Tuebingen, and now hosted by the Technical University Berlin. All coding was done by Soeren, Mikio and I during our free time. mldata.org costed a bit more because we paid a programmer and an intern for a few months at the start, and we also bought a more serious server for the heavier load. Luckily we have a PASCAL2 grant.

The real cost, as with JMLR, is the volunteer time needed. In fact, the mloss/mldata team is stretched pretty thin at the moment, and any help would be most welcome. Please contact us if you have a few free hours!

## March 08, 2012 10:30 AM

#### Positional and Sorted N-grams.

I haven't blogged for quite some time, as I have been quite busy with teaching and research. While most of this work is fun, I am slightly missing the extensive amount of coding, I did back as a PhD student. Whenever possible I take a break and spend some time hacking on some code.

In this post, I present two hacks extending the concept of n-grams that I recently added to my tool Sally. The tool maps a set of strings to a vector space, such that techniques from machine learning can be directly applied to string data. To this end, the strings are characterized by n-grams (substrings of either n bytes or words), where each n-gram is associated with one dimension of the vector space (see the vector space model or bag of words model). If an n-gram occurs in a string, the respective dimension in the vector corresponding to the string is set to 1; otherwise it is 0.

The first hack deals with the position of n-grams in the strings:
Positional n-grams: In some applications the considered strings are aligned and the positions of patterns in the strings play a central role. For instance, in transcription start site detection one tries to identify the start of genes using substrings and their position in the DNA. Positional n-grams address this setting. The n-grams are extracted along with their position in the strings, such that an n-gram matches between two strings only, if occurs at the same position. In particular, Sally maps the strings to a vector space, whose dimensions are associated with both, the n-grams and their positions. For example, the string "abab" contains the following positional 2-grams: "ab@1", "ba@2" and "ab@3".
The second hack can add a little robustness to the analysis:
Sorted n-grams: As any data, strings can be noisy. Words are sometimes swapped in text and DNA bases may flip positions due to mutations. Such noise can be slightly compensated by sorted n-grams. After the extraction of an n-gram, the symbols of the n-grams are sorted, thereby removing the local ordering of the data. For example, a 3-gram of the words "john was here" becomes "here john was". This sorting obviously destroys relevant information of the n-grams, yet if the local ordering is perturbed by noise anyway, sorted n-grams are an alternative for analyzing string data. One rather obscure application of this technique is the analysis of Yoda quotes, which are permuted by design.
You can find out more about both features in the manual of Sally.

Last but not least, if you are using Sally, I would be interested to learn about where and how Sally is applied to your data. As I have designed the tool mainly for my own needs, I am really looking forward to learn about different applications.

## March 02, 2012 09:34 AM

### mloss.org

#### Did the MathWorks Infringe the Competition Laws?

I have just read that the EU commission is investigating whether The MathWorks did infringe the EU competition laws potentially related to its software Matlab and Simulink. An unnamed competitor made an appeal to the EU commission claiming that the MathWorks refused to provide a license for Matlab/Simulink to that certain competitor. This hinders making the competing product interoperable and makes it impossible for the competitor to perform (rightful!) reverse engineering in that case.

## February 28, 2012 10:34 AM

### mloss.org

#### Nature Editorial about Open Science

The case for open computer programs

Does open source software imply reproducible research?

There was a recent Nature editorial expounding the need for open source software in scientific endeavors. It argues that many modern scientific results depend on complex computations and hence source code is needed for scientific reproducibility. It is nice that a high profile journal has published articles promoting open source software, since it increases visibility. However, some more careful thought is required, as the message of the article is inaccurate in both directions.

Open source provides more benefits than just reproducibility

Actually, open source provides more than is necessary for reproducibility, since the licenses provides the ability to edit and extend the code, as well as preventing discriminatory practices. To be pedantic, for reproducibility, any software (even a compiled executable) would work.

We've said this before but the message is worth repeating. Open source provides:

1. reproducibility of scientiﬁc results and fair comparison of algorithms;
2. uncovering problems;
3. building on existing resources (rather than re-implementing them);
6. faster adoption of methods in different disciplines and in industry; and
7. collaborative emergence of standards.

See our paper for more details.

Having source code does not imply reproducibility

As the editorial observes in the final sentence of the abstract "The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, ... ". I've personally spent many frustrating hours trying to get somebody's research code compiled. In fact, one of the most common complaints by reviewers of JMLR's open source track is that they are unable to get submissions to work on their computer. The multitude of computing environments, numerical libraries and programming languages means that very often, the user of the software is in a different frame of mind compared to the authors of the source code. My advice to fledging authors of machine learning open source software is to provide a "quickstart" tutorial in the README, because everybody is impatient, and nobody will look into fixing your bugs before they are convinced that your code will do something useful for them. And yes, fixing \$PATH can be tricky if you don't know exactly how to do it.

I guess the bottom line is quite an obvious statement: Good open source software will give you reproducibility and a few other additional benefits.

## February 03, 2012 11:19 AM

### mloss.org

#### Tagging Project 'Published in JMLR'

I did some minor maintenance work today updating email addresses of certain projects and merging libmlpack and MLPACK.

More importantly, I did add the jmlr mloss url to all mloss.org listed projects that have a corresponding jmlr publication. Since there seemed to be quite some backlog it might be worthwhile to shed some light how one gets flagged 'published in jmlr'.

Naturally, the condition is an corresponding accepted and already online jmlr mloss paper. In addition, you should remind your jmlr action editor to flag the project 'published in jmlr' by giving him the link to the abstract of your publication on the jmlr homepage. This is the field 'abs' under http://jmlr.csail.mit.edu/mloss/ . He will then add this cross-reference and your project benefits from the increased visibility.

I hope that makes the process more transparent and reduces the backlog - ohh and btw we are counting 29 JMLR-MLOSS submissions - keep it going :-)

## January 11, 2012 05:28 PM

#### Open Position in the PROSEC Project.

I am happy to announce an open position for a PhD student in the Computer Security Group at the University of Göttingen. The position is part of the project PROSEC and will be concerned with developing techniques for detecting and blocking attacks against communication services and devices.

We are specifically looking for a highly motivated PhD student with strong background in computer and network security, as well as good knowledge of data analysis and machine learning. The position has a running time of two years. The salary level is 13 TV-L. The application deadline is Feburary 15.

A detailed description of the position and the requirements is available here. If you have questions regarding this position, feel free to send me an email. Please direct electronic applications to our office: office@informatik.uni-goettingen.de.

## December 16, 2011 05:04 AM

### mloss.org

#### Improving mloss.org

Meeting Cheng at NIPS we had a discussion on how to improve user experience of mloss.org. So I got my hands dirty and fixed a few minor issues on mloss.org:

A long standing feature request was that software that once appeared in JMLR will continue to be tagged published in JMLR and be highlighted. This should now be the case.

In addition, I again did limit the automagic pulling of r-cran packages to now happen once / month only. This should give manually updated software a higher visibility again.

If you have suggestions for improvements and don't mind to code a little python mloss.org's source code is now availabe on github. In particular, if you'd like to attempt to improve the R-CRAN slurper it's code is here.

## December 14, 2011 10:17 AM

#### Privacy Attacks on Smart Metering.

Today, I like to share some of my recent research with Marek Jawurek and Martin Johns. Together we have been working on attacking and defending privacy in smart metering. A smart meter is essentially a digital meter that records a fine-grained trace of a household's electricity consumption. At a first glance, this is a great thing and we will all profit from time-of-use billing in the very near future.

At a closer look however, smart metering also brings about severe privacy problems. Every activity that takes place in a household and makes use of electrical appliances is reflected in the recorded data. A large body of work has studied how activities, such as watching television, running the dishwasher or even slicing bread with an electric knife, can be identified with scary precision. There is consensus that consumption traces need to be carefully protected at the electricity suppliers.

Unfortunately, there is a seemingly effective defense: pseudonymization. That is, the household's identity is stored separately from the consumption data and only linked by a pseudonym. Information inferred from consumption traces can not be attributed to a specific household and most privacy attacks fail. For example, an attacker may detect that somebody switched on his television from the stored traces. Without access to the mapping from pseudonyms to households however, there is only little use of this information.

We have studied the security of such pseudonymization and developed two novel attacks that enable linking consumptions traces back to households. Our attacks build on machine learning techniques, which allow us to cleverly model the consumption traces and automatically infer anomalous patterns of electricity usage. These patterns can be correlated with external observations, such as an employee taking a day off or a family preparing a Christmas dinner. Such indicative patterns occur in the consumption traces of many households. Just think about your own electricity usage and events where you significantly deviate from your normal habits.

As a simple example, the following figure shows two patterns automatically inferred from the consumption traces of a household. The blue line corresponds the average consumption on weekends, while the black line gives the consumption on weekdays. The red line actually reflects a weekday, yet its profile clearly matches the weekend. Somebody might have a taken a day off or is sick at home. With a corresponding external observation, we can easily link this data to a household and all previous privacy attacks can be applied again.

We have evaluated our attacks and corresponding defenses on the consumption traces of 53 households recorded over a period of 9 months. In several cases our attacks are successful and we conclude that pseudonymization alone is insufficient for protecting smart metering. More details are available in our paper that has been recently presented at the 27h Annual Computer Security Applications Conference (ACSAC).
Consumption traces collected by Smart Meters are highly privacy sensitive data. For this reason, current best practice is to store and process such data in pseudonymized form, separating identity information from the consumption traces. However, even the consumption traces alone may provide many valuable clues to an attacker, if combined with limited external indicators. Based on this observation, we identify two attack vectors using anomaly detection and behavior pattern matching, that allow effective de-pseudonymization. Using a practical evaluation with real-life consumption traces of 53 households, we verify the feasibility of our techniques and show that the attacks are robust against common countermeasures, such as resolution reduction or frequent re-pseudonymization.
Smart Metering De-Pseudonymization. Marek Jawurek, Martin Johns and Konrad Rieck. Proc. of 27th Annual Computer Security Applications Conference (ACSAC), December, 2011.

## December 06, 2011 11:33 AM

### mloss.org

#### Mendeley/PLoS API Binary Battle (winners)

The results of the Mendeley/PLoS API Binary Battle are out:

## Winner

Share your personal genome from 23andMe or deCODEme to find the latest relevant research and let scientists discover new genetic associations.

## 1st runner up

Continual reviews of papers, even after they are published.

## 2nd runner up

Something close to my heart, programmatic interface to data!

## December 05, 2011 05:00 PM

### mloss.org

#### What is a file?

What is a file?

Two bits of news appeared recently:

• The distribution of file sizes on the internet indicates that the human brain limits the amount of produced data. The article however observes that "it'll be interesting to see how machine intelligence might change this equation. It may be that machines can be designed to distort our relationship with information. If so, then a careful measure of file size distribution could reveal the first signs that intelligent machines are among us!"

• Paul Allen's Institute has been publishing its data in an open fashion. Ironically, the article is behind a paywall. However, the Allen Institute for Brain Science has a data portal.

I wondered about the distribution of data which is clearly machine generated and in some sense most easily digested by machine as well. It turns out that it is quite difficult to find out how big files are. In some sense, for the brain atlas, the amount of data (>1 petabyte of image data alone) is more than is easily transferable across the internet. Most human users of this data would use some sort of web based visualization of the data, and hence the meaning of the word "file" isn't so obvious. In fact, there has been a recent trend to "hide" the concept of a file. One example is iPhones and iPads where you do not have access to the file system, and hence do not really know whether you are transfering parts of a file or streaming bytes. Another example is Google's AppEngine, where users access data through a database. A third example is Amazon's Silk browser which "renders" a web page in a more efficient fashion using Amazon's infrastructure rather than your local client.

If we take the extreme view that we use some sort of machine learning algorithm to filter the world's data for our consumption, this implies that all the world's data is in one "file", and we are just looking at parts of it. From this point of view, the paper about using file sizes to reveal machine intelligence is not going to work. In fact, thinking about file sizes in the first place is just plain misleading.

## October 24, 2011 03:13 PM

### Soeren Sonnenburg

#### Google Summer of Code Mentors Summit 2011

Google mentors summit 2011 is over. It is hard to believe but we had way more discussions about open science, or open data, open source, and open access than on all machine learning open source software workshops combined. Well, and the number of participants oft the science sessions was unexpectedly high too. One of the interesting suggestions was the science code manifesto suggesting among other things that all code written specifically for a paper must be available to the reviewers and readers of the paper. I think that this can nicely be extended to data too and really should receive wide support!

Besides such high goals the summit was a good occasion to get in touch with core developers of git, opencv, llvm/clang, orange, octave, python... and discuss about concrete collaborations, issues and bugs that the shogun toolbox is triggering. So yes I already fixed some and more to come. Thanks google for sponsoring this - it's been a very nice event.

## October 18, 2011 01:37 PM

#### Adaptive Detection of Covert Communication in HTTP Requests.

It has been a while since my last post. I have taken up my new position at the University of Göttingen and a lot of administrative work needs to be done. Fortunately, the website of my research group is already online and half of my lecture slides are prepared. So far for an update, back to some security stuff.

Together with Guido Schwenk from TU Berlin, I have been working on detecting covert channels in HTTP traffic. It is well-known that identifying covert communication in the general case is a very hard and usually fruitless task. The sheer amount of available channels in HTTP traffic renders a comprehensive detection impossible. Data may be hidden in inter-request times, in the ordering of visited links and so on. At the start of our work, we hadn't much hope.

Involved covert channels, however, suffer from low bandwidth and thus are less attractive for attackers. The effective hiding of data in communication requires a careful balance of data volume and time. This rings especially true for malicious software that transfers data from infected machines to the attacker, such as personal documents and files. To transport data in a reasonable time frame, say a day, most malware authors abstain from using advanced cloaking techniques and resort to "mild" covert channels – which is good news.

With this peculiarities in mind, we have addressed the problem and developed a system for detecting covert communication in HTTP requests. Our system is embedded in a web proxy, where it transparently monitors outgoing HTTP requests for suspicious activity. To this end, we model characteristics of normal HTTP communication for each user using machine learning techniques. For example, we learn individual profiles for the entropy, the length, the headers and the rate of outgoing requests. These profiles can be used to identify communication that deviates from the normal characteristics of a user, indicating suspicious activity. Empirically, we could demonstrate this ability and effectively detect communication of malware, tunnels and backdoors in real HTTP traffic with high accuracy.

Clearly, there is room for evasion. If an attacker has access to the learned profiles, he can start to craft communication that bypasses our system. However, we found the construction of such communication far from trivial. By learning profiles on features of volume and time, we effectively limit the available space for hiding communication. Only few bytes can be hidden and send out unnoticeably.

This work has been recently presented at the 7th European Conference on Computer Network Defense. The abstract of the corresponding paper is shown below:
The infection of computer systems with malicious software is an enduring problem of computer security. Avoiding an infection in the first place is a hard task, as computer systems are often vulnerable to a multitude of attacks. However, to explore and control an infected system, an attacker needs to establish a communication channel with the victim. While such a channel can be easily established to an unprotected end host in the Internet, infiltrating a closed network usually requires passing an application-level gateway—in most cases a web proxy–which constitutes an ideal spot for detecting and blocking unusual outbound communication. This papers introduces Dumont, a system for detecting covert outbound HTTP communication passing through a web proxy. Dumont learns profiles of normal HTTP requests for each user of the proxy and adapts to individual web surfing characteristics. The profiles are inferred from a diverse set of features, covering the structure and content of outbound data, and allowing for automatically identifying tunnels and covert channels as deviations from normality. While this approach does not generally rule out sophisticated covert communication, it significantly improves on state-of-the-art methods and hardens networks against malware proliferation. This capability is demonstrated in an evaluation with 90 days of web traffic, where Dumont uncovers the communication of malware, tunnels and backdoors with few false alarms.
Adaptive Detection of Covert Communication in HTTP Requests. Guido Schwenk and Konrad Rieck. 7th European Conference on Computer Network Defense (EC2ND), September, 2011.

## September 27, 2011 04:43 PM

### mloss.org

#### Linus's Lessons on Software

Linus Torvalds talks about how to run a successful software project.

Two things people commonly get wrong:

“The first thing is thinking that you can throw things out there and ask people to help,”

“The other thing—and it's kind of related—that people seem to get wrong is to think that the code they write is what matters,”

The main points on how to run a successful project:

• It is not about the code, it is about the user
• A good workflow for the project is important, and tools may help to create a good workflow.
• For big projects, development happens in small core groups
• Let go, and don't try to control the people and the code

## September 17, 2011 12:52 PM

### mloss.org

#### Software Freedom Day

Happy software freedom day!

## September 07, 2011 09:21 PM

### Soeren Sonnenburg

#### Shogun at Google Summer of Code 2011

Google Summer of Code 2011 gave a big boost to the development of the shogun machine learning toolbox. In case you have never heard of shogun or machine learning: Machine Learning involves algorithms that do intelligent'' and even automatic data processing and is nowadays used everywhere to e.g. do face detection in your camera, compress the speech in you mobile phone, powers the recommendations in your favourite online shop, predicts solulabily of molecules in water, the location of genes in humans, to name just a few examples. Interested? Then you should give it a try. Some very simple examples stemming from a sub-branch of machine learning called supervised learning illustrate how objects represented by two-dimensional vectors can be classified in good or bad by learning a so called support vector machine. I would suggest to install the python_modular interface of shogun and to run the example interactive_svm_demo.py also included in the source tarball. Two images illustrating the training of a support vector machine follow (click to enlarge):

Now back to Google Summer of Code: Google sponsored 5 talented students who were working hard on various subjects. As a result we now have a new core developer and various new features implemented in shogun: Interfaces to new languages like java, c#, ruby, lua written by Baozeng; A model selection framework written by Heiko Strathman, many dimension reduction techniques written by Sergey Lisitsyn, Gaussian Mixture Model estimation written by Alesis Novik and a full-fledged online learning framework developed by Shashwat Lal Das. All of this work has already been integrated in the newly released shogun 1.0.0. In case you want to know more about the students projects continue reading below, but before going into more detail I would like to summarize my experience with GSoC 2011.

### My Experience with Google Summer of Code

We were a first time organization, i.e. taking part for the first time in GSoC. Having received many many student applications we were very happy to hear that we at least got 5 very talented students accepted but still had to reject about 60 students (only 7% acceptance rate!). Doing this was an extremely tough decision for us. Each of us ended up in scoring students even then we had many ties. So in the end we raised the bar by requiring contributions even before the actual GSoC started. This way we already got many improvements like more complete i/o functions, nicely polished ROC and other evaluation routines, new machine learning algorithms like gaussian naive bayes and averaged perceptron and many bugfixes.

The quality of the contributions and independence of the student aided us coming up with the selection of the final five.

I personally played the role of the administrator and (co-)mentor and scheduled regular (usually) monthly irc meetings with mentors and students. For other org admins or mentors wanting into GSoC here come my lessons learned:

• Set up the infrastructure for your project before GSoC: We transitioned from svn to git (on github) just before GSoC started. While it was a bit tough to work with git in the beginning it quickly payed off (patch reviewing and discussions on github were really much more easy). We did not have proper regression tests running daily during most of GSoC leaving a number of issues undetected for quite some time. Now that we have buildbots running I keep wondering how we could survive for so long without them :-)
• Even though all of our students worked very independently, you want to mentor them very closely in the beginning such that they write code that you like to see in your project, following coding style, utilizing already existing helper routines. We did this and it simplified our lives later - we could mostly accept patches as is.
• Expect contributions from external parties: We had contributions to shogun's ruby and csharp interfaces/examples. Ensure that you have some spare manpower to review such additional patches.
• Expect critical code review by your students and be open to restructure the code. As a long term contributer you probably no longer realize whether your class-design / code structure is hard to digest. Freshmans like GSoC students immediately will when they stumble upon inconsitencies. When they discover such issues, discuss with them how to resolve them and don't be afraid of doing even bigger changes in the early GSoC phase (not too big to hinder work of all of your students though). We had quite some structural improvent in shogun due to several suggestions by our students. Overall the project improved drastically - not just w.r.t. additions.
• As a mentor, work with your student on the project. Yes, get your hands dirty too. This way you are much more of an help to the student when things get stuck and it will be much easier for you to answer difficult questions.
• As a mentor, try to answer the questions your students have within a few hours. This keeps the students motivated and you excited that they are doing a great job.

### Dimension Reduction Techniques

Sergey Lisitsyn (Mentor: Christian Widmer)

Dimensionality reduction is the process of finding a low-dimensional representation of a high-dimensional one while maintaining the core essence of the data. For one of the most important practical issues of applied machine learning, it is widely used for preprocessing real data. With a strong focus on memory requirements and speed, Sergey implemented the following dimension reduction techniques:

See below for the some nice illustrations of dimension reduction/embedding techniques (click to enlarge).

### Cross-Validation Framework

Heiko Strathmann (Mentor: Soeren Sonnenburg)

Nearly every learning machine has parameters which have to be determined manually. Before Heiko started his project one had to manually implement cross-validation using (nested) for-loops. In his highly involved project Heiko extend shogun's core to register parameters and ultimately made cross-validation possible. He implemented different model selection schemes (train,validation,test split, n-fold cross-validation, stratified cross-validation, etc and did create some examples for illustration. Note that various performance measures are available to measure how good'' a model is. The figure below shows the area under the receiver operator characteristic curve as an example.

### Interfaces to the Java, C#, Lua and Ruby Programming Languages

Baozeng (Mentor: Mikio Braun and Soeren Sonnenburg)

Boazeng implemented swig-typemaps that enable transfer of objects native to the language one wants to interface to. In his project, he added support for Java, Ruby, C# and Lua. His knowlegde about swig helped us to drastically simplify shogun's typemaps for existing languages like octave and python resolving other corner-case type issues. The addition of these typemaps brings a high-performance and versatile machine learning toolbox to these languages. It should be noted that shogun objects trained in e.g. python can be serialized to disk and then loaded from any other language like say lua or java. We hope this helps users working in multiple-language environments. Note that the syntax is very similar across all languages used, compare for yourself - various examples for all languages ( python, octave, java, lua, ruby, and csharp) are available.

### Largescale Learning Framework and Integration of Vowpal Wabbit

Shashwat Lal Das (Mentor: John Langford and Soeren Sonnenburg)

Shashwat introduced support for 'streaming' features into shogun. That is instead of shogun's traditional way of requiring all data to be in memory, features can now be streamed from e.g. disk, enabling the use of massively big data sets. He implemented support for dense and sparse vector based input streams as well as strings and converted existing online learning methods to use this framework. He was particularly careful and even made it possible to emulate streaming from in-memory features. He finally integrated (parts of) vowpal wabbit, which is a very fast large scale online learning algorithm based on SGD.

### Expectation Maximization Algorithms for Gaussian Mixture Models

Alesis Novik (Mentor: Vojtech Franc)

The Expectation-Maximization algorithm is well known in the machine learning community. The goal of this project was the robust implementation of the Expectation-Maximization algorithm for Gaussian Mixture Models. Several computational tricks have been applied to address numerical and stability issues, like

• Representing covariance matrices as their SVD
• Doing operations in log domain to avoid overflow/underflow
• Setting minimum variances to avoid singular Gaussians.
• Merging/splitting of Gaussians.
An illustrative example of estimating a one and two-dimensional Gaussian follows below.

### Final Remarks

All in all, this year’s GSoC has given the SHOGUN project a great push forward and we hope that this will translate into an increased user base and numerous external contributions. Also, we hope that by providing bindings for many languages, we can provide a neutral ground for Machine Learning implementations and that way bring together communities centered around different programming languages. All that’s left to say is that given the great experiences from this year, we’d be more than happy to participate in GSoC2012.

## September 03, 2011 01:21 AM

### Soeren Sonnenburg

#### Hello World

I have finally managed to setup a blog using handcrafted html and django. Having had the experience to write mloss.org, largescale.ml.tu-berlin.de and mldata.org my homepage here was done rather quickly (one weekend of hacking). If anyone is interested in website design I can really only recommend to use django. Work your way through the awesome tutorial or the book and get inspired by the sources of the many django based websites out there. Finally keep an eye on django code snippets that often contain small but useful functions and are solutions to problems you might have.

## August 29, 2011 03:04 PM

### mloss.org

#### Mendeley/PLoS API Binary Battle

It seems that there are many challenges being organised recently, but a recent announcement of Mendeley and PLoS caught my interest because they have different motivations. They have teamed up to create the Mendeley/PLoS API Binary Battle

First, both of them are dealing with scientific publications, things which we (academics) all care about. So, any winner of a challenge which improves how we deal with the bread and butter of academic life is of interest. Many of us already use Mendeley to manage our reading, and may also have published in PLoS, and so unlike previous competitions, the actual application may impact machine learners as a whole.

Second, the challenge is not about predicting better, but to design the most creative use of the API from either PLoS or Mendeley. This is an opportunity for fledging machine learners to define interesting learning tasks. To stimulate your creativity, here are some ideas that have been already suggested.

In my experience with the practical side of machine learning, the problem with solving real applications is not the lack of access to the data, but more that the data is in the wrong format, and is only available on CDs or something like that. So, a data API goes a long way towards defining how to interact with the objects, and how to define meaningful machine learning tasks. I think data APIs are something worth supporting.

As usual, there is prize money.

## August 11, 2011 06:19 PM

#### Vulnerability Extrapolation: Blackhat Machine Learning.

The discovery of vulnerabilities in source code is a challenging and fascinating task – for blackhats as well as whitehats. Security bugs are often deeply buried within a code base and considerable expertise is required to bring these gems to the surface. In recent years, several tools for automatically identifying vulnerabilities have been developed. However, these tools rather aim for spotting "low-hanging fruit" vulnerabilities and suffer from the inherent inability of one program to completely analyze another program's code. Hence, the generic task of finding security bugs remains an enjoyable though time-consuming manual task. The fast-paced race against malware proliferation, however, calls for more efficient means to discover vulnerabilities in source code.

Together with Fabian Yamaguchi and Felix 'FX' Lindner from Recurity Labs, we have recently addressed this problem and tried to bridge the gap between classic manual analysis and automatic tools. In this work, we propose a method for assisted analysis of source code that can guide a security researcher during auditing and indicate potentially vulnerable code. To this end, we employ our favorite workhorse – machine learning – in an offensive security context. Our method maps the source code of a software project to a vector space and identifies dominant patterns of API usage geometrically as principal directions. By projecting code to these directions, cruft and noise in the code are removed and we obtain a representation that allows us to compare code according to API usage patterns.

Vulnerabilities often emerge from the incorrect usage of API. For example, the incorrect usage of memory management (malloc, free, ...) and string manipulation (strcpy, sprintf, ...) is a well-known source for security problems. This is where our approach kicks in. The vector space constructed by our method describes the source code in terms of API usage patterns. Given a known vulnerability, we can easily locate code snippets sharing API usage patterns with the vulnerability. These snippets are likely to suffer from the same incorrect API usage and may contain further vulnerabilities. We refer to this process as vulnerability extrapolation. Instead of digging around in the entire code base, we can thus focus on potentially vulnerable functions and thereby accelerate the discovery of security flaws. In a case study with the popular library FFmpeg, Fabian has been able to discover a zero-day vulnerability and engineer a working exploit, while inspecting only few of the thousands of functions in the library.

This work on extrapolation of vulnerabilities has been presented at Blackhat 2011 and WOOT 2011. The abstract of the corresponding paper is here:
Rigorous identification of vulnerabilities in program code is a key to implementing and operating secure systems. Unfortunately, only some types of vulnerabilities can be detected automatically. While techniques from software testing can accelerate the search for security flaws, in the general case discovery of vulnerabilities is a tedious process that requires significant expertise and time. In this paper, we propose a method for assisted discovery of vulnerabilities in source code. Our method proceeds by embedding code in a vector space and automatically determining API usage patterns using machine learning. Starting from a known vulnerability, these patterns can be exploited to guide the auditing of code and to identify potentially vulnerable code with similar characteristics – a process we refer to as vulnerability extrapolation. We empirically demonstrate the capabilities of our method in different experiments. In a case study with the library FFmpeg, we are able to narrow the search for interesting code from 6,778 to 20 functions and discover two security flaws, one being a known flaw and the other constituting a zero-day vulnerability.
Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities using Machine Learning. Fabian Yamaguchi, Felix 'FX' Lindner and Konrad Rieck. 5th USENIX Workshop on Offensive Technologies (WOOT), August, 2011.

## July 28, 2011 11:01 AM

### mloss.org

#### Bias corrected

In 1839, Samuel George Morton published what was to be a series of works on human skulls. In an amazing series of highly detailed experiments, he objectively studied the difference in cranial capacities between human populations. In this pre-Darwinian era, his systematic approach of measuring large numbers of specimens was revolutionary. In the end he had measurements of up to 1000 skulls.

With the new century, the question he was asking (whether humans had a single origin or multiple origins) faded into obscurity, along with his work. Up till 1978, when Stephen Jay Gould published a paper in Science claiming that scientists are inherently biased. His prime example was Morton's experiments, arguing that Morton's results (caucasians had the biggest brains, Indians in the middle, and negros having the smallest) were flawed. This Science paper would probably have also lapsed into obscurity except for the fact that Gould is a wonderful communicator. His 1981 book "The mismeasure of Man" was a bestseller, and people took notice of the fact that scientists were after all human.

Did Morton fudge his results?

Forward to June 2011. Because Morton made all his results publicly available, and he had maintained exquisite details of his experiments (the equivalent of open access and open source today), paleoanthropologist Jason E. Lewis, with a team of other scientists, had another look at the orignal work. It turns out that Gould didn't look at the skulls, but just read the papers. Lewis went down to U. Penn and took measurements again from 308 of the 670 original skulls. The conclusion?

Morton did not bias his results.

This amazing saga shows how important it is to question the "truth", and furthermore how important it is to keep records and materials such that the "truth" can be reinvestigated. To quote the PLoS Biology paper: "Our results resolve this historical controversy, demonstrating that Morton did not manipulate data to support his preconceptions, contra Gould. In fact, the Morton case provides an example of how the scientific method can shield results from cultural biases."

References:

## July 14, 2011 01:18 PM

#### Computer Security and Göttingen?

Here comes some good news. I have the great fortune to take up a Junior-Professorship for Computer Security at the University of Göttingen by the end of this year. Though I leave my colleagues in Berlin with a heavy heart, I am really looking forward to this move and a wealth of new activities, ranging from lecturing computer security and malware analysis to pursuing my favorite strains of security research.

As part of the Junior-Professorship, the position of a research assistant is open at the University of Göttingen. The position is embedded in my new group and offers the opportunity to pursue a doctoral degree. If you are skilled and motivated with strong background in computer security, this might be interesting for you: German and English job description. If you have any questions, feel free to contact me.

## June 30, 2011 04:43 PM

### mloss.org

At ICML yesterday, I saw two interesting papers about crowd sourcing. "Adaptively Learning the Crowd Kernel" looks at learning similarities between n objects by choosing triplets (a,b,c) and asking human experts to say wether a is nearer to b or c. The second paper, "Active Learning from Crowds" proposes a probabilistic model for choosing both examples and expert annotators actively. Unfortunately, both papers don't seem to have their software available online.

Some weeks ago, there was an outbreak of a particularly menacing strain (O104) of E.coli in Europe. Now, E.coli is one of the most widely studied organisms in biological labs worldwide, and its genome has been one of the first published way back in the last century (1997). This relatively small genome (4.6 million base pairs in length, containing 4288 annotated protein-coding genes) means that it can be sequenced quite quickly. In fact, nine isolates have now been sequenced by five different teams on four different sequencing platforms, including the Ion Torrent, Illumina HiSeq, Roche's 454 GS Junior, and most recently the Illumina MiSeq. From the sequencing perspective, this is really the first time the different next generation sequencing platforms can be compared. There will definitely be some improvements in bioinformatics pipelines once researchers understand the read errors on the different platforms better by comparing them.

All this data has been collected on github, giving an excellent crowd sourced dataset for machine learners. This rich dataset could be used to study evolution, and also to understand the mutations that caused virulence. This provides a great opportunity for the machine learning community to break out of the binary classification mold, and study some interesting new machine learning tasks.

## June 16, 2011 10:38 AM

#### Similarity Measures for Sequential Data.

Strings are one of the fundamental data representations in computer science and at the core of many problems in related disciplines. Whether you search web pages for malicious code, try to locate genes in DNA, or study trends in Twitter messages; in the very end, many research tasks boil down to analysing strings and their similarity. As a result, I have been facsinated by string analysis for several years.

Some time ago, I have been asked to write a review article for Wiley Interdiscplinary Reviews on strings and similarity measures. This article has now been published and is available online (HTML, PDF). The review covers the basic concepts of similarity measures for strings. I spare you with the details, however, I'd like to share some high-level insights from this work here.

Fun and pain with the edit distance: One of the oldest and most popular techniques to assess the similarity of two strings is the edit distance. Simply put, the edit distance compares two strings by determining the smallest series of operations that transform one string into the other. That is, the edit distance is constructive and a great tool when you are not only comparing strings but transforming them. For example, the edit distance is the tool of choice when eliminating typos in text or seeking a "diff" between two sequences. However, this power comes at a price: the run-time of the edit distance is quadratic in the length of the strings—even with clever tricks, such as the Four Russians Method. For larger strings, the edit distance thus rather poses a problem than a solution.

The impossible vector space model: Another popular concept covered in the article is the vector space model or bag-of-words model. In this model, strings are characterized by contained substrings (e.g. words or n-grams) and mapped to a high-dimensional vector space, where each dimension reflects the frequency of a particular substring. This mapping allows for applying the full machinery of mathematics to string data. Surprisingly, the security community has often refrained from using this model. Even worse, a misconception is spread that the curse of dimensionality generally impedes analysis of such high-dimensional vectors. This is nonsense and analysing data with millions of dimensions has become everday business in many areas of research. For example, my tool Sally can be applied out-of-the-box for comparing strings using the vector space model, see the examples.

In conclusion, the wealth of information that can be stored in strings is reflected in the myriad of ways to compare them. There is no silver bullet and often you simply need to pick the method that works best in the application at hand. While the edit distance is a great tool, it may suffer from excessive run-time. The vector space model is efficient and easy to use (and there is nothing wrong with using millions of dimensions). However, the model is not constructive and there is no trivial way to retrieve a "diff" from this representation. Finally, if you are looking for more advanced ways to torture string data, I recommend an article by Sören Sonnenburg on string kernels.