Shogun - A Large Scale Machine Learning Toolbox

This is the official homepage of the SHOGUN machine learning toolbox.

SHOGUN Logo

The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM) [1]. It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS [21], Liblinear [20], LibSVM [2], SVMLight, [3] SVMLin [4] and GPDT [5]. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved [6], Fischer [7], TOP [8], Spectrum [9], Weighted Degree Kernel (with shifts) [10] [11] [12]. For the latter the efficient LINADD [12] optimizations are implemented. For linear SVMs the COFFIN framework [22][23] allows for on-demand computing feature spaces on-the-fly, even allowing to mix sparse, dense and other data types. Furthermore, SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning [13] [14] [18] [19]. Currently SVM one-class, 2-class and multiclass classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.

SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released as Machine Learning Open Source Software.

We are Participating in the Google Summer of Code 2012 Program

GSOC Logo In case you are a talented student interested in a summer project, we are looking for you! Summer of Code 2012 we aim at
  1. Improving accessibility to shogun (interfaces, i/o support (more file formats)...)
  2. Framework improvements (frameworks for regression, multiclass, structured output, QP solvers).
  3. Integration of existing and new machine algorithms.
Check out our ideas list and instructions on how to apply. To get an idea what shogun is about check out the documentation and read our overview paper:
Soeren Sonnenburg, Gunnar Raetsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien,
Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc.
The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 11:1799-1802, June 2010.

Screenshots

As everyone likes screenshots, we have produced one for each interface: SHOGUN with Octave, Matlab, Python and R. Click on the link for higher resolution images.

Octave Demo Matlab Demo Python Demo R Demo

Applications

We have successfully used this toolbox to tackle the following sequence analysis problems: Protein Super Family classification, Splice Site Prediction [10] [15] [16], Interpreting the SVM Classifier [13] [14], Splice Form Prediction [10], Alternative Splicing [11] and Promotor Prediction [17]. Some of them come with no less than 10 million training examples, others with 7 billion test examples. A graphical example is written digit recognition as shown below:

Licensing Information

Except for SVMLight which is (C) Torsten Joachims and follows a different licensing scheme (cf. LICENSE.SVMLight in the tar achive) SHOGUN is licensed under the GPL version 3 or any later version (cf. LICENSE). GPLv3 Logo

Cite us

If you use SHOGUN in your research you are kindly asked to cite the following paper:

Soeren Sonnenburg, Gunnar Raetsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien,
Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc.
The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 11:1799-1802, June 2010.

Download Releases

SHOGUN Version 1.1.0 (lib 11.0, data 0.3, param 0)

(updated 01.12.2011) Older Versions

This release contains several enhancements, cleanups and bugfixes:

  • Features:
    • New dimensionality reduction algorithms: Diffusion Maps, Kernel Locally Linear Embedding, Kernel Local Tangent Space Alignment, Linear Local Tangent Space Alignment, Neighborhood Preserving embedding, Locality Preserving Projections.
    • Various performance improvements for dimensionality reduction methods (BLAS, alignment formulation of the LLE, ..)
    • Automatical k determination mode for Locally Linear Embedding dimension reduction method based on reconstruction error.
    • ARPACK and SUPERLU integration.
    • Introduce the concept of Converters that can embed (arbitrary) feature types into different feature types.
    • LibSVM is now pthread-parallelized.
    • Create modshogun.dll for csharp.
    • Various new c# examples (thanks Daniel Korn and Ori Cohen).
    • Dimensionality reduction examples application is introduced
  • Bugfixes:
    • Octave_static and octave_modular examples fix.
    • Memory leak in custom kernel is now eliminated (thanks Madeleine Seeland for reporting).
    • Fix for linear machine set_w method (thanks Brian Cheung for reporting).
    • DotFeatures fix for assert bug.
    • FibonacciHeap memory leak fix.
    • Fix for Java modular interface typemapping bug.
    • Fix errors uncovered by LLVM / clang++.
    • Fix for configure on Darwin-x86_64 (thanks Peter Romov for patch).
    • Improve lua / ruby detection.
    • Fix configure / compilation under osx and cygwin for variuos interfaces.
  • Cleanup and API Changes:
    • Most of the inline functions have been (re)moved to the corresponding .cpp file
    • Libshogun is now being compiled with sse support for math (if available) but interfaces are now being compiled with -O0 key which drastically reduces compilation time

Documentation and Examples

We use Doxygen for both user and developer documentation which may be read online here. More than 600 documented examples for the interfaces python_modular, octave_modular, r_modular, static python, static matlab and octave, static r, static command line and C++ libshogun developer interface can be found in the online documentation. In addition, examples are shipped in the examples/(un)documented/[interface] directory in the source code (where interface is one of r, octave, matlab, python, python_modular, r_modular, octave_modular, cmdline, libshogun).

English

Chinese

Note that documentation for python-modular is most complete and also that python's help function will show the documentation when working interactively:

$ python
Python 2.4.4 (#2, Jan  3 2008, 13:36:28) 
[GCC 4.2.3 20071123 (prerelease) (Debian 4.2.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from shogun.Classifier import SVM
>>> help(SVM)

class SVM(CSVM)
 |  Method resolution order:
 |      SVM
 |      CSVM
 |      CKernelMachine
 |      Classifier
 |      SGObject
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, kernel, alphas, support_vectors, b)
[...]

Publications and Presentations

We have presented shogun at numerous occassions and provide additional material below

Bug-Reports, Mailinglist, Planet

In case you find bugs or have feature requests please use the github issue tracker. Check the buildbot for current issues.

Alternatively use the mailinglist (subscription required) if you have comments, problems or questions etc.

We have set up shogun planet for related blogs and blogs of developers.

IRC and Contact

You can chat with us via IRC. Fire up your IRC client and point it to connect to the IRC channel #shogun at irc.freenode.net. You can also connect via webchat #shogun directly in your browser. Note that we just recently started this channel (March 2011) and make chat logs available for your convenience.

In case you need to directly get in touch with us, feel free to contact

Developer Information

Want to contribute ? We maintain SHOGUNs source code via git and are looking forward to your patches!

Class Design and Source Code

class list

Related Projects

shogunwekakernlabdlibniemeorangejava-mlpyMLmlpypybraintorch3
created1999199704-2004200609-200606-200408-200808-200402-200810-200801-2002
last updated03-201001-201010-200903-201003-200903-201008-200901-200911-200911-200911-2004
Main LanguageC++javaRC++C++pythonjavaC++; pythonpythonpythonC++
Main FocusLarge Scale Kernel Methods; String Features; SVMsGeneral Purpose ML PackageKernel Based Classification/Dimensionality ReductionPortability; CorrectnessLinear Regression; Ranking; ClassificationVisual Data AnalysisFeature SelectionKernel MethodsBasic AlgorithmsReinforcement LearningKernel-based Classification

Feature matrix

The pdf document with the machine learning toolbox feature comparison that we originally submitted to JMLR can be found here. An up-to-date version of this matrix is located at Google Spreadsheet. Please notify us about possible corrections and changes.

A comparison of shogun with the popular machine learning toolboxes weka, kernlab, dlib, nieme, orange, java-ml, pyML, mlpy, pybrain, torch3, scikit-learn. A '?' denotes unkown, '-' feature is missing. This table is availabe as a google spreadsheet.

feature shogun weka kernlab dlib nieme orange java-ml pyML mlpy pybrain torch3 scikit-learn
General FeaturesGraphical User Interfacecrosstickcrosstickticktickcrosscrosscrossticktickcross
One Class Classificationticktickticktickcrosscrosscrosstickcrosscrosscrosstick
Classificationticktickticktickticktickticktickticktickticktick
Multiclass classificationtickticktickcrosstickcrossticktickticktickticktick
Regressionticktickticktickticktickcrosstickcrosstickticktick
Structured Output Learningtickcrosscrosscrosstickcrosscrosscrosscrosscrosscrosscross
Pre-Processingtickticktickticktickticktickticktickcrossticktick
Built-in Model Selection Strategiesticktickticktickcrosstickticktickcrosscrosscrosstick
Visualizationcrosstickcrosscrossticktickcrosstickticktickticktick
Test Frameworkticktickcrossticktickuntestedtickcrosscrosscrosscrosstick
Large Scale Learningtickcrosscrossticktickcrosscrosscrosstickcrosscrosscross
Semi-supervised Learningcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
Multitask Learningtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
Domain Adaptationtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
Serializationticktickticktickticktickticktickticktickcrosstick
Parallelized Codeticktickcrosstickcrosscrosscrosscrosscrosscrosscrosstick
Performance Measures (auROC etc)ticktickcrosstickticktickticktickticktickticktick
Image Processingcrosscrosscrosstickcrosscrosscrosscrosscrosscrosscrosscross
Supported Operating SystemsLinuxticktickticktickticktickticktickticktickticktick
Windowstickticktickticktickticktickcrossticktickticktick
Mac OSXtickticktickticktickticktickticktickcrossticktick
Other Unixtickticktickticktickticktickcrosstickcrossticktick
Language BindingsPythontickcrosscrosscrossticktickcrosstickticktickcrosstick
Rtickcrosstickcrosscrosscrosscrosscrosscrosscrosscrosscross
Matlabtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
Octavetickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
C/C++tickcrosscrossticktickcrosscrosscrosscrosscrosstickcross
Command Linetickcrosscrosscrosscrosscrosscrosscrosstickticktickcross
Javaticktickcrosscrosstickcrosstickcrosscrosscrosscrosscross
C#tickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
Luatickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
Rubytickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
SVM SolversSVMLightticktickcrosscrosscrosscrosscrosscrosscrosscrosscrosscross
LibSVMticktickticktickticktickticktickcrosstickcrosstick
SVM Ocastickcrosscrosstickcrosscrosscrosscrosscrosscrosscrosscross
LibLinearticktickcrosscrosscrosscrosscrosscrosscrosscrosscrosstick
BMRMtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
LaRanktickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
SVMPegasoscrosstickcrossticktickcrosscrosscrosscrosscrosscrosscross
SVM SGDtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosstick
othertickcrosstickcrosscrosscrosscrossticktickcrosstickcross
RegressionKernel Ridge Regressiontickcrosscrosscrosscrosscrosscrosstickcrosscrosscrosstick
Support Vector Regressiontickticktickcrosscrosscrosscrosstickcrosscrossticktick
Gaussian Processescrossticktickcrosscrosscrosscrosscrosscrosscrosscrosstick
Relevance Vector Machinecrosstickticktickcrosscrosscrosscrosscrosscrosscrosscross
Multiple Kernel LearningMKLtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
q-norm MKLtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
ClassifiersNaive Bayesticktickcrosscrosscrosstickcrosscrosscrosstickticktick
Bayesian Networkscrosstickcrosstickcrosscrosscrosscrosscrosstickcrosscross
Multi Layer Perceptroncrosstickcrossticktickcrosscrosscrosscrossticktickcross
RBF Networkscrosstickcrosstickcrosscrosscrosscrosscrosstickcrosscross
Logistic Regressionticktickuntestedcrossticktickcrosscrosscrosscrosscrosstick
LASSOcrosscrossuntestedcrosstickcrosscrosscrosscrosscrosscrosstick
Decision Treescrosstickcrosscrosscrossticktickcrosscrosscrosscrosscross
k-NNticktickticktickcrosstickticktickticktickticktick
Linear ClassifiersLinear Programming Machinetickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
LDAtickcrosscrosscrosscrosscrosscrosscrosstickcrosscrosstick
DistributionsMarkov Chainstickcrosscrosscrosscrosscrosstickcrosscrosscrosscrosscross
Hidden Markov Modelstickcrosscrosscrosscrosscrosscrosscrosscrosscrossticktick
KernelsLinearticktickticktickticktickticktickticktickticktick
Gaussianticktickticktickcrosstickticktickticktickticktick
Polynomialticktickticktickcrosstickticktickticktickticktick
String Kernelstickticktickcrosscrosscrosscrosstickcrosscrosscrosscross
Sigmoid Kernelticktickcrosstickcrosstickcrosscrosscrosscrosscrosstick
Kernel Normalizertickuntestedtickcrosscrosscrosscrosstickcrosscrosscrossuntested
Feature SelectionForwardcrosstickcrossuntestedcrossticktickticktickcrosscrosstick
Wrapper methodscrosstickcrossuntestedcrossuntestedtickticktickcrosscrosscross
Recursive Feature Selectioncrosstickcrosstickcrossuntestedtickticktickcrosscrosstick
Missing FeaturesMean value imputationcrosstickcrosscrosscrossticktickcrosstickcrosscrosscross
EM-based/model based imputationcrosstickcrosscrosscrosstickcrosscrosscrosscrosscrosscross
ClusteringHierarchical Clusteringticktickcrosscrosscrosstickcrosscrosstickcrosscrosstick
k-meansticktickticktickcrosstickticktickticktickticktick
OptimizationBFGScrosstickcrossticktickcrosscrosscrosscrosscrosscrosscross
conjugate gradientcrosscrosscrosstickcrosscrosscrosscrosscrosscrosscrosscross
gradient descenttickticktickcrosstickcrosscrosscrossticktickticktick
bindings to CPLEXtickcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
bindings to Mosekcrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscrosscross
bindings to other solvertickcrosstickcrosscrosstickcrosstickcrosscrosscrosstick
Supported File FormatsBinaryticktickcrosscrosscrosscrosscrosscrosscrosstickcrosstick
Arffcrosstickcrosscrosscrosscrosstickcrosscrosscrosscrosscross
HDF5tickcrosstickcrosscrosscrosscrosscrosscrosscrosscrosscross
CSVcrossticktickcrosscrossticktickticktickcrossticktick
libSVM/ SVMLight formatticktickcrossticktickcrosscrosstickcrosstickcrosstick
Excelcrosscrosstickcrosscrosstickcrosscrosscrosscrosscrosscross
Supported Data TypesSparse Data Representationticktickcrosstickticktickticktickticktickcrosstick
Dense Matricesticktickticktickcrosstickticktickticktickticktick
Stringsticktickticktickcrosscrosscrosscrosscrosscrossticktick
Support for native (e.g. C) types (char, signed and unsigned int8, int16, int32, int64, float, double, long double)tickcrosscrosstickcrosscrosscrosscrosstickcrosscrosstick

Acknowlegements

The authors gratefully acknowledge the support of DFG grant MU 987/2-1, MU 987/6-1, RA-1894/1-1 and the PASCAL Network of Excellence.

References

[1]C.Cortes and V.N. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.
[2]C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
[3]T.Joachims. Making large-scale SVM learning practical. In B.Schoelkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169--184, Cambridge, MA, 1999. MIT Press.
[4] V. Sindhwani, S. S. Keerthi. Large Scale Semi-supervised Linear SVMs. SIGIR, 2006.
[5] L. Zanni, T. Serafini, G. Zanghirati. Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems. JMLR 7(Jul), 1467-1492, 2006.
[6]A.Zien, G.Raetsch, S.Mika, B.Schoelkopf, T.Lengauer, and K.-R. Mueller. Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. Bioinformatics, 16(9):799-807, September 2000.
[7]T.S. Jaakkola and D.Haussler.Exploiting generative models in discriminative classifiers. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, pages 487-493, 1999.
[8]K.Tsuda, M.Kawanabe, G.Raetsch, S.Sonnenburg, and K.R. Mueller. A new discriminative kernel from probabilistic models. Neural Computation, 14:2397--2414, 2002.
[9]C.Leslie, E.Eskin, and W.S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In R.B. Altman, A.K. Dunker, L.Hunter, K.Lauderdale, and T.E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing, pages 564-575, Kaua'i, Hawaii, 2002.
[10](1, 2, 3) G.Raetsch and S.Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans, pages 277-298. MIT Press series on Computational Molecular Biology. MIT Press, 2004.
[11](1, 2) G.Raetsch, S.Sonnenburg, and B.Schoelkopf. RASE: recognition of alternatively spliced exons in c. elegans. Bioinformatics, 21:i369--i377, June 2005.
[12](1, 2) S.Sonnenburg, G.Raetsch, and B.Schoelkopf. Large scale genomic sequence SVM classifiers. In Proceedings of the 22nd International Machine Learning Conference. ACM Press, 2005.
[13](1, 2) S.Sonnenburg, G.Raetsch, and C.Schaefer. Learning interpretable SVMs for biological sequence classification. In RECOMB 2005, LNBI 3500, pages 389-407. Springer-Verlag Berlin Heidelberg, 2005.
[14](1, 2) G.Raetsch, S.Sonnenburg, and C.Schaefer. Learning Interpretable SVMs for Biological Sequence Classification. BMC Bioinformatics, Special Issue from NIPS workshop on New Problems and Methods in Computational Biology Whistler, Canada, 18 December 2004, 7:(Suppl. 1):S9, March 2006.
[15]S.Sonnenburg.New methods for splice site recognition. Master's thesis, Humboldt University, 2002. supervised by K.-R. Mueller H.-D. Burkhard and G.Raetsch.
[16]S.Sonnenburg, G.Raetsch, A.Jagota, and K.-R. Mueller. New methods for splice-site recognition. In Proceedings of the International Conference on Artifical Neural Networks, 2002. Copyright by Springer.
[17]S.Sonnenburg, A.Zien, and G.Raetsch. ARTS: Accurate Recognition of Transcription Starts in Human. 2006. (accepted).
[18]S.Sonnenburg, G.Raetsch, C.Schaefer, and B.Schoelkopf,Large Scale Multiple Kernel Learning, Journal of Machine Learning Research, 2006, K.Bennett and E.P.-Hernandez Editors
[19] M.Kloft, U.Brefeldt, S.Sonnenburg, A.Zien, P.Laskov, K.-R. Mueller, Efficient and Accurate Lp-Norm Multiple Kernel Learning, Advances in Neural Information Processing Systems 21, MIT Press, Cambridge, MA,2009
[20] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear/
[21] V. Franc, S. Sonnenburg. Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization, Journal of Machine Learning Research 10(2009), 2157--2192, Software available at http://jmlr.csail.mit.edu/papers/v10/franc09a.html
[22] S. Sonnenburg, V. Franc. COFFIN: A Computational Framework for Linear SVMs, Research Report, Center for Machine Perception, K13133 FEE Czech Technical University, 2009
[23] S. Sonnenburg, V. Franc. COFFIN: A Computational Framework for Linear SVMs. Proceedings of the 27nd International Machine Learning Conference, 2010.
Fork me on GitHub