Large scale document classification with the Shogun Machine Learning Toolbox

By Evangelos Anagnostopoulos (GitHub ID: van51)

This notebook is about Document classification in Shogun. After providing a semi-formal introduction to the Bag of Words model and its limitations, we illustrate the hashing trick. This is consolidated by performing experiments on the large-scale webspam data set.

Document Classification


Document classification consists of assigning documents to specific predefined categories. This usually works by transforming the documents into a vector space that is acceptable by common classifiers, like SVMs, and then learning a model on that new representation.

The most common and the most widely used representation of document collections is the Bag of Words model.The BoW representation considers each document as a collection of tokens. A token is defined as the minimum arbitrary sequence of characters that can be considered as atomic. Tokens depending on the choice of the user can be either whole words or n-grams, etc. A simple approach is to consider every possible token in your document collection as a different dimension in a feature space and then vectorize each document by assigning 1 to every dimension that corresponds to a token contained in that document and 0 to every other. This is the simpler approach and is known as the Boolean Vector Space Model. Another approach is, instead of using {1,0}, to use a weight for every term, like their frequencies, but this is out of our scope.

Let's now consider a document collection consisting of the following documents:
    Document 1 = "This is the first document"
    Document 2 = "Document classification: Introduction"
    Document 3 = "A third document about classification"

and suppose that we consider as tokens, every word that appears in the collection.
Then we would arrive to the following representation:

Document 1Document 2Document 3
[0] this100
[1] is100
[2] the100
[3] first100
[4] document111
[5] classification011
[6] introduction010
[7] a001
[8] third001
[9] about001
The above matrix is called the document-term matrix and is widely used in Information Retrieval applications. In our case since every document is now represented by a numerical vector, we can just pass this as a dataset to common classifiers, including kernel machines.

Limitations on large scale collections

Although the aformentioned procedure is pretty intuitive and straight-forward it has some drawbacks when considered on large collections of documents.

  • First of all, the dimensionality of the feature space (or else, the number of distinct tokens in the collection) is usually not known beforehand and requires a pass over the entire dataset to be calculated. This can make the creation of the document-term matrix trickier, especially in online learning scenarios.
  • Secondly, this dimensionality can grow to be very large; imagine considering as tokens the set of every possible word, or every possible 8-gram.
  • Thirdly, this approach requires an additional dictionary in order to work that maps every token to its appropriate dimension, for instance every appearance of 'this' above should always be mapped to dimension 0, and this mapping data structure will also grow to be very large.

The hashing response

A convenient approach that eliminates all the problems that the BoW representation introduced is to use a hash function to transform all the tokens into numerical entities and then restrict those values into a range [0, n-1] using the modulo function. The values in the [0, n-1] range will correspond to indices on a n-dimensional document-term matrix.
This way:
  • the dimensionality of the target feature space is now known beforehand, as it's simply the number of all possible outcomes of the hash functions, restricted to a specific range [0, n-1] by using the modulo function. No extra passes over the dataset are needed this way.
  • In addition, the dimension can now be restricted to a certain size that is convenient depending on the limitations of the available hardware
  • The dictionary is also not needed anymore since each token is directly mapped to the appropriate dimension through the outcome of the hash function.

The above approach will result in a hashed document-term matrix, similar to the one shown above, of a pre-defined dimension size (number of rows).
Each token will now be assigned to a random dimension, although always the same for the same token, that is determined from the hash function.
This will probably incure some collisions, but if the hash functions is good enough and our choice for the dimension size is sufficiently large it will not be a problem. Especially if we are considering large collections.

After the creation of the hashed document-term matrix, the intuition is to pass it as a dataset to a linear classifier that will take care of the rest!

On-the-fly Hashing

A naive way to implement the hashing procedure is to pre-hash every token of the collection and store it to the disk for later access. However, this approach is not very convenient since it will require space analogous to the size of the original dimension. It will also make the experimentation on different parameters, like the size of the hash function or the choice of tokens, a bit trickier since one would have to re-convert the entire collection based on the choice of those parameters.
The response to that is to read our collection as it is and compute the hash of every token only when it's required, on-the-fly.

On-the-fly Hashing with Shogun

We will now have a look at how the above idea is represented in the Shogun Toolbox. That is we will see how we can load our document collection in memory and consider a hashed document-term matrix with the hashing of every document (or token more specifically) happening on-the-fly, only when it's required to be computed. Altough it may sound a bit tricky, it's actually pretty straightforward and here is how.

First of all we import the required components from the shogun library.

In [1]:
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from shogun import StringCharFeatures, RAWBYTE, HashedDocDotFeatures, NGramTokenizer

The StringCharFeatures are nothing more than a collection of char vectors, where each one represents a single document.
RAWBYTE is simply an enum that specifies that every possible character can be found in the collection.
The HashedDocDotFeatures is where all the magic happens! This class is responsible for encapsulating the document collection and providing the calculation of the hashed representation of each document whenever it is required.
The NGramTokenizer is the tokenizer we will use on our collection. Its job is to parse a document and create the tokens. As its name suggests, it creates a sequence of ngrams. Another option would be the DelimiterTokenizer, which allows the user to specify the delimiter that will be used in order to create the tokens.

Suppose now that we have the following documents:

In [2]:
doc_1 = "this is the first document"
doc_2 = "document classification introduction"
doc_3 = "a third document about classification"
document_collection = [doc_1, doc_2, doc_3]

We will take some time off now to assign each document to a category, to ease our work later on. Since the two last documents refer to classification we will label them as "Relevant" and the first document as "Not relevant". Since we only have two categories, this makes it a binary problem and we will represent "Relevant" as 1 and "Not relevant" as -1.

In [3]:
from shogun import BinaryLabels
from numpy import array

labels = BinaryLabels(array([-1, 1, 1]))

We now create our document collection:

In [4]:
string_features = StringCharFeatures(document_collection, RAWBYTE)

Now we will create the object responsible for the hashed BoW representation. We are going to specify that we want a hash size of 8 bits, which will be translated to a dimension of size 2^8 = 256 (powers of 2 are considered to speed up computatins) and a tokenizer that creates 5-grams. We will also specify that we want to normalize each hashed vector with regards to the square root of the document length, which is a common approach:

In [5]:
hash_size = 8
tokenizer = NGramTokenizer(5)
normalize = True

hashed_feats = HashedDocDotFeatures(hash_size, string_features, tokenizer, normalize)

And that was it!
The hashed_feats object now has all the required parameters and knows how to communicate and provide the hashed representation only when it is needed by the various algorithms of Shogun.

So how do we proceed to actually learn a model that can classify documents?
We said before that the idea after we have taken care of the hashing is to use a linear classifier.
Shogun has many linear classifiers available, but for the moment we will consider SVMOcas and we will see how to use it:

In [6]:
from shogun import SVMOcas

C = 0.1
epsilon = 0.01
svm = SVMOcas(C, hashed_feats, labels)

We have now created our svm. The parameter C specifies the regularization constant. The best choice for this parameter will usually be selected after a model selection process.
Now, let's train our svm:

In [7]:

When the execution finishes, we will have learned our so desired linear model! Mind that for large collections the above call can take hours.
Now that we have a model ready, we would like to see how well it does. Let's see what it predicts for our training data:

In [8]:
predicted_labels = svm.apply()
print (predicted_labels.get_labels())
[ 1.  1.  1.]

We can see that it misclassified the first document. This has to do with the nature of our overly-simplified toy dataset which doesn't provide enough information. However, another option of the HashedDocDotFeatures class will allow us to extract some more information from the same dataset!

Quadratic Features

Generally, the quadratic features refer to the production of new features from our current dataset, by multiplying currently existing features.
For example, if we have a dataset consisting of the following example vector:
We can create the following extra quadratic features:
The creation of these features can be very useful, however it should be used with caution since it will also increase the training time and the size requirements.
The idea behind the quadratic features is very simple and can be easily implemented for numerical features. However in the case of text collections, where we have tokens instead of numbers, things are not so straightforward.

The approach we have selected in Shogun for the text collections does not compute all the quadratic features, but only a subset of them defined using a k-skip n-grams rule. N-grams in this context refer to a collection of up to n consecutive tokens and should not be confused with our choice for what consists a token, where it can be whole words or ngrams themselves. The k-skip in front allows us to skip up to k tokens when we group them in up-to n sized groups. Although this is a simple idea it can be a bit confusing when introduced through a definition, so let's have a look at an example: Suppose we have the following consecutive tokens:
Tokens : ["a", "b", "c", "d"]
and that we want to consider a 2-skip 2-grams approach. This means that we can combine up to 2 tokens (that is a single token or two tokens) while skipping up to 2 tokens between them (that means we can skip 0,1 and 2 tokens). Therefore, we obtain the following combinations:
["a" (0-skips, 1-gram), "ab" (0-skips, 2-gram), "ac" (1-skip 2-gram), "ad" (2-skips, 2-gram), "b" (0-skips, 1-gram), "bc" (0-skips, 2-gram), "bd" (1-skip, 2-gram), "c" (0-skip, 1-gram), "cd" (0-skip, 2-gram), "d" (0-skip, 1-gram)]

These newly created tokens are then treated like our regular ones and hashed to find their appropriate dimension.
As discussed above this idea may enhance our information on the dataset, but it also requires more time to be computed. So if used, the choice of k and n should be kept to small numbers.

Quadratic Features with HashedDocDotFeatures

In order to use the k-skip n-gram approach explained above, one only has to specify his choice for k and n to the HashedDocDotFeatures object he is creating.
So, by keeping our previous choice of parameters intact and introducing the k and n we obtaing the following piece of code :

In [9]:
k = 3 # number of tokens, up to which we allow it to skip
n = 3 # number of tokens, up to which we allow it to combine

hashed_feats_quad = HashedDocDotFeatures(hash_size, string_features, tokenizer, normalize, n, k)

If we do not specify these numbers, as we did not do before, then, (you maybe have guessed it!) they are set by default to the following values, n=1, k=0!
Let's see if there was any improvement in our predicting capabilities.
By keeping the same svm settings, we will train on the new HashedDocDotFeatures object we have just created. Although we have the same document collection underneath, we are now considering a different representation model because of the quadratic features.

In [10]:
predicted_labels = svm.apply()
[-1.  1.  1.]


Attention! Some clarifications now will follow on the Quadratic Features to clear any misunderstandings:

  • The term "ngram" can come up on two different levels.
    The first one is in the tokenization process of the documents, where we have a raw stream of characters and our job there is to split the document on sequences of characters that are considered atomic. For instance, we may decide that sequences of three characters are the basic units which make up a document and therefore we are considering 3-grams (for tokens).
    The second level, is when we are combining the tokens generated in the previous phase to create quadratic features, by following the aforementioned k-skip n-grams rule. Now we are not considering anymore the document as a sequence of characters, but rather as a sequence of tokens. And it is on those tokens that we are now applying the rule. Therefore, if we decide to apply a 0-skip 3-grams rule then we may combine up to 3 tokens (not characters). In conclusion, the way the quadratic features work is completely independent of the Tokenizer used on the documents. We could have chosen a DelimiterTokenizer and still consider ngrams (of tokens) on the quadratic level.
  • The "quadratic" in front of "quadratic features" also refers to the way the computing time increases when we select this approach! So it should only be used when it is necessary.

Experiments on real datasets

The example that was demonstrated before was very small and simple and was manipulated for demonstration purposes, to show the capabilities of the HashedDocDotFeatures class.


Now let's have a look at the results the HashedDocDotFeatures class together with SVMOcas produced when faced against the webspam dataset, which consists of 350000 documents labelled either as "Spam" or as "Not spam".
We demonstrate results when compared against SVMLight, using the Spectrum Kernel.
We consider the following settings:
number of bits in hash = 16, 8-grams for tokens, SVM epsilon = 0.01.

In [11]:
from pylab import *

# HashedDocDotFeatures results
hashed_training_examples = [5000, 10000, 15000, 20000, 25000, 30000, 50000, 100000]

# For C=1
hashed_C_1_sec = [2682.750000,5202.690000,8120.460000,10846.410000,13944.200000,17016.840000,30496.720000,66302.950000]
hashed_C_1_roc = [0.980730,0.986382,0.988894,0.990666,0.991602,0.991957,0.993680,0.995184]

# For C=0.1
hashed_C_01_sec = [1074.130000,2142.390000,3434.710000,4641.380000,5984.530000,7206.040000,12864.270000,28393.540000]
hashed_C_01_roc = [0.976560,0.982660,0.985251,0.987380,0.988368,0.989022,0.990950,0.993197]

# Spectrum kernel results
kernel_training_examples = [5000, 10000, 15000, 20000, 25000]

# For C=1
kernel_C_1_sec = [2912.410000,6543.220000,10840.550000,16108.360000,19899.610000]
kernel_C_1_roc = [0.971284,0.976628,0.979715,0.982084,0.984355]

# For C=0.1
kernel_C_01_sec = [1441.380000,3261.870000,5071.040000,7568.130000,10436.430000]
kernel_C_01_roc = [0.946308,0.955245,0.961576,0.965204,0.968264]

plot(hashed_training_examples, hashed_C_1_sec, 'b')
plot(kernel_training_examples, kernel_C_1_sec, 'r')
title("Time comparison for C=1")
xlabel("Number of examples")
ylabel("Time in seconds")
legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=2)

plot(hashed_training_examples, hashed_C_1_roc, 'b')
plot(kernel_training_examples, kernel_C_1_roc, 'r')
title("Area under ROC comparison for C=1")
xlabel("Number of examples")
_=legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=4)
In [12]:
plot(hashed_training_examples, hashed_C_01_sec, 'b')
plot(kernel_training_examples, kernel_C_01_sec, 'r')
title("Time comparison for C=0.1")
xlabel("Number of examples")
ylabel("Time in seconds")
legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=2)

plot(hashed_training_examples, hashed_C_01_roc, 'b')
plot(kernel_training_examples, kernel_C_01_roc, 'r')
title("Area under ROC comparison for C=0.1")
xlabel("Number of examples")
_=legend(["HashedDocDotFeatures", "Spectrum Kernel"], loc=4)

Language detection

Another experiment we conducted was to create a language detection application. For that purpose we created a dataset consisting of 10,000 documents for 5 different languages; English, Greek, German, Italian and Spanish.
The demo uses underneath it a svm model trained on 30,000 documents with Ngrams of size 4, bits of hash size = 18, quadratic features with options: 2-skips 3-grams and svm C = 0.1. The demo can be found here and you can test it on your own how well it does!

Extra stuff

The hashing trick has also been implemented in Shogun for the DenseFeatures and SparseFeatures classes, as HashedDenseFeatures and HashedSparseFeatures classes. These classes also support quadratic features and provide the option to maintain or drop the linear terms.
For online algorithms or for cases when the data do not fit in memory there exist similar classes with the prefix "Streaming" that support reading examples from the disk and providing them to some algorithms one at a time. The classes specifically are StreamingHashedDocDotFeatures, StreamingHashedDenseFeatures and StreamingHashedSparseFeatures. If one has mixed features, that are not just numerical or not just text, then he can use the CombinedDotFeatures class to combine objects of the aforementioned classes!
Another option is to use the Vw* algorithms and the VwExample class that require the input to be in vw format and are a bit trickier to use.
In addition, a HashedDocConverter class exists that allows the user to get the hashed BoW representation of a document collection as a CSparseFeatures object.


  1. Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, Alex Smola : Feature Hashing for Large Scale Multitask Learning
  2. Olivier Chapelle, Eren Manavoglu, Romer Rosales : Simple and scalable response prediction for display advertising
  3. S̈oren Sonnenburg, Gunnar R̈atsch, Konrad Rieck : Large Scale Learning with String Kernels Vowpal Wabbit
  4. A post in metaoptimize
  5. </ol>
    Special thanks to my mentors for this project on GSoC 2013 : Soeren Sonnenburg, Olivier Chapelle and Benoit Rostykus