# Shogun Planet

## March 01, 2016 11:15 AM

### Heiko Strathmann

#### Google Summer of Code 2016

Great news: Shogun just got accepted to the GSoC 2016. After our break year in 2015, we are extremely excited to continue our GSoC tradition starting in 2011 (when I first joined Shogun).

If you are a student and wish to spend the summer hacking Machine Learning, guided by a vibrant international community of academics, professionals, and NERDS, then pay us a visit. Oh, and you will receive a cheque over $5000 from Google. This year, we focus on framework improvements rather than solely adding new algorithms. Consequently, most projects have a heavy focus on packaging and software engineering questions. But there will be Machine Learning too. We are aiming high! Check our our ideas list and read how to get involved. ## December 24, 2015 08:48 PM ### Heiko Strathmann #### Adaptive Kernel Sequential Monte Carlo And when your name is Dino Sejdinovic, you can actually integrate out all steps in feature space….. Ingmar Schuster wrote a nice blog post about our recently arxived paper draft. It is about using the Kameleon Kernel Adaptive Metropolis-Hastings proposal as an MCMC rejuvenation step in a Sequential Monte Carlo context. ## July 21, 2015 11:30 AM ### Heiko Strathmann #### Kamiltonian Monte Carlo Together with Dino, Sam, Zoltan, and Arthur, I recently arxived a first draft published an article on a project that combines two topics — the combination of which I find rather exciting: kernel methods and Hamiltonian Monte Carlo. Updates. Here is the abstract. ###### We propose Kamiltonian Kernel Hamiltonian Monte Carlo (KMC), a gradient-free adaptive MCMC algorithm based on Hamiltonian Monte Carlo (HMC). On target densities where classical HMC is not an option due to intractable gradients, KMC adaptively learns the target’s gradient structure by fitting an exponential family model in a Reproducing Kernel Hilbert Space. Computational costs are reduced by two novel efficient approximations to this gradient. While being asymptotically exact, KMC mimics HMC in terms of sampling efficiency, and offers substantial mixing improvements over state-of-the-art gradient free samplers. We support our claims with experimental studies on both toy and real-world applications, including Approximate Bayesian Computation and exact-approximate MCMC. Motivation: HMC with intractable gradients. Many recent applications of MCMC focus on models where the target density function is intractable. A very simple example is the context of Pseudo-Marginal MCMC (PM-MCMC), for example in Maurizio’s paper on Bayesian Gaussian Processes classification. In such (simple) models the marginal likelihood$p(\mathbf{y}|\theta )$is unavailable in closed form, but only can be estimated. Performing Metropolis-Hastings style MCMC on$\hat{p}(\theta |\mathbf{y})$results in a Markov chain that (remarkably) converges to the true posterior. So far so good. But no gradients. Sometimes, people argue that for simple objects as latent Gaussian models, it is possible to side-step the intractable gradients by running an MCMC chain on the joint space$(\mathbf{f},\theta)$of latent variables and hyper-parameters, which makes gradients available (and also comes with a set of other problems such as high correlations between$\mathbf{f}$and$\theta, etc). While I don’t want to get into this here (we doubt existence of a one-fits-all solution), there is yet another case where gradients are unavailable. Approximate Bayesian Computation is based on the context where the likelihood itself is a black-box, i.e. it can only be simulated from. Imagine a physicist coming to you with three decades of intuition in the form of some Fortran code, which contains bits implemented by people who are not alive anymore — and he wants to do Bayesian inference on the code-parameters via MCMC… Here we have to give up on getting the exact answer, but rather simulate from a biased posterior. And of course, no gradients, no joint distribution. State-of-the-art methods on such targets are based on adaptive random walks, as no gradient information is available per-se. The Kameleon (KAMH) improves over other adaptive MCMC methods by constructing locally aligned proposal covariances. Wouldn’t it be cooler to harness the power of HMC? Kamiltonian Monte Carlo starts as a Random Walk Metropolis (RWM) and then smoothly transitions into HMC. It explores the target at least as fast as RWM (we proof that), but improves the mixing in areas where it has been before. We do this by learning the target gradient structure from the MCMC trajectory in an adaptive MCMC framework — using kernels. Every MCMC iteration, we update our gradient estimator with a decaying probabilitya_t$that ensures that we never stop updating, but update less and less, i.e. $$\sum_{t=1}^\infty \frac{1}{a_t}=\infty\qquad\text{and}\qquad\sum_{t=1} ^\infty \frac{1}{a_t^2}=0.$$ Christian Robert challenged our approach: using non-parametric density estimates for MCMC proposals directly is a bad idea: if certain parts of the space are not explored before adaptation effectively stopped, the sampler will almost never move there. For KMC (and for KAMH too) however, this is not a problem: rather than using density estimators as proposal directly, we use them for proposal construction. This way, these algorithms inherit ergodicity properties from random walks. I coded an example-demo here. Kernel exponential families as gradient surrogates. The density (gradient) estimator itself is an infinite dimensional exponential family model. More precisely, we model the un-normalised target log-density$\pi$as an RKHS function$f\in\mathcal{H}$, i.e. $$\text{const}\times\pi(x)\approx\exp\left(\langle f,k(x,\cdot)\rangle_{{\cal H}}-A(f)\right),$$ which in particular implies$\nabla f\approx\nabla\log\pi.$Surprisingly, and despite a complicated normaliser$A(f)$, such a model can be consistently fitted by directly minimising the expected$L^2$distance of model and true gradient, $$J(f)=\frac{1}{2}\int\pi(x)\left\Vert \nabla f(x)-\nabla\log \pi (x)\right\Vert _{2}^{2}dx.$$ The magic word here is score-matching. You could ask: “Why not use kernel density estimation?” The answer: “Because it breaks in more than a few dimensions.” In contrast, we are actually able to make the above gradient estimator work in ~100 dimensions on Laptops. Two approximations. The über-cool infinite exponential family estimator, like all kernel methods, doesn’t scale nicely in the number$n$of data used for estimation — and here neither in the in input space dimension$d$. Matrix inversion with costs$\mathcal{O}(t^3d^3)$becomes a blocker, in particular as$t$here is the growing number of MCMC iterations. KMC comes in two variants, which correspond to different approximations to the original kernel exponential family model. 1. KMC lite expresses the log-density in a smaller dimensional (yet growing) sub-space, via collapsing all$d$input space dimensions into one. It takes the dual form $$f(x)=\sum_{i=1}^{n}\alpha_{i}k(x_{i},x),$$ where the$x_i$are$n$random sub-samples (just like KAMH) from the Markov chain history. Downside: KMC lite has to be re-estimated whenever the$x_i$change. Advantage: The estimator’s tails vanish outside the$x_i$, i.e.$\lim_{\|x\|_{2}\to\infty}\|\nabla f(x)\|_{2}=0$, which translates into a geometric ergodicity result as we will see later. 2. KMC finite approximates the model as a finite dimensional exponential family in primal form, $$f(x)=\langle\theta,\phi_{x}\rangle_{{\cal H}_{m}}=\theta^{\top}\phi_{x},$$ where$x\in\mathbb{R}^{d}$is embedded into a finite dimensional feature space$\phi_{x}\in{\cal H}_{m}=\mathbb{R}^{m}.$While other choices are possible, we use the Randon Kitchen Sink framework: a$m$-dimensional data independent random Fourier basis. Advantage: KMC lite is an efficient on-line estimator that can be updated at constant costs in$t$— we can fit it on all of the MCMC trajectory. Disadvantage: Its do not decay and our proof for geometric ergodicity of KMC lite does not apply. Increasing dimensions. So far, we did not work out how the approximation errors propagate through the kernel exponential family estimator, but we plan to do that at some point. The paper contains an empirical study which shows that the gradients are good enough for HMC up to ~100 dimensions — Under a “Gaussian like smoothness” assumption. The below plots show the acceptance probability along KMC trajectories and quantify how “close” KMC proposals are to HMC proposals. From RWM to HMC. Using the well known and (ab)used Banana density as a target, we feed a non-adaptive version of KMC and friends with an increasing number of so-called “oracle” samples (iid from the target), and then quantify how well they mix. While this scenario is totally straw-man, it allows to compare the mixing behaviour of the algorithms after a long burn-in. The below plots show KMC transitioning from a random walk into something close to HMC as the number of “oracle” samples (x-axis) increases.ABC — Reduced simulations and no additional bias. While there is another example in the paper, I want to show the ABC one here, which I find most interesting. Recall in ABC, the likelihood is not available. People often use synthetic likelihoods that are for example Gaussian, which induces an additional bias (in addition to ABC itself) but might improve statistical efficiency. In an algorithm called Hamiltonian ABC, such a synthetic likelihood is combined with stochastic gradient HMC (SG-HMC) via randomized finite differences, called simultaneous perturbation stochastic approximation (SPSA), which works as follows. To evaluate the gradient at a position$\theta$in sampling space: 1. Generate a random SPSA mask$\Delta$(set of binary directions) and compute the perturbed$\theta+\Delta$and$\theta+\Delta$. 2. Interpolate linearly between them the perturbations. That is simulate from the ABC likelihood at both points, construct the synthetic likelihood, and use their difference. 3. The gradient is a (step-size dependent!) approximation to the unknown gradient of the (biased) synthetic likelihood model. 4. Perform a single stochastic HMC leapfrog step (adding friction as described in the SG-HMC paper) 5. Iterate for$L$leapfrog iterations. What I find slightly irritating is that this algorithm needs to simulate from the ABC likelihood in every leapfrog iteration — and then discards the gained information after a leapfrog step has been taken. How many leapfrog steps are common in HMC? Radford Neal suggests to start with$L\in[100,1000]$. Quite a few ABC simulations come with this! But there are more issues: 1. SG-HMC mixing. I found that stochastic gradient HMC mixes very poorly when the gradient noise large. Reducing noise costs even more ABC simulations. Wrongly estimated noise (non-stationary !?!) induces bias due to the “always accept” mentality. Step-size decreasing to account for that further hurts mixing. 2. Bias. The synthetic likelihood can fail spectacularly, if the true likelihood is skewed for example. KMC does not suffer from either of those problems: It keeps on improving its mixing as it sees more samples, while only requiring a single ABC simulation at each MCMC iterations — rather than HABC’s$2L$(times noise-reduction-repetitions). It targets the true (modulo ABC bias) posterior, while accumulating information of the target (and its gradients) in the form of the MCMC trajectory. But how do you choose the kernel parameter? Often, this question is a threat for any kernel-based algorithm. For example, for the KAMH algorithm, it is completely unclear (to us!) how select these parameters in a principled way. For KMC, however, we can simply use cross-validation on the score matching objective function above. In practice, we use a black box optimisation procedure (for example CMA or Bayesian optimisation) to on-line update the kernel hyper-parameters every few MCMC iterations. See an example where our Python code does this here. Just like the updates to the gradient estimator itself, this can be done with a decaying probability to ensure asymptotic correctness. ## March 30, 2015 03:24 PM ### mloss.org #### MLOSS workshop at ICML 2015: Open Ecosystems MLOSS workshops are returning to ICML this summer! Key dates: • Submission DL 28 April 2015 • Workshop date 10 July 2015 We (Gaël Varoquaux, Cheng Soon Ong and Antti Honkela) are organising another MLOSS workshop at ICML 2015 in Lille, France this July. The theme for this edition is "Open Ecosystems" whereby we wish to invoke discussion on benefits (or drawbacks?) of multiple tools in the same ecosystem. Our invited speakers (John Myles White and Matthew Rocklin) will share some of their experiences on Julia and Python, and we would be happy to hear from others either on the same or different ecosystems through contributed talks. Usual demonstrations of new great software are naturally also welcome! In addition to the talks, we have planned two more active sessions: • an open discussion with themes voted by workshop participants similar to MLOSS 2013; and • a hackathon for planning and starting to develop infrastructure for measuring software impact. If you have any comments or suggestions regarding these, please add a comment here or email the organisers! More details at the workshop website at http://mloss.org/workshop/icml15/ ## March 17, 2015 10:18 AM ### Heiko Strathmann #### Unbiased Big Bayes without sub-sampling bias: Paths of Partial Posteriors Together with Dino Sejdinovic and Mark Girolami, I recently arxived our second draft of a paper on the popular topic of how to scaleup Bayesian inference when facing large datasets. This project is yet another method to scale up Bayesian inference via sub-sampling. But unlike most other work on this field, we do not use sub-sampling within an MCMC algorithm to simulate from the posterior distribution of interest. Rather, we use sub-sampling to create a series of easier inference problems, whose results we then combine to get the answer to the question we were originally asking — in an estimation sense, without full posterior simulation. I’ll describe the high-level ideas now. Updates on the project: Let’s assume the following desiderata for large-scale Bayesian estimators: • Average computational costs sub-linear in the data size$N$. • Bounded variance (even for growing$N$) and can be controlled. • No sub-sampling bias is introduced. The method inherits the usual finite time bias from MCMC though. • Not limited to certain models or inference schemes. • Trade off redundancy in the data against computational savings. • Perform competitive in practice. Bayesian inference usually involves integrating a functional$\varphi:\Theta\rightarrow\mathbb{R}$over a posterior distribution $$\phi=\int d\theta\varphi(\theta)\underbrace{p(\theta|{\cal D})}_{\text{posterior}},$$ where $$p(\theta|{\cal D})\propto\underbrace{p({\cal D}|\theta)}_{\text{likelihood data}}\underbrace{p(\theta)}_{\text{prior}}.$$ For example, to compute the predictive posterior mean for a linear regression model. This is often done by constructing an MCMC algorithm to simulate$m$approximately iid samples from approximately$\theta^{(j)}\sim p(\theta|{\cal D})$(the second “approximately” here refers to the fact that MCMC is biased for any finite length of the chain, and the same is true for our method) and then performing Monte Carlo integration $$\phi\approx\frac{1}{m}\sum_{j=1}^{m}\varphi(\theta^{(j)}).$$ Constructing such Markov chains can quickly become infeasible as usually, all data needs to be touched in every iteration. Take for example the standard Metropolis-Hastings transition kernel to simulate from$\pi(\theta)\propto p(\theta|{\cal D})$, where at a current point$\theta^{(j)}$, a proposed new point$\theta’\sim q(\theta|\theta^{(j)})$is accepted with probability $$\min\left(\frac{\pi(\theta’)}{\pi(\theta^{(j)})}\times\frac{q(\theta^{(j)}|\theta’)}{q(\theta’|\theta^{(j)})},1\right).$$ Evaluating$\pi$requires to iterate through the whole dataset. A natural question therefore is: is it possible to only use subsets of$\mathcal{D}$? Existing work. There has been a number of papers that tried to use sub-sampling within the transition kernel of MCMC: All these methods have in common that they either introduce bias (in addition to the bias caused by MCMC), have no convergence guarantees, or mix badly. This comes from the fact that it is extremely difficult to modify the Markov transition kernel and maintain its asymptotic correctness. In our paper, we therefore take a different perspective. If the goal is to solve an estimation problem as the one above, we should not make our life hard trying to simulate. We construct a method that directly estimates posterior expectations — without simulating from the full posterior, and without introducing sub-sampling bias. The idea is very simple: 1. Construct a series of partial posterior distributions$\tilde{\pi}_{l}:=p({\theta|\cal D}_{l})\propto p({\cal D}_{l}|\theta)p(\theta)$from subsets of sizes $$\vert\mathcal{D}_1\vert=a,\vert\mathcal{D}_2\vert=2^{1}a,\vert\mathcal{D}_3\vert=2^{2}a,\dots,\vert\mathcal{D}_L\vert=2^{L-1}a$$ of the data. This gives $$p(\theta)=\tilde{\pi}_{0}\rightarrow\tilde{\pi}_{1}\rightarrow\tilde{\pi}_{2}\rightarrow\dots\rightarrow\tilde{\pi}_{L}=\pi_{N}=p({\theta|\cal D}).$$ 2. Compute posterior expectations$\phi_{t}:=\hat{\mathbb{E}}_{\tilde{\pi}_{t}}\{\varphi(\theta)\}$for each of the partial posteriors. This can be done for example MCMC, or other inference methods. This gives a sequence of real valued partial posterior expectations — that converges to the true posterior. 3. Use the debiasing framework, by Glynn & Rhee. This is a way to unbiasedly estimate the limit of a sequence without evaluating all elements. Instead, one randomly truncates the sequence at term$T$(which should be “early” in some sense,$T$is a random variable), and then computes $$\phi^*=\sum_{t=1}^T \frac{\phi_{t}-\phi_{t-1}}{\mathbb{P}\left[T\geq t\right]}.$$ This is now an unbiased estimator for the full posterior expectation. The intuition is that one can rewrite any sequence limit as a telescopic sum $$\lim_{t\to\infty} \phi_t=\sum_{t=1}^\infty (\phi_{t}-\phi_{t-1})$$ and then randomly truncate the infinite sum and divide through truncation probabilities to leave the expectation untouched, i.e. $$\mathbb{E}[\phi^*]=\phi,$$where$\phi$is the sequence limit. 4. Repeat$R$times for reducing the estimator variance as$1/R$. In the paper, we choose the random truncation time as $$\mathbb{P}(T=t)\propto 2^{-\alpha t}.$$ From here, given a certain choice of$\alpha$, it is straight-forward to show the average computational cost of this estimator is$\mathcal{O} (N^{1-\alpha}),$which is sub-linear in$N$. We also show that variance is finite and can be bounded even when$N\to\infty$. This statement is modulo certain conditions on the$\phi_t$. In addition, the choice of the$\alpha$-parameter of the random truncation time$T$is crucial for performance. See the paper for details how to set it. Starting from$p(mu)$, we take steps on a single posterior path by subsequently doubling the data-size, eventually reaching the full posterior, which is proportional to$p(mathbf{mu})prod_{i=1}^{100}p(mathbf{x}_i|mathbf{mu})$. Shown are 2D contour plots of the partial posterior densities. " data-medium-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/concentration_gaussian_2d_website.jpg?fit=300%2C223" data-large-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/concentration_gaussian_2d_website.jpg?fit=656%2C488" /> Illustration of debiasing for the posterior mean of a 2D Gaussian with unknown mean$\mu$and fixed covariance$\Sigma$. Data is$\mathcal{D}=\{\mathbf{x}_i\sim\mathcal{N}(\mathbf{x}_i|\mu=\mathbf{2},\Sigma)\}_{i=1}^{100}$with$\Sigma=[(-1,3)^\top,(3,1)^\top]$, prior is$p(\mathbf{\mu})=\mathcal{N}(\mathbf{\mu}|\mathbf{0}, I)$. We aim to compute the posterior mean$\int \mu p(\mu|\mathcal{D})d\mu$. Debiasing computes multiple posterior paths (coloured solid lines), which are randomly truncated (solid line stops), and then plugged into the debiasing estimator to estimate the posterior mean of$\mu_1$and$\mu_2$(coloured round dots, dotted lines connect path end-point to estimate). The procedure is averaged$R=1000$times (gray dots), after which the empirical mean matches the full posterior mean. A kernel density estimate of the gray dots is shown in the background. " data-medium-file="http://i2.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_illustration_website.jpg?fit=235%2C300" data-large-file="http://i2.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_illustration_website.jpg?fit=656%2C839" /> Experimental results are very encouraging. On certain models, we are able to estimate model parameters accurately before plain MCMC even has finished a single iteration. Take for example a log-normal model $$p(\mu,\sigma|{\cal D})\propto\prod_{i=1}^{N}\log{\cal N}(x_{i}|\mu,\sigma^{2})\times\text{flat prior}$$ and functional $$\varphi((\mu,\sigma))=\sigma.$$ We chose the true parameter$\sigma=\sqrt{2}$and ran our method in$N=2^{26}$data. The pictures below contain some more details. Debiasing estimates converge to true posterior statistic$sigma=sqrt{2}$quickly. " data-medium-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_PartialPosteriorStatisticStan_log_Gaussian_website.jpg?fit=300%2C150" data-large-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_PartialPosteriorStatisticStan_log_Gaussian_website.jpg?fit=656%2C329" /> Remarkably, the largest partial posterior size was only$max_{r} left{n_{T_r}right}=2048$, leading to a maximum single replication cost of$sum_{t=1}^{max_r {T_r}}n_t= 4088$, and a median$n_{T_r}$of only$16$. After$R=300$replications, the number of data touched in total is$sum_{r=1}^{R} sum_{t=1}^{T_r} n_t=27,264$. Taking into account the$600$MCMC iterations to estimate each partial posterior expectation, this sums up to$16,358,400$likelihood evaluations, which is less than a quarter of a single full MCMC burn-in iteration ($N=2^{26}$), and less than$1/(4cdot 500)approx 0.0005$times the number of likelihood evaluations required to complete the burn-in of full MCMC. " data-medium-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_PartialPosteriorStatisticStan_log_Gaussian_used_data_website.jpg?fit=300%2C112" data-large-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_PartialPosteriorStatisticStan_log_Gaussian_used_data_website.jpg?fit=656%2C245" /> Convergence of partial posteriors’ mean of$sigma$as a function of sub-sample size for Bardenet’s log-Gaussian model. We randomly sub-sample$n_t$data from the dataset and run MCMC to estimate$sigma$, whose true value is$sigma=sqrt{2}$. Error bars are 95% over 150 trials. " data-medium-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/convergence_stan_LogGaussian_website.jpg?fit=300%2C247" data-large-file="http://i1.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/convergence_stan_LogGaussian_website.jpg?fit=656%2C539" /> Comparison to stochastic variational inference is something we only thought of quite late in the project. As we state in the paper, as mentioned in my talks, and as pointed out in feedback from the community, we cannot do uniformly better than MCMC. Therefore, comparing to full MCMC is tricky — even though there exist trade-offs of data redundancy and model complexity that we can exploit as showed in the above example. Comparing to other “limited data” schemes really showcases the strength (and generality) of our method. We chose to compare against “Gaussian Processes for Big Data“, which puts together a decade of engineering of variational lower bounds, the inducing features framework for GPs, and stochastic variational inference to solve Gaussian Process regression on roughly a million data points. In contrast, our scheme is very general and not specifically designed for GPs. However, we are able to perform similar to this state of the art method. On predicting airtime delays, we reach a slightly lower RMSE at comparable computing effort. The paper contains details on our GP, which is slightly different (but comparable) to the one used in the GP for Big Data paper. Another neat detail of the GP example is that the likelihood does not factorise, which is in contrast to most other Bayesian sub-sampling schemes. Debiasing for the airtime delays dataset. With an order of magnitude less average computational cost than GP-SVI, we are able to reproduce a comparable RMSE on$10^5$randomly chosen test covariates. In particular, as in the previous GP example, 95$%$error bars are computed over 20 repetitions. " data-medium-file="http://i0.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_gp_airlines_website.jpg?fit=300%2C201" data-large-file="http://i0.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/debiasing_gp_airlines_website.jpg?fit=656%2C440" /> Convergence of RMSE as a function of mini-batch size using random Fourier features. " data-medium-file="http://i2.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/convergence_gp_airlines_website.jpg?fit=300%2C240" data-large-file="http://i2.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/convergence_gp_airlines_website.jpg?fit=656%2C525" /> Results reported in “Gaussian Processes for Big Data” " data-medium-file="http://i0.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/gp_svi.png?fit=300%2C300" data-large-file="http://i0.wp.com/herrstrathmann.de/wp-content/uploads/2015/03/gp_svi.png?fit=483%2C482" /> In summary, we arrive at • Computational costs sub-linear in$N$• No introduced sub-sampling bias (in addition to MCMC, if used) • Finite, and controllable variance • A general scheme that is not limited to MCMC nor to factorising likelihoods • A scheme that trades off redundancy for computational savings • A practical scheme that performs competitive to state-of-the-art methods and that has no problems with transition kernel design. • See the paper for a list of weaknesses and open problems. ## January 30, 2015 07:30 PM ### Sergey Lisitsyn #### Prepositions and Arguments I had some crazy idea how to make C++ code more readable using prepositions. It would take some discipline to use and probably serious industrial people would consider me mad. ## January 28, 2015 10:02 PM ### Kevin Hughes #### Mailcheck.js in Production Mailcheck.js in Production I was recently tasked with adding Mailcheck.js to some of our production pages and I want to describe a bit of the process I went through because I did some things a bit differently and had some fun along the way. Lets start with a PSA – do not simply drop Mailcheck onto your website as is! In my opinion / findings the default algorithm is way too greedy – aka it will mostly suggest all emails should be ____@gmail.com. It is worth taking the time to tweak mailcheck for your particular userbase, on one wants to see a correction for their proper email address! The first thing I did was dumped a ton of emails from our database to create a dataset to work with. I could have used Node to write some scripts to test out the Mailcheck behaviour but Python is just so much more convient for doing numerical analysis. Plus its what our data team uses so I could leverage some of their knowledge and code. So now for the fun part – I ended up using PyV8 (a python wrapper for calling out to Google’s V8 javascript engine). With this setup I was able to slice and dice through our production emails using python and pandas calling the exact javascript mailcheck algorithm and collecting my results. After tweaking the algorithm I could take the settings and new js code and put it in production. Check out this wacky franken script that got the job done (pandas not included): import PyV8 def init_mailcheck(): global ctxt ctxt = PyV8.JSContext() ctxt.enter() ctxt.eval(open("mailcheck.js").read()) def run_sift3Distance(s1,s2): script = "Mailcheck.mailcheck.sift3Distance('%s','%s')" %(s1,s2) return ctxt.eval(script) def run_splitEmail(email): script = "Mailcheck.mailcheck.splitEmail('%s')" %(email) return ctxt.eval(script) def run_mailcheck(email): script = """ Mailcheck.mailcheck.run({ email: "%s", }) """ % (email) result = ctxt.eval(script) if result: try: result = result.address + '@' + result.domain except(AttributeError): pass return result if __name__=="__main__": init_mailcheck() print run_mailcheck("kevinhughes27@gmil.com") # >>> @kevinhughes27@gmail.com  ## January 27, 2015 12:44 AM ### Heiko Strathmann #### Shogun 4.0 and GSoC 2014 follow up No, this is not about Fernando’s and mine honeymoon … The Shogun team just released version 4.0 of their community driven Machine Learning toolbox. This release most of all features the work of our 8 Google Summer of Code 2014 students, so this blog post is dedicated to them — you guys rock. This also brings an end to yet another very active year of Shogun: we organised a second workshop in Berlin, and I presented Shogun in to the public in London, New York and Berlin. For the 4th time, Shogun participated in Google’s wonderful program which more than anything boosted the team’s size and motivation. What else makes people spend sleepless nights hunting bugs for the sake of Machine Learning for everyone? This year was the first time that I organised our participation. This ranged from writing the application last second, harrassing potential mentors until they say ‘yes’, to making up overly ambitious projects to scare students away, and ending up mentoring too many students on my own. Jokes aside, this was a very challenging (in particular time-wise) but also a very rewarding experience that definitely sharpened my project organisation skills. As in the previous year, I tried to fuse my scientific life and Shogun’s GSoC participation — kernel methods and variational learning is something I touch on a daily basis at Gatsby. Many mentors were approached after having met them at scientific Machine Learning conferences, and being exposed to ML on for some years now, it is also easier to help students implement and write about popular ML algorithms. Here is a list. (Note that all projects come with really nice IPython notebooks — something that we continued to insist on from last year.) Fundamental ML algorithms by Parijat Mazumdar (parijat). Mentor: Fernando Shogun needs more standard ML algorithms. Parijat implemented some of those: random forests, kernel density estimation and more. Parijat’s code quality is amazing and together with Fernando’s superb mentoring skills (his first time mentoring), this project is likely to have been very sustainable. Notebook random forest, notebook KDE. Kernel testing and feature selection by Rahul De (lambday). Mentor: Dino Sejdinovic, Heiko Previous year’s student lambday continued to rock. First, he massively extended my 2012 project on kernel hypothesis testing to Big Data land. Dino, who was one of the invited speakers in the Shogun workshop last summer, and I are actually working on a journal article where we will use this code. Second, he extended the framework to perform feature selection via dependence measures. Third, he initiated and guided development of a framework for unifying Shogun’s linear algebra operations. This for example can be used to change existing algorithms from CPU to GPU with a compile switch — useful also for our deep learning project. Variational Inference for Gaussian Processes by Wu Lin (yorkerlin). Mentor:Heiko, Emtiyaz Khan In our third GSoC project on GPs, Wu took a couple of state-of-the-art approximate variational inference methods developed by Emiyaz, and put them into Shogun’s framework. The result of this very involved and technical project is that we now have large-scale classification using GPs. Emtiyaz also was a speaker at ourworkshop. Notebook Shogun missionary by Saurabh. Mentor:Heiko The idea of this project was to showcase Shogun’s abilities — sometimes we definitely need to work on. Saurabh wrote a couple of Notebooks that are essentially ML tutorials using Shogun. If you want to know about ML basics, regression, classification, model-selection, SVMs, multiclass, multiple kernel learning this was for you. He also extended our web-demo framework to for example include model-selection for GPs. OpenCV integration by Kislay. Mentor: Kevin Kislay, after writing a very cool notebook on PCA for his application, wrote data-structures to bridge between Shogun and OpenCV. The project was supervised by Kevin, who is also one of our former GSoC students This makes it possible to use the too libraries together in a neat way. Deep learning by Khaled Nasr. Mentor: Theofanis, Sergey The hype is on! After NIPS, Facebook, GoogleDeepmind, Shogun now also joined Khaled did a very good job in coding up the standard ones, and was involved in generalising Shogun’s linear algebra on the fly with lambday. This is a project that is likely to have a second part. Check his superb notebooks on deep belief neural networks, convolutional networks,autoencoders, restricted Boltzman machines SO Learning with Approximate Inference by Jiaolong. Mentor: hushell, Thoralf This was another project that was (co-)mentored by a former GSoC student. With the help his mentors, Jialong implemented various approximate inference methods for structured output (SO) models. Check out his notebook. Large-Scale Multi-Label Classification by Abinash. Mentor: Thoralf Another project involving our structured output expert Thoralf as mentor. Abinash implemented large-scale multilabel learning — beating scikit learn‘s implementation both in runtime and accuracy. The last experiment is described in this notebook. Finally, we sent two of our delegates (Thoralf and Fernando) to the 10th year jubilee mentor summit in California in late October. Really cool: I got lucky and won Google’s lottery on some extra places, so I could also join. The summit once again was overly colourful, bursting with creative minds who have the most diverse set of opinions and approaches, but who are all united by their excitement about open-source. The beauty of this community to me really lies in the people who do work purely driven by their interest on *the thing itself*, independent of competitive and in particular commercial interests — sometimes almost to an extend that is beyond any form of compromise. A wonderful illustration of this was when at the mentor summit, during the reception in the Tech museum in San Jose: Google’s speaker and head of finance Patrick Pichette (disclaimer: not sure, don’t quote me on this) who is the boss of Chris DiBona, who himself organises the GSoC, searched to inspire the audience to “think BIG” and to “change the lifes of GSoC students”. Guest speaker Linus Torvalds 10 minutes later then contemplates that he could not be a GSoC mentor as he would scare people away and that the best way to get involved in open-source is to “start small” — a sentence after which P.P. left the room. Funny enough: in GSoC, this community is then hugged by a super capitalistic American internet company — and gladly lets it happen: we all love GSoC and Shogun certainly would not be where it is without it. I also want to mention the day Google rented a whole theme park for us nerds — which made Fernando try a roller-coaster for the first time after being pushed by MLPack maintainer Ryan and myself. After being horrified at first, he even started to talk about C++ the second or third time. As you would expect from attending such geeky meetings, Thoralf, Fernado, and I also spent quite some time hacking Shogun, discussing ideas until late night (of course getting emotional about them ). I managed to take a picture of Fernando falling asleep while hacking Shogun’s modular interfaces. Some of those ideas are collected on our wiki. • Improve usability • Making Shoun more modular and slim • Improving Shogun’s efficiency Some of those ideas are also part of our theme for our GSoC 2015 application and our planned Hackathon. We have come to a point where we seriously need to focus on application and stability rather than adding more and more cutting-edge algorithms — Shogun’s almost 15 year old framework needs a face lift. GSoC students will see that this years project ideas will focus on cleaning up the toolbox and implement ML applications. Meet the Shogun/MLPack crew, as nerdy as it gets ## January 25, 2015 09:56 PM ### Kevin Hughes #### pyIPCA a python library for Incremental PCA I extracted some of the useful code and nifty examples from the background of my Thesis as a python library for your enjoyment. PCA or Principal Component Analysis is a pretty common data analysis technique, incremental PCA lets you perform the same type of analysis but uses the input data one sample at a time rather than all at once. The code fully conforms to the scikit-learn api and you should be able to easily use it anywhere you are currently using one of the sklearn.decomposition classes. In fact this library is sort of on the waiting list for sklearn: https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets IPCA on 2D point cloud shaped like an ellipse Check it out if you’re interested and holla at sklearn if you want this feature! https://github.com/kevinhughes27/pyIPCA ## January 25, 2015 09:55 PM #### Gmail_ToDo!! Do you use your inbox as a running ToDo list? Do you send yourself emails so you remember to do certain things? Do you spend most of your day in a terminal? If you answered yes to all 3 of those questions then you might be interested in this nifty little ruby gem I just released. Its called gmail_todo and its made for quickly emailing yourself a ToDo note from the command line, think of the precious seconds you’ll save! Now when you remember something you need to do and you are in a terminal rather than alt-tabbing (or god forbid reaching for the mouse) you can quickly type ‘todo “get milk”‘ and voila an email will appear in your inbox. As an added bonus my gem prepends “[ToDo]” to your subject for easy filtering! Check it out on Github and RubyGems https://github.com/kevinhughes27/gmail_todo http://rubygems.org/gems/gmail_todo ## January 25, 2015 09:55 PM #### Tax Receipts for Shopify! A few days ago I launched a free app for Shopify that automates the process of sending your customers tax receipts for their donations to your non-profit Shopify site. Its a pretty cool little app – I think it is a really good example of how to build a simple piece of automation using webhooks. I fully open sourced the code here: https://github.com/kevinhughes27/shopify_tax_receipts so have a look! ## December 06, 2014 06:47 PM ### Kevin Hughes #### Apipie example recording with Minitest I’ve been using the excellent apipie gem on a new project I’m working on and I wanted to share my solution for example recording with Minitest. The gem documentation has an example of how to specify which functional tests should be shown as an Api example if you happen to be using Rspec but nothing for Minitest – what to do? If you read the source of the example recorder it’s actually pretty smart – it loads the apipie_examples.json file of the previously recorded examples and only adds new ones or updates existing examples. With this knowledge I came up with an approach – I added a special string #apidoc to the test name of the tests I want included in my Api docs. Then I made this Rake task to run only my Api controller tests and filter which tests using the string: task :api_examples do Dir["#{Rails.root}/test/controllers/api/*controller_test.rb"].each do |controller| pipe = IO.popen("APIPIE_RECORD=examples bundle exec ruby -Itest #{controller} -n /#apidoc/") while (line = pipe.gets); print line; end end end  Also make sure you have config.show_all_examples = 1 in your apipie initializer file. That’s how I did and its been working nicely for me, let me know in the comments if you have a better solution! ## November 20, 2014 02:37 AM ### Kevin Hughes #### Some fun with D3.js I’ve been intrigued by D3.js and the impressive visualizations I’ve seen people make for quite some time and I finally got to check it out for myself. I wanted to write this blog post sharing my experiences and revelations from the perspective of someone who at one point spent a fair amount of time with Matlab/Octave and my personal fave matplotlib. The first major point I want to make is that D3.js is not a plotting library – it should not be compared directly to Matlab/Octave or matplotlib because they serve different purposes. D3.js is a visualization library, sure this visualization may be a graph but you shouldn’t be playing with D3.js until you already understand your data and simply want to show it off. While exploring and figuring out your data matplotlib is going to be way faster and more effective (note there are some wrappers building standard plotting functions ontop of D3.js like NVD3 if javascript is really your thing). Once you’ve figured out your data and its time to show-off D3.js really shines and here is the main reason why in my opinion. Its due to a fundamental shift in a way of thinking that makes D3.js so powerful: rather than calling plot with an array of x and y values I am binding an array of javascript objects to my plot and then telling it which attributes to use for different things. Essentially you are pulling more data into the plot and you can write code at each level – let me show you the snippet that made me realize how powerful this shift in thinking is: var chart = d3.select(&quot;.chart&quot;); var tip = d3.tip() .attr('class', 'd3-tip') .offset([-10, 0]) .html(function(d) { return &quot;&lt;span&gt;&quot; + d.text + &quot;&lt;/span&gt;&quot;; }); chart.call(tip); var bar = chart.selectAll(&quot;g&quot;) .data(data) .enter().append(&quot;g&quot;) .attr(&quot;transform&quot;, function(d, i) { return &quot;translate(&quot; + i * barWidth + &quot;,0)&quot;; }); bar.append(&quot;rect&quot;) .attr(&quot;y&quot;, function(d) { return y(d.value); }) .attr(&quot;height&quot;, function(d) { return height - y(d.value); }) .attr(&quot;width&quot;, barWidth - 1) .on('mouseover', tip.show) .on('mouseout', tip.hide);  The important parts to note are this: I grab the value for my y axis of the rect from d.value where d is one of the objects in my data array that I am plotting. So far this is pretty standard but look at the tip function that gets called on mouseover – it shows a span which contains d.text! I’ve attached another piece of data here called text that can be used to display more information! This was the wow moment for me because this kind of thing isn’t possible in matplotlib and showing a simple tooltip is really on the beginning of what you can do with all this added context. Also something about plots with a :hover effect is kind of awesome! If you haven’t played with D3 yet and have some cool data I definetly recommend it, I used it to make some pretty sweet visualizations/tools for our Parity Ultimate Frisbee League here in Ottawa, check’em out: ## October 30, 2014 11:07 AM ### mloss.org #### A third of the top 100 papers are about software How many of the papers that are in the top 100 most cited about software? 21, with an additional 12 papers which are not specifically about software itself, but about methods or statistics that were implemented later in software. When you take a step back and think about the myriad areas of research and the stratospheric numbers of citations the top 100 get, it is quite remarkable that one fifth of the papers are actually about software. I mean really about software, not software as an afterthought. Some examples: To put in perspective how rarified the air is in the top 100 citations, the if we combined all citations received by all JMLR papers in the last five years (according to SCImago), this one gigantic paper would not even make it into the top 100. Yes, yes, citations do not directly measure the quality of the paper, and there are size of community effects and all that. To be frank, being highly cited seems to be mostly luck. In the spirit of open science, here is a bar plot showing these numbers, and here is my annotated table which I updated from the original table. For a more mainstream view of the data, look at the Nature article. ## September 07, 2014 01:16 PM ### Heiko Strathmann #### Shogun in NYC In late August, I was invited to NYC to present Shogun at an open-source Machine Learning software workshop (link), organised by John Langford. Seeing Shogun being recognised as a major player along with big ML/stats libraries like TheanoStanTorchLibLinearVowpalWabbit, etc really got my excited. I talked to most of the other project’s developers and a few very interesting possible collaborations came up. For example, Shogun’s unique way to automagically generate interfaces to most computing languages via swig. It has been in our pipeline of ideas for a long time to pull this functionality out of Shogun, and offer it to other projects in a modular way. Sergey has put together (link) a simple prototype here and will continue to work on this soon. Another thing is that we would like to integrate some (fixed) models from Stan into Shogun, for example to complement our collection of variational inference methods for Gaussian Processes with a full blown MCMC based approach. While talking to Gunnar Raetsch, we had the idea to host a hackfest where we bring together all Shogun core-developers for a week, working on more sophisticated projects that are not suitable for Google Summer of Code. A generic framework for parallelising/distributing algorithms in Shogun would be a first idea, extending ideas from lambday’s GSoC 2013 project. This would again also be useful for other ML libraries and I in fact talked to John Langford about using ideas from VowpalWabbit here. We made a list of ideas (link) for such a hack and are currently trying to get funding for it. The meeting was video-taped and I will put links for this soon. ## August 19, 2014 12:36 AM ### Heiko Strathmann #### Shogun workshop 2014 Another super nice event that happened in Berlin in July was the second Shogun workshop that me and the Shogun team organised. We had a hands-on session and a main workshop day full of talks, cool people, and lots of Machine Learning. It is a great pleasure working with the Shogun team, and I am very happy that we were able to push this year’s workshop through. I talked a bit about what Shogun is and what it tries to be, and gave some directions for future work. If you missed the event, we got full video coverage! Workshop program Videos ## August 18, 2014 11:29 PM ### Heiko Strathmann #### Shogun at EuroPython 2014 It’s already a while ago, in July, I presented Shogun at the EuroPython in Berlin. This was a super nice conference full of interesting people and projects. My talk was recorded and can be found [here], slides are [here]. Another super interesting project was presented by Thomas Wiecki: Probabilistic programming with PyMC3 [video]. This is a very interesting project, a bit similar to Stan, which allows to plug together probabilistic models and then do HMC for inference, powered by automatic differentiation (which I am currently super excited about). Unlike Stan, this happens all in Python which makes things super comfortable (example notebook). As I am working on kernel-based samplers (Kameleon MCMC & soon more), this might be a good vehicle to make them public, exploiting the nice auto-diff tool (Theano) that PyMC uses. ## July 28, 2014 12:39 PM ### mloss.org #### Open Machine Learning Workshop Just in case there are people who follow this blog but not John Langford's, there is going to be an open machine learning workshop on 22 August 2014 in MSR, New York, organised by Alekh Agarwal, Alina Beygelzimer, and John Langford. As it says on John's blog: If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments. ## July 22, 2014 12:00 AM ### mloss.org #### Machine Learning Distro What would you include a linux distribution to customise it for machine learning researchers and developers? Which are the tools that would cover the needs of 90% of PhD students who aim to do a PhD related to machine learning? How would you customise a mainstream linux distribution to (by default) include packages that would allow the user to quickly be able to do machine learning on their laptop? There are several communities which have their own custom distribution: • Scientific Linux which is based on Red Hat Enterprise Linux is focused making it easy for system administrators of larger organisations. The two big users are FermiLab and CERN who each have their own custom "spin". Because of its experimental physics roots, it does not have a large collection of pre-installed scientific software, but makes it easy for users to install their own. • Bio-Linux is at the other end of the spectrum. Based on Ubuntu, it aims to provide a easy to use bioinformatics workstation by including more than 500 bioinformatics programs, including graphical menus to them and sample data for testing them. It is targeted at the end user, with simple instructions for running it Live from DVD or USB, to install it, and to dual boot it. • Fedora Scientific is the latest entrant, providing a nice list of numerical tools, visualisation packages and also LaTeX packages. Its documentation lists packages for C, C++, Octave, Python, R and Java. Version control is also not forgotten. A recent summary of Fedora Scientific was written as part of Open Source Week. It would seem that Fedora Scientific would satisfy the majority of machine learning researchers, since it provides packages for most things already. Some additional tools that may be useful include: • tools for managing experiments and collecting results, to make our papers replicable • GPU packages for CUDA and OpenCL • Something for managing papers for reading, similar to Mendeley • Something for keeping track of ideas and to do lists, similar to Evernote There's definitely tons of stuff that I've forgotten! Perhaps a good way to start is to have the list of package names useful for the machine learning researcher in some popular package managers such as yum, apt-get, dpkg. Please post your favourite packages in the comments. ## June 25, 2014 09:02 PM ### Heiko Strathmann #### ICML paper: Kernel Adaptive Metropolis Hastings$\DeclareMathOperator*{\argmin}{arg\,min}\def\Mz{M_{\mathbf{z},y}}In this year’s ICML in Bejing, Arthur Gretton, presented our paper on Kernel Adaptive Metropolis Hastings [link, poster]. His talk slides are based on Dino Sejdinovic ones [link]. Pitty that I did not have travel funding. The Kameleon is furious! The idea of the paper is quite neat: It is basically doing Kernel-PCA to adapt a random walk Metropolis-Hastings sampler’s proposal based on the Markov chain’s history. Or, more fancy: Performing a random walk on an empirically learned non-linear manifold induced by an empirical Gaussian measure in a Reproducing Kernel Hilbert Space (RKHS). Sounds scary, but it’s not: the proposal distribution ends up being a Gaussian aligned to the history of the Markov chain. We consider the problem of sampling from an intractable highly-nonlinear posterior distribution using MCMC. In such cases, the thing to do ™ usually is Hamiltonian Monte Carlo (HMC) (with all its super fancy extensions on Riemannian manifolds etc). The latter really is the only way to sample from complicated densities by using the geometry of the underlying space. However, in order to do so, one needs access to target likelihood and gradient. We argue that there exists a class of posterior distributions for that HMC is not available, while the distribution itself is still highly non-linear. You get them as easy as attempting binary classification with a Gaussian Process. Machine Learning doesn’t get more applied. Well, a bit: we sample the posterior distribution over the hyper-parameters using Pseudo-Marginal MCMC. The latter makes both likelihood and gradient intractable and the former in many cases has an interesting shape. See this plot which is the posterior of parameters of an Automatic Relevance Determination Kernel on the UCI Glass dataset. This is not an exotic problem at all, but it has all the properties we are after: non-linearity and higher order information unavailable. We cannot do HMC, so we are left with a random walk, which doesn’t work nicely here due to the non-linear shape. The idea of our algorithm is to do a random walk in an infinite dimensional space. At least almost. • We run some method to get a rough sketch of the distribution we are after. • We then embed these points into a RKHS; some infinite dimensional mean and covariance operators are going on here, but I will spare you the details here. • In the RKHS, the chain samples are Gaussian, which corresponds to a manifold in the input space, which in some sense aligns with what we have seen of the target density yet. • We draw a samplef$from that Gaussian – random walk in the RKHS • This sample is in an infinite dimensional space, we need to map it back to the original space, just like in Kernel PCA • It turns out that is hard (nice word for impossible). So let’s just take a gradient step along some cost function that minimises distances in the way we want: $\argmin_{x\in\mathcal{X}}\left\Vert k\left(\cdot,x\right)-f\right\Vert _{\mathcal{H}}^{2}$ • Cool thing: This happens in input space and gives us a point whose embedding is in some sense close to the RKHS sample. • Wait, everything is Gaussian! Let’s be mathy and integrate the RKHS sample (and the gradient step) out. Et voilà, we get a Gaussian proposal distribution in input space$q_{\mathbf{z}}(\cdot|x_{t})=\mathcal{N}(x_{t},\gamma^{2}I+\nu^{2}M_{\mathbf{z},x_{t}}HM_{\mathbf{z},x_{t}}^{\top}),$where$\Mz=2\eta\left[\nabla_{x}k(x,z_{1})|_{x=y},\ldots,\nabla_{x}k(x,z_{n})|_{x=y}\right]is based on kernel gradients (which are all easy to get). This proposal aligns with the target density. It is clear (to me) that using this for sampling is better than just an isotropic one. Here are some pictures of bananas and flowers. The paper puts this idea in the form of an adaptive MCMC algorithm, which learns the manifold structure on the fly. One needs to be careful about certain details, but that’s all in the paper. Compared to existing linear adaptive samplers, which adapt to the global covariance structure of the target, our version adapts to the local covariance structure. This can all also be mathematically formalised. For example, for the Gaussian kernel, the above proposal covariance becomes $\left[\text{cov}[q_{\mathbf{z}(\cdot|y)}]\right]_{ij} = \gamma^{2}\delta_{ij} + \frac{4\nu^{2}}{\sigma^{4}}\sum_{a=1}^{n}\left[k(y,z_{a})\right]^{2}(z_{a,i}-y_{i})(z_{a,j}-y_{j}) +\mathcal{O}(n^{-1}),$ where the previous pointsz_a$influence the covariance, weighted by their similarity$k(y,z_a)$to current point$y$. Pretty cool! We also have a few “we beat all competing methods” plots, but I find those retarded and will spare them here – though they are needed for publication Dino and me have also written a pretty Python implementation (link) of the above, where the GP Pseudo Marginal sampling is done with Shogun’s ability to importance sample marginal likelihoods of non-conjugate GP models (link). Pretty cool! ## June 03, 2014 12:01 AM ### mloss.org #### Google Summer of Code 2014 GSoC 2014 is between 19 May and 18 August this year. The students should now be just sinking their teeth into the code, and hopefully having a lot of fun while gaining invaluable experience. This amazing program is in its 10th year now, and it is worth repeating how it benefits everyone: • students - You learn how to write code in a team, and work on projects that are long term. Suddenly, all the software engineering lectures make sense! Having GSoC in your CV really differentiates you from all the other job candidates out there. Best of all, you actually have something to show your future employer that cannot be made up. • mentors - You get help for your favourite feature in a project that you care about. For many, it is a good introduction to project management and supervision. • organisation - You recruit new users and, if you are lucky, new core contributors. GSoC experience also tends to push projects to be more beginner friendly, and to make it easier for new developers to get involved. I was curious about how many machine learning projects were in GSoC this year and wrote a small ipython notebook to try to find out. Looking at the organisations with the most students, I noticed that the Technical University Vienna has come together and joined as a mentoring organisation. This is an interesting development, as it allows different smaller projects (the titles seem disparate) to come together and benefit from a more sustainable open source project. On to machine learning... Using a bunch of heuristics, I tried to identify machine learning projects from the organisation name and project titles. I found more than 20 projects with variations of "learn" in them. This obviously misses out projects from R some of which are clearly machine learning related, but I could not find a rule to capture them. I am pretty sure I am missing others too. I played around with some topic modelling, but this is hampered by the fact that I could not figure out a way to scrape the project descriptions from the dynamically generated list of project titles on the GSoC page. Please update the source with your suggestions! ## April 28, 2014 11:32 AM ### Soumyajit De Rahul #### Amazing blog by Sergey Hi there. Our fellow Shoguneer and C++ guru Sergey Lisitsyn has started a new blog. Be sure to check this out – contains classic stuffs. Statutory Warning: You may feel inferior regarding your C++ coding abilities :P ## April 28, 2014 11:30 AM #### Getting a minimal shogun java_modular interface program running I used this code just to see whether shogun works or not. I configured shogun with --interfaces=java_modular option and did a make install in /usr/local. It has to have jblas installed in /usr/share/java/jblas.jar 1. First I checked if jblas works fine. I tried this example - import org.jblas.*; public class jblas_test { public static void main(String[] args) { double[][] data = new double[][] {{ 1, 2, 3, 4, 5}, { 6, 7, 8, 9, 10}, {11, 12, 13, 14, 15}}; DoubleMatrix matrix = new DoubleMatrix(data); DoubleMatrix vector = new DoubleMatrix(new double[]{3, 3, 3, 3,3}); DoubleMatrix result = matrix.mmul(vector); System.out.println(result.rows+"x"+result.columns+": "+result); System.out.println("Jblas working fine"); } }  2. Then I compiled and ran it with -  [rahul@cfdvs4-2 jblas]$ javac -cp ".:/usr/share/java/jblas.jar" jblas_test.java

[rahul@cfdvs4-2 jblas]$java -cp ".:/usr/share/java/jblas.jar" jblas_test 3x1: [45.000000; 120.000000; 195.000000] Jblas working fine  3. Next step was to get a minimal shogun example run. I wrote a simple code - import org.shogun.*; public class helloworld { static { try{ System.load("/usr/local/lib/jni/libmodshogun.so"); System.out.println("Loaded modshogun API"); } catch(UnsatisfiedLinkError e) { System.out.println("Couldn't load modshogun"); System.out.println(e.getMessage()); } } public static void main(String[] args) { modshogun.init_shogun_with_defaults(); System.out.println("shogun works"); modshogun.exit_shogun(); } }  4. Then I compiled and ran it with - [rahul@cfdvs4-2 test]$ javac -cp ".:/usr/local/share/java/shogun.jar" helloworld.java
[rahul@cfdvs4-2 test]$java -cp ".:/usr/local/share/java/shogun.jar" -Djava.library.path=".:/usr/local/lib/jni/libmodshogun.so" helloworld Loaded modshogun API shogun works  5. Now was the time to run some actual shogun code. I need shogun for string kernel classification, so I tried out classifier_domainadaptationsvm_modular.java . Just changed System.loadLibrary("modshogun"); to System.load("/usr/local/lib/jni/libmodshogun.so"); at line #9. And also added this line after line 51. System.out.println(out.rows+"x"+out.columns+": "+out);  Then compiled and ran - [rahul@cfdvs4-2 test]$ javac -cp ".:/usr/share/java/jblas.jar:/usr/local/share/java/shogun.jar" classifier_domainadaptationsvm_modular.java
[rahul@cfdvs4-2 test]$java -cp ".:/usr/share/java/jblas.jar:/usr/local/share/java/shogun.jar" -Djava.library.path=".:/usr/local/lib/jni/libmodshogun.so" classifier_domainadaptationsvm_modular 1x10: [-1.000000, 1.000000, -1.000000, -1.000000, -1.000000, -1.000000, 1.000000, 1.000000, 1.000000, -1.000000]  Feels terrific :D ## April 28, 2014 11:30 AM #### GSoC’13 – Implement estimators of large-scale sparse Gaussian densities I have submitted the proposal (it can be found here). I’m having a much clearer view of the tasks that has to be done. I’ll be working on two other entrance issues before the official coding day begins, namely the graph coloring issue and the elliptic curve functions issue. In my understanding, the whole work can be summarized as the following - 1. use greedy graph coloring of the precision matrix (power of precision matrix?) for finding a set of probing vectors, $\{v_{j}\}$ (task for the project includes integrating an existing library. I checked out a few libraries and seems like Joe Culberson’s Graph Coloring code is a really good candidate. Its written in C and provides two greedy methods for vertex coloring. More tests for our specific need has to be done) 2. for each probing vector, we need to compute $v^{T}log(Q)v$, $Q$ being the precision matrix, $log(Q)$ is the matrix-logarithm 3. for computing $log(Q) v$ in the above expression, Cauchy’s integral formula (coming from complex analysis) for matrix function is used, which can be discretized, giving a rational approximation of $log(Q) v$, which involves solving $N$ different systems of linear equation. Solving these systems involves invoking a preconditioned CG solver (in the project, we’ll be integrating this from an existing library) 4. the systems have complex coefficients (coming from conformal mapping needed for the quadrature rule of the above integration) which is given by an existing algorithm in Hale et. al. The precision of this approximation is defined by a theorem, which depends on the extremal eigenvalues. We can calculate the number $N$ for desired accuracy using this theorem. The algorithm (for finding complex integration weights and shifts) needs Jacobi elliptic curve functions (in Driscoll’s SC-toolbox ellipkkp and ellipjc) and extremal eigenvalues. (task for the project is to integrate Krylstat’s implementation of this) 5. combining everything for the expression of log-determinant involves writing a class which, which will combine all the subtasks. I had a discussion with Heiko on this. While initially we’ll use this to compute this on one computer, later plans include replacing this so that a OpenMPI program can run the subtasks on a cluster with low communication cost and speeded up execution. Will discuss about more details later on The main paper describes several other techniques for handling special cases (like, when the matrix is ill-conditioned, etc). I’ll read up about these in more detail and update this page later. ## April 28, 2014 11:28 AM #### On designing a framework for estimators of large-scale sparse Gaussian Let me start this with a happy note. I am really glad for my proposal for “Estimators of large-scale sparse Gaussian” has been accepted in Google summer of code, 2013. I’m really looking forward to an awesome summer. I promised myself to write more often regarding this, but I’ve been really busy with a task that was needed for this project. I hope I can include some of that experience in this post too. Some theoretical background about the project - The aim of this project is to estimate log-determinant (up to an arbitrary precision) of a very large sparse precision matrix (inverse of covariance matrix) that arises in the log-likelihood expression of a multivariate Gaussian distribution. A direct method (like the one that I already added there in Shogun, CStatistics::log_det()) relies on Cholesky factorization of the matrix, which, in practice, may not be that sparse and often cannot be even stored in the memory and hence are not feasible in most of the practical scenarios . The idea of this project borrows a concept from complex analysis (Cauchy’s integral formula for matrix functions) which represents a matrix function using a contour integral in the complex plane. Rational approximation of this integral for matrix function (logarithm of a matrix in this case) times a vector leads to a shifted family of linear systems with complex shifts, weights and constant, which can be estimated up to an arbitrary precision. The task for estimating log-determinant here then becomes to sample the trace of the log of matrix using a set of vectors (called probing vectors, generated using greedy graph coloring), and in that expression, fit the log-matrix times vector using the rational approximation formula, There has been quite a few changes in the design and structure of the framework than I have initially had in mind. Heiko suggested some really cool ideas regarding the way it should work. ## April 28, 2014 11:27 AM #### GSoC weekly report – week 0 This is my first post after being selected for GSoC this year, been really really occupied with stuffs so far. So I’ll just quickly list down the things that I’ve done so far and things I am planning to do next. We have designed the framework for implementing the log-determinant project. The background of this project can be found here (I’m really happy to see my name listed there :) thanks to Heiko! :) ). For this project, we need to work with complex numbers, which was not supported by Shogun yet. The first task was to incorporate std::complex<T> with shogun and that required • adding the datatype as a primitive type • checking switch over all ptypes • adding support for the template classes that will use this • adding support for the parameter framework • and finally check serialization with this. I added a new type for std::complex<double> as complex64_t and all required support. We also needed Jacobi elliptic functions that we will be needing for the rational appromixation of the matrix logarithm. This part is also completed. The next step was to come up with the basic structure of the framework. So far, the class diagram looks like the following - Not all the classes are shown, and a few mistakes are there (please pardon me). I’m going to implement the base classes one by one, and for the sake of simplicity, we’ll start with a direct computation of matrix log (thanks to Eigen3) instead of approximation. So my main focus for this week would be to code up basic stuffs, focus on the framework development rather than numerical issues, fixing errors+unit-test+documentation+leak-check all the newly added classes thoroughly. Once the basic framework seems promising, the rest of the stuffs can be added iteratively one by one. See you next weeek! Adios! ## April 28, 2014 11:26 AM #### GSoC weekly report – week 1 As we planned, to keep the focus more on a working framework initially, I started with a direct logarithm of dense matrices that Eigen3 unsupported module provides (available for Eigen3.1.0 and later) instead of the rational approximation. This week, I have implemented the framework for dense matrix linear operators using Gaussian trace samples ($\mathcal{N}(0,I)$) and that shows that for a small $2\times 2$ matrix, say, $C=\begin{bmatrix} 2&1 \\ 1&3 \end{bmatrix}$ we can approximate the $log(\left|C\right|)$ by estimating $E(s^{T}log(C)s)$ up to a certain precision with a large number of estimates. For example, $log(\left|C\right|)=1.609438$, and for $1,000,000$ log-det estimates with Gaussian samples, we get an expected estimate $E(s^{T}log(C)s)=1.609546$, $s \sim\mathcal{N}(0,I)$. The framework has been developed in a modular way, with all independent base classes and implementations first and then the dependent ones. I added a unit-test for (almost) every component that has been added to ensure that they behave as we might expect them to do. Currently the framework has - • CLinearOperator<T> base class, CDenseMatrixOperator<T> implementation of this, a unit-test. • CJobResult base class, CScalarResult<T> and CVectorResult<T> implementation of this, a unit-test for scalar. • CJobResultAggregator base class, CStoreScalarAggregator<T> implementation of this and a unit-test. • CIndependentJob base class, CDenseExactLogJob implementation of this and a unit-test. • CIndependentComputationEngine base class, CSerialComputationEngine implementation, and a unit-test • COperatorFunction<T> base class, CDenseMatrixExactLog implementation of this, a unit-test • CTraceSampler base, CNormalSampler implementation, a unit-test • CLogDetEstimator class and a unit-test I have a small program ready (very much similar to the CLogDetEstimator unit-test) with this which I’ll add to the libshogun examples. This shows how the framework works - • The sample method of CLogDetEstimator computes a number of log-det estimates of a linear operator function • In sample, it first computes the stuff that are required for the rest of the computation job (call init on COperatorFunction and CTraceSampler). For example, CDenseMatrixExactLog implementation computes the log of the dense matrix and sets that as the linear operator in the operator function in its init. CLogRationalApproximation will compute complex shifts, weights etc using Jacobi elliptic functions in init, etc. CNormalSampler initializes the number of samples it should generate per log-det estimate (which is 1 in this case). CProbingSampler implementation will use graph coloring of the sparse linear operator and set the number of colors as the number of estimates, all in its init. • Then for the required number of log-det estimates, sample uses the submit_job method of COperatorFunction to create a number of jobs per sample, and keeps the JobResultAggregators with itself. submit_job internally creates a number of jobs (based on the implementation) with one aggregator, submits the jobs to the computation engine, which may or may not start computing those job immediately (based on implementation) and passes the aggregator to sample. • In serial implementation of the computation engine, it starts computing the jobs immediately as soon as they are submitted (call compute method on the job) and blocks until the computation is done. For CDenseExactLogJob, the computation task is to compute $log(C)s$ first (apply linear operator on vector, result is a vector), and then $s^{T}log(C)s$ (vector-vector dot product). The vector is then safely discarded. • sample then waits for all computation jobs to be completed with the engine’s wait_for_all method. Serial implementation returns immediately in this case since all jobs are already computed. • sample calls finalize on all the job result aggregators which shapes the final aggregation of the job results into CScalarResults. It then computes an average of the estimates and returns. Plan for next week To implement the CLogRationalApproximation class for dense matrix that computes weights, shifts in init (requires the eigenvalues, currently thought of using Eigen3 for that), and then CLogRationalApproximationIndividual, which creates num_shifts jobs for each trace sample and moves the shifts into the operator. A CLogRationalApproximationIndividualJob will use Eigen3’s complex solver for direct solving. I planned to keep CLinearSolver as a base class for all the solvers, which I’ll try to implement next week, along with its DirectLinearSolver implementation. CStoreVectorAggregator has to be implemented as well. Ah, its been too long for a report! I just realized it! :) That’s all folks! See you next week! ## April 28, 2014 11:26 AM #### making shogun with cmake [Reposted from Shogun's mailing list] default is libshogun. run, within a build directory inside shogun/, cmake .. if testing is required, run cmake -DENABLE_TESTING=ON .. and then make and run ctest –output-on-failure, this will run the tests as well as unit-tests. If valgrind check is needed, run cmake -DENABLE_TESTING=ON -DBUILD_DASHBOARD_REPORTS=ON .. and run ctest –output-on-failure. this will run valgrind on all the tests. If only unit-testing is required, then use ctest -R unit. If verbose is required, then run it with ctest -V (or -VV) -R unit. alternatively, just use -DENABLE_TESTING=ON and run valgrind on any suspected test valgrind “-q” “–tool=memcheck” “–leak-check=yes” “–show-reachable=yes” “–workaround-gcc296-bugs=yes” “–num-callers=50″ ./tests/unit/shogun-unit-test “–gtest_filter=SparseMatrixOperator.*” To build it for another interface, refer to Viktor’s mail which is - this will build you libshogun, without interfaces. if you want modular interfaces then you have to turn it on, either with cmake cmd argument or with cmake gui. e.g. if you want python modular interface: cmake -DPythonModular=ON .. the defines for the various interfaces: . .. python modular: -DPythonModular=ON . .. ruby modular: -DRubyModular=ON . .. octave modular: -DOctaveModular=ON . .. csharp modular: -DCSharpModular=ON . .. python modular: -DJavaModular=ON . .. lua modular: -DLuaModular=ON there are some other funky defines like: . .. -DENABLE_TESTING=ON turns on unit testing and all the example (compilation) and run for the interfaces you’ve chosen to compile. NOTE: for running the tests, after a successful ‘make’ instead of the old ‘make tests’ use ctest: ctest –output-on-failure . .. -DBUNDLE_EIGEN=ON your system does not ship the right eigen version (3.1.2 or later)? just use this define, and cmake will automagically download/compile and bundle eigen 3.1.4 into shogun . .. -DBUNDLE_JSON=ON same as the previous case just with libjson-c and now something more…FANCY! cmake comes with a nice cpack utility that basically can generate all kinds of packages, not only .tar.gz and .tar.bz2 but like debian’s .deb or redhat’s rpm, osx’ mpkg etc. so hopefully with this we soon be able to ship various nightly built binary packages for you. of course this work is still under progress…. if you want it sooner, come and help us, so we can get it for you sooner. anyways, enjoy! and of course if you run into some problems while trying to cmake shogun, just write here on the mailing list or come to the irc channel and ask somebody! cheers, viktor Hi Klye your best bet is to use cmake gui, i.e. ccmake on UNIX systems. there you can browse through the possible options, and i’ve written a small description for all the possible options. for the 2 things you were looking for: –disable-svm-light in cmake is -DUSE_SVMLIGHT=OFF –enable-svm-light in cmake is -DUSE_SVMLIGHT=ON –enable-swig-directors is -DUSE_SWIG_DIRECTORS=ON we are in progress with updating the ./src/INSTALL file where we’d describe things in more detail than my little email. cheers, viktor ## April 28, 2014 11:25 AM #### Threading using C++11: Part 1 Recently I bumped into a great introductory video series on concurrent programming using C++11 by Bo Qian on youtube. In this I am just trying to note down important parts and playing with stuffs. Some of the examples may be directly used as it is in the video series. Basic threading environment in C++11 The following example introduces with the basic constructs. /** * filename: test1.cpp * a simple example of using threads in c++11 * to compile, use * g++ -lpthread -std=c++0x test1.cpp * on linux environment */ #include <iostream> #include <thread> using namespace std; // a stupid thread function void thread_function(int n) { for (int i=0; i<n; ++i) // each thread has a unique id cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t1 thread is " << &n << endl; } // a stupid thread function that takes a reference void thread_function_ref(int& n) { for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t2 (reference) thread is " << &n << endl; } // a stupid functor class functor { int state; public: functor():state(0){}; void operator()(int n) { for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello "<< i << endl; } }; // a stupid main function int main() { int n = 3; // thread creation - // ---------------- // thread using normal function - takes a callable object as the first // argument and a number of other arguments which are to be passed as // arguments to the callable object (function, lambda expressions, functor) // note that the arguments are always passed by value! a way to pass the // arguments by reference is shown in the next example thread t1(thread_function, n); // using std::ref we can pass the argument (n in this case) as a reference // just passing n won't do the trick! thread t2(thread_function, std::ref(n)); // to verify, we print the address of n in the main thread and in the child // threads of the thread functions. let's see! cout << "address of n in main thread is " << &n << endl; // this is cool! using this we can share memory between threads. we can // also completely hand over a memory to another thread using std::move // but more on that later // joining and stuffs - // -------------------- // we may decide to join the thread, in which the creator thread will wait // until the execution of the child thread is finished. here, however we // decide not to do that right away // //t1.join(); // more funky ways of thread creation - // ----------------------------------- // thread using lambda function - lambda functions are cool! check more // about these on cppreference.com thread t3([n]() { // inherits variable n from current scope and copies it to the thread // stack at different address for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t3 thread is " << &n << endl; }); thread t4([&n]() { // inherits variable n from current scope and uses the same address for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t4 thread is " << &n << endl; }); thread t5([](int n) { // doesn't inherit anything - we pass it by value - same as t1 for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; cout << "address of n in t5 thread is " << &n << endl; }, n); // similarly we can pass it by reference as t2 - but its getting boring // thread using functor - // ---------------------- // creating an object first functor f1; thread t6(f1, n); // using anonymous object (note the extra paranthesis) thread t7((functor()), n); // some more boring stuffs - // ------------------------- // something to do for the main thread for (int i=0; i<n; ++i) cout << "thread with id " << this_thread::get_id() << " says hello " << i << endl; // join/detatch all threads t1.detach(); if (t1.joinable()) t1.join(); t2.join(); t3.join(); t4.join(); t5.join(); t6.join(); t7.join(); return 0; }  This code works. When I compiled and ran on my system (Fedora-19-x86_64) it ran, except it printed out something that’s pretty tough to read! [lambday@lambday.iitb.ac.in thread_test]$ ./a.out
0
thread with id 0139845777069824 says hello
thread with id 139845768677120 says hello 1
thread with id 139845777069824 says hello 2
thread with id  says hello 0
thread with id 139845760284416 says hello 1
thread with id 139845760284416 says hello 2

thread with id 139845751891712 says hello 1
thread with id 139845751891712 says hello 2
0
thread with id 139845768677120 says hello 1
thread with id 139845768677120 says hello 2
139845785462528 says hello 1
139845743499008 says hello 0
thread with id 139845743499008 says hello 1
thread with id 139845743499008 says hello 2
thread with id 139845785470848 says hello 0
thread with id 139845785470848 says hello 1
thread with id 139845785470848 says hello 2
thread with id 139845735106304 says hello 0
thread with id 139845735106304 says hello 1
thread with id 139845735106304 says hello 2


This is because all the threads are running for the same resource stdout and causing data race condition. This must be handled using mutex which provides lock/unlock mechanisms for shared resources. The following example shows how.

Handling data race condition using mutex

/**
* filename: test2.cpp
* a simple example of using a shared resource, such as cout,
* among multiple threads securely using mutex
*/

#include <iostream>
#include <mutex>

using namespace std;

class printer
{
ostream& os;
mutex& mu;
public:
// making default constructor, copy constructor and
// assignment operator implicitly disabled for apparently
// no reason
printer()=delete;
printer(const printer&)=delete;
const printer& operator=(const printer&)=delete;

// the constructor that we will be using
printer(mutex& m):os(cout), mu(m){}

// the function that we will be using
// I tried using string& s as a parameter but didn't work!
void shared_print(string s, int i)
{
// before using the resource that is os, we first need to
// lock the mutex, which could have normally be done using
// mutex::lock and mutex::unlock methods. however, if in
// between lock and unlock some exception happens, then
// the mutex is never going to be unlocked again until the
// kept on waiting till death
//
// lock_guard provides a Resource Acquisition Is Initialization
// (RAII) sort of thing, which calls the unlock in its destructor
// therefore, whenever the control goes outside of this block and
// guard gets killed, the mutex will be unlocked!
// sweet!
lock_guard<mutex> guard(mu);
os << s << i << endl;
}
};

{
for (int i=0; i<n; ++i)
// tried using std::ref which gave weird errors
// that it has been deleted!! any clue??
pr.shared_print(string("t1():"),i);
}

int main()
{
// mutex is the main tool for employing mutual exclusion from
// shared resources
mutex mu;

printer pr(mu);
int n = 4;

// to print stuffs. in fact everyone wishes to print stuffs
// using cout should use this! otherwise cout will be exposed
// to vulnerabilities and somebody can do nasty stuffs.
//
// its really importantn that we never ever give the access
// of the ostream from printer to anybody!

// its a good practice to surround the following within
// a try-catch block, so that if the following part of the
// code throws an exception, the child thread still gets joined
//
try {
for (int i=0; i<n; ++i)
pr.shared_print(string("main():"),i);
} catch(...) {
t1.join();
throw;
}

t1.join();

return 0;
}


This gives a pretty output

[lambday@lambday.iitb.ac.in thread_test]$./a.out main():0 t1():0 t1():1 t1():2 t1():3 main():1 main():2 main():3  Number of threads to be created is often a good question. It makes sense to create as many threads as there are cpus available. Otherwise a lot of time would be wasted in context switching which is a bad thing for program performance. std::thread::hardware_concurrency() function provides a way to do that. In the next example, a different kind of lock guard is shown at action, which provides a deferred locking mechanism. /** * filename: test3.cpp * this shows usage of unique_lock */ #include <iostream> #include <thread> #include <mutex> using namespace std; // similar to what we have seen before, locks the data // and then writes to it void thread_function(int& data, mutex& mu, void (*f)(int&)) { // we shouldn't create a lock_guard here because there is // no point in locking the data between the iterations of // the loop - the other thread would then have to wait until // this function returns from one thread unique_lock<mutex> guard(mu, defer_lock); for (int i=0; i<10; ++i) { // here we can create the lock as lock_guard and before the // next iteration the mutex will be unlocked. but to avoid // the overhead of creating and destroying a lock_guard obj // each time, we can use unique lock's lock/unlock mechanism // under defered lock //lock_guard<mutex> guard(mu); // guard.lock(); f(data); cout << this_thread::get_id() << ": says data is " << data << endl; guard.unlock(); } } int main() { mutex mu; // data that the threads try to write into // must be synchronized to avoid data race condition // using mutex int data = 0; // creating a thread that tries to increase the data thread t1(thread_function, std::ref(data), std::ref(mu), [](int& n){n++;}); // another thread which tries to decrease the data thread t2(thread_function, std::ref(data), std::ref(mu), [](int& n){n--;}); t1.join(); t2.join(); return 0; }  Output is shown below. after the execution is over, the value of data is same as before, as one might expect. [lambday@lambday.iitb.ac.in thread_test]$ ./a.out
140657387300608: says data is 1
140657387300608: says data is 2
140657387300608: says data is 3
140657387300608: says data is 4
140657387300608: says data is 5
140657387300608: says data is 6
140657378907904: says data is 5
140657378907904: says data is 4
140657378907904: says data is 3
140657378907904: says data is 2
140657378907904: says data is 1
140657378907904: says data is 0
140657378907904: says data is -1
140657378907904: says data is -2
140657378907904: says data is -3
140657378907904: says data is -4
140657387300608: says data is -3
140657387300608: says data is -2
140657387300608: says data is -1
140657387300608: says data is 0


Some more pointers are coming up next! Wrapping up this one cause its getting too big.

## April 28, 2014 10:47 AM

Tests to be added – one for the mean/variance for trace sampler and another with a huge matrix but not diagonal but having subdiagonal entries.

Add a test for distance two coloring on the sparse matrix operator itself and then trace samples compared with distance one coloring on power of the sparse matrix.

very important to have things in a very polished/documented/tested way.  missing features are not that important. rather have the existing things working well. there can be many examples given, eigen solvers, linear solvers, log det esitmators, computation framework examples. Use matplotlib for plots, iypthon notebook –pylab inline.

## April 28, 2014 10:44 AM

#### Adding complex data type for shogun

Related files – under shogun/lib/

• common.h
• DataType.h
• DataType.cpp

first thing – add typedef std::complex<float64_t> complex64_t, then the new SG primitive type.

Todo list:

• serialization
• modular typemaps
• check all switch over ptype
We have this shogun type object, TSGDataType, contains primitive type, container type(vector,matrix), struct type (string,sparse).  TSGDataType is used by shogun parameter framework to represent things, for example for serialization, model-selection, equals method. one has to register parameters with the SG_ADD macro which calls m_parameters->add of the current object. In there, we need new methods for complex, should be a few, in the same style as the existing ones. Then we have to check serialization, which writes types to files. Then modular maps which map the type the the type of the language (eg python complex).

## April 24, 2014 11:14 PM

### Heiko Strathmann

#### Machine Learning Summer School in Reykjavik

I am visiting the Machine Learning Summer School in Reykjavik, Iceland. Having our poster on Kameleon MCMC, a Kernel Adaptive Metropolis-Hastings sampler, with me (arXiv link: to appear in ICML 2014, poster link). Already met loads of interesting new and known people. Iceland is a pretty cool place, looks a bit like this.

## April 18, 2014 12:10 AM

### Sergey Lisitsyn

#### Opaque Pointers Revisited

Opaque pointer (aka d-pointer or pimpl) is a great C++ design pattern useful for prolongated binary interface compatibility, properly hidden implementation and faster compilation. However, it has inherent performance drawback, which could get pretty critical if you care about efficiency. In this post I propose an approach that makes d-pointers less binary compatible but swipes away its inefficiency.

## April 08, 2014 06:53 PM

### Sergey Lisitsyn

#### Actions With RAII

You may know RAII as a cool idiom that makes it pretty easy to handle resources finalization automatically. When used properly, it reduces LoC, helps to avoid bugs and gives more safety for free. This makes RAII an important part of modern C++.

## March 30, 2014 11:16 AM

### mloss.org

#### Reproducibility is not simple

There has been a flurry of articles recently outlining 10 simple rules for X, where X has something to do with data science, computational research and reproducibility. Some examples are:

## Best practices

These articles provide a great resource to get started on the long road to doing "proper science". Some common suggestions which are relevant to practical machine learning include:

##### Use version control

Start now. No, not after your next paper, do it right away! Learn one of the modern distributed version control systems, git or mercurial currently being the most popular, and get an account on github or bitbucket to start sharing. Even if you don't share your code, it is a convenient offsite backup. Github is the most popular for open source projects, but bitbucket has the advantage of free private accounts. If you have an email address from an educational institution, you get the premium features for free too.

Distributed version control systems can be conceptually daunting, but it is well worth the trouble to understand the concepts instead of just robotically type in commands. There are numerous tutorials out there, and here are some which I personally found entertaining, git foundations and hginit. For those who don't like the command line, have a look at GUIs such as sourcetree, tortoisegit, tortoisehg, and gitk. If you work with other people, it is worth learning the fork and pull request model, and use the gitflow convention.

##### Open source your code and scripts

Publish everything. Even the two lines of Matlab that you used to plot your results. The readers of your NIPS and ICML papers are technical people, and it is often much simpler for them to look at your Matlab plot command than to parse the paragraph that describes the x and y axes, the meaning of the colours and line types, and the specifics of the displayed error bars. Tools such as ipython notebooks and knitr are examples of easy to implement literate programming frameworks that allow you to make your supplement a live document.

It is often useful to try to conceptually split your computational code into "programs" and "scripts". There is no hard and fast rule for where to draw the line, but one useful way to think about it is to contrast code that can be reused (something to be installed), and code that runs an experiment (something that describes your protocol). An example of the former is your fancy new low memory logistic regression training and testing code. An example of the latter is code to generate your plots. Make both types of code open, document and test them well.

##### Make your data a resource

Your result is also data. When open data is mentioned, most people immediately conjure images of the inputs to prediction machines. But intermediate stages of your workflow are often left out of making things available. For example, if in addition to providing the two lines of code for plotting, you also provided your multidimensional array containing your results, your paper now becomes a resource for future benchmarking efforts. If you made your precomputed kernel matrices available, other people can easily try out new kernel methods without having to go through the effort of computing the kernel.

Efforts such as mldata.org and mlcomp.org provide useful resources to host machine learning oriented datasets. If you do create a dataset, it is useful to get an identifier for it so that people can give you credit.

## Challenges to open science

While the articles call these rules "simple", they are by no means easy to implement. While easy to state, there are many practical hurdles to making every step of your research reproducible .

##### Social coding

Unlike publishing a paper, where you do all your work before publication, publishing a piece of software often means that you have to support it in future. It is remarkably difficult to keep software available in the long term, since most junior researchers move around a lot and often leave academia altogether. It is also challenging to find contributors that can help out in stressful periods, and to keep software up to date and useful. Open source software suffers from the tragedy of the commons, and it quickly becomes difficult to maintain.

While it is generally good for science that everything is open and mistakes are found and corrected, the current incentive structure in academia does not reward support for ongoing projects. Funding is focused on novel ideas, publications are used as metrics for promotion and tenure, and software gets left out.

##### The secret branch

When developing a new idea, it is often tempting to do so without making it open to public scrutiny. This is similar to the idea of a development branch, but you may wish to keep it secret until publication. The same argument applies for data and results, where there may be a moratorium. I am currently unaware of any tools that allow easy conversion between public and private branches. Github allows forks of repositories, which you may be able to make private.

Once a researcher gets fully involved in an application area, it is inevitable that he starts working on the latest data generated by his collaborators. This could be the real time stream from Twitter or the latest double blind drug study. Such datasets are often embargoed from being made publicly available due to concerns about privacy. In the area of biomedical research there are efforts to allow bona fide researchers access to data, such as dbGaP. It seamlessly provides a resource for public and private data. Instead of a hurdle, a convenient mechanism to facilitate the transition from private to open science would encourage many new participants.

What is the right access control model for open science?

##### Data is valuable

It is a natural human tendency to protect a scarce resource which gives them a competitive advantage. For researchers, these resources include source code and data. While it is understandable that authors of software or architects of datasets would like to be the first to benefit from their investment, it often happens that these resources are not made publicly available even after publication.

## March 28, 2014 06:59 AM

### Sergey Lisitsyn

#### Any Struggles

In C++ you can’t just forget about types. Each variable has its own type that is not going to be changed by any means. What if you really need something heterogeneous? There is a known idiom called Any that enables you to erase the type and recall it later.

## March 25, 2014 11:33 PM

### Heiko Strathmann

#### I Like Intractable Likelihoods

Last week, I went to the i-like workshop at Oxford university. Pretty cool! All of Britain’s statisticians were there and I met many of them for the first time. Check out my two posters (Russian Roulette, Kernel Adaptive Metropolis Hastings). Talks were amazing – as in last NIPS, the main trend is on estimating likelihoods (well, that’s the name of the program), either using some other random process such as importance sampling a latent model’s marginal likelihood (aka Pseudo-Marginal MCMC), or directly sub-sampling likelihoods or gradients.

These things are important in Machine Learning too, and it is very nice to see the field growing together (even-though there was a talk by a Statistician spending lots of time on re-inventing belief propagation and Junction tree ideas – always such a pitty if this happens simply because communities do not talk to each other enough). Three talks that I really found interesting:

Remi Bardenet talked about sub-sampling approaches to speed up MCMC. This is quite related to the Austerity in MCMC land paper by Welling & Co, with the difference that his tests do not suffer from small number of points in the hypothesis test to decide accept/reject.

Chris Sherlock talked about optimal rates and scaling for Pseudo-Marginal MCMC. There finally are some nice heuristics how to scale PM estimates in a way that the number of iid samples per computation time is optimal. Interestingly, the acceptance rate and the variance of the likelihood estimate can be tweaked separately.

Jim Griffin gave a very interesting talk on adaptive MCMC on discrete, in particular binary, state-spaces – he used them for feature selection (in ML language). His algorithm automatically learns global mutations rates for each of the positions. However, it doesn’t take any correlations between the features into account. This might be a very interesting application for our fancy Kameleon sampler (arxiv, code), thinking about this!

Finally, I presented two posters, the one on Playing Russian Roulette with Intractable Likelihoods that I already presented in Reykjavik, and (with Dino) a new poster (link) on the Kernel Adaptive Metropolis Hastings Kameleon that I mentioned above. The corresponding paper is hopefully published very soon. Talking to other scientists about my own work is just great!