[dropcap]Probabilistic[/dropcap] inference lies no longer at the fringe. The importance of how we connect our observed data to the assumptions made by our statistical models—the task of inference—was a central part of this year's Neural Information Processing Systems (NIPS) conference: Zoubin Ghahramani's opening keynote set the stage and the pageant of inference included forecasting, compression, decision making, personalised modelling, and automating the scientific method; Josh Tenenbaum inspired us to think about the central role of of probabilistic inference in building intelligent systems and for our understanding of brains, minds and machines.

I was fortunate to be one of the organisers of this year's NIPS 2015 Workshop on Advances in Approximate Bayesian Inference, where we debated the past, present and future of approximate inference. This post is my brief reflection of a year in inference and the advances we have made since the last workshop; *my review of that workshop is here: Variational Inference: Tricks of the Trade".*

# Seeking a Deeper Understanding

Stochastic approximation remains the driving force for scaling inference. At the workshop we explored stochastic approximation schemes for expectation propagation (EP), building on work published earlier this year. Three important explorations of EP at the workshop were:

- Stochastic EP for Large Scale Gaussian Process Classification,
*Training Deep Gaussian Processes using Stochastic Expectation Propagation and Probabilistic Backpropagation,*and*Black-box α-divergence Minimization*

EP remains one of the best inference methods for GP classification and these approaches make this even more so, and bring EP back into the conversation on availability and limitations of different approximate inference methods.

Stochastic approximation schemes were established for variational free energy methods over the past few years, and this thinking has matured further. The main tricks driving variational inference remain the use of Monte Carlo gradient estimators using either pathwise-derivative or score-function estimators, allowing us to apply the variational principle for inference in both discrete and continuous models. As a field, we are now interrogating these estimators further and attempting to tame their variance in different ways, including methods using local expectation gradients or delta method-derived control variates.

- At the workshop, we explored one new approach to this problem through the unification of variational inference and reinforcement learning. This will allow for variance reduction and other approximation tools from reinforcement learning, such as value and advantage functions, to be applied within the variational inference setting.

# Inference Models and Amortisation

The use of inference models and amortised inference schemes has had much wider use. We now have a variety of ways in which to deploy such inference networks in our algorithms: in expectation propagation, we can learn to pass messages using kernels; for posterior predictive distributions, we can distill this computation as done in Bayesian Dark Knowledge; in variational inference we can develop sequential extensions, such as that used in DRAW.

- At the workshop we saw
*inference networks for graphical models*used to learn proposal distributions for use in importance sampling or sequential Monte Carlo.

# Automated Inference

The simplicity and deeper understanding we now have of the gradient estimators makes it easy to promote further automation and enhance existing software systems. This generality allows us to equip any automatic differentiation system with the ability to do variational inference.

- We had an in depth panel discussion that explored this, highlighting new tools such as STAN and python-autograd that make variational inference even more accessible and easy to use.
- And there are many ways to formalise and make easier the mixing of the two types of gradient estimators that we have been using, such as that developed in stochastic computational graphs.
- These tools allow variational inference to be implemented in about 5 lines of code. Variational inference is on a path that will soon allow it to be a default method for statistical inference due to these far-sighted efforts.

# Posteriors and Learning Objectives

The question that is most often asked in variational inference, is how we can obtain better posterior inferences. Two paths have emerged, one looking at modified objective functions and the other at developing richer classes of approximate posterior distributions.

**Learning objectives.**

- At the workshop we explored Perturbation Theory for Variational Inference that allowed us to obtain tighter bounds by adding higher-order correction terms.
- In a closely related workshop (on black-box inference), we looked at importance-weighted autoencoders that tightened the variational bound based on the importance sampling estimator of the marginal likelihood.

* Richer posterior distributions.* There has been much development of richer posteriors this year.

- The workshop showed that more robust inference can be obtained by improving covariance estimates obtained from mean-field inference, using robust variational inference.
- Earlier in the year MCMC and variational inference (VI-HMC) was proposed as one way of achieving this. Normalising flows were proposed as another approach. At this workshop, we obtained a new tool for constructing rich posterior distribution by combining the auxiliary variational bound from VI-HMC with normalising flows, giving us
*hierarchical variational models*.

# More Theory needed

Our constant request is for more theory that can guide our efforts and provide more insight. And there are many efforts in this regard.

- At the workshop we saw discussion of the
*convergence of stochastic variational inference*and the development on approximate non-parametric Bayesian inference schemes that provide guarantees from the outset. - Manfred Opper gave an invited talk that explored approximate inference in undirected models using random matrix theory.

As always more theoretical work is needed as well as new ways of communicating theoretical insights to the wider approximate inference community.

# Summary

There was so much more that was discussed: better ways of doing optimisation (multicanonical methods, incremental variational inference, and inference in sequential/temporal models), testing and evaluation, and the types of guarantees that might be needed when using approximate inference in production and scientific settings. Approximate inference is clearly a healthy and active research area. There are many directions going forward, and these will include a continued search for rich posterior approximations, better optimisation algorithms, and more general purpose software, as well as new directions for greater interpretability, tighter integration with other systems such as differentiable rendering engines, more diverse applications in compression and decision making, and greater cross-pollination with other approximate inference approaches. I can't wait to see what one more year will bring.

My interpretation of the "Hierarchical Variational Models" paper by Ranganath et al. is a bit different. For me, the main take-away from that paper was a variational lower bound on the marginal entropy of z in q(z,y | x), as might be used in an inference model to approximate a posterior over z, aided by some "auxiliary" latent variables y. The auxiliary latent variables y don't participate in the generative model p(x) = \sum_z p(x | z)p(z) which we'll train using samples from q (i.e. standard SGVB business). This is an important distinction from the work by Salimans et al. The marginal entropy lower bound can be used to minimize KL(q(z | x) || p(z)), as required for maximizing the VFE bound on log likelihood.

That is, we can lower bound H(z | x) = H(z,y | x) - H(y | x,z) by combining exact evaluation of H(z,y | x) with an upper bound on H(y | x,z), where we construct the upper bound using a (stochastic) variational lower bound on log(q(y | x,z)). This derivation of a variational lower bound on marginal entropy takes more or less the same form as the derivation of a variational lower bound on mutual information, as provided in your paper on "Variational Information Maximisation...". The only difference is that the MI lower bound drops an expected log(q(z | x,y)) term.

Intuitively, the mutual information between the terminal state of a trajectory and preceding steps in the trajectory is the marginal entropy of the terminal state, minus the expected entropy of the final step in the trajectory (i.e. entropy not explained by the rest of the trajectory). So, a lower bound on marginal entropy of the end point of a stochastic trajectory (e.g. z in q(z,y | x)) can be converted to a lower bound on I(z; y) by just subtracting off an expected log(q(z | x,y)) term.