I recently received some queries on our paper: S. Mohamed, K. Heller and Z. Ghahramani. Bayesian and L1 Approaches for Sparse Unsupervised Learning. International Conference on Machine Learning (ICML), June 2012 .
The questions were very good and I thought it would be useful to post these for future reference. The paper looked at Bayesian and optimisation approaches for learning sparse models. For Bayesian models, we advocated the use of spike-and-slab sparse models and specified an adapted latent Gaussian model with an additional set of discrete latent variables to specify when a latent dimension is sparse or not. This allows exact zeroes in the Bayesian sampling and improved performance in many settings. Three questions that came up were:
Rationale for the latent variable sampling scheme
There are many approaches to dealing with the spike z, and slab v, in MCMC:
- We could do naive sampling: sample Z conditioned on V, and then V conditioned on Z.
- We could sample Z integrating out V (which is what we show in eqs 5 and 6).
- Other joint sampling, such as sampling V first (integrating Z).
All three have been shown in other contexts and when a collapsed or uncollapsed method is best depends on the complexity of the model and the data being used, but none form a standard approach. Approach 3 would involve summing over all combinations of the discrete latent variables, and hence we did not follow this route. We tried both approaches 1 and 2 (both collapsed and uncollapsed versions of the sampler), and we found the collapsed version quickly informs us of the state of the slab overall, and gave better performance (and makes intuitive sense since we account for the variation and dependency on the slab components). For other variables we wanted better performance than standard MH, and went with the slice sampler since our experience is that slice sampling generates smarter proposals and is straightforward to implement.
What does recovering the true sparsity mean for MCMC? Do you report a single sample or an average of samples?
I think in this case I ran the sampler to convergence and used a set of samples (thinned to provide independent samples) and for each I computed the sparsity of the reconstruction (this is how I got the error bars). An average of the samples will not be sparse. If we are computing predictive probabilities, we make use of each sparse sample in the computation and this is not a problem. For interpretation, this is more difficult. We can always make use of the most likely sample in interpretation (but this is not appealing, since we could have achieved this using a MAP estimate). It would be better to make use of a small set of samples, making interpretations of each sample and comparing them to each other to gain insight into the variation in your interpretation. As a further thought, for interpretation I think it might make sense to also infer the latent means and covariances of the latent variables (rather than interpret the latent variables directly) - but this would be a model that is a bit more involved, by replacing the prior on the slab and not just using independent Bernoullis, but using a more complex structure - maybe a set of auto-regressive Bernoullis, or a further latent-Gaussian model.
Scalability for high-dimensional data (~105 dimensions)
Even for high-dimensional data, the key bottleneck is in the sampling of the latent variables (z and v). The dimenisonality of these latent variables is typically much, much smaller than the dimensionality of the data, so even in these settings MCMC can be a good candidate. We tried experiments to show that in both cases of n>p and p>n, a sparse Bayesian approach is applicable and can produce good results, though we have not experimented with data sets of the scale you are faced with. That said, mcmc will struggle with mixing in high-dimensional cases, due to difficulty exploring the combinatorial space of the sparse latent vectors. This aspect is something we are looking into. One approach is to make use of approximate Bayesian inference methods (variational methods or EP). These approaches can be parallelisable in some settings e.g., by splitting the data across processors and maintaining a shared posterior over the latent variables, like that we described for latent-Gaussian models . If we give up on fully Bayesian inference then we could instead compute a MAP estimate of the latent variables (and would be easier for interpretation) and compute the posterior only on the model parameters (theta) (potentially exploiting ideas such as submodularity, ). This is an important question and one we continue to work on.