Thanks for the post. I really like this short trick-explanation format.

I believe "V[z⊤ A z] = 2 Tr(A A⊤) − 2 Tr(A)^2" should be "V[z⊤ A z] = 2 Tr(A A⊤) − 2 Tr(A^2)"

]]>Very nice article, it makes for a nice read and explains some basic concepts. I have a question though. Somewhere you say that the vanilla gradient is hard to compute because

"This gradient is often difficult to compute because the integral is typically unknown and the parameters θ, with respect to which we are computing the gradient, are of the distribution p(z;θ)p(z;θ)."

I recognize the first difficulty. Indeed, if your function f() is complicated, like a neural network, computing analytically the integral is hard.

However, what do you mean exactly by the second difficulty? Do you mean that because the variable θ with respect to which you take the gradient is also in the pdf, makes the gradient computation much harder, because the overall composite function becomes harder?

Or because now you need to take the expectation over a multi-dimensional random variable, whereas after the reparameterization trick the expectation is taken over the single dimensional ε? Which implies that you need much fewer samples to compute a good enough expectation, which in turn implies that you can much less variance (or smaller bias) of the estimator?

]]>