Derivation of the Expected Improvement Acquisition Function
I regularly use the Expected Improvement (EI) as an acquisition function in Bayesian optimization (BO), but I’ve never fully understood the derivation of its closed form solution. This blog post aims to derive EI step by step.
Background
Assume we have an optimization problem x∗=argmaxxf(x)
In BO with EI as an acquisition function, we want to choose the point that is expected to improve the most upon the existing best observed point y^∗=f(x^∗)≥f(xi)∀i∈(1,…,t) . t is the number of observations thus far.
More formally, we can create an improvement function I(x) that tells us how much better the posterior of our probabilistic model at x is than our best observed point y^∗=f(x^∗). If the posterior is less than the best observed point, we make I(x) zero.
I(x)=max(yθ(x)−f(x^∗),0)
yθ is most commonly the posterior of a Gaussian process, which, at a specific input x, is a normal distribution:
yθ(x)∼N(μθ(x),σθ2(x))
For EI, we want the expectation of the improvement:
EI(x)=Ey[I(x)]
If you’ve ever used EI or read about it in a paper, you might have been confused how to get the following closed form of EI:
The upper limit of the expectation integral can actually stop at y^∗ since we consider I(x) to be 0 when yθ(x)>y^∗:
EI(x)=∫−∞y^∗I(x)p(yθ∣x)dy
Substituting in the definition I(x):
∫−∞y^∗(yθ(x)−y^∗)p(yθ∣x)dy
It’s not obvious how to solve this integral since it is in terms of a parametrized random variable yθ. To find a closed form for the integral, we can use the reparameterization trick, which essentially replaces a parametrized random variable (in our case yθ) with a deterministic function multiplied by a non-parametrized random variable. As a side note, the reparametrization trick is most famously what makes variational autoencoders possible, but it also has applications across Bayesian optimization.
Our reparameterization is as follows: yθ=μθ(x)+σθ(x)ϵ where ϵ∼N(0,1). Therefore:
That above step just uses the definition of the CDF, recalling that ϵ is normally distributed. If you replace Z∗, you’ll see that the first term is already the same as our desired form. For the second term, we need to substitute in the equation for the PDF of the normal distribution and simplify. We then get:
As Martin Krasser points out, the two terms we get in this closed form represent the classic exploration-exploitation tradeoff. The first term is the exploitation term that primarily takes into account the improvement of the mean over our best observed value (though Z∗ is weighted by σθ(x) to reduce this exploitation under high uncertainty). The second term focuses on exploration by allocating more weight to predictions with large uncertainty.
Also, note that, in practice, we would implement EI using an if statement to set EI(x) to zero if there is no predicted improvement.