For two probability measures P , Q \mathbb{P}, \mathbb{Q} P , Q supported on R d \mathbb{R}^d R d and with densities p , q p,q p , q with respect to the Lebesgue measure, the Kullback-Leibler divergence between them is defined as
k l ( P ∥ Q ) = E X ∼ P [ ln ( p ( X ) q ( X ) ) ] = ∫ R d p ( x ) ln ( p ( x ) ) − p ( x ) ln ( q ( x ) ) d x . \mathrm{kl}(\mathbb{P}\Vert \mathbb{Q}) = \mathbb{E}_{X \sim \mathbb{P}}\left[ \ln \left(\frac{p(X)}{q(X)}\right)\right] = \int_{\mathbb{R}^d} p(x)\ln(p(x)) - p(x)\ln(q(x))\mathrm{d}x. k l ( P ∥ Q ) = E X ∼ P [ ln ( q ( X ) p ( X ) ) ] = ∫ R d p ( x ) ln ( p ( x ) ) − p ( x ) ln ( q ( x ) ) d x .
Reminders on the k l \mathrm{kl} k l divergence .
If f f f is a density function, the « relative entropy of p p p with respect to f f f » is the nonnegative quantity defined as
H p ( f ) = − ∫ p ( x ) ln f ( x ) d x . H_p(f)=-\int p(x) \ln f(x)\mathrm{d}x. H p ( f ) = − ∫ p ( x ) ln f ( x ) d x .
Information theory à la Shannon tells us that this is the mean cost of « encoding » random variables drawn for p p p using the density f f f . This cost is minimized for h = p h=p h = p and the minimal cost is H p ( p ) H_p(p) H p ( p ) , the entropy of p p p – that's Shannon's theorem. The Kullback-Leibler divergence is thus the difference H p ( f ) − H p ( p ) H_p(f) - H_p(p) H p ( f ) − H p ( p ) ; in other words, it quantifies what is lost when encoding p p p with q q q , or in other words what quantity of information on p p p is not contained in q q q .
In dimension d d d , the Gaussian distribution N ( μ , Σ ) \mathscr{N}(\mu, \Sigma) N ( μ , Σ ) with mean μ \mu μ and covariance Σ \Sigma Σ (a d × d d\times d d × d positive, nonsingular matrix) is given by
g μ , Σ ( x ) = 1 ( 2 π ) d ∣ Σ ∣ exp { − ⟨ x − μ , Σ − 1 ( x − μ ) ⟩ 2 } g_{\mu, \Sigma}(x) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}}\exp\left\lbrace - \frac{\langle x- \mu, \Sigma^{-1}(x-\mu)\rangle }{2}\right\rbrace g μ , Σ ( x ) = ( 2 π ) d ∣ Σ ∣ 1 exp { − 2 ⟨ x − μ , Σ − 1 ( x − μ ) ⟩ }
where ∣ Σ ∣ |\Sigma| ∣ Σ ∣ is the determinant of the matrix Σ \Sigma Σ . The point of this note is the following formula –- no one remembers it and I always have to google it myself.
k l ( N ( μ 1 , Σ 1 ) ∥ N ( μ 2 , Σ 2 ) ) = 1 2 ln ∣ Σ 2 Σ 1 − 1 ∣ − d 2 + 1 2 t r a c e ( Σ 1 Σ 2 − 1 ) + 1 2 ⟨ μ 2 − μ 1 , Σ 2 − 1 ( μ 2 − μ 1 ) ⟩ .
\mathrm{kl}(\mathscr{N}(\mu_1, \Sigma_1)\Vert \mathscr{N}(\mu_2, \Sigma_2)) = \frac{1}{2}\ln |\Sigma_2\Sigma_1^{-1}| - \frac{d}{2} + \frac{1}{2}\mathrm{trace}(\Sigma_1 \Sigma_2^{-1}) + \frac{1}{2}\langle \mu_2 - \mu_1, \Sigma_2^{-1}(\mu_2-\mu_1)\rangle.
k l ( N ( μ 1 , Σ 1 ) ∥ N ( μ 2 , Σ 2 ) ) = 2 1 ln ∣ Σ 2 Σ 1 − 1 ∣ − 2 d + 2 1 t r a c e ( Σ 1 Σ 2 − 1 ) + 2 1 ⟨ μ 2 − μ 1 , Σ 2 − 1 ( μ 2 − μ 1 ) ⟩ .
We'll note p = g μ 1 , Σ 1 p = g_{\mu_1, \Sigma_1} p = g μ 1 , Σ 1 and q = g μ 2 , Σ 2 q=g_{\mu_2, \Sigma_2} q = g μ 2 , Σ 2 , so that
k l ( N ( μ 1 , Σ 1 ) ∥ N ( μ 2 , Σ 2 ) ) = E [ ln ( p ( X ) / q ( X ) ) ] \mathrm{kl}(\mathscr{N}(\mu_1, \Sigma_1)\Vert \mathscr{N}(\mu_2, \Sigma_2)) = \mathbb{E}[\ln(p(X)/q(X))] k l ( N ( μ 1 , Σ 1 ) ∥ N ( μ 2 , Σ 2 ) ) = E [ ln ( p ( X ) / q ( X ) ) ]
where X ∼ N ( μ 1 , Σ 1 ) X \sim \mathscr{N}(\mu_1, \Sigma_1) X ∼ N ( μ 1 , Σ 1 ) . From the definitions, ln p ( x ) / q ( x ) \ln p(x)/q(x) ln p ( x ) / q ( x ) is equal to
ln ∣ Σ 2 ∣ − ln ∣ Σ 1 ∣ 2 − 1 2 ⟨ x − μ 1 , Σ 1 − 1 ( x − μ 1 ) ⟩ + 1 2 ⟨ ( x − μ 2 ) , Σ 2 − 1 ( x − μ 2 ) ⟩ . \begin{aligned}
\frac{\ln |\Sigma_2| - \ln |\Sigma_1|}{2} - \frac{1}{2}\langle x-\mu_1, \Sigma_1^{-1}(x-\mu_1)\rangle + \frac{1}{2}\langle (x-\mu_2), \Sigma_2^{-1}(x-\mu_2)\rangle .
\end{aligned} 2 ln ∣ Σ 2 ∣ − ln ∣ Σ 1 ∣ − 2 1 ⟨ x − μ 1 , Σ 1 − 1 ( x − μ 1 ) ⟩ + 2 1 ⟨ ( x − μ 2 ) , Σ 2 − 1 ( x − μ 2 ) ⟩ .
We recall that for any vector x ∈ R d x\in\mathbb{R}^d x ∈ R d and matrix M M M , we can write ⟨ x , M x ⟩ = t r a c e ( x x ⊤ M ) \langle x, Mx\rangle = \mathrm{trace}(xx^\top M) ⟨ x , M x ⟩ = t r a c e ( x x ⊤ M ) ; moreover, we recall that
expectations can be swapped with linear maps, ie if ℓ : R d → R \ell : \mathbb{R}^d \to \mathbb{R} ℓ : R d → R is linear then E [ ℓ ( X ) ] = ℓ ( E [ X ] ) \mathbb{E}[\ell(X)] = \ell(\mathbb{E}[X]) E [ ℓ ( X ) ] = ℓ ( E [ X ] ) ,
if X ∼ N ( μ 1 , Σ 1 ) X \sim \mathscr{N}(\mu_1, \Sigma_1) X ∼ N ( μ 1 , Σ 1 ) then E [ ( x − μ 1 ) ( x − μ 1 ) ⊤ ] = Σ 1 \mathbb{E}[(x-\mu_1)(x-\mu_1)^\top] = \Sigma_1 E [ ( x − μ 1 ) ( x − μ 1 ) ⊤ ] = Σ 1 .
Consequently,
E [ ⟨ x − μ 1 , Σ 1 − 1 ( x − μ 1 ) ⟩ ] = E [ t r a c e ( ( x − μ 1 ) ( x − μ 1 ) ⊤ Σ 1 − 1 ) ] = t r a c e ( E [ ( x − μ 1 ) ( x − μ 1 ) ⊤ ] Σ 1 − 1 ) = t r a c e ( Σ 1 Σ 1 − 1 ) = d . \begin{aligned}\mathbb{E}[\langle x-\mu_1, \Sigma_1^{-1}(x-\mu_1)\rangle] &= \mathbb{E}[\mathrm{trace}((x-\mu_1)(x-\mu_1)^\top \Sigma_1^{-1})] \\
&= \mathrm{trace}(\mathbb{E}[(x-\mu_1)(x-\mu_1)^\top] \Sigma_1^{-1})\\
&= \mathrm{trace}(\Sigma_1 \Sigma_1^{-1}) \\&= d.\end{aligned} E [ ⟨ x − μ 1 , Σ 1 − 1 ( x − μ 1 ) ⟩ ] = E [ t r a c e ( ( x − μ 1 ) ( x − μ 1 ) ⊤ Σ 1 − 1 ) ] = t r a c e ( E [ ( x − μ 1 ) ( x − μ 1 ) ⊤ ] Σ 1 − 1 ) = t r a c e ( Σ 1 Σ 1 − 1 ) = d .
For the second term in (6 ) , since X − μ 1 X-\mu_1 X − μ 1 is centered we note that E [ ( x − μ 2 ) ( x − μ 2 ) ⊤ ] = Σ 1 + ( μ 2 − μ 1 ) ( μ 2 − μ 1 ) ⊤ \mathbb{E}[(x-\mu_2)(x-\mu_2)^\top] = \Sigma_1 + (\mu_2 - \mu_1)(\mu_2-\mu_1)^\top E [ ( x − μ 2 ) ( x − μ 2 ) ⊤ ] = Σ 1 + ( μ 2 − μ 1 ) ( μ 2 − μ 1 ) ⊤ , so that
E [ ⟨ x − μ 2 , Σ 2 − 2 ( x − μ 2 ) ⟩ ] = t r a c e ( Σ 1 − 1 Σ 2 ) + ⟨ μ 2 − μ 1 , Σ 2 − 1 ( μ 2 − μ 1 ) ⟩ . \begin{aligned}\mathbb{E}[\langle x-\mu_2, \Sigma_2^{-2}(x-\mu_2)\rangle] &= \mathrm{trace}(\Sigma_1^{-1}\Sigma_2) + \langle \mu_2 - \mu_1, \Sigma_2^{-1}(\mu_2-\mu_1)\rangle.\end{aligned} E [ ⟨ x − μ 2 , Σ 2 − 2 ( x − μ 2 ) ⟩ ] = t r a c e ( Σ 1 − 1 Σ 2 ) + ⟨ μ 2 − μ 1 , Σ 2 − 1 ( μ 2 − μ 1 ) ⟩ .
Gathering everything into (6 ) we get exactly (4 ) .