\(\DeclarePairedDelimiterX{\Set}[2]{\{}{\}}{#1 \nonscript\;\delimsize\vert\nonscript\; #2}\) \( \DeclarePairedDelimiter{\set}{\{}{\}}\) \( \DeclarePairedDelimiter{\parens}{\left(}{\right)}\) \(\DeclarePairedDelimiterX{\innerproduct}[1]{\langle}{\rangle}{#1}\) \(\newcommand{\ip}[1]{\innerproduct{#1}}\) \(\newcommand{\bmat}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\barray}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\mat}[1]{\begin{matrix}#1\end{matrix}}\) \(\newcommand{\pmat}[1]{\begin{pmatrix}#1\end{pmatrix}}\) \(\newcommand{\mathword}[1]{\mathop{\textup{#1}}}\)
Cross Entropy of Probability Distributions
Discrete Entropy
Similarity Functions
Needed by:
Differential Relative Entropy
Mutual Information
Tree Distribution Approximators
Sheet PDF
Graph PDF

Relative Entropy


Consider two distributions on the same finite set. The entropy of the first distribution relative to the second distribution is the difference of the cross entropy of the first distribution relative to the second and the entropy of the second distribution. We call it the relative entropy of the first distribution with the second distribution. People also call the relative entropy the Kullback-Leibler divergence or KL divergence.


Let $A$ be a non-empty finite set. Let $p: A \to \R $ and $q: A \to \R $ be distributions. Let $H(q, p)$ denote the cross entropy of $p$ relative to $q$ and let $H(q)$ denote the entropy of $q$. The entropy of $p$ relative to $q$ is $$H(q, p) - H(q).$$ Herein, we denote the entropy of $p$ relative to $q$ by $d(q, p)$.

A similarity function

The relative entropy is a similarity function between distributions.

Let $q$ and $p$ be distributions on the same set. Then $d(q, p) \geq 0$ with equality if and only if $p = q$.

So, $d$ has a few of the properties of a metric. However, $d$ is not a metric; for example, it is not symmetric.

There exist distributions $p: A \to \R $ and $q: A \to \R $ (with $A$ a non-empty finite set) such that $$d(q, p) ≠ d(p, q).$$

Optimization perspective

A solution to finding a distribution $p: A \to \R $ to

\[ \text{minimize} \quad d(q, p), \]

is $p^\star = q$.

Copyright © 2023 The Bourbaki Authors — All rights reserved — Version 13a6779cc About Show the old page view