\(\DeclarePairedDelimiterX{\Set}[2]{\{}{\}}{#1 \nonscript\;\delimsize\vert\nonscript\; #2}\) \( \DeclarePairedDelimiter{\set}{\{}{\}}\) \( \DeclarePairedDelimiter{\parens}{\left(}{\right)}\) \(\DeclarePairedDelimiterX{\innerproduct}[1]{\langle}{\rangle}{#1}\) \(\newcommand{\ip}[1]{\innerproduct{#1}}\) \(\newcommand{\bmat}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\barray}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\mat}[1]{\begin{matrix}#1\end{matrix}}\) \(\newcommand{\pmat}[1]{\begin{pmatrix}#1\end{pmatrix}}\) \(\newcommand{\mathword}[1]{\mathop{\textup{#1}}}\)
Needs:
Tree Density Approximators
Needed by:
Tree Approximators of a Normal
Links:
Sheet PDF
Graph PDF

Optimal Tree Density Approximators

Why

Which is the optimal tree to use for tree density approximation?

Definition

We want to choose the tree whose corresponding approximator for the given density achieves minimum relative entropy with the given density among all tree density approximators. We call such a density an optimal tree approximator of the given density. We call a tree according to which an optimal tree approximator factors and optimal approximator tree.

Result

Let $g: \R ^n \to [0, 1]$ be a density. A tree $T$ on $\set{1, \dots , n}$ is an optimal approximator tree if and only if it is a maximal spanning tree of the differential mutual information graph of $q$.

First, denote the optimal approximator of $g$ for tree $T$ by $f^*_T$. Recall

\[ f^*_T = f_1 \prod_{i \neq 1} f_{i \mid \pa{i}} \]

Second, recall $d(g, f) = H(g, f) - H(g)$. Since $H(g)$ does not depend on $f$, $f$ is a minimizer of $d(g, f)$ if and only if it is a minimizer of $H(g, f)$.

Third, express the cross entropy of $f^*_T$ relative to $g$ as

\[ \begin{aligned} H(q, p^*_T) &= h(q_1) - \sum_{j \neq i} ( \int_{\R ^d} g(x) \log g_{i \mid \pa_i}(x_i, x_{\pa_i}) dx ) \\ &= H(q_1) - \sum_{i \neq 1} \sum_{a \in A} q(a) (\log q_{i,\pa_i}(a_i, a_{\pa_i}) - \log q_{\pa{i}}(a_{\pa_i})) \\ &= H(q_1) - \sum_{i \neq 1} \sum_{a \in A} q(a) (\log q_{i,\pa_i}(a_i, a_{\pa_i}) - \log q_{\pa_i}(a_{\pa_i}) - \log q_i(a_i) + \log q_{i}(a_i) ) \\ &= \sum_{i = 1}^{n} H(q_i) - \sum_{i \neq 1} I(q_i, q_{\pa_i}) \\ &= \sum_{i = 1}^{n} H(q_i) - \sum_{\set{i,j} \in T} I(q_i, q_{j}) \\ \end{aligned} \]

where $\pa_i$ denotes the parent of vertex $i$ in $T$ ($i = 2, \dots , n$). $H(g_i)$ does not depend on the choice of tree. Choosing a tree to minimize the second term in the final expression above is equivalent to choosing a maximal spanning tree from the weighted graph with differential mutual information edge weights; namely, the mutual information graph of $g$.

Copyright © 2023 The Bourbaki Authors — All rights reserved — Version 13a6779cc About Show the old page view