Which tree is optimal for tree distribution approximation?
We want to choose a tree whose corresponding approximator for the given distribution minimizes the relative entropy with the given distribution among all tree distribution approximators. We call such a distribution an optimal tree approximator of the given distribution. We call a tree according to which an optimal tree approximator factors and optimal approximator tree.
Let $A_1, \dots , A_n$ be finite nonempty sets. Define $A = \prod_{i = 1}^{n} A_i$. Let $q: A \to [0, 1]$ a distribution. A tree $T$ on $\set{1, \dots , n}$ is an optimal approximator tree if and only if it is a maximal spanning tree of the mutual information graph of $q$.
First, denote the optimal tree distribution
approximator of $q$ for tree $T$ by $p^*_T$.
Express
\[
p^*_T = q_1 \prod_{i \neq 1} q_{i \mid \pa_i}
\]
Second, express $d(q, p) = H(q, p) - H(q)$. Since $H(q)$ does not depend on $T$, $p^*_T$ is a minimizer (w.r.t. $T$) of $d(q, p^*_T)$ if and only if it is a minimizer of $H(q, p^*_T)$.
Third, express the cross entropy of $p^*_T$
relative to $q$ as
\[
\begin{aligned}
H(q, p^*_T) &= H(q_1) - \sum_{i \neq 1} \sum_{a \in A}
q(a)\log q_{i \mid pa{i}}(a_i, a_{\pa_i}) \\
&= H(q_1) - \sum_{i \neq 1} \sum_{a \in A} q(a)(\log
q_{i,\pa{i}}(a_i, a_{\pa_i}) - \log q_{\pa{i}}(a_{\pa{i}})) \\
&= H(q_1) - \sum_{i \neq 1} \sum_{a \in A} q(a)(\log
q_{i,\pa_i}(a_i, a_{\pa_i}) - \log q_{\pa_i}(a_{\pa_i}) - \log
q_i(a_i) + \log q_{i}(a_i)) \\
&= \sum_{i = 1}^{n} H(q_i) - \sum_{i \neq 1} I(q_i,
q_{\pa_i}) \\
&= \sum_{i = 1}^{n} H(q_i) - \sum_{\set{i,j} \in T} I(q_i,
q_{j}) \\
\end{aligned}
\]
The foregoing statement says that to we should first select a maximum spanning tree of the mutual information graph of the distribution we are approximating. Then, we should pick the best approximator to $q$ which factors according to that tree.