What if we use the principle of maximum likelihood to select the maximum likelihood normal density which factors according to a tree?
A maximum likelihood tree normal of a dataset in $\R ^d$ is a multivariate normal density that factors according to a tree and maximizes the likelihood of the dataset.
Let $D = (x^1, \dots , x^n)$ be a dataset in $\R ^d$. A normal density is a maximum likelihood tree normal of $D$ if and only if it is an optimal tree approximator of the empirical normal density of $D$.
First, let $f: \R ^d \to \R $ be a normal density.
First, express the log likelihood of $f$ on
a record $x^k$ by
\[
\log f(x^k) = - \frac{1}{2} (x^k - \mu )^{\tp} \inv{\Sigma }
(x^k - \mu ) - \frac{1}{2} \log \det \Sigma -
\frac{d}{2}\log 2\pi .
\]
Second, use the trace to rewrite the quadratic form
\[ -\frac{1}{2}\tr{(\Sigma ^{-1} (x^k - \mu ) (x^k - \mu )^{\tp})}. \]
Third, use these two, and the linearity of trace to express the average negative log likelihood by
\[ \begin{aligned} - \frac{1}{n} \sum_{k = 1}^{n} \log f(x^k) &= \frac{1}{2} \tr{(\Sigma ^{-1}(\frac{1}{n}\sum_{i = 1}^{n} (x^k - \mu )(x^k - \mu )^{\tp}))} + \frac{1}{2}\log\det\Sigma + \frac{d}{2}\log 2\pi . \end{aligned} \]
Fourth, use matrix calculus (or the derivation in Proposition 1 of Multivariate Normal Maximum Likelihood) to see that, for a minimizer of the negative average log likelihood, the mean must be $\frac{1}{n} \sum_{i = 1}^{n} x^k$.
Fifth, recognize the empirical covariance matrix $\frac{1}{n}\sum_{k = 1}^{n} (x^k - \mu )(x^k - \mu )^{\tp}$; denote it by $S$.
Sixth, change variables with $P = \Sigma ^{-1}$ and express
\[ \log{(\det{(\Sigma )})} = \log{(\det{(\inv{P})})} = \log{(\inv{(\det P)})} = -\log{(\det{(P)})} \]
Seventh, write the likelihood in simplified form (using circulant property of trace)
\[ \frac{1}{2} \tr{(SP)} - \frac{1}{2} \log \det P - \frac{d}{2}\log 2\pi \]
Eighth, drop the constant and prefactors:
\[ \tr{(SP)} - \log \det P \]
Ninth, if $g$ is a normal with then the tree density approximation objective is the same equivalent to
\[ d(g, f) = h(g, f) - h(f) \sim h(g, f) = - \int_{\R ^d} g\log f. \]
TODO: Extra, the let $g$ be normal and $f$ be normal. The tree normal approximation problem
\[ d(g, f) \sim h(g, f) = - \int_{R^d} g \log f. \]
The log of $f$ is\[ - \frac{1}{2} \tr{( \Sigma _f \int_{\R ^d} (x - \mu _f)(x - \mu _f)^{\tp} dx )} - \frac{1}{2} \log \det \Sigma _f^{-1} - \frac{d}{2} \log 2\pi \]
Since the set of optimal solutions for both optimization is contained in the set of normal densities which match on the mean we can assume that $\mu _f = \mu _g$. So the approximation objective is equivalent to
\[ \tr{P_f \Sigma _g} - \log \det P_f, \]
which is exactly the maximum likelihood objective.Thus, a solution is a maximum likelihood tree normal of a dataset if and only if it is an optimal tree approximator of the empirical normal density of the dataset.