\(\DeclarePairedDelimiterX{\Set}[2]{\{}{\}}{#1 \nonscript\;\delimsize\vert\nonscript\; #2}\) \( \DeclarePairedDelimiter{\set}{\{}{\}}\) \( \DeclarePairedDelimiter{\parens}{\left(}{\right)}\) \(\DeclarePairedDelimiterX{\innerproduct}[1]{\langle}{\rangle}{#1}\) \(\newcommand{\ip}[1]{\innerproduct{#1}}\) \(\newcommand{\bmat}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\barray}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\mat}[1]{\begin{matrix}#1\end{matrix}}\) \(\newcommand{\pmat}[1]{\begin{pmatrix}#1\end{pmatrix}}\) \(\newcommand{\mathword}[1]{\mathop{\textup{#1}}}\)
Needs:
Maximum Likelihood Multivariate Normals
Matrix Trace
Tree Approximators of a Normal
Needed by:
None.
Links:
Sheet PDF
Graph PDF

Maximum Likelihood Tree Normals

Why

What if we use the principle of maximum likelihood to select the maximum likelihood normal density which factors according to a tree?

Definition

A maximum likelihood tree normal of a dataset in $\R ^d$ is a multivariate normal density that factors according to a tree and maximizes the likelihood of the dataset.

Results

Let $D = (x^1, \dots , x^n)$ be a dataset in $\R ^d$. A normal density is a maximum likelihood tree normal of $D$ if and only if it is an optimal tree approximator of the empirical normal density of $D$.

First, let $f: \R ^d \to \R $ be a normal density.

First, express the log likelihood of $f$ on a record $x^k$ by

\[ \log f(x^k) = - \frac{1}{2} (x^k - \mu )^{\tp} \inv{\Sigma } (x^k - \mu ) - \frac{1}{2} \log \det \Sigma - \frac{d}{2}\log 2\pi . \]

Second, use the trace to rewrite the quadratic form

\[ -\frac{1}{2}\tr{(\Sigma ^{-1} (x^k - \mu ) (x^k - \mu )^{\tp})}. \]

Third, use these two, and the linearity of trace to express the average negative log likelihood by

\[ \begin{aligned} - \frac{1}{n} \sum_{k = 1}^{n} \log f(x^k) &= \frac{1}{2} \tr{(\Sigma ^{-1}(\frac{1}{n}\sum_{i = 1}^{n} (x^k - \mu )(x^k - \mu )^{\tp}))} + \frac{1}{2}\log\det\Sigma + \frac{d}{2}\log 2\pi . \end{aligned} \]

Fourth, use matrix calculus (or the derivation in Proposition 1 of Multivariate Normal Maximum Likelihood) to see that, for a minimizer of the negative average log likelihood, the mean must be $\frac{1}{n} \sum_{i = 1}^{n} x^k$.

Fifth, recognize the empirical covariance matrix $\frac{1}{n}\sum_{k = 1}^{n} (x^k - \mu )(x^k - \mu )^{\tp}$; denote it by $S$.

Sixth, change variables with $P = \Sigma ^{-1}$ and express

\[ \log{(\det{(\Sigma )})} = \log{(\det{(\inv{P})})} = \log{(\inv{(\det P)})} = -\log{(\det{(P)})} \]

Seventh, write the likelihood in simplified form (using circulant property of trace)

\[ \frac{1}{2} \tr{(SP)} - \frac{1}{2} \log \det P - \frac{d}{2}\log 2\pi \]

Eighth, drop the constant and prefactors:

\[ \tr{(SP)} - \log \det P \]

Ninth, if $g$ is a normal with then the tree density approximation objective is the same equivalent to

\[ d(g, f) = h(g, f) - h(f) \sim h(g, f) = - \int_{\R ^d} g\log f. \]

TODO: Extra, the let $g$ be normal and $f$ be normal. The tree normal approximation problem

\[ d(g, f) \sim h(g, f) = - \int_{R^d} g \log f. \]

The log of $f$ is

\[ - \frac{1}{2} \tr{( \Sigma _f \int_{\R ^d} (x - \mu _f)(x - \mu _f)^{\tp} dx )} - \frac{1}{2} \log \det \Sigma _f^{-1} - \frac{d}{2} \log 2\pi \]

Since the set of optimal solutions for both optimization is contained in the set of normal densities which match on the mean we can assume that $\mu _f = \mu _g$. So the approximation objective is equivalent to

\[ \tr{P_f \Sigma _g} - \log \det P_f, \]

which is exactly the maximum likelihood objective.

Thus, a solution is a maximum likelihood tree normal of a dataset if and only if it is an optimal tree approximator of the empirical normal density of the dataset.

Copyright © 2023 The Bourbaki Authors — All rights reserved — Version 13a6779cc About Show the old page view