\(\DeclarePairedDelimiterX{\Set}[2]{\{}{\}}{#1 \nonscript\;\delimsize\vert\nonscript\; #2}\) \( \DeclarePairedDelimiter{\set}{\{}{\}}\) \( \DeclarePairedDelimiter{\parens}{\left(}{\right)}\) \(\DeclarePairedDelimiterX{\innerproduct}[1]{\langle}{\rangle}{#1}\) \(\newcommand{\ip}[1]{\innerproduct{#1}}\) \(\newcommand{\bmat}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\barray}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\mat}[1]{\begin{matrix}#1\end{matrix}}\) \(\newcommand{\pmat}[1]{\begin{pmatrix}#1\end{pmatrix}}\) \(\newcommand{\mathword}[1]{\mathop{\textup{#1}}}\)
Needs:
Predictors
Independent and Identically Distributed Random Variables
Approximators
Needed by:
Empirical Error Minimizers
Hypothesis Classes
Probably Approximately Correct Inductors
Links:
Sheet PDF
Graph PDF

Supervised Probabilistic Data Models

Why

We want to discuss an inductor’s performance on consistent (but possibly incomplete) datasets.

We take two steps. First, put a measure on the set of training sets and only consider high-measure subsets. Second, consider predictors performing well in some tolerance.

Definition

Let $(X, \mathcal{X} , \mu )$ be a probability space and $(Y, \mathcal{Y} )$ a measurable space. Let $f: X \to Y$ measurable. We call the pair $((X, \mathcal{X} , \mu ), f)$ a (supervised) probabilistic data model.

We interpret $\mu $ as the data-generating distribution or underlying distribution and $f$ as the correct labeling function. Many authors refer to a supervised probabilistic data model as the statistical learning (theory) framework.

Probable datasets

We put a measure on the set of datasets by using the product measure $(X^n, \mathcal{X} ^n, \mu ^n)$. We interpret this as a model for a training set of independent and identically distributed inputs.

For $\delta \in (0, 1)$, $\mathcal{S} \subset X^n$ is $1-\delta $-representative if $\mu ^n(\mathcal{S} ) \geq 1 - \delta $. If $\mathcal{S} $ is $1-\delta $-representative for small $\delta $, we think of $\mathcal{S} $ as a set of “probable” or “reasonable” datasets. We call $\delta $ the confidence parameter.

Prediction error

The error of (measurable) $h: X \to Y$ (under $\mu $ and $f$) is

\[ \underset{\mu , f}{\mathword{error}}(h) = \mu (\Set*{x \in \mathcal{X} }{h(x) \neq f(x)}). \]

We interpret this as the probability that the predictor mislabels a point. The accuracy of $h$ is $1 - \mathword{error}_{\mu , f}(h)$.

Since $(f, g) \mapsto \mu [f(x) \neq g(x)]$ is a metric on $L^2(X, \mathcal{X} , \mu , Y)$ we can talk about the error as the “distance” from the correct label classifier. Thus we will say that $\epsilon \in (0, 1)$, (measurable) $h: \mathcal{X} \to \mathcal{Y} $ $\epsilon $-approximates the correct labeling function $f$ if $\mathword{error}(h) \leq \epsilon $. Roughly speaking, if $\epsilon \ll 1$, the the error of the hypothesis is “fairly small.” We call $\epsilon $ the accuracy parameter, since the accuracy of such a predictor is $1 - \epsilon $.

Other terminology

A hypothesis class is a subset of the measurable functions from $X \to Y$. Other names for the error of a classifier include the generalization error, the risk or the true error or loss.

Copyright © 2023 The Bourbaki Authors — All rights reserved — Version 13a6779cc About Show the old page view