\(\DeclarePairedDelimiterX{\Set}[2]{\{}{\}}{#1 \nonscript\;\delimsize\vert\nonscript\; #2}\) \( \DeclarePairedDelimiter{\set}{\{}{\}}\) \( \DeclarePairedDelimiter{\parens}{\left(}{\right)}\) \(\DeclarePairedDelimiterX{\innerproduct}[1]{\langle}{\rangle}{#1}\) \(\newcommand{\ip}[1]{\innerproduct{#1}}\) \(\newcommand{\bmat}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\barray}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\mat}[1]{\begin{matrix}#1\end{matrix}}\) \(\newcommand{\pmat}[1]{\begin{pmatrix}#1\end{pmatrix}}\) \(\newcommand{\mathword}[1]{\mathop{\textup{#1}}}\)
Independent and Identically Distributed Random Variables
Needed by:
Empirical Error Minimizers
Hypothesis Classes
Probably Approximately Correct Inductors
Sheet PDF
Graph PDF

Supervised Probabilistic Data Models


We want to discuss an inductor’s performance on consistent (but possibly incomplete) datasets.

We take two steps. First, put a measure on the set of training sets and only consider high-measure subsets. Second, consider predictors performing well in some tolerance.


Let $(X, \mathcal{X} , \mu )$ be a probability space and $(Y, \mathcal{Y} )$ a measurable space. Let $f: X \to Y$ measurable. We call the pair $((X, \mathcal{X} , \mu ), f)$ a (supervised) probabilistic data model.

We interpret $\mu $ as the data-generating distribution or underlying distribution and $f$ as the correct labeling function. Many authors refer to a supervised probabilistic data model as the statistical learning (theory) framework.

Probable datasets

We put a measure on the set of datasets by using the product measure $(X^n, \mathcal{X} ^n, \mu ^n)$. We interpret this as a model for a training set of independent and identically distributed inputs.

For $\delta \in (0, 1)$, $\mathcal{S} \subset X^n$ is $1-\delta $-representative if $\mu ^n(\mathcal{S} ) \geq 1 - \delta $. If $\mathcal{S} $ is $1-\delta $-representative for small $\delta $, we think of $\mathcal{S} $ as a set of “probable” or “reasonable” datasets. We call $\delta $ the confidence parameter.

Prediction error

The error of (measurable) $h: X \to Y$ (under $\mu $ and $f$) is

\[ \underset{\mu , f}{\mathword{error}}(h) = \mu (\Set*{x \in \mathcal{X} }{h(x) \neq f(x)}). \]

We interpret this as the probability that the predictor mislabels a point. The accuracy of $h$ is $1 - \mathword{error}_{\mu , f}(h)$.

Since $(f, g) \mapsto \mu [f(x) \neq g(x)]$ is a metric on $L^2(X, \mathcal{X} , \mu , Y)$ we can talk about the error as the “distance” from the correct label classifier. Thus we will say that $\epsilon \in (0, 1)$, (measurable) $h: \mathcal{X} \to \mathcal{Y} $ $\epsilon $-approximates the correct labeling function $f$ if $\mathword{error}(h) \leq \epsilon $. Roughly speaking, if $\epsilon \ll 1$, the the error of the hypothesis is “fairly small.” We call $\epsilon $ the accuracy parameter, since the accuracy of such a predictor is $1 - \epsilon $.

Other terminology

A hypothesis class is a subset of the measurable functions from $X \to Y$. Other names for the error of a classifier include the generalization error, the risk or the true error or loss.

Copyright © 2023 The Bourbaki Authors — All rights reserved — Version 13a6779cc About Show the old page view