\(\DeclarePairedDelimiterX{\Set}[2]{\{}{\}}{#1 \nonscript\;\delimsize\vert\nonscript\; #2}\) \( \DeclarePairedDelimiter{\set}{\{}{\}}\) \( \DeclarePairedDelimiter{\parens}{\left(}{\right)}\) \(\DeclarePairedDelimiterX{\innerproduct}[1]{\langle}{\rangle}{#1}\) \(\newcommand{\ip}[1]{\innerproduct{#1}}\) \(\newcommand{\bmat}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\barray}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\mat}[1]{\begin{matrix}#1\end{matrix}}\) \(\newcommand{\pmat}[1]{\begin{pmatrix}#1\end{pmatrix}}\) \(\newcommand{\mathword}[1]{\mathop{\textup{#1}}}\)
Needs:
Supervised Probabilistic Data Models
Needed by:
None.
Links:
Sheet PDF
Graph PDF

Empirical Error Minimizers

Why

A natural and (conceptually) simple inductor is to select a predictor which achieves zero error on the training dataset.

Definition

Let $(X, \mathcal{X} , \mu )$ and $f: X \to Y$ be a probabilistic supervised data model. For a dataset $(x_1, y_1), \dots , (x_n, y_n)$ in $X \times Y$, the empirical error of a predictor $h: X \to Y$ is

\[ (1/n)\num{\Set*{i \in \upto{n}}{h(x_i) \neq y_i}}, \]

An empirical error minimizer is a predictor whose empirical error is minimal. Since we assume that the dataset is consistent, the correct labeling function is always an empirical error minimizer, and achieves zero. If the dataset is incomplete, however, there may be other empirical error minimizers.

A family of inductors $(A^n)_n$ is an empirical error minimizer if, for each $n \in N$, and each dataset $D = ((x_i, y_i))_{i = 1}^{n}$, $A^n(D)$ is an empirical risk minimizer of $D$.

Overfitting and incutive bias

Since a dataset may be incomplete, an empirical risk minimizer may be a “poor” predictor. For example, let $A \subset X \subset \R ^2$ and $Y = \set{0, 1}$. Define $f: X \to Y$ by $f(x) = 1$ if $x \in A$ and $f(x) = 0$ otherwise. Suppose $A \in \mathcal{X} $ and $\mu (A) = 1/2$.

For any training set $(x_1, y_1), \dots , (x_n, y_n)$ in $X \times Y$, the hypothesis $h: X \to Y$ defined by $h(x) = y_i$ if $x = x_i$ and $h(x) = 0$ otherwise achieves zero empirical error but has “actual” error $1/2$. The predictor is said to be overfit or to exhibit overfitting. It is said to fit the training dataset “too well.”

By adjusting the hypothesis class one may “guard against” overfitting. In this context, the constrained hypothesis class is called an inductive bias.

Other terminology

Other terminology for the empirical error includes empirical risk. For these reasons, the learning paradigm of selecting a predictor $h$ to minimizer the empirical risk is called empirical risk minimization or ERM.

Copyright © 2023 The Bourbaki Authors — All rights reserved — Version 13a6779cc About Show the old page view