We want to talk about an algorithm which can learn from a hypothesis class, regardless of the underlying distribution.
Let $(X, \mathcal{X} )$ and $(Y, \mathcal{Y} )$ be measurable input and output spaces.
A hypothesis class $\mathcal{H} $ of measurable
functions from $X$ to $Y$ is
probably approximately correct
learnable (or PAC
learnable) if
(a) there exists an inductor $\mathcal{A} : (X
\times Y)^n \to \mathcal{H} $, so that (b) for
every underlying measure $\mu $ and labeling
function $f: X \to Y$ (c) for every $\epsilon ,
\delta \in (0, 1)$ (d) there exists $m_0 \in
\N $ so that (e) for all $m \geq m_0$
\[
\mu ^m\left[
\mu \left[
\mathcal{A} ((x_i, f(x_i))_{i = 1}^{n})(\xi ) \neq f(\xi )
\right]
\leq \epsilon \right] \geq 1-\delta
\]
We interpret this as follows: “no matter the underlying distribution and correct labeling function, if someone specifies an accuracy and confidence we can tell them the number of samples they need so that the inductor outputs a hypothesis which is probably approximately correct.”
Some authors require that the hypothesis class be realizable with respect to the underlying distribution and correct labeling function. This is natural, because if the hypothesis class includes the correct labeling function $f$, then it is realizable. In this case they refer to the above definition as the agnostic PAC model. We emphasize here that there this definition includes the notion of realizability. In other words. We emphasize again that this definition contains to “approximation parameters.” The accuracy parameter $\epsilon $ corresponds to the “approximately correct” piece and the confidence parameters $\delta $ corresponds to the “probably” piece.
Note that the existence of an $m_0$ above for each $\epsilon $ and $\delta $ is equivalent to requiring that there exists $m_0: (0, 1)^2 \to \N$ so that for all $m \geq m_0(\epsilon ,\delta )$ the condition in the above equation holds. There may exist multiple such $m_0$, so we define $\tilde{m}: (0, 1)^2 \to \N $ so that $\tilde{m}(\epsilon ,\delta )$ is the smallest integer so that the above equation holds. We call $\tilde{m}$ the sample complexity. Clearly it is a function of $\epsilon $ and $\delta $. It also depends on the hypothesis class $\mathcal{H} $ and the learning algorithm $\mathcal{A} $.