We model a real-valued output as corrupted by small random errors. Thus, we can talk about a dataset which is “close” to being consistent with a linear predictor.
Let $(\Omega , \mathcal{A} , \mathbfsf{P} )$ be a probability space. Let $x \in \R ^d$ and $e: \Omega \to \R ^n$. For $A \in \R ^{n \times d}$, define $y: \Omega \to \R ^n$ by $y = Ax + e$. We call $(x, A, e)$ a probabilistic errors linear model. We call $y$ the response vector, $A$ the model matrix and $e$ the error vector.
The most basic distributional assumption for a probabilistic errors linear model pertain to the expectation and variance. Since $\E (y) = Ax + \E (e)$ and $\var(y) = \var(e)$, these assumptions can be given for $e$ or for $y$.
If $\E (x) = 0$ and $\var(y) = \sigma ^2I$ then we call $(x, A, e)$ a classical linear model with moment assumptions. Notice that the components of $e$ are assumed uncorrelated. We have $d + 1$ unknowns (the $d \times 1$ entires of $\theta $ and scalar parameter $\sigma ^2$.
In this case $\E (y_i) = \transpose{a^i}\theta $ and so $\theta $ is called the mean parameter vector and $\sigma ^2$ is called the model variance. The model variance indicates the variability inherent in the observations. Neither the mean nor variance of the error depends on the regression vector $x$ nor on the parameter vector $\theta $.
Consider the two-sample problem in which we have two populations with (unknown) mean responses $\alpha _1, \alpha _2 \in \R $. We observe these responses with (perhaps unknown) common variance $\sigma ^2$, and assume that errors are uncorrelated.
We define $y^1 = \alpha _1\mathbf{1} + e^1$
and $y^2 = \alpha _2\mathbf{1} + e^2$ so that
we can stack these and obtain
\[
y = \bmat{y^1 \\ y^2} = \bmat{\alpha _1\mathbf{1} \\
\alpha _2\mathbf{1} } + \bmat{e^1 \\ e^2}.
\] \[
A = \transpose{
\bmat{\bmat{1\\0} & \cdots &\bmat{1\\0} & \bmat{0 \\ 1}
&\cdots & \bmat{0 \\ 1}
}
}, \quad x = \bmat{\alpha _1 \\ \alpha _2}.
\]