\(\DeclarePairedDelimiterX{\Set}[2]{\{}{\}}{#1 \nonscript\;\delimsize\vert\nonscript\; #2}\) \( \DeclarePairedDelimiter{\set}{\{}{\}}\) \( \DeclarePairedDelimiter{\parens}{\left(}{\right)}\) \(\DeclarePairedDelimiterX{\innerproduct}[1]{\langle}{\rangle}{#1}\) \(\newcommand{\ip}[1]{\innerproduct{#1}}\) \(\newcommand{\bmat}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\barray}[1]{\left[\hspace{2.0pt}\begin{matrix}#1\end{matrix}\hspace{2.0pt}\right]}\) \(\newcommand{\mat}[1]{\begin{matrix}#1\end{matrix}}\) \(\newcommand{\pmat}[1]{\begin{pmatrix}#1\end{pmatrix}}\) \(\newcommand{\mathword}[1]{\mathop{\textup{#1}}}\)
Least Squares Linear Regressors
Needed by:
Featurized Probabilistic Linear Models
Polynomial Regressors
Probabilistic Errors Linear Model
Sheet PDF
Graph PDF

Feature Maps


Linear predictors are simple and we know how to select the parameters. The main downside is that there may not be a linear relationship between inputs and outputs.


A feature map (or regression function) for outputs $A$ is a mapping $\phi : A \to \R ^d$. In this setting, we call $a \in A$ the raw input record and we call $\phi (a)$ an embedding, feature embedding or feature vector. We call the components of a feature vector the features. We call $\phi (A)$ the regression range.

A feature map is faithful if, whenever records $a_i$ and $a_j$ are in some sense “similar” in the set $A$, the embeddings $\phi (a_i)$ and $\phi (a_j)$ are close in the vector space $\R ^d$.

Since it is common for raw input records $a \in A$ to consist of many fields, it is regular to have several feature maps $\phi _i$ which operate component-wise on the fields of $a$. These are sometimes called basis functions, by analogy with real function approximators (see Real Function Approximators). We concatenate these field feature maps and commonly add a constant feature $1$. Since $\R ^d$ is a vector space, it is common to refer to it in this case as the feature space.

Given a dataset $a = (a^1, \dots , a^n)$ in $A$ and a feature map $\phi : A \to \R ^d$, the embedded dataset of $a$ with respect to $\phi $ is the dataset $(\phi (a^1), \dots , \phi (a^n)$ in $\R ^d$.

Featurized consistency: a route around $X \neq \R ^d$

Recall that a dataset is parametrically consistent with the family $\set{h_{\theta }: X \to Y}_{\theta }$ if there exists $\theta ^\star$ so that the dataset is consistent with $\theta ^{\star}$. We saw how to pick $\theta $ if we use a linear model with a squared loss (see Least Squares Linear Regressors).

Let $\mathcal{G} = \set{g_{\theta }: \R ^d \to \R }_{\theta }$. A dataset is featurized parametrically consistent with respect to the family $\mathcal{G} $ and the feature map $\phi : X \to \R ^d$ if it is parametrically consistent with respect to $\mathcal{G} \circ \phi = \Set*{g \circ \phi }{g \in \mathcal{G} }$.

The interpretation is that we have transformed the problem of selecting a predictor on an arbitrary space $X$ to the problem of selecting a predictor on the space $\R ^d$. In so doing, we can continue to use simple predictors, such as those that are linear and minimize the squared error on the dataset.1

In other words, we have “shifted emphasis” from the model function $h: X \to \R $ to the regression function from $\R ^d \to \R $. If we know the features and the input $x$, then we know the regression vector $\phi (x)$. The regression range is the set $\Set*{\phi (x)}{x \in X}$. In this case linearity pertains to the parameters $\theta \in \R ^d$ instead of the inputs (or experimental conditions) $x \in X$.

  1. Future editions are likely to modify this section. ↩︎
Copyright © 2023 The Bourbaki Authors — All rights reserved — Version 13a6779cc About Show the old page view