Loss functions¶

Loss functions are used to quantify how well or bad a model can reproduce the values of the training set.

The appropriate loss function depends on the type of problems and the algorithm we use.

Let's denote with $\hat y$ the prediction of the model and $y$ the true value.

Gaussian noise¶

Let's assume that the relationship between the features $X$ and the label $Y$ is given by

$$ Y = f(X) +\epsilon$$

where $f$ is the model whose parameters we want to fix and $\epsilon$ is some random noise with zero mean and variance $\sigma$.

The likelihood to measure $y$ for feature values $x$ is given by

$$L\sim \exp\left(-\frac{(y-f(x))^2}{2\sigma}\right) $$

If we have a set of examples $x^{(i)}$ the likelihood becomes

$$ L\sim \prod_i \exp\left(-\frac{(y^{(i)}-f(x^{(i)}))^2}{2\sigma}\right) $$

Likelihood Derivation¶

Assuming that the noise $\epsilon$ follows a Gaussian distribution with zero mean and variance $\sigma^2$, the probability density function is given by

$$ p(\epsilon) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\epsilon^2}{2 \sigma^2} \right) $$

Since we have $Y = f(X) + \epsilon$, we substitute $\epsilon = y - f(x)$:

$$ p(y|x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(y - f(x))^2}{2 \sigma^2} \right) $$

The likelihood of observing $y$ for a given $x$ is proportional to

$$ L \sim \exp \left( -\frac{(y - f(x))^2}{2 \sigma^2} \right) $$

For multiple observations $\{x^{(i)}, y^{(i)}\}$, the joint likelihood is given by

$$ L \sim \prod_i \exp \left( -\frac{(y^{(i)} - f(x^{(i)}))^2}{2 \sigma^2} \right) $$

We now want to fix the parameters in $f$ such that we maximize the likelihood that our data was generated by the model.

It is more convenient to work with the log of the likelihood. Maximizing the likelihood is equivalent to minimising the negative log-likelihood.

$$ NLL = - \log(L) = \frac{1}{2\sigma^2} \sum_i \left(y^{(i)}-f(x^{(i)})\right)^2 $$

Assuming Gaussian noise for the difference between the model and the data leads to the least square rule. We can use the square error loss

$$ J(f) = \sum_i \left(y^{(i)}-f(x^{(i)})\right)^2 $$

to train our machine learning algorithm.

Two class model¶

If we have two classes, we call one the positive class ($c=1$) and the other the negative class ($c=0$). If the probability to belong to class 1

$$p(c=1) = p$$

we also have

$$ p(c=0)=1-p $$

The likelihood for a single measurement if the outcome is in the positive class is $p$ and if the outcome is in the negative class the likelihood is $1-p$. For a set of measurements with outcomes $y_i$ the likelihood is given by

$$ L = \prod\limits_{y_i=1} p \prod\limits_{y_i=0} (1-p) $$

So the negative log-likelihood is:

$$ NLL = - \sum\limits_{y_i=1} \log(p) - \sum\limits_{y_i=0} \log(1-p) $$

Given that $y=0$ or $y=1$ we can rewrite it as

$$ NLL = - \sum \left( y \log(p) + (1-y)\log(1-p) \right) $$

So if we have a model for the probability $\hat y = p(X)$ we can maximize the likelihood of the training data by optimizing

$$ J= - \sum_i y_i \log \left(\hat y\right) +(1-y_i)\log\left(1 - \hat y\right) $$

It is called the cross entropy.

Regularisation¶

Regularisation is a technique used to prevent overfitting by discouraging overly complex models in Machine Learning. It adds a penalty to the loss function to reduce the magnitude of model parameters.

Overfitting and Regularisation¶

Overfitting occurs when a model learns the noise in the training data to the detriment of its performance on new data. Regularisation helps to mitigate overfitting by adding a complexity penalty to the loss function.

Overfitting vs Underfitting

Regularised Loss Function¶

We modify the loss function to include a penalty term:

\[ J_{\text{pen}}(X, y, \vec{w}) = J(X, y, \vec{w}) + \lambda \cdot \text{Penalty}(\vec{w}) \]

$ J(X, y, \vec{w}) $: Original loss function (e.g., Mean Squared Error).
$ \lambda $: Regularisation parameter controlling the strength of the penalty.
$ \text{Penalty}(\vec{w}) $: Function penalising large weights.

Error vs Regularisation

Small values of $ \lambda $ mean weak regularisation, large values of $ \lambda $ mean strong regularisation.

Types of Regularisation¶

L1 Regularisation (Lasso):
- Penalty: $ \lambda \sum_{i} |w_i| $
- Encourages sparsity (many weights become zero).
L2 Regularisation (Ridge):
- Penalty: $ \lambda \sum_{i} w_i^2 $
- Encourages smaller weights but doesn't force them to zero.
Elastic Net:
- Combination of L1 and L2 penalties.
- Penalty: $ \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2 $

1D Regression Example¶

We now look at a one-dimensional example. Suppose we have the relationship:

\[ y = 7 - 8x - \frac{1}{2} x^2 + \frac{1}{2} x^3 + \epsilon \]

$ \epsilon $: Gaussian noise with mean 0 and unit variance.

1D Data Scatter Plot

Polynomial Fitting¶

We fit the data using polynomials of different orders $ k $:

\[ p_w(x) = \sum_{i=0}^{k} w_i x^i \]

The loss function to minimise is the Mean Squared Error (MSE):

\[ J(x, y, \vec{w}) = \sum_{i} \left( p_w(x^{(i)}) - y^{(i)} \right)^2 \]

Polynomial Fits of Different Orders

Third Order Polynomial¶

The third-order polynomial fits the data well and recovers coefficients close to the true values:

Third Order Coefficients

Term	True Coefficient	Estimated Coefficient
$ w_0 $	7	Approximate value
$ w_1 $	-8	Approximate value
$ w_2 $	-0.5	Approximate value
$ w_3 $	0.5	Approximate value

Second Order Polynomial¶

The second-order polynomial provides a reasonable fit but misses the $ x^3 $ term:

Second Order Coefficients

This results in higher bias but potentially lower variance.

Tenth Order Polynomial¶

The tenth-order polynomial overfits the data:

Tenth Order Coefficients

Coefficients have very large magnitudes.
Model captures noise in the data.
Large cancellations between coefficients indicate overfitting.

Applying Regularisation¶

We modify the loss function to include the regularisation term:

\[ J_{\text{pen}}(x, y, \vec{w}, \lambda) = J(x, y, \vec{w}) + \lambda \sum_{i=0}^{k} w_i^2 \]

$ \lambda $: Regularisation strength parameter.

This is known as Ridge Regression (L2 Regularisation).

Effect of Regularisation¶

With regularisation, the tenth-order polynomial coefficients have smaller magnitudes:

Regularised Tenth Order Coefficients

Effect of Regularisation¶

The model generalises better and is less sensitive to noise.

Regularised 10th Order Polynomial Fit

Bias and Variance¶

In machine learning, the terms bias and variance help us understand the types of errors that can arise during model training and prediction.

Bias measures how well the model approximates the true relationship between features and the label. It is defined as the difference between the expected prediction of the model and the actual value we aim to predict.

High bias indicates an overly simplistic model that does not capture the complexity of the data well, leading to underfitting.

Variance, on the other hand, measures the sensitivity of the model to variations in the training data. A model with high variance pays too much attention to the training data and may not generalize well to unseen data, leading to overfitting.

The Bias-Variance Tradeoff describes the challenge of finding a balance between bias and variance to minimize the overall error of the model.

Model error can be decomposed as:

$$ \text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$

Bias is the error introduced by approximating a real-world problem, which may be complex, with a simplified model. Variance is the error introduced by excessive sensitivity to small fluctuations in the training data. The irreducible error is noise that cannot be eliminated, regardless of the model chosen.

When tuning model complexity:

A simple model typically has high bias and low variance. It fails to capture the underlying patterns, resulting in high training and testing error (underfitting).
A complex model typically has low bias but high variance. It captures noise from the training data, leading to low training error but high testing error (overfitting).

The goal of model training is to achieve the optimal balance, minimizing both bias and variance to reduce the total error.

Bias-Variance Tradeoff¶

Regularisation introduces bias into the model but reduces variance:

High Bias: Oversimplified model, underfitting.
High Variance: Complex model, overfitting.

Regularisation helps find a balance between bias and variance.

Bias-Variance Tradeoff

Choosing Regularisation Parameter¶

The regularisation parameter $ \lambda $ controls the strength of the penalty:

Small $ \lambda $: Weak regularisation, model may overfit.
Large $ \lambda $: Strong regularisation, model may underfit.

We can use techniques like cross-validation to select the optimal $ \lambda $.

Training/Validation/Test Sets¶

In practice, we split our data into:

Training Set: Used to fit the model.
Validation Set: Used to tune hyperparameters like $ \lambda $.
Test Set: Used to assess the model's performance on unseen data.

Data Splits

Cross-Validation¶

When data is limited, we use cross-validation to make efficient use of it.

K-Fold Cross-Validation:

Split the data into $ k $ equal parts (folds).
For each fold:
- Use it as the validation set.
- Use the remaining $ k-1 $ folds as the training set.
Average the performance across all folds.

K-Fold Cross-Validation

Conclusion¶

Regularisation is essential for building models that generalise well to new data. By penalising large coefficients, we prevent the model from becoming too complex and overfitting the training data.

Key takeaways:

Regularisation balances bias and variance.
Cross-validation helps in selecting optimal hyperparameters.
Understanding the data and the problem is crucial for choosing the right regularisation technique.

Support Vector Machines¶

Support Vector Machines are a powerful and versatile Machine Learning tool used for both classification and regression tasks. They are particularly well-suited for problems where the data is high-dimensional and the number of features exceeds the number of samples.

Key Features of SVM¶

Versatility: Applicable to both classification and regression problems.
Flexibility: Can model linear and non-linear relationships using kernel functions.
Robustness: Effective in high-dimensional spaces and when the number of dimensions exceeds the number of samples.

Binary Classification with SVM¶

We focus on a binary classification problem where the labels are $ y = +1 $ for positive class and $ y = -1 $ for negative class.

Objective¶

The goal of an SVM is to find the hyperplane that best separates the two classes by maximizing the margin, which is the minimal distance between the data points and the decision boundary.

SVM Decision Boundary Illustration

Types of Margin Classification¶

Hard Margin Classification¶

Definition: No data points are allowed within the margin; all points must be correctly classified without errors.
Characteristics:
- Only works if the data is linearly separable.
- Sensitive to outliers; a single misclassified point can make the dataset non-separable.

Hard Margin Classification

Soft Margin Classification¶

Definition: Allows some data points to be within the margin or even misclassified.
Characteristics:
- Introduces a trade-off between maximizing the margin and minimizing classification errors.
- Less sensitive to outliers compared to hard margin classification.

Soft Margin Classification

Mathematical Formulation¶

Linear SVM¶

For a linear model, the decision function is:

\[ z = w_0 + \vec{x} \cdot \vec{w} \]

$ \vec{x} $: Input feature vector.
$ \vec{w} $: Weight vector.
$ w_0 $: Bias term.

The distance $ d $ from a point $ \vec{x} $ to the decision boundary $ z = 0 $ is proportional to $ z $:

\[ d = \frac{z}{\lVert \vec{w} \rVert} \]

$ \lVert \vec{w} \rVert = \sqrt{\sum_{i=1}^{n} w_i^2} $: Euclidean norm of the weight vector.

Linear SVM¶

Decision Function Visualization

Margin Maximization¶

The SVM optimization problem aims to:

Maximize the margin $ \frac{2}{\lVert \vec{w} \rVert} $.
Minimize the classification error.

These two goals are in conflict and are balanced using optimization techniques.

Hard Margin Optimization Problem¶

\[ \begin{aligned} & \text{Minimize} && \frac{1}{2} \lVert \vec{w} \rVert^2 \\ & \text{Subject to} && y^{(i)} (w_0 + \vec{x}^{(i)} \cdot \vec{w}) \geq 1 \quad \forall i \end{aligned} \]

Soft Margin Optimization Problem¶

Introduces slack variables $ \xi^{(i)} $ to allow margin violations:

\[ \begin{aligned} & \text{Minimize} && \frac{1}{2} \lVert \vec{w} \rVert^2 + C \sum_{i} \xi^{(i)} \\ & \text{Subject to} && y^{(i)} (w_0 + \vec{x}^{(i)} \cdot \vec{w}) \geq 1 - \xi^{(i)}, \quad \xi^{(i)} \geq 0 \quad \forall i \end{aligned} \]

$ C $: Regularization parameter controlling the trade-off between margin width and classification error.

Support vector machine¶

The loss for the SVM also uses the hinge function, but offset such that we penalise values up to 1:

$$ J(w) = \frac{1}{2}\vec w\cdot \vec w + C \sum h_1( y_i p(x_i,w)) $$

where $p(x_i,w)$ is the model prediction $\vec x\cdot \vec w + w_0$ and $h_1$ is the shifted hinge function.

$$ h_1(x) = \max(0, 1- x).$$

$C$ is a model parameter controlling the trade-off between the width of the margin and the amount of margin violation.

Example¶

Here we use the iris dataset again, but we rescaled the features so that they have 0 mean and unit standard deviation.

no margin violation¶

moderate margin violation¶

more margin violation¶

Non separable example¶

Here we use the cancer data set we used for previous lectures and exercises.

Less margin violation¶

Moderate margin violation¶

More margin violation¶

Training a SVM¶

Adding data to the training set only affects the model if the additional point falls into the margin.

The model is completely defined by the data samples at the boundary or inside the margin (this is where the name comes from, these data samples are the "support" vectors)

Note: Unlike in the logisitic regression case, there is no probabilistic interpretation for a SVM.

ROC curve¶

Let us consider the cancer data sample again

Quality metrics¶

the performance of a binary classifier can be described by the confusion matrix

	true value is positive	true value is negative
predicted positive	true positive TP	false positive FP
predicted negative	false negative FN	True negative TN

From this matrix we can define several metrics to quantify the quality of the classification.

$\mbox{true positive rate}=\frac{TP}{TP+FN}$

and

$\mbox{false positive rate}=\frac{FP}{FP+TN}$

we can see how well the prediction works by plotting the true value as a function of $z$ for each data point in the training sample:

The points with $z>0$ are assigned to the $y=1$ class
- they correspond to $p>\frac12$
those with $z<0$ to the $y=0$ class
- they correspond to $p<\frac12$

The different categories (TP, FP, TN, FN) can be visualised on this plot:

If we are more worried about false negative than about false positive, we can move the decision boundary to the left:

Of course if means more false positives...

If we are more worried about false positive than about false negative, we can move the decision boundary to the right:

Of course if means more false negatives...

The curve describing this trade-off is the ROC curve (Receiver Operating Characteristic). It is the collection of (FP rate, TP rate) values for all values of the decision boundary.

Move the threshold to the left:

more true positives
more false positive

Move the threshold to the right:

less true positives
less false positive

Term	True Coefficient	Estimated Coefficient
\( w_0 \)	7	Approximate value
\( w_1 \)	-8	Approximate value
\( w_2 \)	-0.5	Approximate value
\( w_3 \)	0.5	Approximate value

Loss functions¶

Gaussian noise¶

Likelihood Derivation¶

Two class model¶

Regularisation¶

Overfitting and Regularisation¶

Regularised Loss Function¶

Types of Regularisation¶

1D Regression Example¶

Polynomial Fitting¶

Third Order Polynomial¶

Second Order Polynomial¶

Tenth Order Polynomial¶

Applying Regularisation¶

Effect of Regularisation¶

Effect of Regularisation¶

Bias and Variance¶

Bias-Variance Tradeoff¶

Choosing Regularisation Parameter¶

Training/Validation/Test Sets¶

Cross-Validation¶

Conclusion¶

Support Vector Machines¶

Key Features of SVM¶

Binary Classification with SVM¶

Objective¶

Types of Margin Classification¶

Hard Margin Classification¶

Soft Margin Classification¶

Mathematical Formulation¶

Linear SVM¶

Linear SVM¶

Margin Maximization¶

Hard Margin Optimization Problem¶

Soft Margin Optimization Problem¶

Support vector machine¶

Example¶

no margin violation¶

moderate margin violation¶

more margin violation¶

Non separable example¶

Less margin violation¶

Moderate margin violation¶

More margin violation¶

Training a SVM¶

ROC curve¶

Quality metrics¶

Next lecture we'll cover Non-linear models