Loss functions are used to quantify how well or bad a model can reproduce the values of the training set.
The appropriate loss function depends on the type of problems and the algorithm we use.
Let's denote with $\hat y$ the prediction of the model and $y$ the true value.
Let's assume that the relationship between the features $X$ and the label $Y$ is given by
$$ Y = f(X) +\epsilon$$where $f$ is the model whose parameters we want to fix and $\epsilon$ is some random noise with zero mean and variance $\sigma$.
The likelihood to measure $y$ for feature values $x$ is given by
$$L\sim \exp\left(-\frac{(y-f(x))^2}{2\sigma}\right) $$If we have a set of examples $x^{(i)}$ the likelihood becomes
$$ L\sim \prod_i \exp\left(-\frac{(y^{(i)}-f(x^{(i)}))^2}{2\sigma}\right) $$Assuming that the noise $\epsilon$ follows a Gaussian distribution with zero mean and variance $\sigma^2$, the probability density function is given by
$$ p(\epsilon) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\epsilon^2}{2 \sigma^2} \right) $$Since we have $Y = f(X) + \epsilon$, we substitute $\epsilon = y - f(x)$:
$$ p(y|x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(y - f(x))^2}{2 \sigma^2} \right) $$The likelihood of observing $y$ for a given $x$ is proportional to
$$ L \sim \exp \left( -\frac{(y - f(x))^2}{2 \sigma^2} \right) $$For multiple observations $\{x^{(i)}, y^{(i)}\}$, the joint likelihood is given by
$$ L \sim \prod_i \exp \left( -\frac{(y^{(i)} - f(x^{(i)}))^2}{2 \sigma^2} \right) $$We now want to fix the parameters in $f$ such that we maximize the likelihood that our data was generated by the model.
It is more convenient to work with the log of the likelihood. Maximizing the likelihood is equivalent to minimising the negative log-likelihood.
$$ NLL = - \log(L) = \frac{1}{2\sigma^2} \sum_i \left(y^{(i)}-f(x^{(i)})\right)^2 $$Assuming Gaussian noise for the difference between the model and the data leads to the least square rule. We can use the square error loss
$$ J(f) = \sum_i \left(y^{(i)}-f(x^{(i)})\right)^2 $$to train our machine learning algorithm.
If we have two classes, we call one the positive class ($c=1$) and the other the negative class ($c=0$). If the probability to belong to class 1
$$p(c=1) = p$$we also have
$$ p(c=0)=1-p $$The likelihood for a single measurement if the outcome is in the positive class is $p$ and if the outcome is in the negative class the likelihood is $1-p$. For a set of measurements with outcomes $y_i$ the likelihood is given by
$$ L = \prod\limits_{y_i=1} p \prod\limits_{y_i=0} (1-p) $$So the negative log-likelihood is:
$$ NLL = - \sum\limits_{y_i=1} \log(p) - \sum\limits_{y_i=0} \log(1-p) $$Given that $y=0$ or $y=1$ we can rewrite it as
$$ NLL = - \sum \left( y \log(p) + (1-y)\log(1-p) \right) $$So if we have a model for the probability $\hat y = p(X)$ we can maximize the likelihood of the training data by optimizing
$$ J= - \sum_i y_i \log \left(\hat y\right) +(1-y_i)\log\left(1 - \hat y\right) $$It is called the cross entropy.
Regularisation is a technique used to prevent overfitting by discouraging overly complex models in Machine Learning. It adds a penalty to the loss function to reduce the magnitude of model parameters.
Overfitting occurs when a model learns the noise in the training data to the detriment of its performance on new data. Regularisation helps to mitigate overfitting by adding a complexity penalty to the loss function.

We modify the loss function to include a penalty term:
\[ J_{\text{pen}}(X, y, \vec{w}) = J(X, y, \vec{w}) + \lambda \cdot \text{Penalty}(\vec{w}) \]

Small values of \( \lambda \) mean weak regularisation, large values of \( \lambda \) mean strong regularisation.
We now look at a one-dimensional example. Suppose we have the relationship:
\[ y = 7 - 8x - \frac{1}{2} x^2 + \frac{1}{2} x^3 + \epsilon \]

We fit the data using polynomials of different orders \( k \):
\[ p_w(x) = \sum_{i=0}^{k} w_i x^i \]
The loss function to minimise is the Mean Squared Error (MSE):
\[ J(x, y, \vec{w}) = \sum_{i} \left( p_w(x^{(i)}) - y^{(i)} \right)^2 \]

The third-order polynomial fits the data well and recovers coefficients close to the true values:

| Term | True Coefficient | Estimated Coefficient |
|---|---|---|
| \( w_0 \) | 7 | Approximate value |
| \( w_1 \) | -8 | Approximate value |
| \( w_2 \) | -0.5 | Approximate value |
| \( w_3 \) | 0.5 | Approximate value |
The second-order polynomial provides a reasonable fit but misses the \( x^3 \) term:

This results in higher bias but potentially lower variance.
The tenth-order polynomial overfits the data:

We modify the loss function to include the regularisation term:
\[ J_{\text{pen}}(x, y, \vec{w}, \lambda) = J(x, y, \vec{w}) + \lambda \sum_{i=0}^{k} w_i^2 \]
This is known as Ridge Regression (L2 Regularisation).
With regularisation, the tenth-order polynomial coefficients have smaller magnitudes:

In machine learning, the terms bias and variance help us understand the types of errors that can arise during model training and prediction.
Bias measures how well the model approximates the true relationship between features and the label. It is defined as the difference between the expected prediction of the model and the actual value we aim to predict.
High bias indicates an overly simplistic model that does not capture the complexity of the data well, leading to underfitting.
Variance, on the other hand, measures the sensitivity of the model to variations in the training data. A model with high variance pays too much attention to the training data and may not generalize well to unseen data, leading to overfitting.
The Bias-Variance Tradeoff describes the challenge of finding a balance between bias and variance to minimize the overall error of the model.
Model error can be decomposed as:
$$ \text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$Bias is the error introduced by approximating a real-world problem, which may be complex, with a simplified model. Variance is the error introduced by excessive sensitivity to small fluctuations in the training data. The irreducible error is noise that cannot be eliminated, regardless of the model chosen.
When tuning model complexity:
The goal of model training is to achieve the optimal balance, minimizing both bias and variance to reduce the total error.
Regularisation introduces bias into the model but reduces variance:
Regularisation helps find a balance between bias and variance.

The regularisation parameter \( \lambda \) controls the strength of the penalty:
We can use techniques like cross-validation to select the optimal \( \lambda \).
In practice, we split our data into:

When data is limited, we use cross-validation to make efficient use of it.
K-Fold Cross-Validation:

Regularisation is essential for building models that generalise well to new data. By penalising large coefficients, we prevent the model from becoming too complex and overfitting the training data.
Key takeaways:
Support Vector Machines are a powerful and versatile Machine Learning tool used for both classification and regression tasks. They are particularly well-suited for problems where the data is high-dimensional and the number of features exceeds the number of samples.
We focus on a binary classification problem where the labels are \( y = +1 \) for positive class and \( y = -1 \) for negative class.
The goal of an SVM is to find the hyperplane that best separates the two classes by maximizing the margin, which is the minimal distance between the data points and the decision boundary.



For a linear model, the decision function is:
\[ z = w_0 + \vec{x} \cdot \vec{w} \]
The distance \( d \) from a point \( \vec{x} \) to the decision boundary \( z = 0 \) is proportional to \( z \):
\[ d = \frac{z}{\lVert \vec{w} \rVert} \]

The SVM optimization problem aims to:
These two goals are in conflict and are balanced using optimization techniques.
\[ \begin{aligned} & \text{Minimize} && \frac{1}{2} \lVert \vec{w} \rVert^2 \\ & \text{Subject to} && y^{(i)} (w_0 + \vec{x}^{(i)} \cdot \vec{w}) \geq 1 \quad \forall i \end{aligned} \]
Introduces slack variables \( \xi^{(i)} \) to allow margin violations:
\[ \begin{aligned} & \text{Minimize} && \frac{1}{2} \lVert \vec{w} \rVert^2 + C \sum_{i} \xi^{(i)} \\ & \text{Subject to} && y^{(i)} (w_0 + \vec{x}^{(i)} \cdot \vec{w}) \geq 1 - \xi^{(i)}, \quad \xi^{(i)} \geq 0 \quad \forall i \end{aligned} \]
The loss for the SVM also uses the hinge function, but offset such that we penalise values up to 1:
$$ J(w) = \frac{1}{2}\vec w\cdot \vec w + C \sum h_1( y_i p(x_i,w)) $$where $p(x_i,w)$ is the model prediction $\vec x\cdot \vec w + w_0$ and $h_1$ is the shifted hinge function.
$$ h_1(x) = \max(0, 1- x).$$
$C$ is a model parameter controlling the trade-off between the width of the margin and the amount of margin violation.
Here we use the iris dataset again, but we rescaled the features so that they have 0 mean and unit standard deviation.



Here we use the cancer data set we used for previous lectures and exercises.



Adding data to the training set only affects the model if the additional point falls into the margin.
The model is completely defined by the data samples at the boundary or inside the margin (this is where the name comes from, these data samples are the "support" vectors)
Note: Unlike in the logisitic regression case, there is no probabilistic interpretation for a SVM.
Let us consider the cancer data sample again

| true value is positive | true value is negative | |
|---|---|---|
| predicted positive | true positive TP | false positive FP |
| predicted negative | false negative FN | True negative TN |
$\mbox{true positive rate}=\frac{TP}{TP+FN}$
and
$\mbox{false positive rate}=\frac{FP}{FP+TN}$
we can see how well the prediction works by plotting the true value as a function of $z$ for each data point in the training sample:

The different categories (TP, FP, TN, FN) can be visualised on this plot:

If we are more worried about false negative than about false positive, we can move the decision boundary to the left:

Of course if means more false positives...
If we are more worried about false positive than about false negative, we can move the decision boundary to the right:

Of course if means more false negatives...
The curve describing this trade-off is the ROC curve (Receiver Operating Characteristic). It is the collection of (FP rate, TP rate) values for all values of the decision boundary.

Move the threshold to the left:

Move the threshold to the right:
