Loss functions

Loss functions are used to quantify how well or bad a model can reproduce the values of the training set.

The appropriate loss function depends on the type of problems and the algorithm we use.

Let's denote with $\hat y$ the prediction of the model and $y$ the true value.

Gaussian noise

Let's assume that the relationship between the features $X$ and the label $Y$ is given by

$$ Y = f(X) +\epsilon$$

where $f$ is the model whose parameters we want to fix and $\epsilon$ is some random noise with zero mean and variance $\sigma$.

The likelihood to measure $y$ for feature values $x$ is given by

$$L\sim \exp\left(-\frac{(y-f(x))^2}{2\sigma}\right) $$

If we have a set of examples $x^{(i)}$ the likelihood becomes

$$ L\sim \prod_i \exp\left(-\frac{(y^{(i)}-f(x^{(i)}))^2}{2\sigma}\right) $$

Likelihood Derivation

Assuming that the noise $\epsilon$ follows a Gaussian distribution with zero mean and variance $\sigma^2$, the probability density function is given by

$$ p(\epsilon) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\epsilon^2}{2 \sigma^2} \right) $$

Since we have $Y = f(X) + \epsilon$, we substitute $\epsilon = y - f(x)$:

$$ p(y|x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(y - f(x))^2}{2 \sigma^2} \right) $$

The likelihood of observing $y$ for a given $x$ is proportional to

$$ L \sim \exp \left( -\frac{(y - f(x))^2}{2 \sigma^2} \right) $$

For multiple observations $\{x^{(i)}, y^{(i)}\}$, the joint likelihood is given by

$$ L \sim \prod_i \exp \left( -\frac{(y^{(i)} - f(x^{(i)}))^2}{2 \sigma^2} \right) $$

We now want to fix the parameters in $f$ such that we maximize the likelihood that our data was generated by the model.

It is more convenient to work with the log of the likelihood. Maximizing the likelihood is equivalent to minimising the negative log-likelihood.

$$ NLL = - \log(L) = \frac{1}{2\sigma^2} \sum_i \left(y^{(i)}-f(x^{(i)})\right)^2 $$

Assuming Gaussian noise for the difference between the model and the data leads to the least square rule. We can use the square error loss

$$ J(f) = \sum_i \left(y^{(i)}-f(x^{(i)})\right)^2 $$

to train our machine learning algorithm.

Two class model

If we have two classes, we call one the positive class ($c=1$) and the other the negative class ($c=0$). If the probability to belong to class 1

$$p(c=1) = p$$

we also have

$$ p(c=0)=1-p $$

The likelihood for a single measurement if the outcome is in the positive class is $p$ and if the outcome is in the negative class the likelihood is $1-p$. For a set of measurements with outcomes $y_i$ the likelihood is given by

$$ L = \prod\limits_{y_i=1} p \prod\limits_{y_i=0} (1-p) $$

So the negative log-likelihood is:

$$ NLL = - \sum\limits_{y_i=1} \log(p) - \sum\limits_{y_i=0} \log(1-p) $$

Given that $y=0$ or $y=1$ we can rewrite it as

$$ NLL = - \sum \left( y \log(p) + (1-y)\log(1-p) \right) $$

So if we have a model for the probability $\hat y = p(X)$ we can maximize the likelihood of the training data by optimizing

$$ J= - \sum_i y_i \log \left(\hat y\right) +(1-y_i)\log\left(1 - \hat y\right) $$

It is called the cross entropy.

Bias and Variance

In machine learning, the terms bias and variance help us understand the types of errors that can arise during model training and prediction.

Bias measures how well the model approximates the true relationship between features and the label. It is defined as the difference between the expected prediction of the model and the actual value we aim to predict.

High bias indicates an overly simplistic model that does not capture the complexity of the data well, leading to underfitting.

Variance, on the other hand, measures the sensitivity of the model to variations in the training data. A model with high variance pays too much attention to the training data and may not generalize well to unseen data, leading to overfitting.

The Bias-Variance Tradeoff describes the challenge of finding a balance between bias and variance to minimize the overall error of the model.

Model error can be decomposed as:

$$ \text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} $$

Bias is the error introduced by approximating a real-world problem, which may be complex, with a simplified model. Variance is the error introduced by excessive sensitivity to small fluctuations in the training data. The irreducible error is noise that cannot be eliminated, regardless of the model chosen.

When tuning model complexity:

  • A simple model typically has high bias and low variance. It fails to capture the underlying patterns, resulting in high training and testing error (underfitting).
  • A complex model typically has low bias but high variance. It captures noise from the training data, leading to low training error but high testing error (overfitting).

The goal of model training is to achieve the optimal balance, minimizing both bias and variance to reduce the total error.

Support vector machine

The loss for the SVM also uses the hinge function, but offset such that we penalise values up to 1:

$$ J(w) = \frac{1}{2}\vec w\cdot \vec w + C \sum h_1( y_i p(x_i,w)) $$

where $p(x_i,w)$ is the model prediction $\vec x\cdot \vec w + w_0$ and $h_1$ is the shifted hinge function.

$$ h_1(x) = \max(0, 1- x).$$

$C$ is a model parameter controlling the trade-off between the width of the margin and the amount of margin violation.

ROC curve

Let us consider the cancer data sample again

Quality metrics

  • the performance of a binary classifier can be described by the confusion matrix
true value is positive true value is negative
predicted positive true positive TP false positive FP
predicted negative false negative FN True negative TN
  • From this matrix we can define several metrics to quantify the quality of the classification.

$\mbox{true positive rate}=\frac{TP}{TP+FN}$

and

$\mbox{false positive rate}=\frac{FP}{FP+TN}$

we can see how well the prediction works by plotting the true value as a function of $z$ for each data point in the training sample:

  • The points with $z>0$ are assigned to the $y=1$ class
    • they correspond to $p>\frac12$
  • those with $z<0$ to the $y=0$ class
    • they correspond to $p<\frac12$

The different categories (TP, FP, TN, FN) can be visualised on this plot:

If we are more worried about false negative than about false positive, we can move the decision boundary to the left:

Of course if means more false positives...

If we are more worried about false positive than about false negative, we can move the decision boundary to the right:

Of course if means more false negatives...

The curve describing this trade-off is the ROC curve (Receiver Operating Characteristic). It is the collection of (FP rate, TP rate) values for all values of the decision boundary.

Move the threshold to the left:

  • more true positives
  • more false positive

Move the threshold to the right:

  • less true positives
  • less false positive

Next lecture we'll cover Non-linear models