The same place as all the formative assignments, online for 24 hours and done remotely. It should not take 24 hours to complete! This is redundancy time incase of any problems with the server.
The assigment will be an ML task to analyse a dataset using one or more techniques that we have covered in the lectures such as:
Remember:
One would think that the more features one has to describe samples in a dataset the better one would be able to perform a classification task. Unfortunately with the increase of the number of features comes the difficulty of fitting a multi-dimensional model.
This is generally referred to as the curse of dimensionality and we will see a few surprising effects that explain why more features can make life difficult.
We can ask the question "In the hypercube $-1\leq x_i\leq 1$, how many points are no further apart to the center than 1?"
This is equivalent to asking what is the ratio of the unit "ball" to the volume of the smallest "cube" enclosing it.

In high dimensions most points are in "corners" rather than in the "centre".
Looking at the unit cube, we can calculate the average distance between any two points.
$$ d=\sqrt{\sum_i x_i^2} $$

The average distance increases with the dimension.
We can also plot the distribution of distances:

The likelihood of small distances drops as the dimension increases.
One interesting question to ask is how close to the edges points are. To quantify it we will calculate what is the thickness $t$ of the outer layer of the unit cube that contain half the points if the points are randomly distributed.
The volume inside is given by
$$ V_i = (1-2t)^d \qquad V_i=\frac12 \Rightarrow t = \frac{1-2^{-1/d}}{2}$$

In 35 dimensions half of the points are in a outer layer 0.01 thin.
If we have many features, odds are that many are correlated. If there are strong relationships between features, we might not need all of them.
With principal component analysis we want to extract the most relevant/independant combination of features.
It is important to realise that PCA only looks at the features without looking at the labels, it is an example of unsupervised learning.
Correlated features

Uncorrelated features:

The idea for PCA is to project the (standardised) data on a subspace with fewer dimensions.

Standarization involves transforming data so that each feature has a mean of 0 and a standard deviation of 1.
PCA is sensitive to the scale of the variables. Features with larger scales can dominate the principal components, skewing the results.
Illustrative Example:
If we project onto the first component we get variance 1:

If we project onto the second component we also get variance 1:

But projecting onto a different direction gives a different variance, here larger than 1:

And here smaller than one:

Performing PCA gives a new basis in feature space that include the direction of largest and smallest variance.
There is no guarantee that the most relevant features for a given classification tasks are going to have the largest variance.
If there is a strong linear relationship between features it will correspond to a component with a small variance, so dropping it will not lead to a large loss of variance but will reduce the dimensionality of the model.
The first step is to normalise and center the features.
$$ x_i \rightarrow a x_i +b $$such that
$$ \langle x_i\rangle = 0 \;,\qquad \langle x_i^2\rangle = 1$$The covariance matrix of the data is then given by
$$ \sigma = X^T X $$If $X$ is the $n_d\times n_f$ data matrix of the $n_d$ training samples with $n_f$ features. The covariance matrix is a $n_f\times n_f$ matrix.
When we only consider the $k$ principal axis of a dataset we will lose some of the variance of the dataset.
Assuming the eigenvalues are ordered in size we have
$$\sigma_k\equiv {\rm{Tr}}(X_k^T X_k) = \sum\limits_{j=1}^k \epsilon_j^2$$$\sigma_k$ is the variance our reduced dataset retained from the original, it is often referred as the explained variance.
We consider a dataset of handwritten digits, compressed to an 8x8 image:

These have a 64-dimensional space but this is clearly far larger than the true dimension of the space:
PCA should help us limit our features to things that are likely to be relevant.
Performing PCA we can see how many eigenvectors are needed to reproduce a given fraction of the dataset variance via a cumulative scree plot:

We can keep 50% of the dataset variance with less than 10 features.
The eight most relevant eigenvectors are:

The least relevant eigenvectors are:

If we reduce the data to be 2-dimensional or 3-dimensional we can get a visualisation of the data.


The parameter $k$ can be used to control overfitting.
The $k$-nearest neighbors method is an instance-based learning algorithm.
Key Idea: Similar data points are likely to have similar target values.
Advantages:
Disadvantages:
Common Distance Metrics:
Impact on KNN:
The parameter $k$ can be used to control overfitting.
We can use the iris dataset:





We can use the 8x8 digits picture example after applying PCA to reduce it to 2 dimensions:



