Feedback¶

  • Several people use standard MSE as the score metric for the regression
  • A lot of people forget to format the output of best alpha in the regression to be one digit
  • Seemed some people (again) copied off of the solution notebook; it is a formative assignment!


# Function to categorize decision values into true positives (TP), false positives (FP),
# true negatives (TN), and false negatives (FN) based on model predictions
    
    # Initialize empty lists for each category
    tp = []  # True positive
    tn = []  # True negative
    fp = []  # False positive
    fn = []  # False negative
    
    # Get the decision function values for the dataset
    res_lr = model.decision_function(X)
    
    # Iterate over each decision value and actual label
    for ii in range(len(res_lr)):
        if res_lr[ii] > 0:  # Positive decision threshold
            if y[ii] == 1:  # True label is positive
                tp.append(res_lr[ii])
            else:  # True label is negative
                fp.append(res_lr[ii])
        else:  # Negative decision threshold
            if y[ii] == 1:  # True label is positive
                fn.append(res_lr[ii])
            else:  # True label is negative
                tn.append(res_lr[ii])
    
    # Return the categorized decision values
    return tp, fp, tn, fn


# Defining a range of alpha values to test
alphas = np.logspace(-4, 4)

# Initializing a list to store the scores for each alpha
scores = []

# Looping through each alpha value
for al in alphas:
    # Creating a Ridge regression model with the current alpha
    rr = Ridge(alpha=al)
    
    # Fitting the model on the training data
    rr.fit(X_train, y_train)
    
    # Predicting on the validation set
    y_ridge_alpha.append(rr.predict(xval_res))
    
    # Calculating the model's score on the test set
    score = rr.score(Xtest_res, y_test)
    
    # Storing the alpha value and its corresponding score
    scores.append([al, score])

Variance Bias Trade-off¶

What is Bias and Variance?

Bias: The error introduced by approximating a real-world problem, which may be complex, by a much simpler model.

  • High bias can cause an algorithm to miss relevant relations between features and target outputs (underfitting).

Variance: The error introduced by the model's sensitivity to fluctuations in the training set.

  • High variance can cause overfitting, where the model captures noise in the data.

Let's suppose the relationship between $X$ and $Y$ is described by

$$ Y = \sum_i w_i^\star x^i + \epsilon$$

where $w_i^\star$ are the true parameters and $\epsilon$ is some noise.

We will try to model this with

$$ y = p(x) = \sum_i w_i x^i$$

where now the $w_i$ will be fitted to data.

We define

$$ \bar w_i = \langle w_i\rangle $$

as the expectation value of the parameter $w_i$ when fitted to multiple independent samples drawn from the true distribution.

We want to calculate the expected deviation of the fitted coefficients form the true coefficient:

$$ \langle (w_i-w_i^\star)^2\rangle$$

$$ \begin{eqnarray} \langle (w_i-w_i^\star)^2\rangle & = & \langle (w_i-\bar w_i +\bar w_i -w_i^\star)^2\rangle \\ &=& \langle (w_i-\bar w_i)^2\rangle + \langle (\bar w_i-w_i^\star)^2\rangle +2 \langle (w_i-\bar w_i)(\bar w_i -w_i^\star)\rangle \end{eqnarray}$$

The third term vanishes: $$ \langle (w_i-\bar w_i)(\bar w_i -w_i^\star)\rangle = \langle (w_i-\bar w_i)\rangle (\bar w_i -w_i^\star) =0 $$

So we have

$$ \langle (w_i-w_i^\star)^2\rangle = \langle (w_i-\bar w_i)^2\rangle + \langle (\bar w_i-w_i^\star)^2\rangle$$

The first term is the variance term and the second is the bias.

Example¶

To illustrate the variance-bias tradeoff we will be using different models to describe data with true relationship between the input $x$ and the outcome

$$Y(x) = 1+\frac15 x^2 + \epsilon \qquad \mbox{for}\qquad 0\leq x\leq 1\;, \quad 0 \; \mbox{otherwise}$$

Where $\epsilon$ is a gaussian noise. We will use the two models

$$ m_1(x) = a +bx$$

and

$$m_2(x)= a+bx +cx^2 +dx^3.$$

Using $m_1$, a model with too few parameters we get

For low dataset size we see the the variance dominates but as the number of training samples grows the bias dominates. Since the model is not capable of describing the truth the error is not diminishing even though the variance part of the error drops proportional to $1/\sqrt{N}$

For the second model where we have enough freedom to exactly describe the truth we get: