Notes for Prof. Hung-Yi Lee's ML Lecture 1, Regression

July 08, 2021

Notations

$x^n$: vector of input # n. The superscript indicates the # of object.

$x^n_i$: feature # i of input # n. The subscript indicates the # of feature/component.

$\hat{y}^n$: the real output # n.

$(x^n,\hat{y^n})$: a data pair.

$f^* = arg \min\limits_{f}L(f)$: find best function f over loss function L (badness).

or $w^* = arg \min\limits_{w}L(w)$: find the best vector of parameters, where $w$ defines a function $f$.

Gradient Descent

For a loss function $L(w)$:

(Randomly) pick initial $w^0$.
$w^1 = w^0 - \eta \nabla L \vert_{w=w^0}$, where $\eta$ is the learning rate.
repeat step 2 to get $w^2$ … $w^T$ after $T$ iterations.

Finally, we can get a local minimum (possibly the global minum).

On evaluating functions

When evaluating functions. what we really care about is the loss over testing set but not the training set.

Overfitting

If we can truely find the best function, a more complicated model / larger function set yields lower error on training data, but not always yields lower error on testing data. Such phenomenon is overfitting. Therefore, it’s important to select the suitable model.

Regularization

In the example of linear regression, we can add a term $\lambda \sum \vert w \vert ^2$. It let the functions with smaller $w$’s be better, i.e. it prefer smoother functions. It also let noises have less influence over the loss function.

Higher $\lambda$ let the loss function consider the error on training data less, so it increases training error. We prefer smoother functions, but don’t make it be too smooth, i.e. becoming constants.

We don’t have to perform regulatization on biases because it’s not related to smoothness.