Sketching the Journey of Learning

Bali's notes & demos.

Notes for Prof. Hung-Yi Lee’s ML Lecture 13: Unsupervised Learning- Linear Methods | Sketching the Journey of Learning

Notes for Prof. Hung-Yi Lee's ML Lecture 13: Unsupervised Learning- Linear Methods

August 22, 2021

Unsupervised Learning

Unsupervised learning are the scenarios that we only have either input or output for the function. Examples for the former (化繁為簡) are dimension reduction and clustering, while an example for the later (無中生有) is generation.

Clustering

“How many cluster” is still an open question; we have to decide it.

K-Means

To cluster $X = { x^1, \dots, x^N }$ into $K$ clusters by their mean.

Steps:

k-means steps

When initializing the centers of the clusters, we randomly choose from the data set to avoid the segmentation fault that some clusters have no data points.

My discussion

Q: How about that some of the randomly chosen data points in initialization should actually belong to the same cluster?

A: It’s possible, intuitively. However, since we just don’t really know how many clusters should the data set be separated into, so possibly the better approach is to set the number $K$ slightly higher in the beginning and then we can manually merge the clusters later if we see some clusters are close to each other.

Hierarchical Agglomerative Clustering

Step 1: build a tree

Find the closest 2 examples in the data set, make them a cluster, and then use their mean to represent this culser as a new data point. Repeat such process until getting the root, i.e. make the whole data set as a tree.

Step 2: pick a threshold

Choosw a level in the tree and separate the tree into clusters by that level.

HAC

More of My Discussions to Clustering

Aren’t there any clustering methods based on low density separation assumption or smoothness assumption? Unsupervised SVM for the former? DBSCAN for the later?

Distributed Representaion / Dimension Reduction

Transform the high-dimensional examples into low-dimensional representatils by using their attribute or characteristics.

dimension reduction

Feature Selection

Manually choose the dimensions of features that emphasizes the differences between the data points.

HAC

Principal Component Analysis

Miximizing Variance after Projection

By PCA, we project all the data points $x$ onto $W$ to get a set of $z$ of lower dimensions, while we want to find the $W$ to let the variance of $z$ be as large as possible.

pca variance

Definitions

We have

\[z = Wx\]

where

\[W = \begin{bmatrix} (w^1)^T \\ (w^2)^T \\ \vdots \\ (w^n)^T \\ \end{bmatrix}\]

and

\[z_n = w^n \cdot x\] \[w^n = arg \max \limits_{w^n} Var(z_n)\] \[Var(z_n) = \frac{1}{N}\sum_{z_n}(z_n - \overline{z}_n)^2\]

where $Var(z_1)$ is the largest, $Var(z_2)$ the 2nd largest, then $Var(z_n)$ the nth largest, repectively.

Solution

$w^1$ to $w^n$ are the eigen vectors of the covariance matrix $Cov(x)$, with eigen value from the largest to nth largest, where

\[Cov(x) = \frac{1}{N} \sum (x - \overline{x})(x - \overline{x})^T\]

Discussion

After transformation by $W$, the covariance matrix $Cov(z)$ is a diagonal matrix, i.e. the transformation has a decorrelation effect on the data points.

pca decorrelation

Minimizing Reconstruction Error

We can also see PCA as finding a set of representations that minimizing the reconstruction error between $x - \overline{x}$ and the reconstructed $\hat{x}$, i.e.

\[x - \overline{x} \approx c_1 u^1 + c_2 u^2 + \dots + c_k u^k = \hat{x}\]

and the reconstruction error

\[\vert \vert (x - \overline{x}) - \hat{x} \vert \vert_2\]

and by singiular value decomposition, the k eigen vector with the largest eigen value of the covariance matrix $Cov(x)$ is the solution, i.e. from PCA, the set of vectors

\[\{ w^1, w^2, \dots, w^k \}\]

is the set of component of

\[\{ u^1, u^2, \dots, u^k \}\]

that minimize the reconstruction error.

Also, we can also see PCA as a neural network with 1 hidden layer and linear activation function, and also an autoencoder, which minimize the reconstruction error

1 hidden layer autoencoder

Since it’s a neural network, we can we gradient descent to find the $w$s. Howeverm the $w$s may be different from the ones we get from PCA because the $w$s are not constrained to be orthonormal like in PCA.

Also, it becomes deep autoencoder if we let it be deep.

Discussion

PCA keeps the distances from high to low dimensional space. (My comment: not really. there are kernal in the linear transformation, so some distances may just disappear by the transformation. Although if the distance between points A and B is larger than the distance between points C and D, after the transformation, the former is still not smaller than the later.)

Weakness of PCA: PCA is unsupervised and doesn’t consider the labels, so in some cases, it cannot find the proper direction of projection to separate the data points. The linear discriminant analysis, a dimension reduction that considers the labels, for example, can perform better for such scenario.

PCA vs LDA

Also, since PCA is linear, it cannot provide good results when the distribution of data is highly nonlinear.

PCA to nonlinear

Also, when examing the selected components of image data, we may see that the components don’t look like elements of the drawing. The reason is that the components $w$s and the projections $z$s can have negative values. To get component that looks like elements of the drawing, we can use non-negative matrix factorization, for example.

MNIST PCA eigen digits

MNIST NMF eigen digit

Matrix Factorization

We assume that every variable and every target can be described by a latent vactor of n dimentions, and then we use the data to solve the latent vactors for all the variables and target. Such technique can be applied for recommendation system. For example, the data of variables (otakus) and targets (characters)

matrix-factorization-data

and the latent vectors

matrix-factorization-pairs

Using SVD to Minimize the Recostruction Error

matrix-factgorization-svd

Using Gradient Descent to Minimize the Recostruction Error

In some cases, the data may be missing for some pairs of variables and targets. We can use gradient descent to find minimize the reconstruction error and find the latent vectors. Moreover, by using gradient descent, we can even add biases for all the latent vectors.

matrix-factorization-gd

More Approaches

Netflix Recommendation System

CCA can be used for the cases with multiple data sources, e.g. having both voice data and image data of mouth shapes for speech related tasks.

ICA can be used for source separation; just find independent but not orthogonal vectors.

more reference

References

More about PCA: Bishop Chap.12

Proof of minimizing reconstruction error = mazimize z varriance, see Bishop 12.1.2

Singular Value Decomposion

Youtube ML Lecture 13: Unsupervised Learning - Linear Methods

Course website