Notes for Prof. Hung-Yi Lee's ML Lecture 15: Neighbor Embedding

August 30, 2021

Manifold Learning

In some data set, it may be meaningless to compute the Euclidean distance directly, especially when the distance between examples are not small. Therefore, in such cases, if we want to perform dimension reduction, it has to be non-linear. Such dimension reduction may benefits for further clustering or supervised learning.

manifold example

Locally Linear Embedding

「在天願做比翼鳥在地願為連理枝」

In the original space of $x$, we find $w_{ij}$ for all the examples $x^i$, where $w_{ij}$ represents the relation between $x^i$ and its neighboring points $x^j$s, and $w_{ij}$ miimize

\[\sum_{i} \vert \vert x^i - \sum_j w_{ij} x^j \vert \vert _2\]

then we find the dimension reduction results $z^i$ and $z^j$ by minimizing

\[\sum_{i} \vert \vert z^i - \sum_j w_{ij} z^j \vert \vert _2\]

so that we can keep $w_{ij}$ unchanged in the new space.

LLE points

Besides, choosing the number of neighbors is important. By too many neighbors, some of them may be too far to keep the localness so their $w_{ij}$s become meaningless. (I guess) By too few neighbors, the $w_{ij}$s cannot properly describe the relationships between examples.

LLE neighbor counts

One advantage of LLE is that, even we don’t have $x$s, we can find $z$s if we have $x$s’ relationships $w_{ij}$.

Since LLE only cares about the neighbors of an example, when adding new data, it should be reasonable to compute only the $z$s without re-computing $z$s of the whole data set.

Laplacian Eigenmaps

We use distance defined by graph approximate the distance on manifold.

Laplacian eigenmap graph concept

First, we connect the graph and then calculate the similarity

Laplacian Eigenmaps similarity

then in the new space of $z$, we find $z$s to minimize

\[S = \frac{1}{2} \sum_{i,j} w_{i,j} \vert \vert z^i - z^j \vert \vert _2\]

then $z^1$ and $z2$ will be close to each other if $x1$ and $x2$ are close in a high density region. To avoid the trivial solution that all $z = 0$, we make a further constraint

\[\text{If the dim of } z \text{ is M, } Span \{ z^1, z^2, \dots, z^n \} = R^M\]

The solution is the eigenvectors of the graph Laplacian.

Also, Spectral clustering is peforming Laplacian Eigenmaps and then performing clustering.

When adding new examples, it should be ok to compute $z$s for only the new examples, so Laplacian eiganmaps can be used for the training / testing scenarios.

The Problem of Aggregating Clusters

Both LLE and Laplacian eiganmaps have the problem that the non-similar data may also aggregate together because only the relationships between the similar examples are defined. T-SNE can fix this problem.

aggregating problem

T-distribution Stochastic Neighboring Embedding

Compute similarity between all pairs of $x: S(x^i,x^j)$, then calculate

\[P(x^j \vert x^i) = \frac{S(x^i, x^j)}{\sum_{k \neq i} S(x^i, x^k)}\]

then compute similarity between all pairs of $z: S’(z^i,z^j)$, then calculate

\[P(z^j | z^i) = \frac{S'(z^i, z^j)}{\sum_{k \neq i} S'(z^i, z^k)}\]

then find a set of z to minimize the KL entropy

\[L = \sum_{i} KL(P(*|x^i)||Q(*|z^i)) \\ = \sum_{i,j} P(x^j | x^i)log \frac{P(x^j | x^i)}{Q(z^j|z^i)}\]

where $S(x^i,x^j) = exp(- \vert \vert x^i - x^j \vert \vert _2)$. For $S’(z^i,z^j)$, in SNE

\[S'(z^i,z^j) = exp(- \vert \vert z^i - z^j \vert \vert _2)\]

in T-SNE,

$S’(z^i,z^j) = \frac{1}{1 +

z^i - z^j

_2}$

with the T-distribution, the distance in the space of $x$ will be enlarged in the space of $x$.

S' distributions and distances

T-SNE can provide very good visualization. In practical scenarios, we often use simple PCA first to reduce the dimensionality once and then use T-SNE to reduce computing cost.

TSNE visualization demo

By T-SNE, we have to calculate the KL-divergence for the whole data set, so we have to re-compute the whole transformation once we want to include more data. Therefore, T-SNE is not suitable for the scenarios of training and testing.

My question for adding new data with T-SNE: why not compute only the newly generated probabilities? It seems that we can do it, so it should be not so problematic for computing cost when adding new data. By doing so, the transofrmation of the old examples effect those of the new examples significantly, so we should still re-compute the transformation for the whole old and new data if the amount of the new data is not relatively small to the old.

References

Youtube ML Lecture 15: Unsupervised Learning - Neighbor Embedding

Course website

Sketching the Journey of Learning

Bali's notes & demos.