Processing math: 100%

Sketching the Journey of Learning

Bali's notes & demos.

Notes for Prof. Hung-Yi Lee’s ML Lecture: Support Vector Machine | Sketching the Journey of Learning

Notes for Prof. Hung-Yi Lee's ML Lecture: Support Vector Machine

December 08, 2021

My Intuitive Explanation

The overall computation is equivalent to:

  1. transform the features to a high-dimensional (often infinite-dimensional) space.
  2. find a plane to separate the 2 classes by the hinge loss.
  3. use the plane to classify new data.

with the tricks:

  1. By the kernel method, when classifying new data, we calculate the kernel function betwen the new data and all the training data in the original space, so we don’t have to perform computations in the high-dimensional space.
  2. By the kernal method and the hinge loss, we only have to calculate the kernel product between the new data and only a few data points of the training set near the classification boundary. These training data points are the support vectors.

Traditional Formulation

I refer to Bishop Chap.7.1 for this part.

For Separable Class Distributions

Defining the Problem

For a binary classificaiton problem, we have the function:

y(x)=wTϕ(x)+btn={1 when y(x)>01 when y(x)<0

We assume the data to be perfectly separable, then

tny(xn)>0

To find the plane with maximum margin, we try to find w,b by

argmaxw,b{1wminn[tn(wTϕ(xn)+b)]}

separable margin

Derivations

By wκw and bκb, for the points that are closest to the boundary, we can set

tn(wTϕ(xn)+b)=1

then for all points,

tn(wTϕ(xn)+b)1n=1,,N

Then the questions is equivalent to

argminw,b12w2

with constraints

tn(wTϕ(xn)+b)1n=1,,N

Then by Lagrange multiplier, we get the Lagrangian

L(w,b,a)=12w2Nn=1an{tn(wTϕ(xn)+b)1}an0a=(a1,,aN)T

By optimizing w & b by 0 gradient, we get

w=Nn=1antnϕ(xn)0=Nn=1antn

then we get the dual representation:

˜L(a)=Nn=1an12Nn=1Nm=1anamtntmk(xn,xm)

which subject to

an0Nn=1antn=0

where k(x,x)=ϕT(x)ϕ(x) is the kernel function. Also, by the Lagrange multiplier, the KKT conditions are

an0tny(xn)10an{tny(xn)1}=0

Now it became a quadratic programming problem that we can solve by the established approaches to find the a’s.

Inference

y(x)=Nn=1antnk(x,xn)+btn={1 when y(x)>01 when y(x)<0

Discussions on the Support Vectors

By the KKT conditions, when an=0, then xn is behind the correct margin and is not a support vector; when an>0, then xn is on the margin and is a support vector.

Alternative Formulation

We can express the maximum-margin classifier in terms of the minimization of an error function, with a simple quadratic regularizer, in the form

Nn=1E(y(xn)tn1)+λw2

where E(z) is a function that is 0 if z0 and otherwise. Note that as long as the regularization parameter satisfies λ>0, its precise value plays no role.

For Overlapping Class Distributions

Defining the problem

For a binary classificaiton problem, we have the function:

y(x)=wTϕ(x)+btn={1 when y(x)>01 when y(x)<0

To have a “soft margin”, i.e. to allow the data points to be on the wrong side, we introduce the slack variables ξn0, n=1,,N for each training data point. ξn=0 for data points on or inside the correct margin boundary and ξn=|tny(xn)| otherwise.

To find the plane with maximum margin, we try to find w,b by

argmaxw,b{1wminn[tn(wTϕ(xn)+b)]}

overlapping margin

Derivation

By wκw and bκb, for the points that are on the correct boundary, we can set

tn(wTϕ(xn)+b)=1

then for all points,

tn(wTϕ(xn)+b)1ξnn=1,,N

then the question is equivalent to

argminw,b(CNn=1ξn+12w2)

with constraints

tn(wTϕ(xn)+b)1ξnn=1,,N

where C>0. And when C, it became the SVM for separable data.

Then by Lagrange multiplier,

L(w,b,a)=12w2+CNn=1ξnNn=1an{tny(xn)1+ξn}Nn=1μnξn

and we get the KKT conditions:

an0μn0,tny(xn)1+ξn0ξn0,an(tny(xn)1+ξn)=0μnξn=0.

To optimize w, b, ξ,

Lw=0w=Nn=1antnϕ(xn)Lb=0Nn=1antn=0Lξn=0an=Cμn

then

˜L(a)=Nn=1an12Nn=1Nm=1anamtntmk(xn,xm)

The ˜L is the same as the separable data SVM. And it subject to

0anC(the box constraint), Nn=1antn=0

then it became a quadratic programming problem that can be solved by the established approaches.

Inference

Identical to the separable data SVM,

y(x)=Nn=1antnk(x,xn)+btn={1 when y(x)>01 when y(x)<0

Discussions on Support Vectors

For the training data points,

Condition Location Is Suppport Vector?
an=0 inside the correct margin N
C>an>0 on the margin Y
C=a out of the correct margin Y

Alternative Formulation

Nn=1max(1y(xn)tn,0)+λw2

where the former term is the Hinge loss.

Binary Classifier

Multiclass SVM

Although the application of SVMs to multiclass classification problems remains an open issue, in practice the one-versus-the-rest approach is the most widely used in spite of its ad-hoc formula- tion and its practical limitations.

Prof. Lee’s Formulation by the 3 Steps of ML

SVM 3 step

References

Youtube ML Lecture 20: Support Vector Machine (SVM)

Course website

Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006