Notes for Prof. Hung-Yi Lee's ML Lecture: Support Vector Machine
My Intuitive Explanation
The overall computation is equivalent to:
- transform the features to a high-dimensional (often infinite-dimensional) space.
- find a plane to separate the 2 classes by the hinge loss.
- use the plane to classify new data.
with the tricks:
- By the kernel method, when classifying new data, we calculate the kernel function betwen the new data and all the training data in the original space, so we don’t have to perform computations in the high-dimensional space.
- By the kernal method and the hinge loss, we only have to calculate the kernel product between the new data and only a few data points of the training set near the classification boundary. These training data points are the support vectors.
Traditional Formulation
I refer to Bishop Chap.7.1 for this part.
For Separable Class Distributions
Defining the Problem
For a binary classificaiton problem, we have the function:
y(x)=wTϕ(x)+b, tn={1 when y(x)>0−1 when y(x)<0We assume the data to be perfectly separable, then
tny(xn)>0To find the plane with maximum margin, we try to find w,b by
argmaxw,b{1‖w‖minn[tn(wTϕ(xn)+b)]}
Derivations
By w→κw and b→κb, for the points that are closest to the boundary, we can set
tn(wTϕ(xn)+b)=1then for all points,
tn(wTϕ(xn)+b)≥1, n=1,…,NThen the questions is equivalent to
argminw,b12‖w‖2with constraints
tn(wTϕ(xn)+b)≥1, n=1,…,NThen by Lagrange multiplier, we get the Lagrangian
L(w,b,a)=12‖w‖2−N∑n=1an{tn(wTϕ(xn)+b)−1}an≥0, a=(a1,…,aN)TBy optimizing w & b by 0 gradient, we get
w=N∑n=1antnϕ(xn)0=N∑n=1antnthen we get the dual representation:
˜L(a)=N∑n=1an−12N∑n=1N∑m=1anamtntmk(xn,xm)which subject to
an≥0, N∑n=1antn=0where k(x,x′)=ϕT(x)ϕ(x′) is the kernel function. Also, by the Lagrange multiplier, the KKT conditions are
an≥0tny(xn)−1≥0an{tny(xn)−1}=0Now it became a quadratic programming problem that we can solve by the established approaches to find the a’s.
Inference
y(x)=N∑n=1antnk(x,xn)+b, tn={1 when y(x)>0−1 when y(x)<0Discussions on the Support Vectors
By the KKT conditions, when an=0, then xn is behind the correct margin and is not a support vector; when an>0, then xn is on the margin and is a support vector.
Alternative Formulation
We can express the maximum-margin classifier in terms of the minimization of an error function, with a simple quadratic regularizer, in the form
N∑n=1E∞(y(xn)tn−1)+λ‖w‖2where E∞(z) is a function that is 0 if z≥0 and ∞ otherwise. Note that as long as the regularization parameter satisfies λ>0, its precise value plays no role.
For Overlapping Class Distributions
Defining the problem
For a binary classificaiton problem, we have the function:
y(x)=wTϕ(x)+b, tn={1 when y(x)>0−1 when y(x)<0To have a “soft margin”, i.e. to allow the data points to be on the wrong side, we introduce the slack variables ξn≥0, n=1,…,N for each training data point. ξn=0 for data points on or inside the correct margin boundary and ξn=|tn−y(xn)| otherwise.
To find the plane with maximum margin, we try to find w,b by
argmaxw,b{1‖w‖minn[tn(wTϕ(xn)+b)]}
Derivation
By w→κw and b→κb, for the points that are on the correct boundary, we can set
tn(wTϕ(xn)+b)=1then for all points,
tn(wTϕ(xn)+b)≥1−ξn, n=1,…,Nthen the question is equivalent to
argminw,b(CN∑n=1ξn+12‖w‖2)with constraints
tn(wTϕ(xn)+b)≥1−ξn, n=1,…,Nwhere C>0. And when C→∞, it became the SVM for separable data.
Then by Lagrange multiplier,
L(w,b,a)=12‖w‖2+CN∑n=1ξn−N∑n=1an{tny(xn)−1+ξn}−N∑n=1μnξnand we get the KKT conditions:
an≥0, μn≥0,tny(xn)−1+ξn≥0, ξn≥0,an(tny(xn)−1+ξn)=0, μnξn=0.To optimize w, b, ξ,
∂L∂w=0⇒w=N∑n=1antnϕ(xn)∂L∂b=0⇒N∑n=1antn=0∂L∂ξn=0⇒an=C−μnthen
˜L(a)=N∑n=1an−12N∑n=1N∑m=1anamtntmk(xn,xm)The ˜L is the same as the separable data SVM. And it subject to
0≤an≤C(the box constraint), N∑n=1antn=0then it became a quadratic programming problem that can be solved by the established approaches.
Inference
Identical to the separable data SVM,
y(x)=N∑n=1antnk(x,xn)+b, tn={1 when y(x)>0−1 when y(x)<0Discussions on Support Vectors
For the training data points,
Condition | Location | Is Suppport Vector? |
---|---|---|
an=0 | inside the correct margin | N |
C>an>0 | on the margin | Y |
C=a | out of the correct margin | Y |
Alternative Formulation
N∑n=1max(1−y(xn)tn,0)+λ‖w‖2where the former term is the Hinge loss.
Binary Classifier
Multiclass SVM
Although the application of SVMs to multiclass classification problems remains an open issue, in practice the one-versus-the-rest approach is the most widely used in spite of its ad-hoc formula- tion and its practical limitations.
Prof. Lee’s Formulation by the 3 Steps of ML
References
Youtube ML Lecture 20: Support Vector Machine (SVM)
Pattern Recognition and Machine Learning, Christopher M. Bishop, 2006