Support Vector Machines

kittycat
5 min readOct 23, 2021

Support Vector Machine (SVM) is a supervised machine learning algorithm and is very popular among data scientists.

The main idea behind SVM is to draw a line between two or more classes in the best possible manner. Once the line is drawn to separate the classes, you can then use it to predict future data. The goal is to find the largest possible width for the margin that can separate the two groups. The center of the margin is called a hyperplane.

The hyperplane is the line separating the two groups of points. We use the term “hyperplane” instead of “line” because in SVM we typically deal with more than two dimensions, and using the word “hyperplane” more accurately conveys the idea of a plane in a multidimensional space.

Support Vector

A key term in SVM as support vectors. Support vectors are the points that lie on the two margins.

In this case, we say that there are two support vectors — one for each class.

Now let’s work on an example to see how SVM works and how to implement it using Scikit-learn.

Let’s take a look at the following synthetic sample and plot a two-dimensional scatter plot:

Two-class classification dataset in which classes are not linearly separable.

Let try to fit the data by linearSVC:

Decision boundary found by linear SVM

A linear model for classification can only separate points using a line, and will not be able to do a very good job on this dataset.

Let’s create a third feature to add to the data and plot a three-dimensional scatter plot:

Now let’s expand the set of input features, say by also adding feature1 ** 2, the square of the second feature, as a new feature. Instead of representing each data point as a two-dimensional point, (feature0, feature1), we now represent it as a three-dimensional point, (feature0, feature1, feature1 ** 2).

Expansion of the dataset by creating a third feature from feature 1

Fit the three-dimensional data: Linear SVC

Decision boundary found by linear SVM on the new three-dimensional dataset

As a function of the original features, the linear SVM model is not actually linear anymore. It is not a line, but more of an ellipse as in the one below plot:

Kernel trick

Adding nonlinear features to the representation of our data can make linear models much more powerful. However, often we don’t know which features to add, and adding many features (like all possible interactions in a 100- dimensional feature space) might make computation very expensive.

Luckily, there is a clever mathematical trick that allows us to learn a classifier in a higher-dimensional space without actually computing the new, possibly very large representation. This is known as the kernel trick, and it works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion.

There are two ways to map your data into a higher-dimensional space that are commonly used with support vector machines:

Polynomial kernel, which computes all possible polynomials up to a certain degree of the original features (like feature1 ** 2 * feature2 ** 5).

Radial basis function (RBF) kernel, also known as the Gaussian kernel. The Gaussian kernel is a bit harder to explain, as it corresponds to an infinite-dimensional feature space. One way to explain the Gaussian kernel is that it considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees.

Conclusion

Kernelized support vector machines are powerful models and perform well on a variety of datasets. SVMs allow for complex decision boundaries, even if the data has only a few features. They work well on low-dimensional and high-dimensional data (i.e., few and many features), but don’t scale very well with the number of samples. Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage.

Another downside of SVMs is that they require careful preprocessing of the data and tuning of the parameters. This is why, these days, most people instead use tree-based models such as random forests or gradient boosting (which require little or no preprocessing) in many applications. Furthermore, SVM models are hard to inspect; it can be difficult to understand why a particular prediction was made, and it might be tricky to explain the model to a nonexpert.

Still, it might be worth trying SVMs, particularly if all of your features represent measurements in similar units (e.g., all are pixel intensities) and they are on similar scales.

--

--