2
$\begingroup$

Setup:

When I was introduced to the idea of a support vector machine (SVM), we started with a binary linear classification problem written as: $$\hat{w} := \text{argmax}_{w \in \mathbb{R}^d}\{\frac{\text{margin}(w)}{||w||}\}$$

where $\text{margin}(w) := \min_{i = 1}^n\{y_i \cdot w^T x_i\}$ for the classes $y_i \in \{-1, 1\}$.

This makes sense, as we intuitively want to maximize the minimum distance between any point in our training set $D := \{(x_i, y_i)\}_{i = 1}^n$.

Naturally, scaling the vector $w$ by some scalar $\alpha$ does not change the fraction, i.e.: $$\forall \alpha \in \mathbb{R}.\frac{\text{margin}(\alpha w)}{||\alpha w||} = \frac{\text{margin}(w)}{||w||}$$

Armed with this knowledge, we set $\alpha := \frac{1}{\text{margin}(w)}$, which in turn simplifies the fraction to $\frac{1}{||\alpha w||}$.

Now, my confustion starts:

After setting $\tilde{w} := \alpha \cdot w$, we definde the hard-margin SVM as the problem:

$$ \min_{\tilde{w}} ||w||, \text{ s.t.} \\ y_i \cdot \tilde{w}^T x_i \geq 1 \text{ for } i = 1, \dots, n $$

Problem:

I don't get the constraint of the problem, i.e. $ y_i \cdot \tilde{w}^T x_i \geq 1 \text{ for } i = 1, \dots, n$. I get that the condition comes from the fact that the margin is the minimum distance and by our choice of $\alpha$ we get $\text{margin}(\tilde{w}) = \text{margin}\left(\frac{w}{margin(w)}\right) = 1$, but nonetheless if I was asked to explain the condition in words I would have no idea what to say.

Question: What does the constraint of the problem formulation achieve? Why do we need it and what is the intuition behind it?

N.B.: I have checked online, though the ressources I found use a different way of motivating/deriving SVMs (they take two hyperplanes, then take the middle of them as the decision boundary) so they don't answer my question.

$\endgroup$

1 Answer 1

0
$\begingroup$

A perfect Linear Classifier on Linearly Separable Data satisfies:

$$ \begin{cases} \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b > 0 & \text{if}\; {y}_{i} = 1 \\ \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b < 0 & \text{if}\; {y}_{i} = -1 \end{cases} \Rightarrow {y}_{i} \left( \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b \right) > 0 $$

The Hard SVM requires ${y}_{i} \left( \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b \right) \geq 1$.
The number is arbitrary as one can always scale $\boldsymbol{w}, \; b$ to have any number while the classifier is the same.

Given that, now only use the geometric to explain why maximizing $\frac{1}{ {\left\| \boldsymbol{w} \right\|}_{2}^{2} }$ or minimizing ${\left\| \boldsymbol{w} \right\|}_{2}^{2}$ yields the maximum margin and you derived the Hard SVM.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.