Undertsanding and Deriving the Constraint of Hard SVM

Question

Setup:

When I was introduced to the idea of a support vector machine (SVM), we started with a binary linear classification problem written as: $$\hat{w} := \text{argmax}_{w \in \mathbb{R}^d}\{\frac{\text{margin}(w)}{||w||}\}$$

where $\text{margin}(w) := \min_{i = 1}^n\{y_i \cdot w^T x_i\}$ for the classes $y_i \in \{-1, 1\}$.

This makes sense, as we intuitively want to maximize the minimum distance between any point in our training set $D := \{(x_i, y_i)\}_{i = 1}^n$.

Naturally, scaling the vector $w$ by some scalar $\alpha$ does not change the fraction, i.e.: $$\forall \alpha \in \mathbb{R}.\frac{\text{margin}(\alpha w)}{||\alpha w||} = \frac{\text{margin}(w)}{||w||}$$

Armed with this knowledge, we set $\alpha := \frac{1}{\text{margin}(w)}$, which in turn simplifies the fraction to $\frac{1}{||\alpha w||}$.

Now, my confustion starts:

After setting $\tilde{w} := \alpha \cdot w$, we definde the hard-margin SVM as the problem:

$$ \min_{\tilde{w}} ||w||, \text{ s.t.} \\ y_i \cdot \tilde{w}^T x_i \geq 1 \text{ for } i = 1, \dots, n $$

Problem:

I don't get the constraint of the problem, i.e. $ y_i \cdot \tilde{w}^T x_i \geq 1 \text{ for } i = 1, \dots, n$. I get that the condition comes from the fact that the margin is the minimum distance and by our choice of $\alpha$ we get $\text{margin}(\tilde{w}) = \text{margin}\left(\frac{w}{margin(w)}\right) = 1$, but nonetheless if I was asked to explain the condition in words I would have no idea what to say.

Question: What does the constraint of the problem formulation achieve? Why do we need it and what is the intuition behind it?

N.B.: I have checked online, though the ressources I found use a different way of motivating/deriving SVMs (they take two hyperplanes, then take the middle of them as the decision boundary) so they don't answer my question.

Royi · Accepted Answer · 2025-08-16 06:03:53Z

A perfect Linear Classifier on Linearly Separable Data satisfies:

$$ \begin{cases} \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b > 0 & \text{if}\; {y}_{i} = 1 \\ \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b < 0 & \text{if}\; {y}_{i} = -1 \end{cases} \Rightarrow {y}_{i} \left( \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b \right) > 0 $$

The Hard SVM requires ${y}_{i} \left( \boldsymbol{w}^{T} \boldsymbol{x}_{i} + b \right) \geq 1$.
The number is arbitrary as one can always scale $\boldsymbol{w}, \; b$ to have any number while the classifier is the same.

Given that, now only use the geometric to explain why maximizing $\frac{1}{ {\left\| \boldsymbol{w} \right\|}_{2}^{2} }$ or minimizing ${\left\| \boldsymbol{w} \right\|}_{2}^{2}$ yields the maximum margin and you derived the Hard SVM.

Stack Exchange Network

Undertsanding and Deriving the Constraint of Hard SVM

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Undertsanding and Deriving the Constraint of Hard SVM

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions