4
$\begingroup$

I am interested in one result in the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma.

In Equation B.1 of the appendix (page 17), the authors introduce the following optimization problem (which relates to the maximum margin of an $\ell_1$-SVM classifier) : $$\begin{align} \min_{\alpha\in\mathcal L^1(\mathbb S^{d-1})} &||\alpha||_1 = \int_{u\in\mathbb S^{d-1}} |\alpha(u)|du \\ \text{subject to }\, &y_i\langle \alpha,\varphi(x_i)\rangle \ge 1 \end{align} \tag1$$ Here, $\mathbb S^{d-1}$ is the unit sphere in $\mathbb R^d$, and $\mathcal L^1(\mathbb S^{d-1})$ is the set of real-valued functions defined on $\mathbb S^{d-1}$ with Lebesgue integrable absolute value.
For all $i \in [\![ 1;n ]\!]$, the pair $(x_i,y_i)\in\mathbb R^d \times \{-1,1\} $ represents datapoints and their associated labels, while $\varphi$ is a "lifting function" mapping any datapoint $x_i$ to the infinite dimensional space $\mathcal L^\infty(\mathbb S^{d-1}) $, and thus $\varphi(x_i)\in\mathcal L^\infty(\mathbb S^{d-1})$.
Lastly, $\langle\cdot,\cdot\rangle$ represents the dot product : $\langle f,g\rangle =\int_{u\in\mathbb S^{d-1}} f(u)g(u)du $.

In claim B.3 of the paper (page 17), the authors claim that the dual of equation $(1)$ has the following form : $$ \begin{align} \max_{\lambda\in\mathbb R^n} &\sum_{i=1}^n \lambda_i \\ \text{subject to } &\left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert\le 1\; \forall u\in\mathbb S^{d-1}\\ &\lambda_i \ge 0 \; \forall i \in [\![ 1;n ]\!] \end{align} \tag2 $$ Where $\phi$ is such that $\forall u \in \mathbb S^{d-1}, \forall x \in \mathbb R^d, \varphi(x)[u] = \phi(u^Tx)$.

The details on how to prove that $(2)$ is the dual of $(1)$ are however not given and I fail to prove it myself, being quite unfamiliar with Lagrangian optimization on function spaces. My question is thus : How to prove that $\mathbf{(2)}$ is the dual of $\mathbf{(1)}$ ?

I attempted to do the proof by proceeding like in the more common "subset of $\mathbb R^d$" case, i.e. minimizing the Lagrangian, which according to my calculations has the following expression : $$L(\alpha,\lambda) = ||\alpha||_1 + \sum_{i=1}^n\lambda_i\left(1-y_i\left\langle\alpha,\varphi(x_i)\right\rangle\right) $$ However, I can't get much further than this, as I don't know how to find the minimum of $L(\alpha,\lambda)$ for $\alpha\in\mathcal L^1(\mathbb S^{d-1})$.
Even though the paper suggests to take $L^*(\lambda) = \sum_i \lambda_i $, I don't know how to prove that it is indeed the solution, and I have no idea where that second constraint with the absolute value in $(2)$ comes from. I'm afraid that my approach is totally wrong.
Any help with this problem will be much appreciated.

$\endgroup$

1 Answer 1

1
+50
$\begingroup$

So, you want to find $L^*(\lambda) = \inf_\alpha L(\alpha,\lambda)$.

short version: There are two relevant cases. First, if $$ \left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert> 1 \forall u\in A\subset\mathbb S^{d-1} $$ holds for some measurable set $A$ with nonzero measure, then one can show that $L^*(\lambda)=-\infty$.

In the second case, where $$ \left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert\le 1 \forall u\in \mathbb S^{d-1} $$ holds, one can show that $\alpha=0$ is a minimizer of $L(\cdot,\lambda)$.

elaborations: We define the function $w(u):=\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)$. Then one can show $$ L(\alpha,\lambda) =\sum_{i=1}^n \lambda_i + \|\alpha\|_1 - \langle \alpha ,w\rangle $$ holds. In the first case, we can without loss of generality assume that $w\geq 1 + \delta$ on a measurable set $B$ with nonzero measure and some $\delta>0$ (for $w<-1$ the proof would work similar). We choose $\alpha_k\in \mathcal L^1$ such that $\alpha_k(u)=k$ on $B$ and $0$ otherwise. Then we have $$ L(\alpha_k,\lambda) =\sum_{i=1}^n \lambda_i + \int_{\Bbb S^{d-1}} |\alpha_k(u)| - \alpha_k(u) w(u)\,\mathrm du =\sum_{i=1}^n \lambda_i + \int_B |\alpha_k(u)| - \alpha_k(u) w(u)\,\mathrm du \leq \sum_{i=1}^n \lambda_i + \int_B k - k (1+\delta)\,\mathrm du $$ which converges to $-\infty$ if $k\to\infty$. Thus we have $L^*(\lambda)=-\infty$, which means that these $\lambda$ are not feasible in the dual problem.

In the second case, we have $|w(u)|\leq1$ for all $u$, and therefore $|\alpha(u)|-\alpha(u) w(u)\geq0$ for all $u$. This implies $$ L(\alpha,\lambda) =\sum_{i=1}^n \lambda_i + \int_{\Bbb S^{d-1}} |\alpha(u)| - \alpha(u) w(u)\,\mathrm du \geq \sum_{i=1}^n \lambda_i = L(0,\lambda) $$ and therefore $L^*(\lambda)= \sum_{i=1}^n \lambda_i$.

remarks regarding context: Claim B.3 in the paper is used to show Lemma B.1. However, I think Lemma B.1 is false (if I am understanding the notation correctly): If $\operatorname{supp}(\alpha^*)$ is finite, then $\alpha^*$ would only be nonzero on finitely many points, and $\langle \alpha^*,\varphi(x_i)\rangle=0$ would hold for all $i$ (since $\langle\cdot,\cdot\rangle$ represents an integral). Thus, the optimal value would always be $0$ and $\alpha^*=0$ would always be the best possible solution to (3.6). However, this should not be true for all data $x_i,y_i,\varphi$.

$\endgroup$
6
  • 1
    $\begingroup$ Thank you so much for this clear and detailed answer ! I guess the first expression of the Lagrangian you give should read as $$L(\alpha,\lambda) =\sum_{i=1}^n \lambda_i + \|\alpha\|_1 \mathbf- \langle \alpha ,w\rangle $$ right ? Anyway, regarding your remark, why would imposing a finite support on $\alpha^*$ imply that the solution has to be zero everywhere ? There are no continuity or smoothness assumptions on $\alpha$, so as long as it is $\mathcal L^1$ with 1-norm less than one it should be fine, isn't that correct ? $\endgroup$ Commented Oct 30, 2021 at 3:34
  • 1
    $\begingroup$ @StratosFair thank you for correcting the sign. I edited the question. $\endgroup$ Commented Nov 1, 2021 at 9:02
  • 1
    $\begingroup$ @StratosFair regarding the remark: I (hopefully) improved my explanation. If you calculate the $1$-norm of a function with finite support, you would find that the $1$-norm is always $0$. In measure theory, we say "$\alpha^*=0$ holds almost everywhere", because it is only nonzero on a set with Lebesgue measure zero. $\endgroup$ Commented Nov 1, 2021 at 9:15
  • 1
    $\begingroup$ If they use a different $1$-norm for sparse functions, then the claim is maybe correct. I only saw section 1.2, where the $1$-norm was defined using an integral. If in doubt, maybe you can try contacting the authors. $\endgroup$ Commented Nov 1, 2021 at 9:43
  • 1
    $\begingroup$ I do not have a reference ready for Lagrangian optimization in infinite dimensional spaces. In general, one should be careful and not everything that works in finite dimensions will work in infinite dimensions. Also note that the problems are convex in this context, so maybe a book on convex optimization could be helpful. $\endgroup$ Commented Nov 1, 2021 at 9:45

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.