I am interested in one result in the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma.
In Equation B.1 of the appendix (page 17), the authors introduce the following optimization problem (which relates to the maximum margin of an $\ell_1$-SVM classifier) :
$$\begin{align}
\min_{\alpha\in\mathcal L^1(\mathbb S^{d-1})} &||\alpha||_1 = \int_{u\in\mathbb S^{d-1}} |\alpha(u)|du \\
\text{subject to }\, &y_i\langle \alpha,\varphi(x_i)\rangle \ge 1 \end{align} \tag1$$
Here, $\mathbb S^{d-1}$ is the unit sphere in $\mathbb R^d$, and $\mathcal L^1(\mathbb S^{d-1})$ is the set of real-valued functions defined on $\mathbb S^{d-1}$ with Lebesgue integrable absolute value.
For all $i \in [\![ 1;n ]\!]$, the pair $(x_i,y_i)\in\mathbb R^d \times \{-1,1\} $ represents datapoints and their associated labels, while $\varphi$ is a "lifting function" mapping any datapoint $x_i$ to the infinite dimensional space $\mathcal L^\infty(\mathbb S^{d-1}) $, and thus $\varphi(x_i)\in\mathcal L^\infty(\mathbb S^{d-1})$.
Lastly, $\langle\cdot,\cdot\rangle$ represents the dot product : $\langle f,g\rangle =\int_{u\in\mathbb S^{d-1}} f(u)g(u)du $.
In claim B.3 of the paper (page 17), the authors claim that the dual of equation $(1)$ has the following form : $$ \begin{align} \max_{\lambda\in\mathbb R^n} &\sum_{i=1}^n \lambda_i \\ \text{subject to } &\left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert\le 1\; \forall u\in\mathbb S^{d-1}\\ &\lambda_i \ge 0 \; \forall i \in [\![ 1;n ]\!] \end{align} \tag2 $$ Where $\phi$ is such that $\forall u \in \mathbb S^{d-1}, \forall x \in \mathbb R^d, \varphi(x)[u] = \phi(u^Tx)$.
The details on how to prove that $(2)$ is the dual of $(1)$ are however not given and I fail to prove it myself, being quite unfamiliar with Lagrangian optimization on function spaces. My question is thus : How to prove that $\mathbf{(2)}$ is the dual of $\mathbf{(1)}$ ?
I attempted to do the proof by proceeding like in the more common "subset of $\mathbb R^d$" case, i.e. minimizing the Lagrangian, which according to my calculations has the following expression :
$$L(\alpha,\lambda) = ||\alpha||_1 + \sum_{i=1}^n\lambda_i\left(1-y_i\left\langle\alpha,\varphi(x_i)\right\rangle\right) $$
However, I can't get much further than this, as I don't know how to find the minimum of $L(\alpha,\lambda)$ for $\alpha\in\mathcal L^1(\mathbb S^{d-1})$.
Even though the paper suggests to take $L^*(\lambda) = \sum_i \lambda_i $, I don't know how to prove that it is indeed the solution, and I have no idea where that second constraint with the absolute value in $(2)$ comes from. I'm afraid that my approach is totally wrong.
Any help with this problem will be much appreciated.