This document briefly explains why target mean value minimizes MSE error and why target median minimizes MAE.
Suppose we have a dataset $$ \{(x_i,y_i)\}_{i=1}^N $$
Basically, we are given pairs: features $x_i$ and corresponding target value $y_i \in \mathbb{R}$.
We will denote vector of targets as $y \in \mathbb{R}^N$, such that $y_i$ is target for object $x_i$. Similarly, $\hat y \in \mathbb{R}$ denotes predictions for the objects: $\hat y_i$ for object $x_i$.
Let's start with MSE loss. It is defined as follows:
$$ MSE(y, \hat y) = \frac{1}{N} \sum_{i=1}^N (\hat y_i - y_i)^2 $$Now, the question is: if predictions for all the objects were the same and equal to $\alpha$: $\hat y_i = \alpha$, what value of $\alpha$ would minimize MSE error?
$$ \min_{\alpha} f(\alpha) = \frac{1}{N} \sum_{i=1}^N (\alpha - y_i)^2 $$The function $f(\alpha)$, that we want to minimize is smooth with respect to $\alpha$. A required condition for $\alpha^*$ to be a local optima is $$ \frac{d f}{d \alpha}\bigg|_{\alpha=\alpha^*} = 0\, . $$
Let's find the points, that satisfy the condition:
$$ \frac{d f}{d \alpha}\bigg|_{\alpha=\alpha^*} = \frac{2}{N} \sum_{i=1}^N (\alpha^* - y_i) = 0 $$$$ \frac{2}{N} \sum_{i=1}^N \alpha^* - \frac{2}{N} \sum_{i=1}^N y_i = 0 $$$$ \alpha^* - \frac{1}{N} \sum_{i=1}^N y_i = 0 $$And finally: $$ \alpha^* = \frac{1}{N} \sum_{i=1}^N y_i $$
Since second derivative $\frac{d^2 f}{d \alpha^2}$ is positive at point $\alpha^*$, then what we found is local minima.
So, that is how it is possible to find, that optial constan for MSE metric is target mean value.
Similarly to the way we found optimal constant for MSE loss, we can find it for MAE.
Recall that $ \frac{\partial |x|}{dx} = sign(x)$, where $sign$ stands for signum function . Thus
$$ \frac{d f}{d \alpha}\bigg|_{\alpha=\alpha^*} = \frac{1}{N} \sum_{i=1}^N sign(\alpha^* - y_i) = 0 $$So we need to find such $\alpha^*$ that
$$ g(\alpha^*) = \sum_{i=1}^N sign(\alpha^* - y_i) = 0 $$Note that $g(\alpha^*)$ is piecewise-constant non-decreasing function. $g(\alpha^*)=-1$ for all calues of $\alpha$ less then mimimum $y_i$ and $g(\alpha^*)=1$ for $\alpha > \max_i y_i$. The function "jumps" by $\frac{2}{N}$ at every point $y_i$. Here is an example, how this function looks like for $y = [-0.5, 0, 1, 3, 3.4]$:
Basically there are $N$ jumps of the same size, starting from $-1$ and ending at $1$. It is clear, that you need to do about $\frac{N}{2}$ jumps to hit zero. And that happens exactly at median value of the target vector $g(median(y))=0$. We should be careful and separate two cases: when there are even number of points and odd, but the intuition remains the same.