According to Bayes’ Theorem, the posterior distribution that a sample belong to class
is given by
where is the prior probability of membership in class
and
is the conditional probability of the predictors
given that data comes from class
.
The rule that minimizes the total probability of misclassification says to classify into class
if
has the largest value across all of the classes.
There are different types of Discriminant Analysis and they usually differ on the assumptions made about the conditional distribution .
Linear Discriminant Analysis (LDA)
If we assume that in Eq. (1) follows a multivariate Gaussian distribution
with a class-specific mean vector
and a common covariance matrix
we have that the
of Eq. (2), referred here as discriminant function, is given by
which is a linear function in that defines separating class boundaries, hence the name LDA.
In practice [1], we estimate the prior probability , the class-specific mean
and the covariance matrix
by
,
and
, respectively, where:
, where
is the number of class
observations and
is the total number of observations.
Quadratic Discriminant Analysis (QDA)
If instead we assume that in Eq. (1) follows a multivariate Gaussian
with class-specific mean vector and covariance matrix we have a quadratic discriminant function
and the decision boundary between each pair of classes and
is now described by a quadratic equation.
Notice that we pay a price for this increased flexibility when compared to LDA. We now have to estimate one covariance matrix for each class, which means a significant increase in the number of parameters to be estimated. This implies that the number of predictors needs to be less than the number of cases within each class to ensure that the class-specific covariance matrix is not singular. In addition, if the majority of the predictors in the data are indicators for discrete categories, QDA will only be able to model these as linear functions, thus limiting the effectiveness of the model [2].
Regularized Discriminant Analysis
Friedman ([1], [3]) proposed a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common covariance as in LDA. The regularized covariance matrices have the form
where is the common covariance matrix as used in LDA and
is the class-specific covariance matrix as used in QDA.
is a number between
and
that can be chosen based on the performance of the model on validation data, or by cross-validation.
It is also possible to allow to shrunk toward the spherical covariance
where is the identity matrix. The equation above means that, when
, the predictors are assumed independent and with common variance
. Replacing
in Eq. (3) by
leads to a more general family of covariances
indexed by a pair of parameters that again can be chosen based on the model performance on validation data.
References:
[1] Hastie, T., Tibshirani, R., Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer.
[2] Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
[3] Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American statistical association, 84(405), 165-175.