Multiclass Matthew's Correlation Coefficient
or: How I learned to stop worrying and love the
Have you ever taken a glance at the code in an open source ML
library like
Mead-Baseline
or
scikit-learn
for the Matthews Correlation Coefficient (also called
- def mcc(X, Y):
- cm = ConfusionMatrix(X, Y)
- samples = np.sum(cm)
- correct = np.trace(cm)
- y = np.sum(cm, axis=1, dtype=np.float64)
- x = np.sum(cm, axis=0, dtype=np.float64)
- cov_x_y = correct * samples - np.dot(x, y)
- cov_y_y = samples * samples - np.dot(y, y)
- cov_x_x = samples * samples - np.dot(x, x)
- denom = np.sqrt(cov_x_x * cov_y_y)
- denom = denom if denom != 0.0 else 1.0
- return cov_x_y / denom
You ask yourself "What is going on?" Luckily in the comments there is a link to Wikipeida and you think "oh good, this will clear things up". But then you see a formula that looks like the following for the multiclass calculation:
This looks nothing like the code you found and if anything looking
at Wikipedia has probably made things more confusing. This post
will start with a quick introduction to the
Prediction | |||
---|---|---|---|
1 | 0 | ||
Gold | 1 | TP |
FN
|
0
|
FP |
TN
|
Throughout this post we will use a confusion matrix that looks like the above. For a given example in the dataset we will add one to the matrix where the row index is the class of the true label and the column index is based on the predicted label. When the predictions match the gold label the numbers will appear on the diagonal while example where the model is wrong appear off it.
This confusion matrix can be extended to the multiclass setting by adding a row and column for each new class.
The Metric
Definitions of terms
Simplifying the calculation
Simplifying
Simplifying and
From a math perspective there isn't a lot to do with the first term, we'll come back to this when we convert it to code.
Simplification of
Simplification of
-
is the multiclass extension of the discretized pearsons rank correlation and is this is the definition of it.
and are one hot vectors . Each row in the matrix represents a example in the dataset and each column represents a class in the problem. The row for example has a 1 in column if the label for that example is and is zero otherwise. represents the labels created from the models while is the gold labels. - We define the covariance between two matrices to be a linear combination of the covariance between the columns of the matrices.
-
is the weight on some class in our linear combination. We decide to use a uniform prior on the importance of each class. - This is the definition of covariance.
-
is the mean of the column of of . Because is a one hot where a column for a row is one if that examples was labeled with . We can read this value from the confusion matrix. It is the sum of all elements in column of the confusion matrix because that is the class in regardless of the class in . The denominator in the mean is the total number of samples because the mean is actually over all rows in even if we read the values from the confusion matrix. -
We can do the same transformation into a confusion matrix
based calculation of
except that the sum is along a row because we care about the value in regardless of the value in . - Again this is our definition of covariance between two matrices.
- Here we substitute in the definition of covariance between vectors.
-
is a constant across each value in the summation. It is a multiplication which distributes over addition so we can pull it out of the summation. Note this only works because we defined if was different for each class we would not be able to factor it out of the summation over . - We foil the covariance calculation.
- By thinking of all the subtraction in this equation as adding the negative term all each term is combined with addition. Addition is commutative so we can rearrange the summations to apply them to each term individually.
-
We can see that the variable
is not used inside summation, this means that the value calculated in the summation over is repeated and summed times. This is the definition of multiplication by . -
We substitute in our definitions of
and in terms of the confusion matrix -
Division is multiplication by the reciprocal and
multiplication distributes of the addition of the summation
so we can pull the division by
out. -
The
in the numerator cancels with one of the s in the denominator. -
Similar to above we can factor
out of the summation over because doesn't depend on -
is the number of examples in that belong to class . This can be read from the confusion matrix as described before. -
We substitute the confusion matrix definition for
was well as for -
As established before we can factor the multiplication by
out of the summation because multiplication distributes over addition, -
can be expanded similarly -
We substitute our simplified expressions into the foiled
calculation of
-
Here we can see that the
outer
andlast
sections cancel out. -
We substitute our definition of
-
Any number divided by itself is
because this is the same as and is the multiplicative inverse (or reciprocal) of and the result of a number times it multiplicative inverse is . We can multiply one side of this equation by because is the multiplicative identity and any multiplication by results in the same number. -
The
in the numerator distributes over the subtraction. (You actually need to recast the subtraction is as addition of the negation in order to do it). -
The
in the denominator of the second term of the numerator cancels. -
can similarly be simplified but we don't have so all the calculation based on the confusion matrix operate on columns like in . -
In the same vain
only has confusion matrix operations of -
Lets plug this back into the
definition. -
By the rule of square roots
= \sqrt{ab}\) we can combine the two terms in the denominator. -
By the rule of square roots
we can split the denominator into two separate terms of a fraction. -
so the square and square root in the denominator cancel. - Division is multiplication by the reciprocal so we can rewrite the equation.
-
The
's cancel leaving us with a final simplified .
Converting to numpy
Let's start translating these into numpy code. We'll start with the simple ones.
-
is the number of examples in the dataset. Because each example adds one to the confusion matrix is just the sum of all the values in the confusion matrix. This is an easy translation numpy,np.sum(C)
-
is the sum each element in the row. Calculating this sum for each row and stacking them together produces a vector which has the sum of each row respectively. Numpy is vectorized so we can convert these sums and the creation of the vecotor to numpy with a single call,np.sum(C, axis=0)
-
is the same as except the sum is over the columns not that rows. This is similarly converted to numpy,np.sum(C, axis=1)
-
This term generates the sum of products between the sum of a some column and the sum of some row . When we have pre-computed the row and column sums and stored them in two vectors this operation is the dot product (the sum of the products of each element of two vectors, or .)-
from translates tonp.dot(np.sum(C, axis=0), np.sum(C, axis=1))
-
from translates tonp.dot(np.sum(C, axis=0), np.sum(C, axis=0))
-
from translates tonp.dot(np.sum(C, axis=1), np.sum(C, axis=1))
-
Transforming np.sum(np.diagonal(C))
or
np.trace(C)
in code.
When the inputs match (both are np.sum(C)
We now have code snippets that we can used to calculate all the
terms in our simplified
An Alternate Look at simplifying
We simplified
-
Recall our definition for
- Multiply each side by
- Substitutes this new value into the term.
-
No summation depends on
and because multiplication distributes over addition we can pull out of the summation. -
We can do the same substitution for
. -
Substitute these terms for
outer
andinner
terms as well was our first transformation of thelast
term into the definition - Cancel terms
Reduction to Matthews Correlation Coefficient for
The Wikipedia page for MCC gives multiple ways to
calculate it by reading from the confusion matrix and also
the original equations that Matthews uses for calculation
in his paper. This presentation skips however the original
formula Matthews presented where the true label
distribution and the predicted label distributions are
cast as Random Variables and the correlation between the
two is measured. This formulation allows us see that this
is equivalent to the
-
This is the correlation based definition of
from his original paper with variables renamed to fit our scheme. The Predicted labels for Matthew is now and the observations (previously ) are now -
expands to , The term can be similarly expanded - These are the covariance definitions from earlier
-
is symmetric so . We can see this because all interactions between the variables in are multiplications and multiplication is commutative so we can swap the order
We can see above that this calculation for
This next step is a little odd and I don't have a great way to demonstrate this mathematically but we should be able to see that these results will be the same.
-
Imagine
is some vector of ones and zeros that represent some labels for examples in a dataset. -
is a one hot representation of . We see that column is the exact same as the original vector and column is the flipped (all zeros are ones and ones zeros).
-
For some binary vectors
and we say the Matthews Correlation Coefficient is -
We saw above that all the calculations for a single column
in
is the same as for and we saw how column one for is the same . This means the result will also be -
For column zero of
we saw that it is the same as but flipped. This means that all the means, sums and the like calculated by are going to be the same. This result is also -
Earlier we defined
to be one over the number of classes we have which is in this case -
We know that
is the linear combination of the calculated for each column weighted by as expressed here. -
We factor the
out of each term. - Multiplication is repeated addition.
-
We substitute
in for . -
We cancel to show that
is equal to .
Hopefully this explanation helps you to understand why the rather
opaque code found in many open source libraries actually does
calculate the