Multiclass Matthew's Correlation Coefficient

or: How I learned to stop worrying and love the Rk

Have you ever taken a glance at the code in an open source ML library like Mead-Baseline or scikit-learn for the Matthews Correlation Coefficient (also called Rk) and seen something dense and confusing like the following?

  1. def mcc(X, Y):
  2. cm = ConfusionMatrix(X, Y)
  3. samples = np.sum(cm)
  4. correct = np.trace(cm)
  5. y = np.sum(cm, axis=1, dtype=np.float64)
  6. x = np.sum(cm, axis=0, dtype=np.float64)
  7. cov_x_y = correct * samples - np.dot(x, y)
  8. cov_y_y = samples * samples - np.dot(y, y)
  9. cov_x_x = samples * samples - np.dot(x, x)
  10.  
  11. denom = np.sqrt(cov_x_x * cov_y_y)
  12. denom = denom if denom != 0.0 else 1.0
  13. return cov_x_y / denom

You ask yourself "What is going on?" Luckily in the comments there is a link to Wikipeida and you think "oh good, this will clear things up". But then you see a formula that looks like the following for the multiclass calculation:

MCC=klmCkkClmCklCmkk(lCkl)(k|kklCkl)k(lClk)(k|kklClk)

This looks nothing like the code you found and if anything looking at Wikipedia has probably made things more confusing. This post will start with a quick introduction to the Rk metric and then hopefully help you understand how the above code works and how why it produces the correct result.

Prediction
1 0
Gold 1 TP FN
0
FP TN

Throughout this post we will use a confusion matrix that looks like the above. For a given example in the dataset we will add one to the matrix where the row index is the class of the true label and the column index is based on the predicted label. When the predictions match the gold label the numbers will appear on the diagonal while example where the model is wrong appear off it.

This confusion matrix can be extended to the multiclass setting by adding a row and column for each new class.

The Rk Metric

Definitions of terms

Rk=covk(X,Y)covk(X,X)covk(Y,Y)covk(X,Y)=kNwkcov(Xk,Yk)wk=1Ncov(X,Y)=sS(XsX¯s)(YsY¯k)X¯k=1SsSXsk=1SlNClkY¯k=1SsSYsk=1SlNCkl

Simplifying the Rk calculation

covk(X,Y)=kNwkcov(Xk,Yk)covk(X,Y)=kNwksS(XsX¯s)(YsY¯s)covk(X,Y)=wksSkN(XskX¯k)(YskY¯k)covk(X,Y)=wksSkNXskYskXskY¯kYskX¯k+X¯kY¯kcovk(X,Y)=wksSkNXskYsksSkNXskY¯ksSkNYskX¯k+sSkNX¯kY¯k

Simplifying sSkNX¯kY¯k

sSkNX¯kY¯k=SkNX¯kY¯ksSkNX¯kY¯k=SkN1SlNClk1SlNCklsSkNX¯kY¯k=SS2kNlNClklNCklsSkNX¯kY¯k=1SkNlNClklNCkl

Simplifying sSkNXskY¯k and sSkNYskX¯k

sSkNXskY¯k=kNY¯ksSXsksSXsk=lNClksSkNXskY¯k=kNlNClk1SlNCklsSkNXskY¯k=1SkNlNClklNCklsSkNYskX¯k=1SkNlNCkllNClk

From a math perspective there isn't a lot to do with the first term, we'll come back to this when we convert it to code. sSkNXskYsk.

Simplification of covk

covk(X,Y)=wksSkNXskYsk1SkNlNClklNCkl1SkNlNCkllNClk+1SkNlNClklNCklcovk(X,Y)=wksSkNXskYsk1SkNlNCkllNClkcovk(X,Y)=sSkNXskYskkNlNCkllNClkSNcovk(X,Y)=sSkNXskYskkNlNCkllNClkSNSScovk(X,Y)=S(sSkNXskYsk)S(kNlNCkllNClk)SNScovk(X,Y)=SsSkNXskYskkNlNCkllNClkNScovk(X,X)=SsSkNXskXskkNlNClklNClkNScovk(Y,Y)=SsSkNYskYskkNlNCkllNCklNS

Simplification of Rk

RRk=SsSkNXskYskkNlNCkllNClkNSSsSkNXskXskkNlNClklNClkNSSsSkNYskYskkNlNCkllNCklNSRk=SsSkNXskYskkNlNCkllNClkNSSsSkNXskXskkNlNClklNClkSsSkNYskYskkNlNCkllNCkl(NS)2Rk=SsSkNXskYskkNlNCkllNClkNSSsSkNXskXskkNlNClklNClkSsSkNYskYskkNlNCkllNCkl(NS)2Rk=SsSkNXskYskkNlNCkllNClkNSSsSkNXskXskkNlNClklNClkSsSkNYskYskkNlNCkllNCklNSRk=SsSkNXskYskkNlNCkllNClkNSNSSsSkNXskXskkNlNClklNClkSsSkNYskYskkNlNCkllNCklRk=SsSkNXskYskkNlNCkllNClkSsSkNXskXskkNlNClklNClkSsSkNYskYskkNlNCkllNCkl
  • Rk is the multiclass extension of the discretized pearsons rank correlation and is this is the definition of it.
    X and Y are one hot vectors RS,N. Each row in the matrix represents a example in the dataset and each column represents a class in the problem. The row for example s has a 1 in column k if the label for that example is k and is zero otherwise. X represents the labels created from the models while Y is the gold labels.
  • We define the covariance between two matrices to be a linear combination of the covariance between the columns of the matrices.
  • wk is the weight on some class k in our linear combination. We decide to use a uniform prior on the importance of each class.
  • This is the definition of covariance.
  • X¯k is the mean of the kth column of of X. Because X is a one hot where a column k for a row s is one if that examples was labeled with k. We can read this value from the confusion matrix. It is the sum of all elements in column k of the confusion matrix because that is the class in X regardless of the class in Y¯. The denominator in the mean is the total number of samples because the mean is actually over all S rows in X even if we read the values from the confusion matrix.
  • We can do the same transformation into a confusion matrix based calculation of Y except that the sum is along a row because we care about the value in Y regardless of the value in X.
  • Again this is our definition of covariance between two matrices.
  • Here we substitute in the definition of covariance between vectors.
  • wk is a constant across each value in the summation. It is a multiplication which distributes over addition so we can pull it out of the summation. Note this only works because we defined wk=1N if wk was different for each class k we would not be able to factor it out of the summation over k.
  • We foil the covariance calculation.
  • By thinking of all the subtraction in this equation as adding the negative term all each term is combined with addition. Addition is commutative so we can rearrange the summations to apply them to each term individually.
  • We can see that the variable s is not used inside summation, this means that the value calculated in the summation over k is repeated and summed s times. This is the definition of multiplication by S.
  • We substitute in our definitions of X¯k and Y¯k in terms of the confusion matrix
  • Division is multiplication by the reciprocal and multiplication distributes of the addition of the summation so we can pull the division by S out.
  • The S in the numerator cancels with one of the Ss in the denominator.
  • Similar to above we can factor Y¯k out of the summation over s because Y¯k doesn't depend on s
  • sSXsk is the number of examples in X that belong to class k. This can be read from the confusion matrix as described before.
  • We substitute the confusion matrix definition for sSXsk was well as for Y¯k
  • As established before we can factor the multiplication by 1S out of the summation because multiplication distributes over addition,
  • sSkNYskX¯k can be expanded similarly
  • We substitute our simplified expressions into the foiled calculation of covk
  • Here we can see that the outer and last sections cancel out.
  • We substitute our definition of wk
  • Any number divided by itself is 1 because this is the same as x1x and 1x is the multiplicative inverse (or reciprocal) of x and the result of a number times it multiplicative inverse is 1. We can multiply one side of this equation by 1 because 1 is the multiplicative identity and any multiplication by 1 results in the same number.
  • The S in the numerator distributes over the subtraction. (You actually need to recast the subtraction is as addition of the negation in order to do it).
  • The S in the denominator of the second term of the numerator cancels.
  • covk(X,X) can similarly be simplified but we don't have Y¯k so all the calculation based on the confusion matrix operate on columns like in Clk.
  • In the same vain cov(Y,Y) only has confusion matrix operations of Ckl
  • Lets plug this back into the Rk definition.
  • By the rule of square roots ab = \sqrt{ab}\) we can combine the two terms in the denominator.
  • By the rule of square roots ab=ab we can split the denominator into two separate terms of a fraction.
  • a2=a so the square and square root in the denominator cancel.
  • Division is multiplication by the reciprocal so we can rewrite the equation.
  • The NS's cancel leaving us with a final simplified Rk.

Rk=SsSkNXskYskkNlNCkllNClkSsSkNXskXskkNlNClklNClkSsSkNYskYskkNlNCkllNCkl This still looks fairly complicated and we have this annoying disconnected where the majority of elements are calculated from the confusion matrix C but we have some things sSXskYsk, sSXskXsk sSYskYsk that are calculated based on the one hot representations of the predicted and gold labels. Calculating these terms using the one hot representation would be a waste of memory and the creation of the one-hots would be a waste of time. We want to calculate these terms based on the confusion matrix too. In the next section we will begin converting the terms in Rk into code and we will see how properties of these non-confusion matrix based terms will let use convert them into simple operations that act on the confusion matrix.

Converting to numpy

Let's start translating these into numpy code. We'll start with the simple ones.

  • S is the number of examples in the dataset. Because each example adds one to the confusion matrix S is just the sum of all the values in the confusion matrix. This is an easy translation numpy, np.sum(C)
  • lNCkl is the sum each element in the kth row. Calculating this sum for each row and stacking them together produces a vector which has the sum of each row respectively. Numpy is vectorized so we can convert these sums and the creation of the vecotor to numpy with a single call, np.sum(C, axis=0)
  • lNClk is the same as lNCkl except the sum is over the columns not that rows. This is similarly converted to numpy, np.sum(C, axis=1)
  • kNlNCkllNCkl This term generates the sum of products between the sum of a some column k and the sum of some row k. When we have pre-computed the row and column sums and stored them in two vectors this operation is the dot product (the sum of the products of each element of two vectors, or ixiyi.)
    • kNlNCkllNClk from cov(X,Y) translates to np.dot(np.sum(C, axis=0), np.sum(C, axis=1))
    • kNlNCkllNCkl from cov(Y,Y) translates to np.dot(np.sum(C, axis=0), np.sum(C, axis=0))
    • kNlNClklNClk from cov(X,X) translates to np.dot(np.sum(C, axis=1), np.sum(C, axis=1))

Transforming sSkNXskYsk is a little trickier and depends on understanding what these matrices represent. X and Y are one hot matrices Rs,k where Xsk is 1 if example s was classified as class k. As one hot matrices for a given example s only a single element in the row will be 1. This means that when example s is classified correctly the 1 will be in the same location in X and Y. This means that the sum over k will be one. All the other products are 00=0. On the other hand when the example is classified differently between the predicted and the gold label then each product will be 00=0 for classes that are not either label, 10=0 for the predicted class and 01=0 for gold class. This means that the result of the summation over k is 0. This can be applied to each example independently. In summary the value will be 1 for correct examples and 0 for incorrect examples. The number of correct examples can be read from the confusion matrix. The counts on the diagonal are the number of examples for each class where the predicted label matches the gold labels. This sum can be expressed as kNCkk and as np.sum(np.diagonal(C)) or np.trace(C) in code.

When the inputs match (both are X or both are Y) the elements at a given sk will always match. Because these are one hots the value of the multiplication is always one at the match and the rest of the multiplications result in zero (because these are one hots where there is only a single class with a 1 in it). This means this summation over k will always be one and because of the sum over s the result will always be number of examples. This can be pulled from the confusion matrix with np.sum(C)

We now have code snippets that we can used to calculate all the terms in our simplified Rk, the actually code uses some local variables to avoid repeating calculations but it should be possible to see the equivalence of the code at the beginning, the code we outlined here, and the math.

An Alternate Look at simplifying cov

We simplified sSkNXskY¯k by converting it into operations on the confusion matrix which let us cancel out terms in the foiled Rk calculation. There is another way to manipulate terms to get them to cancel.

X¯k=1SsSXskSX¯k=sSXsksSkNXskY¯k=kNSX¯kY¯ksSkNXskY¯k=SkNX¯kY¯ksSkNYskX¯k=SkNY¯kX¯kcovk(X,Y)=wksSkNXskYskSkNX¯kY¯kSkNY¯kX¯k+SkNX¯kY¯kcovk(X,Y)=wksSkNXskYskSkNY¯kX¯k
  • Recall our definition for X¯k
  • Multiply each side by S
  • Substitutes this new value into the term.
  • No summation depends on S and because multiplication distributes over addition we can pull S out of the summation.
  • We can do the same substitution for sSkNYsk.
  • Substitute these terms for outer and inner terms as well was our first transformation of the last term into the covk definition
  • Cancel terms

Reduction to Matthews Correlation Coefficient for k=2

The Wikipedia page for MCC gives multiple ways to calculate it by reading from the confusion matrix and also the original equations that Matthews uses for calculation in his paper. This presentation skips however the original formula Matthews presented where the true label distribution and the predicted label distributions are cast as Random Variables and the correlation between the two is measured. This formulation allows us see that this is equivalent to the Rk metric.

MCC(X,Y)=sS(YsY¯)(XsX¯)sS(XsX¯)2sS(YsY¯)2MCC(X,Y)=sS(YsY¯)(XsX¯)sS(XsX¯)(XsX¯)sS(YsY¯)(YsY¯)MCC(X,Y)=cov(Y,X)cov(X,X)cov(Y,Y)MCC(X,Y)=cov(X,Y)cov(X,X)cov(Y,Y)
  • This is the correlation based definition of MCC from his original paper with variables renamed to fit our scheme. The Predicted labels Pn for Matthew is now Xn and the observations (previously Sn) are now Yn
  • sS(XsX¯)2 expands to sS(XsX¯)(XsX¯), The Y term can be similarly expanded
  • These are the covariance definitions from earlier
  • cov is symmetric so cov(Y,X)=cov(X,Y). We can see this because all interactions between the variables in sS(YsY¯)(XsX¯) are multiplications and multiplication is commutative so we can swap the order

We can see above that this calculation for MCC is almost that exact same as for Rk! The only differences is that MCC is calculated with the covariance between vectors of 0s and 1s of length S while Rk uses the covk function to calculate covariance between two one-hot matrices RSx2.

This next step is a little odd and I don't have a great way to demonstrate this mathematically but we should be able to see that these results will be the same.

X=[10110]X=[0110010110]
  • Imagine x is some vector of ones and zeros that represent some labels for examples in a dataset.
  • x is a one hot representation of x. We see that column 1 is the exact same as the original vector and column 0 is the flipped (all zeros are ones and ones zeros).
MCC(X,Y)=αRk(X1,Y)=αRk(X0,Y)=αwk=12Rk=wkα+wkαRk=wk(α+α)Rk=wk2αRk=2α2Rk=α
  • For some binary vectors X and Y we say the Matthews Correlation Coefficient is α
  • We saw above that all the calculations for a single column in Rk is the same as for MCC and we saw how column one for X is the same X. This means the result will also be α
  • For column zero of X we saw that it is the same as X but flipped. This means that all the means, sums and the like calculated by Rk are going to be the same. This result is also α
  • Earlier we defined wk to be one over the number of classes we have which is in this case 2
  • We know that Rk is the linear combination of the Rk calculated for each column weighted by wk as expressed here.
  • We factor the wk out of each term.
  • Multiplication is repeated addition.
  • We substitute 12 in for wk.
  • We cancel to show that Rk(X,Y) is equal to MCC(X,Y).

Hopefully this explanation helps you to understand why the rather opaque code found in many open source libraries actually does calculate the Rk metric and why some people call it Rk and some call it MCC.