Multiclass Matthew's Correlation Coefficient

or: How I learned to stop worrying and love the \(R_k\)

Have you ever taken a glance at the code in an open source ML library like Mead-Baseline or scikit-learn for the Matthews Correlation Coefficient (also called \(R_k\)) and seen something dense and confusing like the following?

def mcc(X, Y):
    cm = ConfusionMatrix(X, Y)
    samples = np.sum(cm)
    correct = np.trace(cm)
    y = np.sum(cm, axis=1, dtype=np.float64)
    x = np.sum(cm, axis=0, dtype=np.float64)
    cov_x_y = correct * samples - np.dot(x, y)
    cov_y_y = samples * samples - np.dot(y, y)
    cov_x_x = samples * samples - np.dot(x, x)

    denom = np.sqrt(cov_x_x * cov_y_y)
    denom = denom if denom != 0.0 else 1.0
    return cov_x_y / denom

You ask yourself "What is going on?" Luckily in the comments there is a link to Wikipeida and you think "oh good, this will clear things up". But then you see a formula that looks like the following for the multiclass calculation:

\[ \mcc = \frac{ \sum_k \sum_l \sum_m C_{kk}C_{lm} - C_{kl}C_{mk} }{ \sqrt{\sum_k (\sum_l C_{kl})(\sum_{k'|k' \neq k} \sum_{l'} C_{k'l'})} \sqrt{\sum_k (\sum_l C_{lk})(\sum_{k'|k' \neq k} \sum_{l'} C_{l'k'})} } \]

This looks nothing like the code you found and if anything looking at Wikipedia has probably made things more confusing. This post will start with a quick introduction to the \(R_k\) metric and then hopefully help you understand how the above code works and how why it produces the correct result.

		Prediction
		1	0
Gold	1	TP	FN
Gold	0	FP	TN

Throughout this post we will use a confusion matrix that looks like the above. For a given example in the dataset we will add one to the matrix where the row index is the class of the true label and the column index is based on the predicted label. When the predictions match the gold label the numbers will appear on the diagonal while example where the model is wrong appear off it.

This confusion matrix can be extended to the multiclass setting by adding a row and column for each new class.

The \(R_k\) Metric

Definitions of terms

\[ \begin{align} \cssId{show-line-1} R_k &= \frac{\cov_k(X, Y)}{\sqrt{\cov_k(X, X)}\sqrt{\cov_k(Y, Y)}} \\[10pt] \cssId{show-line-2} \cov_k(X, Y) &= \sum_k^N w_k \cov(X_k, Y_k) \\[10pt] \cssId{show-line-3} w_k &= \frac{1}{N} \\[10pt] \cssId{show-line-4} \cov(X, Y) &= \sum_s^S (X_s - \bar{X}_s)(Y_s - \bar{Y}_k) \\[10pt] \bar{X}_\cssId{show-line-5}k &= \frac{1}{S}\sum_s^S X_{sk} = \frac{1}{S}\sum_l^N C_{lk} \\[10pt] \bar{Y}_\cssId{show-line-6}k &= \frac{1}{S}\sum_s^S Y_{sk} = \frac{1}{S}\sum_l^N C_{kl} \\ \end{align} \]

Simplifying the \(R_k\) calculation

\[ \begin{align} \cssId{show-line-7} \cov_k(X, Y) &= \sum_k^N w_k \cov(X_k, Y_k) \\[10pt] \cssId{show-line-8} \cov_k(X, Y) &= \sum_k^N w_k \sum_s^S (X_s - \bar{X}_s)(Y_s - \bar{Y}_s) \\[10pt] \cssId{show-line-9} \cov_k(X, Y) &= w_k \sum_s^S \sum_k^N (X_{sk} - \bar{X}_k)(Y_{sk} - \bar{Y}_k) \\[10pt] \cssId{show-line-10} \cov_k(X, Y) &= w_k \sum_s^S \sum_k^N X_{sk}Y_{sk} - X_{sk}\bar{Y}_k - Y_{sk}\bar{X}_k + \bar{X}_k \bar{Y}_k \\[10pt] \cssId{show-line-11} \cov_k(X, Y) &= w_k \sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_s^S \sum_k^N X_{sk}\bar{Y}_k - \sum_s^S \sum_k^N Y_{sk}\bar{X}_k + \sum_s^S \sum_k^N \bar{X}_k \bar{Y}_k \end{align} \]

Simplifying \(\sum\nolimits_s^S \sum_k^N \bar{X}_k\bar{Y}_k\)

\[ \begin{align} \cssId{show-line-12} \sum_s^S \sum_k^N \bar{X}_k \bar{Y}_k &= S\sum_k^N \bar{X}_k \bar{Y}_k \\[10pt] \cssId{show-line-13} \sum_s^S \sum_k^N \bar{X}_k \bar{Y}_k &= S\sum_k^N \frac{1}{S}\sum_l^N C_{lk} \frac{1}{S}\sum_l^N C_{kl} \\[10pt] \cssId{show-line-14} \sum_s^S \sum_k^N \bar{X}_k \bar{Y}_k &= \frac{S}{S^2}\sum_k^N \sum_l^N C_{lk} \sum_l^N C_{kl} \\[10pt] \cssId{show-line-15} \sum_s^S \sum_k^N \bar{X}_k \bar{Y}_k &= \frac{1}{S}\sum_k^N \sum_l^N C_{lk} \sum_l^N C_{kl} \\ \end{align} \]

Simplifying \(\sum_s^S \sum_k^N X_{sk}\bar{Y}_k \) and \(\sum_s^S \sum_k^N Y_{sk} \bar{X}_k\)

\[ \begin{align} \cssId{show-line-16} \sum_s^S \sum_k^N X_{sk}\bar{Y}_k &= \sum_k^N \bar{Y}_k \sum_s^S X_{sk} \\[10pt] \cssId{show-line-17} \sum_s^S X_{sk} &= \sum_l^N C_{lk} \\[10pt] \cssId{show-line-18} \sum_s^S \sum_k^N X_{sk} \bar{Y}_k &= \sum_k^N \sum_l^N C_{lk} \frac{1}{S}\sum_l^N C_{kl} \\[10pt] \cssId{show-line-19} \sum_s^S \sum_k^N X_{sk} \bar{Y}_k &= \frac{1}{S} \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{kl} \\[10pt] \cssId{show-line-20} \sum_s^S \sum_k^N Y_{sk}\bar{X}_k &= \frac{1}{S} \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk} \\ \end{align} \]

From a math perspective there isn't a lot to do with the first term, we'll come back to this when we convert it to code. \(\sum_s^S \sum_k^N X_{sk}Y_{sk}\).

Simplification of \(\cov_k\)

\[ \begin{align} \cssId{show-line-21} \cov_k(X, Y) &= w_k \sum_s^S \sum_k^N X_{sk}Y_{sk} - \frac{1}{S} \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{kl} - \frac{1}{S} \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk} + \frac{1}{S}\sum_k^N \sum_l^N C_{lk} \sum_l^N C_{kl} \\[10pt] \cssId{show-line-22} \cov_k(X, Y) &= w_k \sum_s^S \sum_k^N X_{sk}Y_{sk} - \frac{1}{S} \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk} \\[10pt] \cssId{show-line-23} \cov_k(X, Y) &= \frac{\sum_s^S \sum_k^N X_{sk}Y_{sk} - \frac{\sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{S}}{N} \\[10pt] \cssId{show-line-24} \cov_k(X, Y) &= \frac{\sum_s^S \sum_k^N X_{sk}Y_{sk} - \frac{\sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{S}}{N} * \frac{S}{S} \\[10pt] \cssId{show-line-25} \cov_k(X, Y) &= \frac{S * (\sum_s^S \sum_k^N X_{sk}Y_{sk}) - \frac{S * (\sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk})}{S}}{N * S} \\[10pt] \cssId{show-line-26} \cov_k(X, Y) &= \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{N * S} \\[10pt] \cssId{show-line-27} \cov_k(X, X) &= \frac{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk}}{N * S} \\[10pt] \cssId{show-line-28} \cov_k(Y, Y) &= \frac{S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}{N * S} \\[10pt] \end{align} \]

Simplification of \(R_k\)

\[ \begin{align} \cssId{show-line-29} R R_k &= \frac{ \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{N * S} }{ \sqrt{\frac{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk}}{N * S}}\sqrt{\frac{S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}{N * S}} } \\[10pt] \cssId{show-line-30} R_k &= \frac{ \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{N * S} }{ \sqrt{\frac{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk} * S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}{(N*S)^2}} } \\[10pt] \cssId{show-line-31} R_k &= \frac{ \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{N * S} }{ \frac{\sqrt{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk} * S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}}{\sqrt{(N*S)^2}} } \\[10pt] \cssId{show-line-32} R_k &= \frac{ \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{N * S} }{ \frac{\sqrt{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk} * S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}}{N*S} } \\[10pt] \cssId{show-line-33} R_k &= \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}}{N * S} * \frac{N * S}{\sqrt{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk} * S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}} \\[10pt] \cssId{show-line-34} R_k &= \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}} {\sqrt{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk} * S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}} \end{align} \]

\(R_k\) is the multiclass extension of the discretized pearsons rank correlation and is this is the definition of it.
\(X\) and \(Y\) are one hot vectors \(\in \R^{S, N}\). Each row in the matrix represents a example in the dataset and each column represents a class in the problem. The row for example \(s\) has a 1 in column \(k\) if the label for that example is \(k\) and is zero otherwise. \(X\) represents the labels created from the models while \(Y\) is the gold labels.
We define the covariance between two matrices to be a linear combination of the covariance between the columns of the matrices.
\( w_k \) is the weight on some class \(k\) in our linear combination. We decide to use a uniform prior on the importance of each class.
This is the definition of covariance.
\(\bar{X}_k\) is the mean of the \(k^{th}\) column of of \(X\). Because \(X\) is a one hot where a column \(k\) for a row \(s\) is one if that examples was labeled with \(k\). We can read this value from the confusion matrix. It is the sum of all elements in column \(k\) of the confusion matrix because that is the class in \(X\) regardless of the class in \(\bar{Y}\). The denominator in the mean is the total number of samples because the mean is actually over all \(S\) rows in \(X\) even if we read the values from the confusion matrix.
We can do the same transformation into a confusion matrix based calculation of \(Y\) except that the sum is along a row because we care about the value in \(Y\) regardless of the value in \(X\).
Again this is our definition of covariance between two matrices.
Here we substitute in the definition of covariance between vectors.
\(w_k\) is a constant across each value in the summation. It is a multiplication which distributes over addition so we can pull it out of the summation. Note this only works because we defined \(w_k = \frac{1}{N}\) if \(w_k\) was different for each class \(k\) we would not be able to factor it out of the summation over \(k\).
We foil the covariance calculation.
By thinking of all the subtraction in this equation as adding the negative term all each term is combined with addition. Addition is commutative so we can rearrange the summations to apply them to each term individually.
We can see that the variable \(s\) is not used inside summation, this means that the value calculated in the summation over \(k\) is repeated and summed \(s\) times. This is the definition of multiplication by \(S\).
We substitute in our definitions of \(\bar{X}_k\) and \(\bar{Y}_k\) in terms of the confusion matrix
Division is multiplication by the reciprocal and multiplication distributes of the addition of the summation so we can pull the division by \(S\) out.
The \(S\) in the numerator cancels with one of the \(S\)s in the denominator.
Similar to above we can factor \(\bar{Y}_k\) out of the summation over \(s\) because \(\bar{Y}_k\) doesn't depend on \(s\)
\(\sum_s^S X_{sk}\) is the number of examples in \(X\) that belong to class \(k\). This can be read from the confusion matrix as described before.
We substitute the confusion matrix definition for \(\sum_s^S X_{sk}\) was well as for \(\bar{Y}_k\)
As established before we can factor the multiplication by \(\frac{1}{S}\) out of the summation because multiplication distributes over addition,
\(\sum_s^S \sum_k^N Y_{sk}\bar{X}_k\) can be expanded similarly
We substitute our simplified expressions into the foiled calculation of \( \cov_k \)
Here we can see that the outer and last sections cancel out.
We substitute our definition of \(w_k\)
Any number divided by itself is \(1\) because this is the same as \(x * \frac{1}{x}\) and \(\frac{1}{x}\) is the multiplicative inverse (or reciprocal) of \(x\) and the result of a number times it multiplicative inverse is \(1\). We can multiply one side of this equation by \(1\) because \(1\) is the multiplicative identity and any multiplication by \(1\) results in the same number.
The \(S\) in the numerator distributes over the subtraction. (You actually need to recast the subtraction is as addition of the negation in order to do it).
The \(S\) in the denominator of the second term of the numerator cancels.
\(\cov_k(X, X)\) can similarly be simplified but we don't have \(\bar{Y}_k\) so all the calculation based on the confusion matrix operate on columns like in \(C_{lk}\).
In the same vain \(\cov(Y, Y)\) only has confusion matrix operations of \(C_{kl}\)
Lets plug this back into the \(R_k\) definition.
By the rule of square roots \(\sqrt{a}\sqrt{b}\) = \sqrt{ab}\) we can combine the two terms in the denominator.
By the rule of square roots \(\sqrt{\frac{a}{b}} = \frac{\sqrt{a}}{\sqrt{b}}\) we can split the denominator into two separate terms of a fraction.
\(\sqrt{a^2} = a\) so the square and square root in the denominator cancel.
Division is multiplication by the reciprocal so we can rewrite the equation.
The \(N * S\)'s cancel leaving us with a final simplified \(R_k\).

\[ R_k = \frac{S\sum_s^S \sum_k^N X_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}} {\sqrt{S\sum_s^S \sum_k^N X_{sk}X_{sk} - \sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk} * S\sum_s^S \sum_k^N Y_{sk}Y_{sk} - \sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}}} \] This still looks fairly complicated and we have this annoying disconnected where the majority of elements are calculated from the confusion matrix \(C\) but we have some things \(\sum_s^S X_{sk}Y_{sk}\), \(\sum_s^S X_{sk}X_{sk}\) \(\sum_s^S Y_{sk}Y_{sk}\) that are calculated based on the one hot representations of the predicted and gold labels. Calculating these terms using the one hot representation would be a waste of memory and the creation of the one-hots would be a waste of time. We want to calculate these terms based on the confusion matrix too. In the next section we will begin converting the terms in \(R_k\) into code and we will see how properties of these non-confusion matrix based terms will let use convert them into simple operations that act on the confusion matrix.

Converting to numpy

Let's start translating these into numpy code. We'll start with the simple ones.

\(S\) is the number of examples in the dataset. Because each example adds one to the confusion matrix \(S\) is just the sum of all the values in the confusion matrix. This is an easy translation numpy, np.sum(C)
\(\sum_l^N C_{kl}\) is the sum each element in the \(k^{th}\) row. Calculating this sum for each row and stacking them together produces a vector which has the sum of each row respectively. Numpy is vectorized so we can convert these sums and the creation of the vecotor to numpy with a single call, np.sum(C, axis=0)
\(\sum_l^N C_{lk}\) is the same as \(\sum_l^N C_{kl}\) except the sum is over the columns not that rows. This is similarly converted to numpy, np.sum(C, axis=1)
\(\sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}\) This term generates the sum of products between the sum of a some column \(k\) and the sum of some row \(k\). When we have pre-computed the row and column sums and stored them in two vectors this operation is the dot product (the sum of the products of each element of two vectors, or \(\sum_i x_i y_i\).)
- \(\sum_k^N \sum_l^N C_{kl} \sum_l^N C_{lk}\) from \(\cov(X, Y)\) translates to np.dot(np.sum(C, axis=0), np.sum(C, axis=1))
- \(\sum_k^N \sum_l^N C_{kl} \sum_l^N C_{kl}\) from \(\cov(Y, Y)\) translates to np.dot(np.sum(C, axis=0), np.sum(C, axis=0))
- \(\sum_k^N \sum_l^N C_{lk} \sum_l^N C_{lk}\) from \(\cov(X, X)\) translates to np.dot(np.sum(C, axis=1), np.sum(C, axis=1))

Transforming \(\sum_s^S \sum_k^N X_{sk}Y_{sk}\) is a little trickier and depends on understanding what these matrices represent. \(X\) and \(Y\) are one hot matrices \(\in \R^{s,k}\) where \(X_{sk}\) is \(1\) if example \(s\) was classified as class \(k\). As one hot matrices for a given example \(s\) only a single element in the row will be \(1\). This means that when example \(s\) is classified correctly the \(1\) will be in the same location in \(X\) and \(Y\). This means that the sum over \(k\) will be one. All the other products are \(0 * 0 = 0\). On the other hand when the example is classified differently between the predicted and the gold label then each product will be \(0 * 0 = 0\) for classes that are not either label, \(1 * 0 = 0\) for the predicted class and \(0 * 1 = 0\) for gold class. This means that the result of the summation over \(k\) is \(0\). This can be applied to each example independently. In summary the value will be \(1\) for correct examples and \(0\) for incorrect examples. The number of correct examples can be read from the confusion matrix. The counts on the diagonal are the number of examples for each class where the predicted label matches the gold labels. This sum can be expressed as \(\sum_k^N C_{kk}\) and as np.sum(np.diagonal(C)) or np.trace(C) in code.

When the inputs match (both are \(X\) or both are \(Y\)) the elements at a given \(sk\) will always match. Because these are one hots the value of the multiplication is always one at the match and the rest of the multiplications result in zero (because these are one hots where there is only a single class with a \(1\) in it). This means this summation over \(k\) will always be one and because of the sum over \(s\) the result will always be number of examples. This can be pulled from the confusion matrix with np.sum(C)

We now have code snippets that we can used to calculate all the terms in our simplified \(R_k\), the actually code uses some local variables to avoid repeating calculations but it should be possible to see the equivalence of the code at the beginning, the code we outlined here, and the math.

An Alternate Look at simplifying \(\cov\)

We simplified \(\sum_s^S \sum_k^N X_{sk}\bar{Y}_k\) by converting it into operations on the confusion matrix which let us cancel out terms in the foiled \(R_k\) calculation. There is another way to manipulate terms to get them to cancel.

\[ \begin{align} \bar{X}_\cssId{show-line-35}k &= \frac{1}{S}\sum_s^S X_{sk} \\ \cssId{show-line-36} S\bar{X}_k &= \sum_s^S X_{sk} \\ \cssId{show-line-37} \sum_s^S \sum_k^N X_{sk}\bar{Y}_k &= \sum_k^N S\bar{X}_k \bar{Y}_k \\ \cssId{show-line-38} \sum_s^S \sum_k^N X_{sk}\bar{Y}_k &= S \sum_k^N \bar{X}_k \bar{Y}_k \\ \cssId{show-line-39} \sum_s^S \sum_k^N Y_{sk}\bar{X}_k &= S \sum_k^N \bar{Y}_k \bar{X}_k \\ \cssId{show-line-40} \cov_k(X, Y) &= w_k \sum_s^S \sum_k^N X_{sk}Y_{sk} - S \sum_k^N \bar{X}_k\bar{Y}_k - S \sum_k^N \bar{Y}_k\bar{X}_k + S \sum_k^N \bar{X}_k \bar{Y}_k \\ \cssId{show-line-41} \cov_k(X, Y) &= w_k \sum_s^S \sum_k^N X_{sk}Y_{sk} - S \sum_k^N \bar{Y}_k\bar{X}_k \end{align} \]

Recall our definition for \(\bar{X}_k\)
Multiply each side by \(S\)
Substitutes this new value into the term.
No summation depends on \(S\) and because multiplication distributes over addition we can pull \(S\) out of the summation.
We can do the same substitution for \(\sum_s^S \sum_k^N Y_{sk}\).
Substitute these terms for outer and inner terms as well was our first transformation of the last term into the \(cov_k\) definition
Cancel terms

Reduction to Matthews Correlation Coefficient for \(k=2\)

The Wikipedia page for MCC gives multiple ways to calculate it by reading from the confusion matrix and also the original equations that Matthews uses for calculation in his paper. This presentation skips however the original formula Matthews presented where the true label distribution and the predicted label distributions are cast as Random Variables and the correlation between the two is measured. This formulation allows us see that this is equivalent to the \(R_k\) metric.

\[ \begin{align} \cssId{show-line-42} \mcc(X, Y) &= \frac{\sum_s^S (Y_s - \bar{Y}) (X_s - \bar{X})}{\sqrt{\sum_s^S (X_s - \bar{X})^2 \sum_s^S (Y_s - \bar{Y})^2}} \\ \cssId{show-line-43} \mcc(X, Y) &= \frac{\sum_s^S (Y_s - \bar{Y}) (X_s - \bar{X})}{\sqrt{\sum_s^S (X_s - \bar{X})(X_s - \bar{X})\sum_s^S (Y_s - \bar{Y})(Y_s - \bar{Y}) }} \\ \cssId{show-line-44} \mcc(X, Y) &= \frac{\cov(Y, X)}{\sqrt{\cov(X, X)\cov(Y, Y)}} \\ \cssId{show-line-45} \mcc(X, Y) &= \frac{\cov(X, Y)}{\sqrt{\cov(X, X)\cov(Y, Y)}} \\ \end{align} \]

This is the correlation based definition of \(\mcc \) from his original paper with variables renamed to fit our scheme. The Predicted labels \(P_n\) for Matthew is now \(X_n\) and the observations (previously \(S_n\)) are now \(Y_n\)
\(\sum_s^S (X_s - \bar{X})^2\) expands to \(\sum_s^S (X_s - \bar{X})(X_s - \bar{X})\), The \(Y\) term can be similarly expanded
These are the covariance definitions from earlier
\(\cov\) is symmetric so \(\cov(Y, X) = \cov(X, Y)\). We can see this because all interactions between the variables in \(\sum_s^S (Y_s - \bar{Y})(X_s - \bar{X})\) are multiplications and multiplication is commutative so we can swap the order

We can see above that this calculation for \(\mcc\) is almost that exact same as for \(R_k\)! The only differences is that \(\mcc\) is calculated with the covariance between vectors of \(0\)s and \(1\)s of length \(S\) while \(R_k\) uses the \(\cov_k\) function to calculate covariance between two one-hot matrices \(\in \R^{S \text{x} 2}\).

This next step is a little odd and I don't have a great way to demonstrate this mathematically but we should be able to see that these results will be the same.

\[ \begin{align} \cssId{show-line-46} X &= \begin{bmatrix} 1 \\ 0 \\ 1 \\ 1 \\ \vdots \\ 0 \end{bmatrix} \\ \cssId{show-line-47} X' &= \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ \vdots & \vdots \\ 1 & 0 \\ \end{bmatrix} \end{align} \]

Imagine \(x\) is some vector of ones and zeros that represent some labels for examples in a dataset.
\(x'\) is a one hot representation of \(x\). We see that column \(1\) is the exact same as the original vector and column \(0\) is the flipped (all zeros are ones and ones zeros).

\[ \begin{align} \cssId{show-line-48}\mcc(X, Y) &= \alpha \\ \cssId{show-line-49}R_k(X'_1, Y) &= \alpha \\ \cssId{show-line-50}R_k(X'_0, Y) &= \alpha \\ \cssId{show-line-51}w_k &= \frac{1}{2} \\ \cssId{show-line-52}R_k &= w_k * \alpha + w_k * \alpha \\ \cssId{show-line-53}R_k &= w_k (\alpha + \alpha) \\ \cssId{show-line-54}R_k &= w_k * 2\alpha \\ \cssId{show-line-55}R_k &= \frac{2 * \alpha}{2} \\ \cssId{show-line-56}R_k &= \alpha \\ \end{align} \]

For some binary vectors \(X\) and \(Y\) we say the Matthews Correlation Coefficient is \(\alpha\)
We saw above that all the calculations for a single column in \(R_k\) is the same as for \(\mcc\) and we saw how column one for \(X'\) is the same \(X\). This means the result will also be \(\alpha\)
For column zero of \(X'\) we saw that it is the same as \(X\) but flipped. This means that all the means, sums and the like calculated by \(R_k\) are going to be the same. This result is also \(\alpha\)
Earlier we defined \(w_k\) to be one over the number of classes we have which is in this case \(2\)
We know that \(R_k\) is the linear combination of the \(R_k\) calculated for each column weighted by \(w_k\) as expressed here.
We factor the \(w_k\) out of each term.
Multiplication is repeated addition.
We substitute \(\frac{1}{2}\) in for \(w_k\).
We cancel to show that \(R_k(X', Y)\) is equal to \(\mcc(X, Y)\).

Hopefully this explanation helps you to understand why the rather opaque code found in many open source libraries actually does calculate the \(R_k\) metric and why some people call it \(R_k\) and some call it \(\mcc\).