Entropy, KL Divergence and the Uniform Distribution

Commonly in Machine Learning, the notion of Entropy is used to describe how data behavies, to train models, and the like. When applied to a Uniform distribution, these equations have special forms that make them eaiser to compute. Some of these are found below.

Definition of Entropy

\[ \begin{align} \cssId{show-line-1}H[P] &= \mathbb{E}[\log\frac{1}{p(x)}] \\ &=\cssId{show-line-2}{\mathbb{E}[\log 1 - \log p(x)]} \\ &=\cssId{show-line-3}{\mathbb{E}[0 - \log p(x)]} \\ &=\cssId{show-line-4}{\mathbb{E}[- \log p(x)]} \\ &=\cssId{show-line-5}{\sum_{x\in\mathcal{X}}p(x) * - \log p(x)} \\ \cssId{show-line-6} H[P] &= -\sum_{x\in\mathcal{X}}p(x) \log p(x) \qed \\ \end{align} \]

This is the original definition of entropy that is related to compression. The best compression for a symbol with a probability \(p(x)\) is a code of length \(\log \frac{1}{p}\) (Shannon and Weaver 1949). Therefore the amount of information can be viewed as the expectation of the length when perfectly compressed.
Apply the Rule of Logs, \(\log \frac{A}{B} = \log A - \log B\)
\(\log 1 = 0\)
\(0\) is the additive identity.
The expectation for a discrete random variable is the sum of the the probability of some value times the function applied to that value for all possible values.
Multiplication distributes over addition, so we can pull the \(-1\) out of the summation. We have now arrived at the equation that people in ML generally use for entropy.

The Entropy of a Uniform Distribution

\[ \begin{align} \cssId{show-line-7}H[P] &= -\sum_{x\in\mathcal{X}} p(x) \log p(x) \\ \cssId{show-line-8}u(x) &= \frac{1}{|\mathcal{X}|} \\ \cssId{show-line-9}U &= \forall_{x \in \mathcal{X}}\text{ } p(x) = \frac{1}{|\mathcal{X}|} \end{align} \]

\[ \begin{align} \cssId{show-line-10}H[U] &= -\sum_{x\in\mathcal{X}} u(x) \log u(x) \\ \cssId{show-line-11}H[U] &= -\sum_{x\in\mathcal{X}} \frac{1}{|\mathcal{X}|} \log \frac{1}{|\mathcal{X}|} \\ &= \cssId{show-line-12}{ -\sum_{x\in\mathcal{X}} \frac{1}{|\mathcal{X}|} (\log 1 - \log |\mathcal{X}|) }\\ &= \cssId{show-line-13}{ -\sum_{x\in\mathcal{X}} \frac{1}{|\mathcal{X}|}\log(1) - \frac{1}{|\mathcal{X}|}\log |\mathcal{X}| }\\ &= \cssId{show-line-14}{ -\sum_{x\in\mathcal{X}} \frac{1}{|\mathcal{X}|} * 0 - \frac{1}{|\mathcal{X}|}\log |\mathcal{X}| }\\ &= \cssId{show-line-15}{ -\sum_{x\in\mathcal{X}} -\frac{1}{|\mathcal{X}|}\log |\mathcal{X}| }\\ &= \cssId{show-line-16}{ \sum_{x\in\mathcal{X}} \frac{1}{|\mathcal{X}|}\log |\mathcal{X}| }\\ &= \cssId{show-line-17}{ \frac{1}{|\mathcal{X}|}\sum_{x\in\mathcal{X}} \log |\mathcal{X}| }\\ &= \cssId{show-line-18}{ \frac{1}{|\mathcal{X}|} * |\mathcal{X}| * \log |\mathcal{X}| }\\ \cssId{show-line-19}H[U] &= \log |\mathcal{X}| \qed \\ \end{align} \]

This is the definition of entropy we found above.
In a uniform distribution, the probability for a given symbol is \(1\) divided by the number of possible symbols.
For a uniform distribution, all symbols have the same probability.
The entropy of a uniform distribution.
We plug in the values for \(u(x)\).
Apply the rule of logs, \(\log \frac{A}{B} = \log A - \log B\).
Distribution \(\frac{1}{|\mathcal{X}|}\)
\(\log 1 = 0\)
\(0\) multiplied by anything is \(0\).
Multiplication distributes over addition so we can pull the \(-1\) out of the summation.
Multiplication distributes over addition so we can pull the \(\frac{1}{|\mathcal{X}|}\) out of the summation.
As the value in the summation, \(\log |\mathcal{X}|\), does not depend on \(x\), the result of the summation will be the addition of it \(|\mathcal{X}|\) times. Multiplication is repeated addition, thus the summation is the same as multiplying by \(|\mathcal{X}\).
Multiplication by the reciprocal always yields \(1\) and \(1\) is the multiplicative identity. Therefore we are just left with the Entropy of the Uniform distribution as the log of the vocabulary size.

KL Divergence from a Uniform Distribution

\[ \begin{align} \cssId{show-line-20}KL(P || Q) &= \sum_{x\in\mathcal{X}} p(x) \log\frac{p(x)}{q(x)} \\ \cssId{show-line-21}H[P] &= -\sum_{x\in\mathcal{X}} p(x) \log p(x) \\ \cssId{show-line-22}H[U] &= \log |\mathcal{X}| \\ \cssId{show-line-23}u(x) &= \frac{1}{|\mathcal{X}|} \\ \cssId{show-line-24}\sum_{x\in\mathcal{X}} p(x) &= 1 \end{align} \]

\[ \begin{align} \cssId{show-line-25}KL(P || U) &= \sum_{x\in\mathcal{X}} p(x) \log\frac{p(x)}{u(x)} \\ &= \cssId{show-line-26}{\sum_{x\in\mathcal{X}} p(x) (\log p(x) - \log u(x))} \\ &= \cssId{show-line-27}{\sum_{x\in\mathcal{X}} p(x)\log p(x) - p(x)\log u(x)} \\ &= \cssId{show-line-28}{\sum_{x\in\mathcal{X}} p(x)\log p(x) - \sum_{x\in\mathcal{X}} p(x)\log u(x)} \\ &= \cssId{show-line-29}{-H[P] - \sum_{x\in\mathcal{X}} p(x)\log u(x)} \\ &= \cssId{show-line-30}{-H[P] - \sum_{x\in\mathcal{X}} p(x)\log \frac{1}{|\mathcal{X}|}} \\ &= \cssId{show-line-31}{-H[P] - \sum_{x\in\mathcal{X}} p(x)(\log 1 - \log |\mathcal{X}|)} \\ &= \cssId{show-line-32}{-H[P] - \sum_{x\in\mathcal{X}} p(x)\log 1 - p(x)\log |\mathcal{X}|} \\ &= \cssId{show-line-33}{-H[P] - \sum_{x\in\mathcal{X}} p(x) * 0 - p(x)\log |\mathcal{X}|} \\ &= \cssId{show-line-34}{-H[P] - \sum_{x\in\mathcal{X}} - p(x)\log |\mathcal{X}|} \\ &= \cssId{show-line-35}{-H[P] + \sum_{x\in\mathcal{X}} p(x)\log |\mathcal{X}|} \\ &= \cssId{show-line-36}{-H[P] + \log |\mathcal{X}| \sum_{x\in\mathcal{X}} p(x)} \\ &= \cssId{show-line-37}{-H[P] + \log |\mathcal{X}| * 1} \\ &= \cssId{show-line-38}{-H[P] + \log |\mathcal{X}|} \\ &= \cssId{show-line-39}{\log |\mathcal{X}| -H[P]} \\ \cssId{show-line-40}KL(P || U) &= H[U] - H[P] \qed \\ \end{align} \]

The definition of KL Divergence.
The definition of entropy we estabilished above.
The entropy of the uniform distirbution we established above.
The probability of some symbol in a uniform distribution.
For a probability distribution to be valid, it must sum to \(1\).
Our KL divergence between some distribution \(P\) and the uniform distribution.
Apply the rule of logs, \(\log \frac{A}{B} = \log A - \log B\).
Distribute multiplication of \(p(x)\).
Linearity of the Sum.
Based on our definition above, the first term in the negative entropy of \(P\).
Plug in the probability given by the uniform distribution.
Apply the rule of logs, \(\log \frac{A}{B} = \log A - \log B\).
Distribute \(p(x)\).
\(\log 1 = 0\).
\(0\) multiplied by anything is zero and zero is the additive identity.
Multiplication distributes over addition, thus we can pull the \(-1\) out of the summation.
\(\log |\mathcal{X}|\) does not depend on the value of \(x\), thus we pull it out of the summation, again because multiplication distributes over addition.
Based on the definition of a valid probability distribution, the final term is equal to \(1\).
\(1\) is the multiplicative identity.
Addition is commutative, so we can rearrange the terms.
Above we found that \(\log |\mathcal{X}|\) is the entropy of the uniform distribution. Therefore, we have found that the KL Divergence between some distribution \(P\) and the uniform distribution is given by the entropy of the uniform distribution minus the entropy of \(P\).

Now we see that things like Entropy and KL Divergence have special forms when applied to a Uniform distributions.