Theorem.
There is a unique (up to scalar) function over finite probability distributions such that it satisfies the following:
- Continuity: For any fixed , is continuous in the
- Increasing in the size of the uniform distribution: is increasing as a function of
- Decomposability: is such that . The unique function satisfying the above conditions is equivalent to our definition of entropy.
The last condition seems arbitrary at first glance but is actually quite well-motivated. Suppose we sent the random variables and as part of one message to someone else. Intuitively, there should be no loss of information if instead we were to send first then . We want our measure of information to be such that it obeys this intuition.
To be more specific, once we accept that the information should be a function of probability we want the following to be true:
The decomposability condition in the above theorem is essentially a generalization of this intuition, but for entropy instead of information. We want entropy to be such that if we decompose a random variable into a sequence of random variables, it adds.