Suppose we are given some probability ensemble where is some sample space and is some probability measure. For simplicity, we only concern ourselves with discrete probability distributions.
Definition.
Let be an event. We define the Shannon information of as follows:
The intuition here is that the more “surprising,” or “improbable” the event, the more “information” we gain from observing it. A helpful example: suppose I asked someone about whether their birthday is today. If they say “no,” I have not gained much information. However, if they say “yes,” then I have gained a tremendous amount of information. Of course, the probability that they say “yes” as opposed to “no” is much smaller, but it is precisely for this reason that it provides more information.