If Entropy is seen as number of binary questions required to reach the answer. KL-distance can be described as extra number of questions required if you assume the wrong distribution.

For example, Let us A,B,C,D actually occur with probability p = 1/4,1/4,1/4,1/4

The number of questions actually required is just 2. ( 1st question - Is it (A/B) or (C,D); 2nd question - If (A/B), is it A)

Say if you assume the distribution wrongly as q = 1/2, 1/4, 1/8, 1/8, then how many extra questions on an average would you end up asking is what KL distance is about.

So, if the distribution is 1/2, 1/4, 1/8, 1/8 of A,B,C,D. This is how you would proceed to calculate entropy or average number of binary questions.

  1. Is it A?

Half the times you will receive yes. So average number of questions is 1/2 x 1 = 0.5

  1. Is it B?

You will receive yes half the times. But this is when you received No in first question, whose chances of occuring are again 1/2. So, average number of questions = 1/2 x 1/2 x 2 = 0.5

  1. Is it C ?

You will have to ask this question if question 1 is answered no(probability = 1/2) and question 2 is also answered no(probability = 1/2). So, average number of questions on = 1/2 x 1/2 x 3 = 0.75

Total average number of questions = 0.5 + 0.5 + 0.75 = 1.75 questions.

But note that this is NOT the original distribution, which is p = 1/4,1/4,1/4,1/4. You are assuming the distribution wrongly as q = 1/2, 1/4, 1/8, 1/8

So, let us calculate how many average binary questions you would end up asking if you assume the wrong distribution.

  1. Is it A?

You will receive yes only one-fourth of times(as opposed to previous case - 1/2). So, average number of questions = 0.25 questions.

  1. Is it B ?