George Rebane
This little monograph is the most recent in the series to explain Bayesian reasoning to the non-technical (but otherwise intelligent) reader, and will serve as a reference for future posts. Some previous efforts in this direction can be found here and here. In the sequel we repeat the exercise graphically without resorting to a rigorous development using equations.
Understanding Bayesian reasoning is important because of its place in all of our lives. Today it is marbled into literally every facet of what we do, what is done to us, how we handle uncertainty (e.g. risk), and how we attempt to reason. Bayesian methods underlie and enable countless technologies that range all the way from providing the water out of your faucet, the manufacture of your car, how your 401K is invested, to the detection of dreaded diseases. In the last thirty years Bayesian reasoning has literally become ubiquitous in all areas save demagoguery, journalism, law, and politics. The major effort of protagonists in those arenas is to hide and/or deny Bayesian reasoning to their audiences.
The Bayesian updating process then continues by using the updated truth value as the prior probability for incorporating the next piece of new evidence. In this manner we always have the latest measure of reliability about our knowledge that has incorporated all the previously available evidence. So let’s see how this is really accomplished by understanding the simple figure below.
In the figure the big red rectangle represents all the plausible cases H (say, SamS robbed the bank) in which our hypothesis is TRUE, and all the cases ¬H (SamS didn’t rob the bank) in which our hypothesis is FALSE. The minus sign with the little hangy down part is the logical symbol for NOT. So our universe of interest can be visualized as comprising of H cases (the blue part) and the ‘NOT H’ cases (the green part). Let the areas of the rectangles represent the relative number of cases, H and ¬H, comprising each type of outcome. So if we were to pick a random outcome from the big red rectangle, the chance of picking H is simply the proportion of the blue area to the total area, or the blue plus green areas. It is clear that if the boundary labeled C moves to the right, then the probability of H increases while the probability of ¬H decreases. The fractional blue area is then our prior or current probability of H being TRUE.
Now let’s introduce some new evidence E (Sam’s fingerprints found at the teller window) the cases for which are represented by the crosshatched yellow rectangle. That E overlaps both H and ¬H indicates the relative number of cases in which both E and H, and E and ¬H correspond or occur together. We indicate the number of these potential or plausible coincident occurrences as (H,E) and (¬H,E) respectively and note that these sum to E, the total area of the yellow rectangle. Sam’s lawyer will argue that Sam could have visited the teller window as a customer before the robbery, thereby giving rise to a case from (¬H,E). The prosecutor will argue for (H,E) pointing out that Sam’s fingerprints were the top ones on the counter after which no customers visited as the bank which then became a sealed crime scene.
With the arrival of E, the crucial point to understand is that our universe now shrinks to the crosshatched area in the yellow rectangle. The cases we should, nay, that we must now focus on are the ones that only involve the crosshatched (H,E) and (¬H,E), since with the certitude of E it is only these cases that comprise our new universe of possibilities. And as with our prior (red rectangle) universe, in our new evidence dominated universe the posterior probability of H is simply the fraction of E (yellow rectangle) that the (H,E) covers. And surprisingly that is all there is to understanding the fundamentals of Bayesian inference and reasoning. This new posterior probability of H then becomes the prior probability when the next piece of evidence is processed. Ta-daaa!
So absent other evidence, it is up to the jury to decide the relative sizes of (H,E) and (¬H,E). We can visualize this in the figure by having them move the A and B boundaries to the right or left. If in their deliberations about Sam’s fingerprints during the robbery, they move the A boundary to the left, increasing the (H,E) cases, then the probability of Sam’s guilt increases; if to the right, decreasing the (H,E) cases, then Sam’s guilt is less likely. When they discuss the possibility of Sam having been a customer, then the B boundary moves to the right, increasing the (¬H,E) cases, when the plausibility of those cases increases, and then the obverse, moving B left to decrease (¬H,E), when that argument begins to lose its plausibility in the jury’s collective mind.
Through such deliberations (reflected by moving the boundaries) we see that new evidence can be used to both increase the posterior probability (of Sam’s guilt) or decrease it. This demonstrates the ability of Bayes to do what techies call non-monotonic reasoning. (Monotonic reasoning methods use new evidence to either ratchet up or down the posterior probability.)
Now that we have the ‘Bayesian picture’ in our mind, we can introduce the notion of the likelihood ratio L to actually calculate the posterior probability of H from its prior probability. As we saw from above, once our universe shrinks down to only that in which the evidence E holds, it’s the relative size of the two crosshatched areas, not necessarily their specific values or actual areas, that now determine the desired posterior or Bayes probability. L is the proper quantitative proxy for moving the E boundaries.
Using this understanding, we let L equal the ratio of crosshatched to solid color area fractions equal (H,E)/H to (¬H,E)/ ¬H. From the figure these are simply the chances that we obtain evidence E when H is TRUE vs the chance of getting the same evidence when H is FALSE. This calculated ratio L = [(H,E)/H]/[(¬H,E)/ ¬H] is a number that can be obtained either subjectively – jury decides Sam’s fingerprints are L = 10 times as likely to be there if he were the robber than if he were a customer - or objectively through analysis of prior occurrences of such evidence, or even through computer modeling of what may happen in cases not yet encountered. So how does coming up with L help us?
Well, since you’ve come this far and now have a firm grip on an intuitive (graphic) understanding of Bayes reasoning, I can fess up that I lied and ask you to forgive me for introducing just one little equation to show how useful L is for updating knowledge by incorporating new evidence with existing knowledge. Let P(H) be the prior probability, or using the areas in the figure P(H) = H/(H + ¬H), i.e. the fractional blue area. The posterior probability P(H|E), read ‘probability of H given E is TRUE’, is calculated from
From the above definition of the likelihood ratio L, we can confirm that if L > 1 (evidence supports hypothesis) then P(H|E) > P(H); and if L < 1 (evidence degrades hypothesis) then P(H|E) < P(H), thereby also demonstrating Bayes’ non-monotonicity. When L = 1, that is the areas (H,E)/H = (¬H,E)/ ¬H, then P(H|E) = P(H) which says that the evidence did not serve to resolve or change the truth value of our hypothesis. You can confirm all this easily with a calculator.
So if initially in Sam’s case we were totally uncertain about Sam’s guilt, our prior probability would have been P(H) = 0.5 = P(¬H), i.e. he is as equally guilty as not guilty. Then when the fingerprint evidence E was introduced with arguments that made us subsequently set L = 10, we recalculated Sam’s probability of guilt from the above Bayes formula to be
and conclude that finding Sam’s fingerprints on the teller’s counter has gone a long way to establish Sam’s guilt. What would we conclude were we to assess the fingerprint evidence to have been three times as probable from Sam’s visit as a customer vs his having robbed the bank, i.e. if we arrived at L = 1/3 = 0.33?
Comments