Wednesday, September 03, 2008

Why Our Brains Do Not Intuitively Grasp Probabilities: Scientific American

OK, so my intentions for this blog were to primarily focus on software development, but so far I've only posted articles on probabilities - the Monty Hall Problem and the Birthday Paradox. Probability theory, however, has many direct applications in computer science. The Birthday Paradox for example, can be used to showcase the basic principles of hashing algorithms. The Monty Hall problem is a great way to understand the concept of an expected value - a value you'd expect to see after running a process over and over. Expected values are used to define the average case complexities of algorithms

Anyway, this Scientific American article offers some speculations about why most of us don't intuitively understand probabilities, and perhaps sheds some light on why problems like those mentioned above can be so bewildering. The article's argument boils down to the idea that we live in the "now", that our brains are hardwired to easily handle only a small number of event occurrences. When we come up against large sets of values, we have a tendency to focus in on its subsets or singular values, which obscures the connections that exist between the set's members.

A very interesting (and short) read...

Monday, August 18, 2008

Happy Birthday to You (and to you too, probably)...

The other day at a family get-together, someone was talking about how weird it was that my brother-in-law and I share the same birthday (and interestingly enough, my father and sister have the same birthday as well). I promised my niece that I'd share the explanation as to why this isn't so weird. This problem is also know as "The Birthday Paradox", though it's not much of a paradox at all.

Right off the bat, we realize this is a probability problem. We're essentially asking "what is the probability that any two people have the same birthday" (i.e. month and day)? Asking about "any two people" is probably a bit too broad since we share the planet with about 6 & 1/2 billion others. How about something simpler, like "if a family has x members, what's the probability that any two relatives have the same birthday?"

It's probably obvious to you that the larger the family, the more likely that two people therein have the same birthday. What's not so obvious is the surprisingly small family size required so that there's a greater likelihood than not (i.e. greater than %50) that two relatives share the same birthday. That number is 23.




Before blurting out the answer as to why this is the magic number, let's start off with three probability terms and definitions: sample spaces, events, and outcomes. A sample space is the set of all possible results for some action. For example, flipping a coin has a sample space containing two items, namely (H)eads and (T)ails. Tossing a six-sided die has a sample space containing each face of the die {1, 2, 3, 4, 5, 6}.





An event is a subset of values in a sample space. For example, using the sample space of die roll values above, the event of even-numbered values is the subset containing die rolls 2, 4 and 6.





Finally, an outcome is an individual value, where favorable outcomes are in the desired event (i.e. the event we wish to examine). Undesirable outcomes are those values that fall outside of the desired event. Using the same example above, each of the die rolls 2, 4, and 6 are considered favorable outcomes in the event of even die rolls, and values 1, 3, and 5 are unfavorable.

Notice that a sample space is also the sum of both favorable and unfavorable outcomes. One reason why we make this distinction between outcome types, rather than just saying "OK, we have a sample space with all these values in it", is because knowing which outcomes are favorable (in our desired event) and unfavorable (those outside our desired event) helps us to better define an event and its compliment.

Now for some notation (there's a point to all this, I promise)! A sample space is typically denoted with the letter S. An event is usually written with the letter E. And an event's compliment, is written as E' (pronounced "E prime").

Using set theory, you can define things like

E' = S - E

That is, the compliment of an event E is equal to the sample space S minus the event E. And this should make sense. You're saying that E' contains all the values of S except for those values found in E. Using our die roll example above, say our desired event E contains any even numbered die value. We know that S contains all six die values, so

S - E
is equivalent to

{1, 2, 3, 4, 5, 6} less {2, 4, 6}

That leaves us with values {1, 3, 5}, and this is E' or the compliment ( the set of odd numbered die rolls).




OK, one last blurb about sample spaces and events. The examples above are pretty simple cases; a single coin toss, a single throw of one die. But what about if we have two coins, or two dice, or maybe 15 dice? The sample space containing all possible outcomes for these events is much larger than before. In the case of two coins tossed, there are four possible outcomes:

{(Heads Heads), (Heads Tails), (Tails Heads), (Tails Tails)}

The list is even longer for two dice:

{(1 1), (1 2) , (1 3) , ... , (1 6), (2 1), (2 2) , ... , (6 5), (6 6)}

I'll spare the discussion on combinatorics, Fundamental Counting Principle, and independence, but know that there's this rule of thumb known as the Multiplication Principle that allows you to determine a sample space by multiplying the number of outcomes of one action times the number of outcomes of another. Using two coins, the first action is the first coin toss, of which there are two outcomes. The second action is the second coin toss, which also has two outcomes.
Multiplying the number of outcomes together gives us:

2 x 2 = 4 outcomes

which reconciles with our set above of two tossed coins.

The same logic applies to two die rolls. Both dice have six outcomes, therefore

6 x 6 = 36 outcomes

(Even though I'm not going into the details of independent events, it's important to note that the outcomes of each action has no effect on the outcome of the other, e.g. the first coin's outcome has no effect on the second coin's outcome.)

Finally, the probability of an event boils down to a fraction represented by the size of the event divided by the size of the sample space, or

p(E) = |E| / |S|

And the probability is always between the values 0 and 1, where 0 is interpreted as the event never occurring, and 1 interpreted as always occurring. So

0 <= p(E) <= 1

Using our sample space above of two die rolls, the probability of obtaining "doubles" is

p(E) = |E| / |S| --> 6/36 --> 1/6

Why? The event E where doubles is found contains these six members:

{(1 1), (2 2), (3 3), (4 4), (5 5), (6 6)}

and the sample space S, contains all 36 possibilities found in two die roles. How about the probability of not getting doubles? This is compliment of our event, or E'.

Above, I pointed out that E' can be interpreted as S - E, which we can extend to express the sizes of S, E, and E'. So

|E'| = |S| - |E|

We can extend this even further to express the probabilities of E, S, and E', giving us

p(E') = p(S) - P(E)

where the probability of S, or p(S) = 1, since we'll always have some outcome occurring in S.

Since we know that the probability of getting doubles is 1/6, then the probability of not getting doubles is

p(E') = p(S) - P(E) --> 1 - 1/6 = 5/6

OK. Now we can get back to the birthday problem. To build our sample space, know that a person has a possible 365 dates on which to have a birthday (let's forget about leap year, and assume that all days in a year are equally likely). Any two people have a possible 365 x 365 dates. Any three people have 365 x 365 x 365 dates, and so on. In short, if you have x members in a family, there are a possible 365x birthdays in the sample space.

Let's say that E is the event of two of our x family members having the same birthday, and E' the event of any two not having a birthday match. We can determine E by first solving for E'. The outcomes in E' are determined by counting the number of possible dates for each of the x family members. Basically, pick one person and say they have a possible 365dates, then the next person can then only have a possible 364 dates, the third a possible 363 dates, and so on, all the way down to the xth family member. This should make sense since each successive person cannot match another's birthday. Our equation for p(E') looks like

p(E') = [365 · 364 · 363 · ... · (365 - x + 1)] / 365x

Solving for p(E) is easy, we just use our formula above for an event's compliment!

p(E) = 1 - ( [365 · 364 · 363 · ... · (365 - x + 1)] / 365x )

Calculating this by hand, you'll find that p(E) exceeds 50% (0.5073 actually) when your family size is 23, and p(E) reaches near certainty around 60. The following table shows just how quickly this probability rises!

So in the end, it can be rather startling to see just how common it is to come across someone with the same b-day as you, at least in the case of same month and day. But knowing this also takes away some of that special allure or "wow, that's random!" feeling. My father told me a story about how years ago, he met a girl on a Long Island beach, and not only did they have the same birthday, but the same birth year, and the same hospital. Now that's random!

Enjoy =)