Wednesday, September 03, 2008

Why Our Brains Do Not Intuitively Grasp Probabilities: Scientific American

OK, so my intentions for this blog were to primarily focus on software development, but so far I've only posted articles on probabilities - the Monty Hall Problem and the Birthday Paradox. Probability theory, however, has many direct applications in computer science. The Birthday Paradox for example, can be used to showcase the basic principles of hashing algorithms. The Monty Hall problem is a great way to understand the concept of an expected value - a value you'd expect to see after running a process over and over. Expected values are used to define the average case complexities of algorithms

Anyway, this Scientific American article offers some speculations about why most of us don't intuitively understand probabilities, and perhaps sheds some light on why problems like those mentioned above can be so bewildering. The article's argument boils down to the idea that we live in the "now", that our brains are hardwired to easily handle only a small number of event occurrences. When we come up against large sets of values, we have a tendency to focus in on its subsets or singular values, which obscures the connections that exist between the set's members.

A very interesting (and short) read...

Monday, August 18, 2008

Happy Birthday to You (and to you too, probably)...

The other day at a family get-together, someone was talking about how weird it was that my brother-in-law and I share the same birthday (and interestingly enough, my father and sister have the same birthday as well). I promised my niece that I'd share the explanation as to why this isn't so weird. This problem is also know as "The Birthday Paradox", though it's not much of a paradox at all.

Right off the bat, we realize this is a probability problem. We're essentially asking "what is the probability that any two people have the same birthday" (i.e. month and day)? Asking about "any two people" is probably a bit too broad since we share the planet with about 6 & 1/2 billion others. How about something simpler, like "if a family has x members, what's the probability that any two relatives have the same birthday?"

It's probably obvious to you that the larger the family, the more likely that two people therein have the same birthday. What's not so obvious is the surprisingly small family size required so that there's a greater likelihood than not (i.e. greater than %50) that two relatives share the same birthday. That number is 23.




Before blurting out the answer as to why this is the magic number, let's start off with three probability terms and definitions: sample spaces, events, and outcomes. A sample space is the set of all possible results for some action. For example, flipping a coin has a sample space containing two items, namely (H)eads and (T)ails. Tossing a six-sided die has a sample space containing each face of the die {1, 2, 3, 4, 5, 6}.





An event is a subset of values in a sample space. For example, using the sample space of die roll values above, the event of even-numbered values is the subset containing die rolls 2, 4 and 6.





Finally, an outcome is an individual value, where favorable outcomes are in the desired event (i.e. the event we wish to examine). Undesirable outcomes are those values that fall outside of the desired event. Using the same example above, each of the die rolls 2, 4, and 6 are considered favorable outcomes in the event of even die rolls, and values 1, 3, and 5 are unfavorable.

Notice that a sample space is also the sum of both favorable and unfavorable outcomes. One reason why we make this distinction between outcome types, rather than just saying "OK, we have a sample space with all these values in it", is because knowing which outcomes are favorable (in our desired event) and unfavorable (those outside our desired event) helps us to better define an event and its compliment.

Now for some notation (there's a point to all this, I promise)! A sample space is typically denoted with the letter S. An event is usually written with the letter E. And an event's compliment, is written as E' (pronounced "E prime").

Using set theory, you can define things like

E' = S - E

That is, the compliment of an event E is equal to the sample space S minus the event E. And this should make sense. You're saying that E' contains all the values of S except for those values found in E. Using our die roll example above, say our desired event E contains any even numbered die value. We know that S contains all six die values, so

S - E
is equivalent to

{1, 2, 3, 4, 5, 6} less {2, 4, 6}

That leaves us with values {1, 3, 5}, and this is E' or the compliment ( the set of odd numbered die rolls).




OK, one last blurb about sample spaces and events. The examples above are pretty simple cases; a single coin toss, a single throw of one die. But what about if we have two coins, or two dice, or maybe 15 dice? The sample space containing all possible outcomes for these events is much larger than before. In the case of two coins tossed, there are four possible outcomes:

{(Heads Heads), (Heads Tails), (Tails Heads), (Tails Tails)}

The list is even longer for two dice:

{(1 1), (1 2) , (1 3) , ... , (1 6), (2 1), (2 2) , ... , (6 5), (6 6)}

I'll spare the discussion on combinatorics, Fundamental Counting Principle, and independence, but know that there's this rule of thumb known as the Multiplication Principle that allows you to determine a sample space by multiplying the number of outcomes of one action times the number of outcomes of another. Using two coins, the first action is the first coin toss, of which there are two outcomes. The second action is the second coin toss, which also has two outcomes.
Multiplying the number of outcomes together gives us:

2 x 2 = 4 outcomes

which reconciles with our set above of two tossed coins.

The same logic applies to two die rolls. Both dice have six outcomes, therefore

6 x 6 = 36 outcomes

(Even though I'm not going into the details of independent events, it's important to note that the outcomes of each action has no effect on the outcome of the other, e.g. the first coin's outcome has no effect on the second coin's outcome.)

Finally, the probability of an event boils down to a fraction represented by the size of the event divided by the size of the sample space, or

p(E) = |E| / |S|

And the probability is always between the values 0 and 1, where 0 is interpreted as the event never occurring, and 1 interpreted as always occurring. So

0 <= p(E) <= 1

Using our sample space above of two die rolls, the probability of obtaining "doubles" is

p(E) = |E| / |S| --> 6/36 --> 1/6

Why? The event E where doubles is found contains these six members:

{(1 1), (2 2), (3 3), (4 4), (5 5), (6 6)}

and the sample space S, contains all 36 possibilities found in two die roles. How about the probability of not getting doubles? This is compliment of our event, or E'.

Above, I pointed out that E' can be interpreted as S - E, which we can extend to express the sizes of S, E, and E'. So

|E'| = |S| - |E|

We can extend this even further to express the probabilities of E, S, and E', giving us

p(E') = p(S) - P(E)

where the probability of S, or p(S) = 1, since we'll always have some outcome occurring in S.

Since we know that the probability of getting doubles is 1/6, then the probability of not getting doubles is

p(E') = p(S) - P(E) --> 1 - 1/6 = 5/6

OK. Now we can get back to the birthday problem. To build our sample space, know that a person has a possible 365 dates on which to have a birthday (let's forget about leap year, and assume that all days in a year are equally likely). Any two people have a possible 365 x 365 dates. Any three people have 365 x 365 x 365 dates, and so on. In short, if you have x members in a family, there are a possible 365x birthdays in the sample space.

Let's say that E is the event of two of our x family members having the same birthday, and E' the event of any two not having a birthday match. We can determine E by first solving for E'. The outcomes in E' are determined by counting the number of possible dates for each of the x family members. Basically, pick one person and say they have a possible 365dates, then the next person can then only have a possible 364 dates, the third a possible 363 dates, and so on, all the way down to the xth family member. This should make sense since each successive person cannot match another's birthday. Our equation for p(E') looks like

p(E') = [365 · 364 · 363 · ... · (365 - x + 1)] / 365x

Solving for p(E) is easy, we just use our formula above for an event's compliment!

p(E) = 1 - ( [365 · 364 · 363 · ... · (365 - x + 1)] / 365x )

Calculating this by hand, you'll find that p(E) exceeds 50% (0.5073 actually) when your family size is 23, and p(E) reaches near certainty around 60. The following table shows just how quickly this probability rises!

So in the end, it can be rather startling to see just how common it is to come across someone with the same b-day as you, at least in the case of same month and day. But knowing this also takes away some of that special allure or "wow, that's random!" feeling. My father told me a story about how years ago, he met a girl on a Long Island beach, and not only did they have the same birthday, but the same birth year, and the same hospital. Now that's random!

Enjoy =)


Sunday, July 06, 2008

Monty Hall Problem - Explained

I recently took a class in "Combinatorics and Probability Theory" where we were assigned the famous Monty Hall problem as part of a homework exercise. The prof figured that since we were near the end of the semester and having already assigned us a few paradoxical probability problems that we'd have a decent chance at coming up with the correct answer without too much trouble.

However, when the answer was given, many of the students were still a bit confused as to why it worked or made sense. I too nearly scratched a hole in my head trying to figure this out, and it wasn't until I framed the problem a little differently that the solution began to appear. After sharing my discovery (which, by the way, I'm not claiming to be the only one who's thought of this...) with my classmates, a few lightbulbs started appearing over a few of their heads. Hopefully, this twist of the problem can help you see why the not-so-obvious solution is the correct one.

Let's start with the formal problem:

Monty Hall, the gameshow host on "Let's Make a Deal", shows you three closed doors. Before any of doors are opened, he tells you that behind two of these doors you'll find a goat. But behind one is a prize worth taking home. The point of the game is to pick the door hiding the prize - you pick it, you win it.


Monty Hall Problem 3 doors 1 prize 2 goats
After you choose a door, he opens one of the two doors that you didn't choose and reveals a goat. Keep in mind that Monty will always reveal a goat. You are then given the option of switching to the other closed door that you didn't choose or staying with your original choice.

Should you switch, and why or why not?

Right off the bat, most of us (including me at first) think switching makes no difference because we believe there's a 50/50 chance of either our 1st door choice or the other closed door having a prize behind it - but that's incorrect!

Let's change the game a little bit, say, there's a 100 doors, behind which are 99 goats and one prize. You choose one, and Monty opens up 98 doors revealing 98 goats (remember that Monty only shows goats and gives you one closed door to switch to). Again we're left with a similar choice: stay or switch to the only closed door left. Knowing that almost all of the goats have been revealed, and the odds are that we probably picked a goat from the start (a 99-in-100 chance to be exact), does it still seem like there's a 50/50 chance that we'll find the prize by switching?

Say the game has 1,000,000 doors and after we pick one, Monty reveals 999,998 goats. Does it seem that the winner is more or less likely to be behind the door he didn't open? In this case, there's a 999,999-in-1,000,000 chance that we probably picked a goat to begin with.

How about 1,000,000,000,000,000 doors? See where this is going? Now imagine these doors being group into sets: those that hide goats, and those that don't (which is always just one door).


Monty Hall Problem divided into winning and losing sets
Since we're more likely to pick a door at random from the set of losing doors (because there's so many more of them), along with Monty revealing the remaining goats in the set, it should seem that switching moves us to the set containing the winning door.

This same logic applies to our original 3-door case. To start, we've a 2/3 chance of randomly picking a losing door and only a 1/3 chance of picking the winner. Since we're more likely to pick a loser, Monty showing us the remaining losers pretty much tells us where the winning door is, 2/3 of the time! Therefore, you've a 2/3 chance of winning the game by switching.

Conversely, you've a 1/3 chance of losing by switching which corresponds to the 1/3 chance you have of picking the winner from the start. If you do, the set of doors remaining is the losing set, and Monty will open all the doors but one (it doesn't matter which ones).

And there you have it. Hopefully this has shed some light and perhaps settled a few arguments out there about why switching gives you a probabilistically better chance of winning. If you want to see this in action, check out this interactive feature from the New York Times online where you can play the game repetitively. Here you can see your score of wins and losses over time, and how you really do in fact win by switching about 2/3 (or 66.67%) of the time.

Have fun!