Happy Birthday!

For today’s post, I thought I’d return to statistics. Remember the Monty Hall problem I talked about last year? If not, do a search on “Monty” on the Blog page. That was an example of statistics defying common sense. Today’s post is another one of those.

This post is about the probability of any two people in a group of people in a room having the same birthday. But let’s simplify this scenario with an equivalent one. Suppose we have a random number generator that generates a number between 1 and 365, including 1 and 365 – kind of like a 365 faced die. Let’s say we “roll” this die twice. What is the probability that the two numbers generated are the same, the successful event?

As is often the case in statistics, it is easier to look at the probability of the unsuccessful events. You can then subtract that from 1 to get the probability of the successful event since

Probability of Success + Probability of Failure = 1 or

Probability of Success = 1 – Probability of Failure

since one or the other must happen. Remember that a certainty in probability is “1”, absolutely no chance is “0” and other probabilities are between those two numbers. For example, the probability of flipping a heads is 0.5. Please see my posts on probability for a review if needed.

So if we roll this 365-faced die twice, the first roll sets the number and the chance of the second roll matching that number is 1/365 and the chance of not matching that number is 364/365. This probability is


Very small! I wouldn’t bet on that happening. This is equivalent to the probability that two random people have the same birthday. Please bear with me here, but an equivalent expression that takes into account that we are rolling the die twice (or have two people in the room) is:


That expression in the exponent, (2×1)/2, is how to calculate the number of pairs that have a chance of being a success. Since we are just rolling the die twice (or there are just two people in a room), we only have 1 pair. If we roll the die 3 times, there are (3×2)/2 or 3 pairs of numbers to compare. Note that this exponent is generated by multiplying the number of rolls by one less, then dividing by 2. So for three rolls (3 people in a room), the chance of two numbers being the same are


Well that appeared to have increased the odds a bit. Let’s roll the die 10 times or have 10 people in a room. There are (10×9)/2 0r 45 pairs that have a chance of being the same. So the probability in this case is


That means that in a group of 10 people, you have slightly better than an 11% chance that any two people have the same birthday. That really increased the chances with just a few more people! You can keep doing this for any number of rolls (people) using the formula


where n is the number of rolls or number of people in a room. If you let n = 23, you will find that the chance of any two people having the same birthday is


You have better than a 50% chance that in a room of 23 people, two of them will have the same birthday! Mathematically, this is so because you have 253 pairs to compare, or 253 opportunities of a success. What a surprise!

Statistics and Data, Confidence Interval

Before I get to the core of today’s post, I would like to show the details of calculating the standard deviation. From my last post, we have two sets of data: one from Nice David and the other from Evil David. Let’s look a t Nice David’s data which are the last five test scores of his students: 83, 83, 85, 87, and 90.

Now in my last post, I gave you the formula for calculating the standard deviation:


This says (much more elegantly than english) that I need to subtract the mean of the data from each data point, square each difference, add these up, divide by the number of data points, then take the square root of the result.

The mean of this data is 85.6. Taking the first data point of 83, subtracting 85.6 gives -2.6. Squaring this gives 6.76. I do this to each data point. After squaring each difference, I add these up and I get 35.2. Dividing this by the number of data points (5), I get 7.04. Then finally taking the square root, I get the standard deviation of 2.65. Doing the same thing to Evil David’s data gives standard deviation of 12.09. By the way, if you know how to use Excel, these calculations are very easy to do when you have lots of data.

Now this post is about one of the ways you can use the standard deviation to make a decision as to which tutor you should use.

Now there is a lot of development that I will be leaving out here and I will also be making several assumptions to simplify the presentation, but the final result is still valid.

I will be assuming that the data from each tutor is normally distributed, which means we can make certain statements about the standard deviations. This is not an extreme assumption as this is usually assumed in statistics.

For data that is normally distributed, an interval of the mean minus one standard deviation to the mean plus one standard deviation contains 68% of the data. So for Nice David’s data, the mean minus one standard deviation is 85.6 – 2.65 = 82.95. The mean plus one standard deviation is 85.6 + 2.65 = 88.25. Now your test score will be another data point in Nice David’s data. Though you do not know what your test score will be, based on Nice David’s historical data, you can be 68% confident that your test score will be between 82.95 and 88.25.

The interval of all numbers between 82.95 and 88.25 is called a confidence interval. In this case, it is a 68% confidence interval. Let’s calculate Evil David’s 68% confidence interval.

Evil David’s standard deviation is 12.09, so his interval is 85.6 – 12.09 = 73.51 to 85.6 + 12.09 = 97.69. So with Evil David, you can be 68% confident that your test score will be between 73.51 am 97.69.

Now most calculations in statistics center around the 95% confidence interval. For normally distributed data that we are assuming here, that is an interval that is two standard deviations about the mean. So for Nice David, the 95% interval is 85.6 – 2×2.65 = 80.3 to 85.6 +2×2.65 = 90.9. So for Nice David as your tutor, you can be 95% confident that your test score will be between 80.3 to 90.9.

What about Evil David? His 95% interval is 85.6 – 2×12.09 = 61.42 to 85.6 + 2*12.09 = 109.78. Now it’s not possible to get over 100 so you can be 95% confident that with Evil David as your tutor, your test score will be between 61.42 and 100.

Who do you choose? If 65 is a passing score on your test, you would be risking a failing grade with Evil David. Not so much with nice David. Though you do have a chance of getting a very high score with Evil David (if you like lotteries), your test score is 95% guaranteed to be a passing one with Nice David. Being the unbiased person that I am, I would go with the tutor with more consistent results!

Statistics and Data, Dispersion

Before I begin the topic of dispersion, I want to illustrate the power of the maths language. Unlike most words in english, maths words (notation) build upon other maths words, making the maths language very efficient at talking about maths. For example, in my last post, the mean was defined as


This is so much more elegant and succinct than ” the mean of a set of data is the sum of all the data points divided by the number of data points”. The maths definition is much shorter because the symbols build upon prior things you have learned, specifically what Σ means and the concepts of addition and division. This power of maths notation allows us to conceptualise and design very complex things like the search algorithms used by Google and the sending of  spacecrafts to Mars. Now before I get more excited, let’s go on to today’s topic.

Suppose you have to choose between two maths tutors: myself (David the Maths Tutor) or my competitor, Evil David the Maths Tutor. They both publish the last 5 test scores  from their students. My published scores are 83, 83, 85, 87, and 90. Evil David’s published score are 71, 73, 86, 99, and 99. Which one do you choose?

If you’ve been paying attention, you would think that maybe you should find the mean of both sets of data. If you do, you will find that they both have the same mean of 85.6. You may be attracted to Evil David because of the high scores, but then notice that there are lower scores as well. You also notice that Nice David (me), has more consistent results. Wouldn’t it be nice if we could measure this quality of data, that is some measure of how spread out the data is. Well there is, otherwise I wouldn’t be writing this post.

Just like the mean, there are several measures for spread or dispersion of the data. The range is the difference between the largest and the smallest data point. This is not used too much as it is only affected by two of the data points, the spread of the data between these two points do not affect the measurement. What about the average difference between each data point and the data’s mean? The problem with that is that some of the differences would be negative and some positive, which when added together would give a smaller number than desired. But this is the right idea. If you squared these differences and took the average of the sum of these differences squared, you would have all positive numbers. This measurement is called the variance and the formal formula for this is


So you first calculate the mean of the data, then subtract that mean from each of the data points and square that difference, sum all these squared values together, then divide by the number of data points. Again, saying this in the language of math is so much more elegant. Now Nice David’s data variance is 7.04 and Evil David’s variance is 146.24.

Now the problem with the variance is that if the data has units like meters, the variance has units meters squared since we are squaring the differences. It would be nice to have a dispersion measurement with the same units as the data. If you noticed, the symbol for variance is σ². σ is the lower case version of Σ so it is also called “sigma”. So you may hear the variance called “sigma-squared”. If we take the square root of the variance, the units will now be the same as the data and this is in fact done. The square root of the variance is called the standard deviation. So the formula for the standard deviation is the same as the variance except that you take the square root as the final step:


The standard deviation of Nice David’s data is 2.65. Evil David’s standard deviation is 12.09. In my next post, I will show you how to use these numbers to make an informed decision. (Hint: It doesn’t look good for Evil David).

Statistics and Data, A Mean Post

I thought I’d start a series on data and how data is consolidated in statistics. My previous posts on statistics were about probability. Probability will enter into the discussion later.

So having a set of data to analyse is what the bulk of a statistics course is about. One of the first things taught are Measures of Central Tendency. These are ways of consolidating all the data into one number. You are familiar with one of these – average or mean. The term “mean” is the mathematician’s term for “average”. As you know, the average is the sum of all the given data divided by the number of data points you have. So for example, the average of 4, 5, and 6 is (4 + 5 + 6)/3 = 15/3 = 5. Now there are other measures of central tendency as well: median and mode. I won’t cover these as they are not as important as the mean in most statistical operations.

I would like to introduce some notation. The formulas used in statistics usually involve summing things. So to indicate a sum, the Greek letter sigma, Σ,  is used. Sigma is the Greek version of the English “S” which is appropriate as it is the first letter in “Sum”. Also, the letter n is typically used to indicate the number of data points and xᵢ is used to represent the data. The letter i is called a subscript. The subscript i represents the generic data point. You can replace the i with a number to represent a particular data point. In our case, x₁ = 4, x₂ = 5, and x₃ = 6. So the formula would look like:

where the numerator notation means “add up all the xᵢ’s, changing the i from 1 to n“. Notice the notation for the mean, a bar over the x. This is pronounced “x bar”. This notation for the mean will be used frequently in my next posts.

It is usually understood from the context that we are summing over all the data, so you may just see Σx in the numerator without the i or n.

So 4, 5, and 6 have a mean of 5, but so do 1, 5 and 9. The second set of data is spread out more and it would be nice to have a measure of this as well to use with the mean. That will be the subject of my next post.

Probability, Part 7, Counting Many Things

So we are working on finding the probability of getting a hand with four aces in a 5-card poker game. To do this, we have to count the total number of possible poker hands. This turns out to be possible using the combination formula:


And in this case,


But I left my last post not calculating this, and cautioning you to not calculate this before doing some simplification. This is because 52! and 47! are HUGE numbers and because of the limitations of many calculators, you will get inaccurate results.

Well how do we simplify \[\frac{52!}{5!(47)!}\]?

Well notice that 52! = 52×51×50×49×48×47! . In other words, you can always start counting down when writing an expanded factorial but the remaining numbers are just the factorial of where you stopped. So now you can cancel the 47! in the numerator and the denominator,


That’s a lot of hands! Well that number is the denominator in our generic probability formula:

Probability = \[\frac{{\mathrm{Number}}\hspace{0.33em}{\mathrm{of}}\hspace{0.33em}{\mathrm{favorable}}\hspace{0.33em}{\mathrm{outcomes}}}{{\mathrm{Total}}\hspace{0.33em}{\mathrm{number}}\hspace{0.33em}{\mathrm{of}}\hspace{0.33em}{\mathrm{possible}}\hspace{0.33em}{\mathrm{outcome}}{s}}\]

So what is the numerator? Well if you want all four aces in your hand, that leaves just one more card. There are 48 other cards that can be in your hand that can be the fifth card. So there are 48 possible ways to have four aces in your hand of five cards. So the probability of having four aces is


A very small probability! You may need to brush up on your bluffing skills.

Probability, Part 6, The Problem with Probability

In my last few posts, I’ve talked about probability and how to calculated a basic probability:

Probability = \[\frac{{\mathrm{Number}}\hspace{0.33em}{\mathrm{of}}\hspace{0.33em}{\mathrm{favorable}}\hspace{0.33em}{\mathrm{outcomes}}}{{\mathrm{Total}}\hspace{0.33em}{\mathrm{number}}\hspace{0.33em}{\mathrm{of}}\hspace{0.33em}{\mathrm{possible}}\hspace{0.33em}{\mathrm{outcome}}{s}}\]

This formula is simple if you know the number of favourable outcomes and the number of possible outcomes. This works well if asking questions like what is the probability of rolling  a 7 with a pair of dice. To calculate the number of total outcomes, there are 6 possible ways a single die can be thrown, and for each of these, the other die can have 6 possible value. So the total will be 6 × 6 or 36. This illustrates the multiplication rule for counting things:

If there are m ways for one thing to occur and n ways for a second thing to occur, then there are m × n ways to do both.

Manually counting the ways to get a 7 where the first number is from die 1 and the second from die 2 gives:

1 + 6, 2 + 5, 3 + 4, 4 + 3, 5 + 2, and 6 + 1. Six ways

So the probability of rolling a 7 is 6/36 or 1/6.

Now what if I asked what is the probability of getting four aces in a 5-card poker hand? How do you even begin to count the number of possible poker hands? There are two ways to count large possibilities like this: combinations and permutations.

A combination is the number of ways where a collection of objects can be arranged where you are not concerned with order. For example, in the card example, a hand of 2, 3, 4, 5, and 6 of hearts would be the same as a 6, 5, 4, 3, an 2 of hearts and you would only want to count these two possibilities as one along with any other arrangement of these five cards. A permutation is where order does count and these two card combinations would be counted as two permutations.

In our card example, order doesn’t count, so we want the number of  combinations of taking 52 cards, 5 at a time. Fortunately, there is a formula and notation used to simplify this. Before I present this, there is another math operation that needs to be explained: factorials.

You may have a calculator with a “!” symbol or “x!” on one of the keys. This is a factorial operation. A factorial is successively multiplying an integer by one less for each factor. For example, 5! = 5 × 4 × 3 × 2 ×1 = 120. Factorials get large very quickly. For example, 30! is 265252859812191058636308480000000. To make the formulas using factorials consistent, a special definition 0! = 1 is made.

So the notation for the number of r combinations of n objects is \[
\] or more commonly C(nr). So in our case, we want to calculate C(52,5), this is the number of possible 5-card combinations out of a deck of 52 cards. The general formula for combinations is


In our poker hand example, the number of possible poker hands is


Now before you go off and calculate this, remember how large factorials can get? Many calculators cannot keep the number of digits necessary to accurately store very large numbers and the accuracy of the calculation will be poor. So when dealing with combination and permutation formulas, it is always best to simplify before calculating the answer. See if you can see where we can simplify the expression on the right side. I will continue this example in my next post.

Probability, Part 5, The Monty Hall Problem

In the 1960’s (so I’m told), a new game show on TV appeared called Let’s Make A Deal. The show was hosted by Monty Hall. In this show, a contestant was shown three doors on the stage and was asked to choose one. One door had a great prize like a car. The other two had less desirable prizes like a goat. Well right away, if you’ve been diligent reading my posts, you know that the contestant has a one-third chance of winning.

However, Monty would then open one of the doors not picked an expose a less desirable prize. Monty would then ask the contestant if she would like to switch. The question is, should she? Does it make a difference or is the probability still one-third either way? Well let’s see the two probability trees for each strategy: stay or switch.

First the “stay” strategy. Let’s assume that the car is behind door 1 so we can label the branches appropriately. The second level branches have probability 0 or 1, because once you choose a door, the strategy of staying or switching determines exactly what you will end up with:

So if you look at all the scenarios (branches) that end up with a car and add those probabilities together, you get 1/3 as expected. Please see my previous post regarding probabilities trees if needed.

Now let’s create a tree where we switch doors after Monty shows what’s behind one of the “loser” doors:

Now add up the probabilities that end up with a car and you get 2/3! You double your chances of winning if you switch. You see, Monty adds more information to the problem by exposing one of the loser doors and you take advantage of this by switching doors. Because you have 2/3 chance of initially picking a goat, if you do pick a goat and Monty exposes the other door with a goat, you will have no choice but to end up with the car if you switch.

Probability, Part 4

I would like to introduce probability trees. These help compute more complex probabilities and combine the addition and multiplication rules we covered earlier. Let’s look at a probability tree for two tosses of a coin:

To create the tree, you start with a branch for the first set of possible events for the first trial, in this case heads or tails, then add other branches for all the possibilities for the second trial and so on. You also include the probabilities on each branch segment.

Travelling along a branch depicts a joint probability. For example, what is the the probability of tossing two heads?  Travelling along the branches for two heads, you hit two probabilities, each 0.5. As we saw before, this is a joint probability so we multiply these together to get 0.25. So along a branch, you multiply the probabilities.

What if I asked what is the probability of getting two heads or two tails?  The only two branch paths that satisfy this requirement are the top one and the bottom one, each with a calculated branch probability of 0.25. These are then added, since this is an OR probability so the addition rule applies:

Adding these probabilities gives the result of 0.50.

Now let’s look at the marble experiment: picking in succession two marbles in a bag with 10 red and 10 blue marbles. Now we can build the probability tree but to know what the probabilities are, we need to know if the first marble is replaced or not. From the last post, you saw that this affects the probabilities of the second pick. I’ll leave it as an exercise for you to build the tree for the “with replacement” case. It will be similar to the coin toss tree.

Without replacement, the tree will look like this:

I’ve kept the probabilities as fractions to make it clearer where they came from. See my last post if needed. Notice that the last column of probabilities add up to 1 as they should since all possible branches have been included.

So what is the probability of picking two blue marbles? This can be read directly from the tree as 0.237. Now let me ask, what has the greater probability: picking two marbles of the same colour or two of different colours? If you add the probabilities of picking two blue or two red, you get 0.474, not quite 50% as you might expect. To get the probability of getting mixed marbles, you can either add the two tree probabilities of 0.263 or subtract the 0.474 probability we just calculated from 1 as this is the only other possibility as the two events are mutually exclusive. Both ways will give the result 0.526. So which possibility would you choose if you were a betting person?

In my next post, I will use probability trees to show the surprising result of the Monte Hall experiment.

Probability, Part 3

So what is the probability of tossing 3 heads in a row flipping a coin? Well another probability rule is called the joint probability rule. For independent events (that is one event does not affect the probability of the other), the rule is

P(A and B) = P(A) × P(B)

The result of flipping a coin does not affect the next flip of the coin, so these would be independent events and we can use this rule. The probability of flipping a heads is 0.5, so the probability of flipping three heads in a row is

P(flipping 3 heads) = 0.5 × 0.5 × 0.5 = 0.125 or 12.5%

Now let’s look at another experiment. Suppose you have a bag of 20 marbles, 10 blue ones and 10 red ones. The probability of picking a red marble is 10/20 or 0.5. If you replace the marble, shake the bag and redo the experiment, the probability of picking a red marble is still the same, that is the two experiments would be independent. So the probability of picking two red marbles in a row this way is 0.5 × 0.5 = 0.25. But what if you did not replace the marble? Before the second pick, the bag now has 19 marbles, 9 red and 10 blue so the probabilities of the second pick are affected by the first pick. This means that the two events are dependent.

If two events are dependent, say A depends on B, the way to show this is P(A|B). This means what is the probability that A occurs given that B occurred.

For dependent events, the joint probability rule is modified slightly:

P(A and B) = P(A|B) × P(B)

So you still just multiply the probabilities, but you must adjust the probability of A if B occurs.

Now back to the marble experiment without replacing the marble. What is the probability of picking two red marbles in a row?

Well for the first pick, we already know that the probability is 0.5. But for the second pick, the probability is 9/19 because there are only 9 red marbles now and a total of 19 marbles. So the probability of picking two red marbles without replacement is

P(2 red marbles) = P(second red marble|first marble is red) × P( first red marble) = 9/19 × 0.5 = 0.237. So the probability is slightly less picking two red marbles without replacement than it is with replacement.

This sets us up to do much more complex probabilities. In my next post, I’ll discuss probability trees.

Probability, Part 2

So we are discussing probability and so far, I’ve just used some simple examples where I used the rule:

Probability = \[

Now I would like to be able to show more complex examples , but first a definition:

Event – a collection of one or more outcomes of an experiment.

An experiment is flipping a coin, rolling a die, picking a card, etc. The outcome is what happens, that is the result of the experiment. To save  writing, we use the following notation:

P(A) is the probability of event A. Event A can be flipping heads, rolling a 2, picking the ace of hearts. Other letters can be used to represent other events.

So for example, what is the probability of rolling a die and getting a 1 or a 6? Well one way is to note that the number of favorable outcomes is 2 and the number of possibilities is 6, so

P(rolling a 1 or a 6) = 2/6 = 1/3

Another way is by using the probability addition rule:

P(A or B) = P(A) + P(B)

Event A can be rolling a 1 and event B can be rolling a 6. We know that the probability of rolling any single number is 1/6 so,

P(A or B) = 1/6 + 1/6 = 2/6 = 1/3

The addition rule only works for mutually exclusive events. Rolling a 1 means that rolling a 6 is impossible and vice versa.

What if I asked what is the probability of not rolling a 1 or a 6? Well there is something called the complement rule that is useful:

P(~A) = 1 – P(A), where ~A means not A

If A is rolling a 1 or a 6, then ~A is rolling any other number. But since we now know what the probability of throwing a 1 or a 6 is, we can use this to find the probability of not throwing a 1 or a 6:

P(not throwing a 1 or a 6) = 1 – P(rolling a 1 or a 6) = 1 – 1/3 = 2/3

In my next post, I’ll explore questions like what is the probability of tossing 3 heads in a row.