Sections within this chapter are
Almost intuitively, you know how to answer the question "If I throw a 6-sided die just once, what is the probability that it will come up as a 6?". Also intuitively, you knew that you had to know that there are only six ways that that die could come up. You counted six discrete values, the value of 6 being just one of those six ways. The probability? One of six or 1/6 or 16.66%. Simple enough. Your knowing that, though, started with counting the options.
Or suppose that you had the task of getting point A to point B and you have the following options to achieve this:
How many ways do you have to go from A to B? Again, simple enough, you just count them: 1 + 1 + 3 = 5 ways. Again, simple counting via addition.
Let's throw into that a little more complexity. After you get to point B, you have another decision to make to take you to Point C, your ultimate destination. From B to C, you have the choice of transferring to one train or two busses: 1 + 2 = 3. OK, how many choices do you have to go from A to C? You multiply A-B's options (5) to B-C's options (3) to get 15. Fifteen options. You could draw and count that as in the following:
The number of steps, the number of tasks (T) [2, from A to B then from B to C], of course, could be many more than two, with each task T_{m} being associated with some number of ways (W). A set of tuples if you like: T_{1}:W_{1}, T_{1}:W_{2}, ⋯, T_{2}:W_{3}. We counted the ways for each task, but then we multiply the counts to find the total number of options as in (and broadening to many more tasks)
And, for all of that, all that we know so far is the only number of options.
The Discrete Mathematics specialists call the addition portion of this "The Rule of Sum" and the multiplication portion of this as "The Rule of Product". Good start, but the counting of the ways in each task is not necessarily always that straightforward.
Consider ⋯ Suppose you have six distinct numbers (1-6) in a bag and you intend to take them out of the bag one at a time and place them in order, left to right. How many ways are there to do that? Six possible options as the first pick, right? It could have been any of the six numbers. With one out, for our second choice five remain, then four, and so on until the last. And then, to find the total number of possibilities, we multiply; the number of options for the first, then the number of options for the second, and so on. This is choosing without replacement. Mathematically, this essentially just the factorial operation, 6! (=6*5*4*3*2) being 720 possible ordered sets. More generally, for the ordering of N different things, N!. Again, just counting of the possible set that could be produced from those six unique numbers. But in this case counting without replacement AND ordering of the choices matters.
Suppose, though, that after each choice, we put the chosen number back into the bag; each next step of choosing would always find six unique numbers in the bag. That is exactly the same as our 6-sided die as mentioned above. Each choice of a number, each new throw of the die, would have six possible values every time. Two throws: 6 * 6 ways. Three throws: 6 * 6 * 6. No replacement means straight up multiplication of the options. Perhaps the number order matters in some games, wherein we throw a 1 followed by a 2 followed by a 3 and have that matter; that being the case we need to count the different ways that this can occur. But, of course, in many games order does not matter. Typically with dice, it doesn't; a {1,2,3} for three dice is the same as {3,2,1}, and {1,3,2}, {2,1,3}, {2,3,1}, and {3,1,2}. When order matters, there are six different ways to represent these three values. When not, when order doesn't matter, all six of these are just one.
And that brings us to our next subjects Combinations and Permutations and the differences between them. And we are still only just counting options.
Let's say that from whatever set of elements we have – stated formally as the size of the universe in a set – we are going to choose a number of elements from there to create smaller subsets. For example, if from those six unique numbers in a bag (six being the size of the universal set), we only care to take out three (the smaller subset size). From that, let's start by caring about the order in which we chose that subset. Later, we intend to ignore their order, that all we care about is what is in the set, not their order in that same set.
When creating subsets, we also have the options of removing chosen elements from the universal set or of putting those elements back into the universal set. In our six-numbers-in-a-bag, with no replacement, we are guaranteed that the three numbers chosen for the subset are unique. With replacement, we could see in the subset the same number multiple times.
So, for what we have looked at so far, we have:
Altering any one of them is going to result in a different count of the size of the subsets being produced. And we need to figure out what those counts are going to be.
To see why these are important and how to count the sizes, let&apol;s start by assuming no replacement (items chosen are removed). Once an element is chosen, the universal set has one fewer elements for the next choice (and so on). As we noted above, we are then talking about the counting operation called factorial.
In addition to that, rather than removing all of the items from the universal set, let's choose elements into a smaller-sized subset. And into that subset, we are going to care about their order. For example, from our bag of six unique numbers, let's choose only three of them and maintain them in the order chosen. For those three choices, the first element's choice could be any one of the six possibilities, the next in order will be from five possible choices, and then four, and then we stop, having gotten our three elements for the new subset. That is 6*5*4 (=120) possible ways.
Clear enough, right? For each choice, all we are doing is keeping track of the remaining number of elements in the universal set and assuming we are removing the chosen item. [If we allowed the chosen element to be used again – with replacement – that count would have been 6*6*6 (=216) possible ways.] With each new choice, the universal set is getting smaller, at least for what we are considering here and now.
That (i.e.,6*5*4) sort of looks like a factorial, except that a complete factorial operation would take our initial 6 all the way down to 1, 6*5*4*3*2*1 = 6!
We stopped, though, after our first three choices. That's exactly like
More formally this is
where N is the size of the universal set and R the size of the subset.
Here, again, the order of the members of the subset matters.
This is called a Permutation, where N is the number of initial possible choices – the universal set size – and R is the number to be chosen, the subset size. A Permutation assumes order is maintained AND no replacment.
Remember, we are just counting, counting the number of possible unique subsets. You are welcome to memorize it this way, as _{N}P_{R}, but I prefer thinking in terms of stopping the factorial operation of N after having processed R terms. It really is just 6*5*4 ways as opposed to a full factorial. That is, after all, what you are doing with that bag of numbers; 6 items, then 5, then 4. And, again, this assumes that the size of the universal set gets smaller with each choice. No replacement and so factorial – decreasing universal set – gets used in multiplying the count of options.
Permutations, again, also assume order in the counting.
Suppose, though, that order did not matter. In our 3-item subset of unique numbers we might have gotten a {1,2,3}, but you know that that is the same as five other cases, all told:
When order matters, this is six different elements of the resulting subset. When order does not matter, all of these would appear the same, and so all six of these are really just one element of the subset. This is true for all other sets of three unique numbers as well. If we are counting unique subsets, our little example of 6 unique numbers and choosing 3 would need to decrease the count we found via Permutation by this factor of 6.
That then brings us to the notion of Combinations. A combination is selection of some given elements in which order does not matter. In our numbers-in-a-bag example, we want to count the number of ways that we can pull those three numbers from a bag of six while ignoring the order in which we pulled them. Rather than caring which is first, then second, then third, instead the order does not matter and so the subsets of one set of three particular numbers are the same.
For example, suppose you want to randomly pull the names of twelve jurors from a bag of names of fifteen possible jurors; twelve primaries and three backups. Every juror in the twelve is not more important than any other. How many ways are there to do that? If order mattered, that would be the permutation formula or 15!/(15-12)! (=15*14*13) different ways. With order not mattering, we need to remove the count of the ways that twelve different names can be ordered from that count.
So, how do we determine that overcount factor?
From our 3-number case above, we saw that three unique numbers taken three at a time – and keeping their order – was 6 which is also 3!. When order does not matter, we want this to be 1.
Let's take a look at another, this time 4 numbers taken four at a time with order maintained. With order being relevant, we get possible subsets of
The count? 4! = 4 * 3 * 2 * 1 = 24. N things taken N at a time is N!. But notice that what we are really talking about here is the size of our target subset R so, more generally R! where R is the number of elements in the subset. 24 is the count where order does not matter, just 1 when it does not. We need to factor out that R!.
So, our membership count becomes, for the case whereour Combination formula which is
which is also
Again, we are dividing out the count of which R different items can be ordered.
The trick is not so much to determine whether we are dealing with a permutation or a combination. Your author is commonly confused on the difference. It is instead determining what to consider in counting. Without replacement, we are typically dealing with factorial. Order not mattering; we need to factor out the overcounting seen in ordering. The target subset is smaller than the unversal set; we need to adjust for the difference in size. So let's revisit these and a bit more via a slightly different thought process.
Permuation and Combination both account for the lack of replacement. Suppose, that we have a case where we are replacing, say as it the case with a set of dice, allowing the same items to be found multiple times in a target set.
For example, rather starting over, how many ways can we have four digits between zero and nine inclusive. Obviously enough it is
Each of the four digits could take on the digits 0 through 9 of which there are four possible locations for a digit. Straight up multiplication; no factorial. If this were a ten-digit number the result would be 10^{10} disinct numbers.
Next, let's add to that 10-digit number the requirement that each of the digits must be a unique digit. We can not include, for example, 1122334455 in the result set because it has digit duplication. In effect, to guarantee uniqueness, from a ten-digit universal set, we are removing one-by-one all ten digits. We are dealing with a straightforward factorial, here 10!, not 10^{10}, for the number of possible values in the set. It's straight up factorial on the unversal set size since the target set is the same size as the universal set; we took out every one of the 10 unique digits. Said differently, you could think of this as a Permutation problem if you like, but the divisor is 0! which is defined to be 1.
Next, still using our 10 unique digit univeral set, we return our interest to a target set of only four digits as above, but here let's again add the requirement with no duplicates, in effect, no replacement. The universal set has 10 different digits, but we are only interested in a target set of any four unique digits. As with the ten-digit case, the first digit could have been any of the ten, the second digit any of the remaining nine, the third any of the remaining eight, and then any of the remaining seven, with the count being 5,040 (10*9*8*7) as opposed to our 10,000 above. As we have seen earlier, this is no different than 10!/(10-4)!, which just happens to be our Permutation case.
Let's take that previous case – the Permutation case, which assumes ordering - and say that order is not relevant, meaning that the unique digits in the target set's element {9,8,7,6} are to be treated the same as the unique digits in the element {6,7,8,9}. We can use our 5,040 count from the Permutation, but we need to factor out what we now consider duplicates. We need to factor out our count of four things taken four at a time, 4!, making that 5040/4! = 210, which just happens to also be our Combination case.
All that we have been talking about here is counting of the elements in various subsets. You know already that not everything is a Permutation or Combination problem. Perhaps it is just straight up simple multiplication, division, and addition. Some – if not all – just take some mental gymnastics, or perhaps even some programming.
With that in mind, let's look at something completely different, but at the same time only a subtle variation.
This next example will take some programming. In this one, let's go back to allowing duplicated digits – allowing the reuse of an already used digit from the source set of digits – and this time say that order is irrelevant. It's a problem not too different than our Combination case above – which assumes no replacement, assumes uniqueness – but here we include the fact that {9, 9, 8, 7} is the same element as {8, 7, 9, 9}. We are allowing reuse. How many different elements are there in the resulting set?
To solve that, ' observe again that when order did matter, the total number of elements with four digits would be 10,000 (=10*10*10*10). But in this problem, as it relates to counting elements in the target set, order does not matter. The set [1,1,2,3] is the same as [3,1,2,1], but still there are four digits. To find this count, we have some mental gymnastics in front of us. In fact, so much so that I required a program to help me out.
So, first, the results of that program:
So, double checking our results,
We have our counts. The key point of all of this? It's not all just permutations and combinations. Somethings just plain take mental effort and/or a program to help you do the counting. Take your time, think it through.
With Combinations, we are talking about doing a set of mathematical operations on a computer to achieve the following:
Each of these three factorials, if done exactly this way, is a loop of multiplies and adds with ever decreasing values, all the way down to the value 1. (Or you might learn that it can be done via recursion, adding still more processing.) There are three of those here. If the starting values are small, do we care? Not really. Now picture N as very large – say millions – and R as being small. If done just like this but only very occasionally, do we care? Again, not really. Now, instead, picture thousands of these being done per second, perhaps concurrently on each of your processors. One multiply alone can be done in a matter of nanoseconds, millions of them in milliseconds, but thousands of those are consuming the compute capacity of at least an entire processor. You don't normally want to do that.
The point? You want this to be done much faster. So, how? Remember what the following really is
It might look like two complete factorial operations, and it can be implemented that way. But if R is 4 for example, it is nothing more than the following:
Far faster, and therefore consuming less compute capacity. Bottom line? Yours is often not the only work being doing on a computing system and everything you do is consuming time on a processor, perhaps via multiple threads on multiple processors, capacity not available for other work. If you are using a computer resource for a time, everyone else must wait behind you to use that same co putting resource. When programming, please be kind to those others, consuming only the capacity that your work needs. The sooner you get done, the sooner that a processor is available for their use.
Occasionally, when counting the contents of one or more sets together, we might make the mistake of counting the same items more than once. When we do, we need to first know it and then to back out those counts.
As a case in point, let's talk about a universal set holding the numbers 1-10. From that we are going to pull one number and put it into two other sets based on some function. In one set – call it E – the function includes all the even values we find. In another set – call it T – set a function includes identifying all of the numbers divisible by three. Possible contents would be
Suppose that we want to find the size of the union of E and T. How many unique values are there? The simple but wrong answer is just to add the sizes as in ⋯
Fine, except that these sets overlap. Both contain the number 6. Using Venn Diagrams, it would look like
We double counted the 6 since it is both E and T. Said differently, it is the value 6 which makes up the intersection of sets E and T as in
and its size is
We need to remove this double count so, when the sets overlap like this,
Please be careful, this sort of thing shows up in database queries all of the time.
Starting easy, what is the probability that I throw a 1 with a single six-sided die?
Let's expand on this by using two die and their sum.
What is the probability of throwing two fair dice with their sum being five or less? Let's start by finding the size of the set where this is true. Perhaps stating the obvious, neither die can have the value of six or five, this being true because the other die must be at least one. So, let's take a look at what remains.
The size of this set is 9. That is our positive outcomes set. What, then, is the size of the universal set, all the possible outcomes from throwing two dice? 36 (6 * 6).
Our positive outcomes set size (9) divided by the universal set size (36) is
OK, let's try a slightly more tricky variation. What is the probability of a pair of dice coming up as five or less OR a multiple of three. We know the count for the first half (9), since we just did it, but we also need to count the second half. That set looks like
The size of this latter set is 11. So, the probability of both – OR – is
Right? Nope. We are double counting. The cases of {1,2} and {2,1} are in both sets. Those two comprise an intersection. Their count needs to be subtracted. So, instead, the total count of the combined two sets needs to be two less and so is 18.
Or, alternatively, go ahead and add the probabilities creating the 55.5% and then subtract off the probability of the two elements {1,2} and {2,1}. Together they are 2 elements out of the universal set of 36 so
So subtracting percentages, 55.5% - 5.5% = 50%. Either way, whether we represented the size of our set as counts or as percentages, we needed to subtract off the intersection of the sets, the double counting.
The definition of Conditional Probability says something like "given that we know something about an event E, what is the probability of some event F which has some dependency on E?" In Wikipedia, that reads "In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred."
So, by example, using our sets from a previous problem using two dice, we might picture Event F as the set of the sum of those two dice which are multiples of 3 and Event E as the set of the sum of those same two dice which are 5 or less. These are overlapping sets, sets with an intersection. Same throws of the two dice, just different ways of looking at them.
We can calculate the conditional probability of events F given the events in E. The elements of set E are the starting point, not so much the elements of the universal set which are outside E or F. In short, given the intersection between sets E and F and full knowledge of the probability of the events in set E, we can find the conditional probability of F given E as in
For our little example, our numerator – the probability of the intersection of Sets E and F – is
Let's switch gears. All this can be easily extended into multiple sets as in thinking of our universal set as something like a jig saw puzzle with another set overlaying that as in the following figure:
We are arbitrarily defining Sets E as partitions of the universal set; we are making the sum of the four sets E the same as the universal set. The sum of the probabilities of the sets E adds up to 1 as a result. Set F is just some arbitrary set within that universal set as well. The probability of finding an element in F here is, of course, some fractional value.
So, what is the probability of elements in any (and all) of the sets E given F? It is first the count of
That happens to correspond to the size of Set F relative to the entire universal set (i.e., all of the Sets E). You can also see that in the following equations where we switch back over to probability. The probability of finding elements in set F is the sum of all of those sets found by interesting parts of the set E with F.
Let's next rewrite our equation from earlier
The probability of any one of the Sets E is the count of that one set divided by the size of this universal set. In the previous equation, each one is our P_{r}(E_{i}).
Which leads us to Bayes' Theorem.
Let's built up to Bayes' Theorem via some sort of a proof.
From Conditional Probability in the previous section we had
We knew that given the probability of a Set E and the probability of the intersection of Set F and E, we can find the probability of F given E.
Those names are just arbitrary, so we can reverse them and simply say
The thing common to both of those is the probability of the set acting as the intersection of Sets E and F. Isolating on that by multiplying both sides by the denominator we get
which we can rewrite as Bayes' Theorem which is
In short, where we know the conditional probability of set F given information about set E, we can also then know the reverse, the conditional probability of set E given information about set F.
OK, so what?
I am going to paraphrase an example from Wikipedia on Bayes' Theorem.
Suppose we have a drug test for cannabis with success rates. Separately, suppose via other testing of a very large population we know with some precision the rate of that drug's use; cannabis was used in these tests.
So, as to probabilities we will be using:
Those are all known probabilities, accepted as fact.
The question being raised is just how successful this test in is determining – say – if we randomly choose individuals from this population,
These two questions are represented as
These are both conditional probabilities. Again, these are the two values that we want to determine. Please keep them in mind.
In what is to follow, what we are really doing here is working backward. We know a lot about both these tests and the overall population. But we do not know the answers to those two questions. We do know from the probabilities given above
which is 90%, meaning that when tested on a known cannabis user, the test successfully finds cannabis at that rate. We also know from above that
which is 80%, meaning that when the test is on a person known to not have used cannabis recently the test successfully came back with a negative test. Both of these are effectively positive tests.
So, now to Bayes' Theorem. Bayes' Theorem says that given these latter two probilities, we can determine what we expect for our desired two probabiliites as in the following:
All that we need to add to these formula are
Doing that math on these latter two equations,
So, now we have everything we need to complete the two Bayes' Theorem formula above. We get
What Wikipedia says about the first of these is, "In other words, even if someone tests positive, the probability that they are a cannabis user is only 19% – this is because in this group, only 5% of people are users, and most positives are false positives coming from the remaining 95%."
For the latter, this high percentage also means that the test is only misidentifying cannabis users as non-users in 0.66% of the cases. Certainly preferable.
There is another way to look at this, something producing the same results. Suppose that we chose 10000 people from our universal set and ran this test. Ideally, we would find counts as in the following, all of which are based on the test and user probabilities given previously:
For the probability of a positive test when the people tested were indeed cannabis users, so P_{r} ( User | Tests Positive ), we divide as in the following and get the expected percentage:
For the probability of a negative test when the people tested were indeed not recent users, so P_{r} ( Not User | Tests Negative ), the division is as follows with results as expected:
The approach provides a nice mental picture of just what we are dealing with in terms of conditional probabilities. In the first, we wanted to find the probability of a cannabis user given that we knew the test was positive. In the second, we wanted to find the probability of a non-user given that the test was negative.
As another example, we will use a tree structure to help us understand it.
Suppose that we have a set of three factories, each measured as producing some number of widgets in a month, and each having their own defect rates as in the following table. We want to know from that our overall success (S) and defect (D) rates and then, using Bayes' Theorem, the probability of a factory given its defects.
As a Step 1 in solving this, we build a tree structure of probabilities of success and defect probabilities given each individual factory.
As a Step 2, we have multiplied each leg of the tree to find relative success and defect rates. As one example, the probability of successfully manufacturing a part in Factory A with knowledge of all factories is 58.8%. Adding up all the probabilities of success and separately the defect probabilities we get 98.125% and 1.875%, respectively. You will find the sum of these to be 100%. Comparing that overall defect rate to the set of three defects in the table, this value of a little less than 2% after factoring the weighted production of the other factories is rather as expected.
As a Step 3, we reintroduce a variation on Bayes' Theorem and realize that we have all the needed pieces from Steps 1 and 2. We are going to rewrite Bayes' Theorem
for our purposes as
The denominator of this – the entire sum – is our weighed defect rate of 1.875% from Step 2. The numerator is each individual factory's defect rate weighted by their production. The result – one for each factory – is the probability of a factory given a defect. Given a defect is found, the results are for each factory:
OK, so how is this useful? This can be used a number of ways, but suppose that you want to invest in decreasing defects in one plant over another. We could invest big in Factory C, since it had the greatest failure rate at 3%, skipping all of this math. But it also has the lowest rate of production. By doing the math, we can see that even completely solving all the defects problems in Factory C, we would only take our defect rate down roughly 16%. On the other hand, if we were considering increasing the production rate at Factory C (or B) a lot, we might want to take another look; fix problems while re-tooling to increase production. Even so, if we want to invest in just one plant, investing in decreasing defects in Factory A would give us the biggest bang for the buck.