BZ’s Note: If I’m considered a "Guest Columnist," then today’s Playing the Numbers Game is written by a Guest-guest Columnist, our very own Horn Brain. He came up with a great idea for a column last week and I just let him run with it. What follows is an only lightly-edited version of what he came up with – an in-depth look at the BCS in a manner I’ve never seen before. Enjoy, kids. --BZ--
So, let’s start with a little background to explain why this is so incredibly geeky:
I’m an aerospace engineering undergrad, so I want to build freakin’ spaceships. I’m also an aerospace engineering undergrad at the University of Texas, so I’m crazy about football. The only logical next step is to crunch statistics like there’s no tomorrow and write about it on the Internet. We cool? Cool.
Now that the pleasantries have been dispensed with, let’s talk about the uncomfortable dis-pleasantry that is the BCS. We all know how it works: Teams get a certain number of points from each poll, whether human or computer, then the points are added up and divided by the maximum possible to give a percentage score, which is averaged into the final BCS average, which you use to rank the teams. The Harris and USA Today polls each count for one-third of a team’s final BCS average, while the computers have the extremes thrown out, and then are all averaged together as the other third. I understand that I just said "we all know how it works," and then proceeded to explain it, but that’s just me stroking everyone’s ego, while simultaneously letting anyone who doesn’t know, know. Cute, right? Yeah, I lied about the pleasantries being dispensed with.
What we don’t all know is the nitty-gritty of what’s really going on inside the aforementioned formulaic festivities quantitatively. We all look at Hawaii and say "Oh, computer no likes. Humans likes. Computer dumb," Or, "The humans have succumbed mightily to the hype on this Hawaii team." We can see that discrepancies occur, but what I’ve done for you all is written a big, ugly Excel spreadsheet that crunches over 100,000 data entries into a few numbers that tell you how often/big those discrepancies occur/are. The standard deviations that I’ve collected are basically going to tell you how far off you can expect a poll to rank a team from where it actually ends up in the BCS Rankings.
If you already understand what the standard deviation of a set is and what it measures, good, skip to the next section, if not, read ahead to bone up. If you neither know, nor care, go ahead and skip to the next section, but also be aware that you’ll be taking my word at face value.
Say I start the Horn Brain poll of unlimited vagary and speculation, publish my poll for all eternity, then I crunch all my own numbers and come up with a standard deviation of 7.5. That means that you could expect my ranking to be within 7.5 places of a team’s BCS ranking most of the time. The smaller my standard deviation, the closer you would expect my poll to be to the BCS. If I was smart about it and published my numbers immediately after the BCS came out every week by scrawling "Horn Brain" over the BCS in "BCS Standings" with a burnt-orange crayon, then my standard deviation would be zero, because my rankings were the BCS rankings. Instead, if the BCS ranked Texas #10 and OU #5 and I ranked Texas #11 and OU #4, my standard deviation for these two rankings would be 1. If I ranked Texas #12 and OU #3, my standard deviation from the BCS Rankings would be 2. If I ranked Texas #11 and OU #3, my standard deviation would be in between 1 and 2 (actually about 1.58). This is a ridiculously small sample size, but you can get the idea. Clear? Sorry to bore you, on to the madness!
Computer Rankings vs. Computer Rankings, or: "WTF Billingsley?":
At first, I was interested in just comparing the computers against each other by taking the standard deviation of each computer poll against the computer average. Here’s what I got:
Note: A&H, RB, CM, KM, JS, and PW are Anderson & Hester, Richard Billingsley, the Colley Matrix, Ken Massey, Jeff Sagarin, and Peter Wolfe, respectively.
First, let me explain a few things:
- I started in 2005 because that’s the farthest back I could get data for the same polls in the same weighting scheme that is currently used. I’m not going to go back and manually enter data just because some genius at BCS-topia, or wherever they’re stationed, decided in 2001 to scan a document every week instead of type up a table that could be used efficiently. Plus it’s not statistically sound to use numbers that were averaged differently.
- It’s not fair to just average 2007 equally with the other years, since it only has 4 sets of data (weeks of published polls), whereas the other have 8, so I made it worth ½ as much to the average as the other years. That’s why it says "Average (wtd)."
- Delta is the poll’s standard deviation minus the average for that year. A&H’s delta for 2005 is 2.18 – 2.63 ~ -0.45. Negative numbers mean you’re closer than the average pollster to the average ranking, positive means you’re further.
So, what do all these numbers tell us? Wow. Billingsley’s rankings disagree so much with the other computers, that he is the only one with an above average standard deviation. Let me say this another way – without Billingsley, some computers would be above the average standard deviation and some below, thus giving us the average. But Billingsley’s rankings are so crazy different from the other computers that with Billingsley included in the computation of the average, everyone else is below the average, thanks to Billingsley’s monstrous standard deviation. Look at it in a picture:
That is Billingsley’s towering purple rod, ladies and gentlemen. Try to refrain from snickering, as this is for purely scientific purposes. Notice that 0 is average, here. Let’s be fair, though, and reserve judgment until we see how the computers compare to the overall BCS rankings. Right now, all we can really note is that, since the other polls generally agree with one another, so there are only three possibilities:
- Billingsley is an idiot
- Everyone else is an idiot
- All six polls are idiotic, and Billingsley disagrees from the other idiots, while still maintaining a superior idiocy himself.
Now, remember when I said that the BCS throws out the highest and lowest computer averages? Here I’ve compiled the average number of times, and the average percentage of times out of 25 (that being the maximum number of times that an individual computer’s rankings can be thrown out per weekly poll), that a ranking was thrown out (read: not used at all) in determining the computer average.
First let me explain a few things:
- I’m only counting the number of times that a poll was the unique maximum or minimum ranking. For example, if a team is ranked 1, 2, 3, 4, 5, and 5, only the poll that ranked that team #1 would have a throwout counted against it. The 5’s are the minima, but since someone agrees with them, I’m not going to call them out on it.
- Once again, I’ve weighted 2007 one half as much as the other two years.
- "Throwout percentage" means a poll’s average number of throwouts, divided by 25 (the number of possible throwouts), times 100%.
Yes. Go ahead and look again. That’s 55%. As in Billingsley has, for the past three years running, averaged being thrown out of the poll more times that he has been counted in it. He’s at almost 14 out of 25 times. Once again, there are only the three possibilities. Occam’s razor does not smile upon Billingsley at the moment.
Computers and Humans vs. The BCS, or: "I Don’t Think You Understand Why We Do This, Billingsley"
Now let’s move on to compare the computers and the humans to the BCS average:
Same rules as before, no special explaining to do, here.
Well, here we have a problem. It’s fine to compare the computers to the computers in this context, and it’s fine to compare the humans to the humans, but let’s remember that the humans each count for a full third of the BCS poll, and they also tend to agree with one another quite a bit (Monkey-see, monkey-do, says the computer. Touché.) While the computers only count for around 1/6 of 1/3 each. So, in the spirit of evening the field, I’ve multiplied the computer deviations by 1/3 and the humans by 2/3, and rerun the averages. I know this is quite ad hoc, but my reasoning is that the computers generally agree with each other, so I’ll count them as one poll, and the humans generally agree with each other, so I’ll count them as a collective poll. If you come up with a more logical way to weight the numbers, argue it to me and I’ll adjust them if you sway me. Here are the results:
Same old, same old.
So, now we see that Billingsley agrees more with the overall BCS than he does with the computers. This means that he must be more in line with the human polls, since he does so much better when they’re factored in. Wait, that’s good, right? Better than good, that’s amazing! Billingsley has created a program that thinks (about college football, at least) like a group of humans! Eureka! Oh, wait, why is it that we wanted to include computer rankings in the first place, way, way back when all this BCS stuff began? It was because people are biased towards big names, and are prone to moving teams around based more on when they lose than to whom. It appears that all Billingsley has done is introduce bias to his computer poll. He says so himself:
"In the first week of the season if Florida St. beats #107 No. Illinois, and Ball St. beats #58 Memphis, I don't want Ball St. ranked ahead of Florida St. just because they both have 1-0 records. That's not logical. We ALL KNOW Ball St. is not in the same league with Florida St., at least not at this juncture. Let them EARN IT first. Let them prove it over due course of time, then my poll will respond accordingly. That's what I mean by Season Progression. All of my teams start out with a rank, #1-#117, because they ARE NOT ALL EQUAL. We KNOW THAT from past experience, so why not use that experience to begin with. Some would say starting all teams equal, or all at 0, is the only FAIR thing to do. I say it's the most UNFAIR thing you can do, and besides its just plain illogical."
Quote from mgoblog, emphasis mine, ALL CAPS not mine.
In response to the emphasized statement, "Um, no. Your computer should not. That’s why it exists, bud. I’m through with you."
So, basically, you can see, through all of this, that the computers are pretty consistent (with one enormous throbbing pillar of an exception) amongst themselves, but they generally have trouble agreeing with the humans. The humans get a lot of help just from being worth so darned much in the formula, but you can see that they also compare nicely with the better of the computer formulas once you account for that.
There's after the break. Click through if you dare.
And Now For Something Completely Different, or: "I Pander to BZ and His Flex Playoff System":
Now let’s shift our attention to something different. Remember those long rants BZ, Red Blooded and I used to share over playoff issues? One of BZ’s and my points in the argument was that the uncertainty of a team’s ranking increases as you move down the ballot (i.e. the higher a team’s ranking, the more consensus there is among the voters that this is the correct ranking for that team), which effectively means that the farther you move down in the rankings, the more claim the team ranked directly below a given spot has to that spot. Therefore, when you allow more teams into a tournament, you’re really just increasing the probability that you have some teams in that tournament that don’t belong, while leaving some teams out that do belong. For instance, if a playoff tournament is constructed based solely on the end-of-season rankings, we can be more certain that we included the actual top 4 teams in the country in a 4-team tournament than we can be that we included the actual top 16 teams in a 16-team tournament. By increasing the number of teams, you’re also (probably) increasing the number of teams who have a legitimate claim to be in the tournament but aren’t, thus, perhaps, actually increasing the controversy surrounding the identity of the national champion.
Well, I’d really rather not make a big fight about this again, and for the record, I’d like to say that I understand Red Blooded’s assertion that the issue is not a team’s claim to a given ranking, but rather to the national championship when it’s all done with. The only real disagreement we have (on this issue) is my championing of choosing based on claim to the NC at the end of the regular season vs. Red Blooded’s desire to eliminate any controversy surrounding the identity of the national champion in the end. That said, it still interests me to what degree the uncertainty rises, and whether or not it is predictable and regular.
Let’s look at some numbers then, shall we?
And now, we have a really neat graph:
So, here we can clearly see that the standard deviation (which is what you would report as the uncertainty of the ranking, if this were my physics lab) of a ranking increases as you move down the ballot, until you get to about #16, at which point it starts dropping off again. This is basically what you would expect, since a good team is generally a consistently good team, and the lower ranked teams tend to be harder to pin down on your ballot (Remember Texas playing like a Top-10 team against OU, right after looking like a Middle-50 team against KSU?). This creates a disagreement among voters, which leads to a higher standard deviation.
The reason for the dropoff after 16? Well, since the computers report an unranked team of any caliber as 0 out of 25 possible points, there’s a sort of wall there at #25, where you can only screw up by guessing higher, which lowers the chances of your ranking being very far off from the actual ranking. Say I think Kentucky is #100 and you think they’re #26, while the BCS ranked them at #25, since we would both report a score of 0 to the BCS score of 1, that would mean the standard deviation of Kentucky’s ranking would be 1, even though it should clearly be higher.
The coolest thing here is my little equation. Given a team’s rank, I can give you a good estimate of the uncertainty in their ranking. If the rank is x, then the uncertainty of the ranking (y) is given by y = -0.0133x^2 + 0.4341x + 0.4683 ± ~0.346. The black line on the chart is the graph of that formula, and you can see that it fits quite well. Notice that average (from the table) is ~3.18. If you want all the teams with better than average standard deviations, then you just plug in 3.18 for y, go back to high school algebra, and use your old friend the quadratic equation to crank out that you want teams ranked higher than ~8.41, which means the top 8 teams.
If that’s not a reason to keep the maximum tournament size to 8 teams, I don’t know what is. Any higher, and you’re less certain about the last team than you are about the average team. "Why do this to yourself?" is all I’m saying. Not to mention, since this seems to be a quadratic function, the fewer teams you include, the slope of the graph gets steeper, which means you cut out more uncertainty with each fewer team included. This is why it makes sense to me to think about giving teams chances based on their claim to the top resume at the end of the regular season, because it’s naturally exclusive [And the number of teams with a claim to that top resume in a given year, while almost always very small, isn’t fixed but variable, which is why the Flex System is so practical. –BZ]. I just wanted to support it with some quantifiable data.
One More Thing:
Finally, I just want to say thanks to anyone that actually reads this and tries to tear me a new one about some assumption I made somewhere or something. The best part about BON is our community that PB and each of us individually do such a great job to maintain at a certain level of rationality and level-headedness that is seldom found on the Internet. Keep up the great work!