This entry marks the first of what will be a weekly column from esteemed BONer billyzane. With a focus on looking at CFB statistics in general, and those related to Texas in particular, BZ's column will offer a thoughtful weekly reflection on the game we love. Not all entries will be as in depth as this opening foray, but I urge each of you to print and take the time to read this one carefully. There's some great data to mine through. --PB--
I believe in statistics. I think you should too. There are problems with them, to be sure. They can be horribly confusing and, if misused, can be misleading. And they certainly can never tell the whole story themselves. But these problems are not inherent to statistics, only to the application of them. The goal of this column will be to, every week, go beyond the numbers that you’ll find on ESPN and on many blogs to find something that needs explaining and to explain it.
That's Playing the Numbers Game.
[Insert hokey theme music of your choosing here - I'm partial to anything that starts with a rolling piano.]
It should be noted that I’m not a statistician. I haven’t taken a real math class since Calculus in high school. But I totally dominated that AP test (think Aaron Lewis’ domination of that other AP in last year’s OU game), and then majored in sociology (among other things) at UT, which means I took at least two whole classes having to do with statistics. Yeah, those classes were mostly filled with athletes and Tri-Delts, but it was a real major, I swear. No, seriously. Whatever, jerks. Anyway, let’s see, what else? Oh yeah, I used to know how to do a linear regression. So in conclusion, I’m practically over-qualified. Regardless, not every week will be as numbers-intensive as this week. I promise not to geek out too much....after this week. This week I totally geeked out.
How the hell do you do a linear regression? Wikipedia says that this equation has something to do with it. Don't worry, this won't be on the test.</text> |
This week, I decided to take a look at what statistical performances are the greatest predictors of success (i.e. winning) in Texas games since 2001. Why 2001? Because I consider that the beginning of the modern Mack Brown Era. That year signaled that Texas was no longer rebuilding from the Mackovic Era, but had in fact been rebuilt. Also because data entry for this exercise was a pain in the ass and took forever. That too.
And what do I mean by predictors of success? It’s fairly simple. Which performances by winning teams most closely correlate with winning football games? We’re going to look at this in the realm of performances relative to the teams that lost the games. So, for instance, scoring more points that your opponent correlates 100% with winning football games. If you win that statistical category within a single game, you will always win that game. But that’s obvious. What about other statistical categories? That’s what this column is about.
As a side note, I totally cribbed this idea and the basic methodology from SMQ. He did something similar to this for every conference in 2006. Here’s his post for the Big XII. I have included more potential predictors of success and have tried to go deeper in the analysis to give more meaning to the numbers. Also, as I said, I have only included Texas games in my data – all of them – from 2001 to 2006.
Click here to read the rest of this entry.
Methodology
You know who wouldn't skip the methodology section? Law and Order</text> |
You’re more than welcome to skip this if it doesn’t interest you. Skip down to the next section. But keep in mind that examining methodology is the only way you can really tell if a statistic is properly constructed to determine what it claims to determine. With that said, here’s what I did.
I took 20 statistical categories that could reasonably be construed as predictors of success – the more often you do better than your opponent in that category, the more often you win. For every Texas game between 2001 and 2006, I went through and gave a "win" to a category if the winner of the game also "won" that category, and a "loss" to that category if the winner of the game did not "win" that category. For instance, take the 2005 Ohio St game. Texas won the game and finished with more total offensive yards than tOSU. Thus, for that game, the category "total offense" gets a "win" – that is, it correlated with winning the game. In that same game, however, tOSU finished with a better red zone scoring percentage. Thus, that category gets a "loss" because it did not correlate with winning the game. If the teams tied within a certain category, I ignored it.
I then got the "winning percentage" for each category for each season by adding up the "wins and losses" for each category. I then averaged those over the course of the 6 years. I also took some standard deviations to see how reliable those averages are, but I’ll get to that in a bit. Please remember that these categories are mostly from an offensive point of view, but that they are the exact same if you look at them from the defensive side. So when I say that "Net rushing yards" has a winning percentage of 85%, I mean that the winning team has more total yards than the losing team 85% of the time. I could also say that the winning team gives up fewer yards on defense than the losing team 85% of the time. It’s the same statistic.
The Data
WIN PERCENTAGES FOR EACH CATEGORY BY SEASON AND AVERAGED
WHAT THIS DATA MEANS AND WHAT IT DOESN’T MEAN
• These winning percentages are NOT the percentage of time Texas wins games when it wins the category. Saying that Texas is 60-0 when it out-rushes its opponent is not valuable because we don’t know what happens when Texas’ opponent out-rushes them. If Texas is also 60-0 when getting out-rushed, then rushing yards just don’t have very much to do with winning games.
• These winning percentages are instead a combination of these two statistics. It’s the percentage of time that the winning team led the losing team in this category. So if Texas out-rushes its opponents 60 times and wins every time, and Texas’ opponents out-rush them 60 times and the opponents win every of those games, then the winner of the rushing battle had a 100% winning percentage in the games. That is what this number means.
• Remember also that these are correlations, not causations. A high percentage in one category does not necessarily mean that doing well in that category causes the team to win. It merely means that when you do well in that category, you also usually win. There may be an independent variable that simultaneously causes these two results.
• Causation is very difficult to prove, given 2 problems here: 1) small sample size, and 2) regressions are hard and generally require software, and I can’t be bothered for something that wouldn’t be conclusive with our small sample size. Instead, we’re going to take our correlations and apply them to what we already know about football to try to come up with likely causations.
Analysis
THE BEST PREDICTORS OF SUCCESS FROM 2001-2006
ARE GAMES WON IN THE TRENCHES?
We should not be surprised that having more total offense than your opponent has such a high correlation with wining games. More offensive yards than your opponent will very often lead to more offensive points than your opponent, which is the definition of how to win.
The only other categories with an 80% or higher win percentage are Net Rush Offense and Sacks. This does seem to reinforce the typical coachspeak that games are won in the trenches. Further, in the last 6 years of Texas games, a team that has had more sacks and more rushing yards than its opponent has won 89.36% of the time. We have to wonder, however, how much of this is related to the fact that teams that are already winning rush more often and teams that are already losing pass more often (and thus get sacked more often).
Well, if you look at the correlation of winning to having more passing yards than your opponent, it’s relatively low, at only 67.53%, ranked number 14 out of 20 on our list. Yards per completion has the same correlation. This could tell us one of two things. First, it could indicate that yes, losing teams pass more and that’s why passing yards doesn’t correlate with winning very well. But it could also potentially tell us that passing just really isn’t that important to winning. Games are won in the trenches and teams that run better usually win. Which is it?
The answer, I think, lies in the correlation of yards per pass attempt to winning games, which is 77.92%, ranked number 6 on our list, a full 10% better correlation than both other passing statistics. Why do yards per pass attempt correlate so much better to winning than yards per pass completion? I think it’s because teams that are losing throw more often and thus throw many more incompletions, which lowers the yards per attempt statistic for the losing team, but not the yards per completion statistic. Thus, losing teams will often have more yards per completion than the winning team, but fewer yards per attempt.
This seems to indicate that the low correlation of passing yards to winning games has more to do with the fact that losing teams pass more often to catch up (and thus get sacked more often) than the idea that games are won in the trenches. Further supporting this is the correlation of yards per carry to winning, which is only #8 on our list at 76.62%, about 9% less than net rush offense. That is, running efficiently doesn’t have nearly as much to do with winning as running often. Teams that are already winning certainly run often to control the clock, but they don’t necessarily run efficiently.
Thus, it doesn’t seem that dominating the trenches wins football games, but rather that winning football games creates statistics that imply the winning team was dominating the trenches. And who knows, perhaps why the winning team got ahead in the game to begin with had a lot to do with passing the football well.
OTHER INTERESTING ODDITIES
As I expected, being the home team had very little to do with who won in games Texas plays, coming in at 54%, good for only #18 of 20 on our list. This probably has something to do with the fact that most of Texas’ toughest opponents (OU, Bowls, Big XII Championship game) are played on neutral sites and thus don’t figure into this analysis. However, I think it’s mostly the fact that Texas is so good, year-in and year-out. They’re good enough to be impervious to the dangers of playing on the road.
Is getting off to a fast start important? The short answer is, yeah, but not THAT important. Scoring first only has a 70.13% correlation with winning, #13 on the list. However, leading after the first quarter has a much more substantial 77.92% correlation, #7 on the list.
Penalty Flags? The more the merrier! |
Perhaps most surprisingly, having fewer penalty yards than your opponent actually has a negative correlation with winning football games. Teams that have fewer penalty yards than their opponents actually win only 46.75% of their games. This isn’t an anomaly either, SMQ came up with a 43.5% correlation for the entire Big XII in 2006. This seems to imply that being a "disciplined" team that doesn’t "kill itself" with penalties has absolutely nothing to do with winning football games. Obviously, there are game situations in which a penalty can hurt you a lot. But overall, avoiding penalties isn’t a big deal. In fact, there seems to be a very loose correlation between being penalized and winning games. Why? I would guess aggressiveness, particularly on defense, but that’s just a guess. Ideas?
Feel free to talk about anything else interesting in the comments.
YEAR BY YEAR RESULTS
To come up with these numbers, I took the average of all 20 winning percentages for each year. What really stands out to me in these numbers is how large the win percentages are for the years 2001 and 2005 (both above 76%) compared to the other 4 years. What do these two years have in common? The two Texas teams in 2001 and 2005 were arguably the most dominant of the Mack Brown Era (an argument can be made for 2004 over 2001, but I think that argument is a losing one).
So what does this tell us? I think it says two things. First, it reinforces how dominant these two teams were. Dominant teams should not only win, but also win most of the statistical categories while winning the game. A high "Average Category Win Percentage" shows a team doing just that (and alternatively, when they lose, they lose most of the statistical categories also). Second, this tells us that these categories are reliable. If these 20 categories really are predictors of success, we would expect the most dominant teams to win the highest percentage of the categories. Which is exactly what happened.
Reliability of This Data
A NOTE ON THE STANDARD DEVIATIONS
[Feel free to ignore if you don’t care. Skip to The Bottom Line, below.]
Hey Everyone, Math "Humor"! OMG! LOL! WTF? |
Standard Deviations basically say how much deviation there was among the numbers that were averaged to get the "Average Category Win Percentage." If we average 10 numbers, which are all 50%, the average is of course 50%, and the standard deviation is 0% (every number we averaged equals the average so there was no deviation from the mean). If we average 10 numbers, 5 of which are 0% and 5 of which are 100%, our average is still 50%, but the standard deviation is 50% because every number we averaged was 50% deviant from the mean (0 and 100 are both 50 away from the mean).
The smaller the standard deviation, then, the closer to the mean the numbers are – which makes the average itself more reliable. See what I mean? If every observation we have is 50%, then the average of those observations (50%) is very reliable – a standard deviation of 0%. But if we have 10 observations, 5 of which are 0% and 5 of which are 100%, our mean of 50% (while the same as the other mean) isn’t very reliable (a Std. Dev. of 50%) because the observations we got varied so wildly from each other (and thus, by definition, from the mean).
The Bottom Line: the lower the standard deviation, the more reliable the average win percentage is for that category.
The standard deviations in the year-by-year don’t matter too much for us. They’re there for you to look at if you care, but they are important for the Win Percentages by Category that we used in the analysis section.
RELIABILITY OF THESE WIN PERCENTAGES BY CATEGORY
The first thing that jumps out at me is how high the standard deviation for Net Rush Offense is. So I looked at the data. From 2001-2006, it has had anywhere between a 100% correlation with winning games (2004) to a 61.54% correlation (2002). This seems to me to back up my earlier conclusion that good Net Rush Offense statistics are the result of winning, not the cause of them.
The 2002 team went 11-2, but was hardly dominant in amassing those wins, having to come back to beat some teams like Kansas State and Nebraska (both those teams outgained Texas on the ground despite Texas winning the game, as did North Texas, Texas A&M, and LSU). In the games Texas had to come back to win, they had more passing yards than their opponents and fewer rushing yards, which reinforces my earlier conclusion. Furthermore, the two most dominant teams of this sample, 2001 and 2006 – teams that regularly got out to huge leads and then held on – both had Net Rush Offense categories that correlated with winning 92.31% of the time and Yards per carry categories that correlated at much lower percentages (76.92% in 2001 and 84.62% in 2005). I’m convinced.
Moving along, as you can see, penalty yards has a low standard deviation of 8.49%, meaning that the average of 46.75% is pretty reliable. This just flies in the face of everything we’ve been told by coaches and pundits.
And once again, predictably, Total Offense has the lowest standard deviation (by a lot) because it’s most obviously directly related to scoring more points than your opponent. That number is pretty reliable.
Anything else you guys can think of?
--BZ--