For the past two years, I have attempted to systematically predict First Round NCAA Tournament upsets using a dataset of match-ups from 2004 onward. Last year, I improved the model by adding opponent Four Factors data, and the model correctly identified Marquette (11 seed), VCU (11 seed), and Richmond (12 seed) as the three teams most likely to pull off upsets. So what does this model say about this year’s March Madness bracket? Read on to find out.

**What is the Model?**

While the performance was no doubt lucky – the odds of all three of those teams pulling upsets according to my model were roughly 1 in 5 – I think the model does to a good job of identifying factors that are important in the one-game setting of the NCAA tournament. The backbone of the model is based on turnovers and rebounding. Specifically, I use turnover rates and rebounding rates to eliminate the potential bias of pace.

Turnovers seem to be very important in March: if you take care of the ball and take it away from your opponent, you are in effect creating extra opportunities to score. Rebounding, too, is important because better defensive rebounding minimizes the opponent’s opportunities to score and better offensive rebounding maximizes a team’s scoring chances.

In addition to these factors, the model also includes measures of Strength of Schedule. This in effect proxies for seed (the two are highly correlated), but also includes valuable information about when teams are mis-seeded (see Wisconsin in 2010, for instance).

Finally, the key to the model is the match-ups. A team with a solid upset profile might be derailed by facing a juggernaut who also takes care of the ball and forces a lot of turnovers.

**The Probit Model**

**This model had 256 observations and a Pseudo R^2 of 0.38. The result variable is coded 1 for a win, and 0 for a loss, so coefficients that are negative mean that the variable is negatively associated with odds of winning. As you can see, a higher turnover percentage predicts lower odds of winning a first round game. The SOS coefficients look larger because they are scaled from 0 to 1, not 0 to 100 as the other variables are.**

In out of sample testing, which was conducted by removing a year from the dataset, running the model, and using that model to estimate probabilities for the games in that year, the model proved to be fairly conservative. It identified 21 “underdog” teams seeded 11-14 that had greater than 50 percent odds of winning in the last nine years. 15 of those teams went on to win, and six did not. Thus the model is good at avoiding false positives.

What it is not as good at, however, is avoiding false negatives. Over that same timespan, 15 other teams pulled off upsets where the model predicted the better seed to win. 40 minutes under the bright lights is a small sample, and no predictive system will be perfect. I am happier with a model that is conservative and predicts fewer upsets more accurately than a model that predicts too many upsets out of sample.

**2012 Upset Predictions**

If you look at the Vegas odds for this year’s first round, you’ll notice something striking: the odds for the 13 and 14 seeds are a lot shorter this year than most years. The 3 and 4 seeds are not as big favorites: Georgetown, Michigan, Florida State, and Indiana are all favored by six points or fewer, whereas last year 3 and 4 seeds were favored by an average of 11 points.

The 2012 crop of 13 and 14 seeds are especially tough. That is borne out in the model predictions below:

According to this model, every 14 seed has at least a 30 percent chance of pulling an upset. That is ridiculously high compared to previous years.

As you can see, the model likes VCU, NC State, Ohio, and Cal (provided they beat USF in the First Four game) to pull upsets. New Mexico-Long Beach State is essentially a coin flip in this model, and Florida State looks vulnerable because of their terrible turnover rate and bizarre inability to rebound defensively.

VCU will be a very trendy upset pick this year because of their run last season, but this model adores the Rams because of their ability to force turnovers (best in the nation) and take care of the ball (27th best in the nation). A note of caution, however: VCU is a big outlier. Since this model is looking for average effects, it may not do as well out of sample predicting teams far away from the average, like VCU. The Rams fit the profile of a tournament David, but between their popularity and their extreme profile, understand that the expected value of picking them might be lower than this model implies.

I will take a 1/3 chance of Harvard beating Vanderbilt.

Can you post results from the other seeds? (at least 10 and 9?)

Those aren’t really upsets if they happen

I have a better model. Just look at the Vegas line. It is easier and more importantly, more accurate than yours! Why? Because as economists know, no one knows more than all of us combined.

Hey Bob,

I agree the Vegas line is the best predictor in the aggregate, but in individual games, that is not always the case. My model might not beat the spread for every possible matchup, but we are looking solely for wins and losses when we fill out brackets. If this model looks at things from that perspective, it might be useful in addition to Vegas.

By the way, the Vegas lines for the games I have tabbed as upsets are fairly low…

Impressive results last year! Out of curiosity, for the 15 upsets that the model predicted were less than a 50% likely, do you see that they were tending above 40%, 30% or was their a wider distribution? I ask because the answer has interesting implications for pools that reward upsets. Thanks!

I believe if the 40/30% ratio distribution is achieved then the 1, 4, 7,8,9, 14 & 15 picks above on the underdogs will prevail. There’s no doubt that the inverse negative coefficients would influence my decision because if you take into account the conference in which each of these individual teams played in AND discount at some rate the ranked teams they may or may not have played, it is clear that they are favored underdogs. I’ve applied this theory over the past several years with great results.

I understand that turnovers correlate to winning, pretty obvious assumption there. And the correlation might be strong, great. But much, much, much more importantly (and not answered here), is the question of whether past turnover performance is predictive of future turnover performance. Because if it isn’t then your model is more explanatory than it is predictive (you see this problem with a lot of shitty nfl team ranking systems *cough* footballoutsiders *cough*)

Hey Brian,

Very good point. It is hard to look at this directly with my dataset, but I am using past turnover performance of previous NCAA teams to correlate with

theirNCAA Tournament performance and the link is strong, even in out of sample testing. I think that makes this model predictive rather than descriptive. Does that make sense?Yes, I think it is predictive. A team like UConn that has a lot of big men yet often gets out-rebounded is likely to continue to get out-rebounded, especially in games where you get down to the final 64 teams. I mean, all of these teams are good. Not in the James Worthy, Sam Perkins, Michael Jordan good, but they are all fairly strong teams that have played at least 25 or so games.

Over the course of 25 games, a team that rebounds well and takes care of the ball is likely to continue to do so provided the competition is roughly equal. At this level the competition is better. So that is where the model’s strength works because it seems factor in strength of schedule. Now let’s hope Ohio smokes Michigan.

I read this a while ago and really enjoyed the model / thought behind it. Turnover % is a very underrated stat for me, as is seen by VCU, and I think you may actually be underselling your model on those outliers (IE VCU) as this game is going on. The turnover margin is the thing creating the disparity in this game, great job, keep it up.

Great work predicting the VCU/WSU game. Fun stuff. I had VCU losing to WSU until I saw this site. The key: weighing the schedule strength with turnovers and rebounding. WSU is good, but they don’t quite match up well enough with VCU. Congratulations (and thanks).

PCA does work for football. I did it when I was a student at Caltech just for grins, and I verified that it was statistically predictive as well as explanatory. I did this rigorously (in the 15 minutes of spare time that I was allotted per term.) I remember being very surprised at the predictive effect of turnovers, as my original assumption was that it would be random.

I should have said that it worked for professional football. In college, I imagine there is a lot more variability in individual talent at the college level, which gives you more of those dreaded “outliers.”

For 2 years in a row you make me look like a genius… When are you graduating? That’s going to be a sad day for my NCAA bracket… Can we talk fantasy football? I’m looking hard for the edge and I think you’re the guy to deliver it… Thanks a lot. I owe you.

– mike

John, here are your model’s 2-year results, broken into ranges that I thought made some sense. 60%+ win probability, with a repeat correct call on VCU, is now a perfect 4 for 4 (100%). In the 45-59% win probability range, the model is 2 for 5 (40%). In the 11-44% win probablity range, 4 for 17 (24%). And less than 10% win probability, 0 for 7, or confirmation there are no upsets here.

John,

What source are you using for the team stats? I tried duplicating these results, but kept getting different numbers.

Purely from Kenpom.com. I’m happy to share my dataset.

Yeah, that would be great if you’re willing to share the dataset.

What format would you like it in? If you have access to STATA, I’m happy to give you my code as well. I’m pretty confident you’ll be able to replicate it.

It would probably be easiest for me just in a .csv format. I don’t have access to STATA right now.

If only VCU could shooot. Oh well. They were fun for one round. Had them beating IU. Apparently Crean can coach after all.

Would you mind sending me the csv also?

What about later rounds? Any forecasts/predictions for the Sweet 16?

John,

If you wouldn’t mind, could you send me your dataset in .csv format? I would like to try and replicate it as well. Thanks for your time.

I am really interested in this. DO you have the formulae and results in xls or something that I can have a look at?

It would be cool to see what the model would say about the 15 seeds that pulled of upsets this year. I would also like to have a look at the dataset if you wouldn’t mind sending something to me.

What are your picks for this year?!?!

Will you be doing this again for 2013 bracket? I would love to see it.

Do you still have the data set for this? It would be fun to try to replicate it on STATA.