For the past two years, I have worked on a model that predicts NCAA Tournament upsets using the gospel of tempo-free basketball stats, the Four Factors. Over that time, the model has gone a perfect six for six in predicting 11 through 14 seeds that became Cinderellas.
While I am fairly sure that this perfect record will come to an end sooner rather than later, I still think the model has value. I ran an improved model on this year’s bracket to try to help you gain an edge on your co-workers in your office pool. Without further ado, the predicted probabilities:
As you can see, the model is only predicting two upsets–both 11 seeds over 6 seeds–this year. The model likes Minnesota over UCLA, and whoever emerges from the Middle Tennessee State-Saint Mary’s First Four clash over Memphis.
Minnesota may be slightly overrated because of their extremely tough schedule, but the Golden Gophers are a very good rebounding team playing a very weak six seed in UCLA. MTSU has a great turnover margin, St. Mary’s cleans up on the glass, and Memphis has played a very weak schedule for a six seed.
In building this model, I’ve used a dataset of every 3-14, 4-13, 5-12, and 6-11 matchup from the last ten NCAA Tournaments. Last year’s post has details of the specific model inputs. The only addition this year is a measure of teams’ consistency.
Note that I am not predicting that there will only have been two upsets come Friday night. Rather, the Upset Model is intended to be conservative: over the last 10 years, using out of sample testing, the model has predicted 25 double digit seeds to pull upsets. 22 of them have been successful, yielding a false positive rate of under four percent.
Over that same time period, there have been a total of 40 double-digit seed upsets. Clearly, some of the teams that my model does not make outright favorites to pull upsets will, in fact, win. That is part of the beauty of March.
A note for the Harvard fans reading this: one of the weaknesses of the model is that it underrates moderately low probability outcomes. The Crimson certainly are not the favorite in their matchup with New Mexico, but they likely have a far greater chance than 4.6% of pulling the upset.
I’ve been waiting for this post with anticipation for a day now. Awesome work. Are you going to make a post with the survival method?
And if it weren’t for Butler’s coaching I’d pick Bucknell, Muscala cleans the glass and is a real presence inside. Also, is Davidson’s odds particularly high historically for a 14 seed? Because I saw they are +3.5 which seems like quite a low spread for a 14 seed.
Wondering if these picks (last two years) were true upsets against the point spread (or a power rating system such as Pomeroy)? E.g., Minnesota is the distinct point spread favorite this year despite being the worse seed. The only teams in your table that differ significantly from the betting odds are a few that your model says are being overvalued by the market:
Oregon
Belmont
Davidson
New Mexico St.
Iona
Thanks for this. You shared your probit coefficients last year. Will you be sharing them again this year to include the new team consistency variable?
This is great. How exactly can I use this chart to place bets??
I’m with wubr2000. Making a couple bucks in Vegas could be fun. I wonder if they take parlay bets? One could parlay the two teams that are predicted to upset their opponents, or do various combinations among several teams with the best chance of pulling an upset. My only caveat would be I live in Minnesota and can attest to how bad the Gophers are right now. The guards are not confident, team leadership is lacking, and they are a poorly coached team.
Doug, the probabilities John lists pretty closely match those offered in Vegas (except for the extreme underdogs, which he says are not reliable). So there would be very few bets to make even if you had the utmost confidence in the model. I have nothing against looking at this kind of analysis, but I think in order to evaluate whether it has any true predictive merit, you’d have to grade it against spread or money line. I.e., you’d want to know if the model adds any predictive ability that is not already reflected in the betting public’s perceptions.
Well, Minnesota opening as a 3 point favorite doesn’t help anyone’s ML or parlay bets much…
In previous years the 15 seeds haven’t been listed. Is this (relatively speaking) a stronger batch of 15 seeds compared to the last two years?
Will you be posting your final team rankings going into the tournament using your survival method as you did last year? That was a great tool to use to assist in picks and I would love to see it again this year!
PLM:
Look at the model picks from the past two years and the money lines on those games. There were very big differences in the past.
I am not sure why this year hews closer to the Vegas lines. Maybe the market now realizes the value of NCAA Tournament-specific prediction models?
John, will you be coming out with a full bracket like you did last year?
Will there be full rankings released?
John, could you please send me step by step instructions so I could try to replicate your results? Thank you for your time.
Compared to implied win probabilities of Vegas moneylines, Minn, St Marys, Miss, and Pacific offer value.
Harvard Vegas
Minn 70% 60%
St Marys 53% 36%
Cal 40% 43%
Bucknell 37% 39%
Miss 31% 30%
Oregon 29% 47%
Belmont 23% 36%
Davidson 20% 41%
Pacific 14% 13%
John,
I know this is a few months ahead of schedule, but for this years March Madness Tournament, will you be able to release a full bracket a few days before the tourney starts as you did in past years? I think last year you did release one but it was released a little late for the pool I was in and I was unable to use it to help me make some picks. If not, it is understandable, but just wanted to throw this out there ahead of time and I look forward to checking out your picks whenever they become available as well as your upsets. I appreciate all the hard work and research you put in to make this a valuable tool!