March Madness for (Statistically Inclined) Dummies

Author note: This article originally appeared on the old HSAC blog around this time last year, back when I was a wee freshman (I am now a no-less-wee sophomore). But because that blog has been lost to the ages, and because this piece is team-independent, I figured some people might be interested in it. If you’d like to follow its advice very easily, CBS Sportsline now has an autofill option called “Historical Random” that does essentially what I am suggesting.

By David Roher

March Madness is probably the only time where a large number of people have a rooting interest in probability in sports. By following a statistical framework and mixing in your own judgment about upsets (or just random guessing), you’ll have a much better chance of filling in a pretty good bracket.

From 1985 to 2008, there have been 1536 teams in the tourney, or 96 teams of each seed. Below, based on all this data, I’ve compiled what each round looks like on average in terms of the number of teams left of each seed. The closer your bracket looks like this framework, the more the historical performance suggests that you’ll have success. Then pick which of the seeds continue to advance. For example, if by a certain round there are expected to be 2.5 #7 seeds left, you might want to rule out one 7-seeded team for sure, include two of them for sure, and then base whether or not you would pick the remaining one on the context.

How would you know what specific teams to pick? Your own thoughts, experts, or simply just random guessing. The last one is especially fun if you want to try and annoy people who follow college basketball religiously.

Keep in mind that for a small group, picking the favorites is still the way to go, with no to little randomization. And obviously, with so much going on in the tourney, this approach is nowhere near a guarantee of even having the best bracket among your friends anywhere near 50% of the time. And if you’re trying to get the best bracket in the country, screw any advice and just pick random upsets.

If you do follow this framework, please tell me how it worked out for you. There are some more advanced tips after the data:

_______________________________________________________

(Round of 64 has four of each seed)

Round of 32:
#1 seeds: 4.00
#2 seeds: 3.83
#3: 3.38
#4: 3.17
#5: 2.71
#6: 2.75 (not a typo, 6 seeds have done better than 5 seeds)
#7: 2.50
#8: 1.83
#9: 2.17 (again, 9 seeds have done better than 8 seeds)
#10: 1.50
#11: 1.25
#12: 1.29
#13: .833
#14: .635
#15: .167
#16: 0

Sweet Sixteen:
#1: 3.50
#2: 2.50
#3: 2.00
#4: 1.71
#5: 1.46
#6: 1.50
#7: .750
#8: .375 (Low because always has to beat the #1 seed to get here)
#9: .125 (Ditto)
#10: .750
#11: .458
#12: .667
#13: .167
#14: .0416
#15: 0
#16: 0

Elite Eight:
#1: 2.88
#2: 1.83
#3: .958
#4: .583
#5: .209
#6:  .500
#7: .250
#8: .250
#9: .0416
#10: .292
#11: .167
#12: .0416
#13: 0
#14: 0
#15: 0
#16: 0

Final Four:
#1: 1.75
#2: .875
#3: .500
#4: .375
#5: .167
#6: .125
#7: 0
#8: .125
#9: .125
#10: 0
#11: .0832
#12: 0
#13: 0
#14: 0
#15: 0
#16: 0

Championship Game: Didn’t find data
Champion: Didn’t find data

_______________________________________________________

More complicated stuff:
–       There are a lot of ways to work in the decimal part of the average besides just rounding to the nearest whole number. One interesting way would be to go to http://www.random.org, and use the number generator they have on the right with a min of zero and a max of 99. If the decimal part of the average is greater than this number, round up. If it’s lower, round down. Here’s an example: you don’t know whether to send one or two #4 seeds to the Sweet Sixteen.  The average is 1.71. If the random number generator is less than 71, pick two teams. If it’s greater than 71, pick only one team. If it’s 71, run it again. This is a really easy way to generate an accurate estimation of number of upsets in your bracket.

–       If you fill in your bracket from the best seeds and work your way down, take into account the seedings of each individual matchup versus the expected seeding. For example, let’s say in the first round you had a 13 seed upset a 4 seed and now that 13 seed is facing a 5 seed in the round of 32.  The 5 seed’s chances of winning just got better. Even if this appears to screw up your closeness to the historical averages, it really doesn’t, as 5 seeds historically beat 13 seeds more often than they beat 4 seeds. This isn’t true for every seed combo, though – the difference between 7 and 10 seeds seems to be negligible once you get past the first round.

–       If you fill in your bracket from the worst seeds and work your way up, keep in mind that the data just refer to the average number of teams of a seed left in each round, not a given team’s chance of getting to that round if they have already made an upset. For instance, if you have an 9 seed knock off a 1 seed in the Round of 32, their chances of also winning in the Sweet Sixteen aren’t nearly as bad as the .0416 (average number of #9 seeds in the Elite Eight) figure would suggest.

–       If you want to figure in actual knowledge of college basketball in a more prominent way, you could forget this framework altogether and assign probabilities to each game. Then use the random number generator in the same way as the first tip – for example, if a team has a 60% chance of winning, have them advance if 60 is greater than the number generated and have the other team advance otherwise. This is a particularly good option if you’re filling out multiple brackets, as each of your brackets would essentially be a separate random simulation of the tournament based on as much of what you know as possible. Since you probably just want at least one to work out, the random factor would insure you against a wrong pick in one draft because you might have had it right in the other.

About the author

harvardsports

View all posts

11 Comments

      • Depending upon what your definition of “better” is, Dave, you’re absolutely right. I’d never seen Poologic before, that’s some good stuff (one complaint). But 2 things:

        – I suggest using experts’ probabilities in the “more complicated stuff.”

        – More importantly: what is your definition of better? It seems to me like Poologic’s strategy works great for contests with a low number of brackets, because it’s great at maximizing expected value. But that’s not what I’m trying to do. I’m trying to maximize the probability that someone will win a large competition. How would you suppose I go about that when all 50 people in my group are using Poologic’s strategies? The best thing to do is pick a “wrong” bracket, picking things that probably won’t happen. The people who pick the best brackets in the country in fact usually pick terrible brackets.

        In our HSAC group, everyone’s going to be using this stuff. If I go along with them, I’ll score more points on average, but my chances of winning will be extremely low. If Duke or Kansas wins, so many brackets will have them that it will come down to picking a large number of other matchups correctly, even in a medium-sized group. There’s a lot of game theory involved.

        *…waits for everyone in the HSAC pool to respond by picking Kentucky, so I can pick Duke and win easily*

        • I agree that the bracket with the highest probability of winning a pool is highly dependent on the size and IQ of the pool. My point is mainly that trying to match historical seed quotas isn’t particularly compelling to me given how much good info there is on the specific teams out there. E.g. I’m not going to pick a certain number of 5/12 upsets, I’m going to let the 5/12 matchups themselves dictate my picks.

          • Yeah, that makes a lot of sense. I still think it’s nice to have a team-independent model, but I can’t argue with the quality of the Pomeroy/Sagarin stuff. Still, though, if I’m in a group with 100 Dave Ks intent on really looking at the 5-12 matchups, I’ll feel compelled to do something different.

          • I love the site, Tom – been playing around with it since Dave posted the link. The ROI definitely solves the problem provided you can come up with a decent estimation of who people pick. What would you suggest when you suspect there is a significant percentage of people using Poologic in a pool?

  • Maybe you could use that knowledge in the ROI calculator. This year:

    1. Increase the estimate of opponents picking Duke and there other teams that are high in the ROI Calculator. And decrease the estimate for other teams like Kansas to compensate.

    2. Decrease the value of an EPM sheet by the percentage of Poologic users.

    Of course as the value of an EPM sheet approaches 1 the market become efficient and you don’t have an edge. Might want go go back to just having fun filling out the bracket!

    • Fascinating stuff. That last suggestion is probably the way to go, anyway. My main regret is not knowing about your site earlier — that’s probably a compliment to your work, in that the people who find out about it know they better keep it a secret.

      This type of game-theoretic decision making is really interesting to me. It would be fun to create a model to predict who other people would predict, even going as far as including demographic and educational information.

  • Come to think of it, the value of an EPM sheet might approach 1 faster than that. One opponent using Poologic might make it go half the way to 1, since you will be expected to split the pot with that player.

    So the market probably gets efficient fast.

Leave a Reply

Your email address will not be published. Required fields are marked *