Survival of the Fittest: A New Model for NCAA Tournament Prediction

(Editor’s note: You can see the model’s 2012 1-68 rankings here and full bracket here)

Every year, millions of Americans tune in to watch the NCAA Men’s Basketball Tournament, colloquially known as March Madness. And every year, millions participate in Bracket Pools, where they attempt to outwit their peers by picking more of the 63 tournament games correctly. There are quite a few very good prediction models available to people that predict NCAA tournament results, but every publicly available approach—including Las Vegas casino futures markets—leaves significant room for improvement. Other prognosticators lean on traits like experience, confidence, and performance under duress that are much harder to quantify. Of course, individual basketball games can often be determined by chance events, so no system will be perfect, but network analysis provides promising tools for quantifying some “intangible” traits.

What all of these prediction systems have in common is that they treat NCAA tournament games as similar to regular season games, despite ample theoretical and anecdotal evidence that the two are quite different. NCAA tournament games carry much more added pressure and added attention, and for quite a few schools, are played in far bigger arenas in front of far bigger crowds than any regular season game that they will play. This evidence suggests that rating systems that attempt to predict the NCAA tournament as if it were a collection of regular season games can be improved.

2006 NCAA Tournament Network

The purpose of this paper is to create a model that will rank NCAA tournament teams—and thus provide a basis for predicting the NCAA tournament—that uses network characteristics, principally a measure of degree centrality. The network characteristics in the model will serve to quantify traits that specifically apply to the tournament and the other teams in it, which form a network as they play each other over the course of the season. Ideally, this model will perform better in pseudo-out of sample testing than other available prediction systems. To my knowledge, this is the first attempt of its kind.

Research Design:

For each NCAA tournament from 2004 until 2011, I took schedule data from the entire season for every team in the field. I created an adjacency matrix where the a_ij entry = 1 if Team_i beat Team_j, 0 if they did not play. This matrix describes the directed network of the NCAA tournament teams from that season where the nodes are teams, and the links are games played between two teams. Thus a team’s out-degree represents the number of other NCAA tournament teams it defeated during the season, and its in-degree represents the number of other NCAA tournament teams it was defeated by during the season. A diagram showing the 2006 network is shown above.

The NCAA tournament is a different arena, with much brighter lights and perhaps different determinants of success. Analyzing the NCAA tournament network of games played between NCAA teams can provide insights into how teams within the Tournament are affected by their history of interactions with top-tier teams. More subtle psychological factors, like confidence and performance under pressure, can be measured via a network approach.

Imagine this hypothetical: Team A has played a very tough schedule, facing thirteen teams that are in the NCAA tournament field and beating seven of them. Team B has played an easier schedule, only having played three Tournament teams, but it has defeated all of them. Through the vagaries of the season, and through games against non-tournament opponents, Team A and Team B have very similar Pythag ratings and consistency metrics. But Team A and Team B inhabit very different parts of the NCAA tournament network. General statistical models might predict Team A and Team B to have equal tournament success, but we might hypothesize that Team A will do better because they have confidence from having played and defeated quite a few NCAA tournament teams.

Any model needs to have some statistics to control for team strength. I believe Ken Pomeroy’s excellent Pythagorean rankings, which can be found at KenPom.com, are the best measures of team strength. As such, its constituent parts, Adjusted Offensive Rating and Adjusted Defensive Rating, form the backbone of my control variables in the NCAA tournament prediction model.

Another important factor to take into account is Consistency. Consistency is a measure of a team’s variance in point spread at the conclusion of a game. For example, winning by five points = +5, losing by twenty = -20. This can be used as a measure of a team’s in-season strength because under the assumption that the teams being analyzed are winning teams (fair for the NCAA tournament bracket), consistency represents a measure of variance of performance. There are no “consistently bad” teams in the NCAA tournament.

Strength of Schedule (SOS) also needs to be included in any predictive model. SOS is an index of the strength of opponents played by the team in-season. Additionally, because variables like seed and strength of schedule are correlated with other predictors, I included (and tested) interaction terms between variables.

Another “intangible” variable often cited by experts as important for NCAA tournament success is experience. To quantify experience, specifically experience in the NCAA tournament, I created a dataset of minutes played at an individual level for every tournament team from 2003 until 2011, and aggregated them at the team level to create Returning Minutes Percentage for each team.

Returning Minutes % for Team I was then multiplied by the number of NCAA tournament games Team I had played in the previous year. I chose games rather than wins because there should be some credit given to teams who simply make the NCAA tournament and get the experience of playing in it, even if they do not win a game. Sensibly, this makes the experience term proportional to both past NCAA tournament success and the percentage of players returning who contributed to that success.

Survival Analysis Model:

A major challenge in predicting the NCAA tournament is specifying an appropriate model. The best way to judge NCAA tournament success is how many games a team wins in the tournament. The goal is to win six games and be champion; the closer a team gets to that goal, the better they have performed. Thus tournament wins should be the dependent variable in any predictive model.

But NCAA tournament wins is an ordinal discrete variable that takes on values from {0,6}. This type of dependent variable violates the assumptions of normality and continuousness that underlay ordinary least squares (OLS) regression, making it an inappropriate model specification. Additionally, the vast majority of observations would be teams who lost early, potentially drowning out the signal of the teams that survive. In statistical terms, the dependent variable is not normally distributed.

To solve these problems, I borrow a concept from sociology known as time-to-event analysis (henceforth called survival analysis). Survival analysis is the name for a class of models that deal with time series data and generally attempt to measure time to failure in a system. In this case, we will treat the NCAA tournament as the system, use each round as a time step, and treat losing, and thus falling out of the tournament, as “failure.”

I chose the Cox Proportional Hazards model, as it is the most general and non-parametric. It is based off of a baseline hazard function that is estimated for the population. The survival analysis model solves the major concern of not being able to capture the signal of teams that succeed, as it recognizes the additional length of time to failure for teams that win multiple games, and the difficulty of attaining that success, and thus generates coefficients that reflect that success. This model, which to my knowledge has never before been applied to the NCAA tournament, is superior to the others I considered because it best addresses the particular and unique challenges involved in estimating NCAA tournament success.

Analysis:

I) Model Fit:

The Cox proportional hazard model was fit in STATA, and the final model fit is summarized in Table(1). As discussed above, the choice of which measure of centrality to use was made by the fit of the model. The best fit was found using simple out-degree centrality, modified to give increased weight to road and neutral site wins. Out-degree was the only measure of centrality that was significant on its own and significant when interacted with the Experience term.

Table 1: The specified Cox Proportional Hazard Model:

As the table shows, all coefficients were significant at at least the 10 percent level. The coefficients of the Cox Proportional Hazard model can be interpreted as increasing or decreasing the risk of failure, holding all else constant. For instance, increasing Offensive Rating by one percent, which corresponds to increasing a team’s scoring by one point per hundred possessions, decreases the hazard of failure by 6.6 percent, relative to the baseline.[1]

The network portion of the model is the out-degree term, and the interaction with Experience. A logistic ratio test was used in STATA to determine whether this model, with the interaction term of Experience and out-degree, was significantly better than the nested model without the interaction. This test was significant at the 95 percent level (p=0.003), and confirmed that the interaction term is a significant predictor of success.

Interestingly, the interaction term is positive, implying that there might be decreasing returns to experience and in-season centrality. The individual terms are negative, showing that holding all else constant, increasing in-season wins over other NCAA tournament teams and having more NCAA tournament experience decrease the hazard of NCAA failure. When these two are increased together, however, while the effect is net positive for NCAA tournament success, it is not as large as the two on their own.

Fifty percent of the NCAA tournament teams have zero NCAA tournament experience from the previous year. For those teams, increasing out-degree by winning more games against NCAA tournament teams is strictly positive in increasing NCAA success expectancy. For teams that do have NCAA experience, however, increasing these factors in concert yields diminishing returns to the probability of survival.

Regardless of the diminishing returns finding, it is clear that network centrality is an important predictor of NCAA tournament success. The interaction with Experience is based on strong theoretical grounds, and is significant in the model. Even when controlling for statistical team strength, this network component is a strong predictor of NCAA tournament success.

The control variables also deserve some mention. Increasing Offensive Rating and decreasing Defensive Rating (becoming more efficient on defense) both lower the hazard of failure. The natural log transformation of Strength of Schedule is the best fit, illustrating decreasing marginal returns to schedule strength. Interestingly, consistency is significant at the five percent level and positively correlated with risk of failure. This supports the hypothesis that more consistent teams perform better in the NCAA tournament.

II) Out of Sample Testing:

The ultimate judge of model fit, however, is how it performs in out of sample testing. The goal of this prediction model should be to predict the NCAA tournament as correctly as possible. This prediction needs to be judged on a relative basis—relative, that is, to other available prediction systems. To test the model out of sample, I simply removed one year’s worth of teams from the dataset, estimated the model, then estimated a ranking based on the covariate values for each team and the model coefficients.

This ranking is known as the Prognostic Index, and represents the log odds of an individual team surviving—by winning six games and thus wining the tournament—compared to some baseline hazard function unspecified in the Cox model. We can raise these log odds using the mathematical constant e to find the relative risk of an individual team winning the NCAA tournament, compared to the baseline hazard.

This interpretation is nice in that it allows us to easily determine whom the Prognostic Index determines as the favorites in the tournament. I have used the Index to fill out a blank tournament bracket for each of the last five years.[1] To fill out the bracket and predict each game, I am constrained to using a very simple decision rule: if Team A is more highly ranked in the Prognostic Index than its opponent, Team B, I predict Team A to advance. This limitation is frustrating when two very evenly ranked teams meet each other, but in fact mirrors the real process of filling out a bracket. Even if you believe the true odds of either team winning the game are 50-50, a coin flip, you must pick one team to advance.

I chose two other prediction systems that I believe are some of the best currently available models to compare my model’s out of sample performance with. These models are Ken Pomeroy’s Pythagorean Expectation model, which has been extensively discussed above, and TeamRankings’ model, which combines its own unique power rankings with public picking trends data to create a bracket that maximizes the chances of winning a bracket pool, and thus winning prize money or prestige. To score the bracket, I used ESPN.com and Yahoo! Sports’ scoring system. In this system, points for correct picks increase exponentially by a factor of two in each round of the tournament. Thus first round correct selections are worth one point, and picking the national champion is worth 32 points. This may not be the best system—especially because it undervalues correctly picking upsets in the first two rounds—but it is the most common system, and so it is used to score the brackets here. The results of the out of sample testing are summarized in the table below:

Table 2: Out of Sample Performance, 2007-2011

As the table shows, the Network Model does better in out of sample prediction than either the TeamRankings model or the Pomeroy model. It predicts slightly more games correctly and significantly more points correctly, illustrating that its strength is in predicting which teams will make the later rounds of the tournament (the Final Four and Championship Game). Indeed, the Network model predicted three of the five National Champions correctly.

One thing to note is the performance of all three models in the 2011 NCAA tournament. This tournament saw lower seeded teams (a 3 seed, a 4 seed, an 8 seed, and an 11 seed) make the Final Four. All three of the models performed well below their levels for the other four years in the sample, predicting fewer games correctly and garnering fewer points. It remains to be seen if 2011 was simply an anomaly, or an inflection point in NCAA tournament outcomes.

Conclusion:

Predicting NCAA tournament success is a subtler problem than simply identifying the “best team” from the regular season. The tournament’s single-elimination format makes results far more random than most other playoff formats, which use multiple-game series. This means that ranking systems that do a very good job of classifying team strength over the course of the regular season may not be the best strategy for predicting postseason success.

This analysis shows that network analysis is a valuable tool for quantifying traits that lead specifically to NCAA tournament success. The position of a team in the network, as measured by degree centrality and when interacted with the level of a team’s previous NCAA tournament experience, is a significant positive predictor of NCAA tournament wins, controlling for team strength.

The strength of the model in the out of sample testing lends added credence to the importance of network analysis in predicting the tournament. What this model ultimately does is magnify the importance games played against teams of the quality that a team will face in the tournament. The assumption that NCAA tournament games should be predicted in the same manner as regular season games does not seem to hold. This finding, which to my knowledge has no antecedent, should lead to a new interest in improving NCAA tournament prediction models, perhaps even leading to a whole new method for quantifying traits that predict NCAA tournament success.

[1] Obviously when I do this for multiple years, I use different models that do not have the data for that year included. For instance, the 2009 bracket will be filled out using a model that does not have the 2009 data, the 2008 bracket with a model that does not have the 2008 data, etc.

20 Comments

David Pinto (@StatsGuru) says:

March 14, 2012 at 8:28 am

So what does this model tell us about 2012!

David Pinto (@StatsGuru) says:

March 14, 2012 at 8:29 am

Sorry, see now that it was in the previous post.

Pingback: Bad Seeds : baseballmusings.com
John says:

March 14, 2012 at 11:37 am

Are you going to post the rankings in your Prognostic Index

hhohw says:

March 14, 2012 at 3:28 pm

I look forward to further work on this topic, but is this truly out of sample for your model? I mean, had your model performed horribly out of sample, my guess is that you wouldn’t have published the article. I.e., this effect: http://marginalrevolution.com/marginalrevolution/2005/09/why_most_publis.html. I like the work you’ve done to be sure, but I just don’t think it’s quite fair to compare test results from it vs. those from models that actually existed prior to the test period.

- uoduckfan33 says:
  
  March 14, 2012 at 4:40 pm
  
  I really like the idea of incorporating the “intangibles” that aren’t so “intangible” any more. Though hhohw’s point is a common controversy….almost like looking at meta-studies and making claims about overarching p-values when so many studies were left unreported. Type II error is not often considered in collective study analysis. That article basically shows that with a type II error rate of 40%, only 75% of reportedly “true” studies are actually true (as in the alternative accurately reported as being true). But even a more acceptable type-II error rate of 5% still means that just 82.6% of reported studies are actually true.
  
  It will be interesting to see how this method does in 2012 versus Pomeroy and “TeamRankings.”
  
  - uoduckfan33 says:
    
    March 14, 2012 at 4:43 pm
    
    …that was assuming the 800/200 false/true alternative hypothesis ratio.
    
- John Ezekowitz says:
  
  March 14, 2012 at 6:43 pm
  
  That is a fair point, and I do think “past-the-post” significance is wrong and effect sizes are often overstated. I would, however, disagree that the testing I did was not out of sample. When the model “trains” itself on data that is missing a year, and then predicts that year as if it were the week before that Tournament, that is truly out of sample.
  
  You might argue that it is unfair to compare to prior systems, but if I had published this last year, I would have looked silly given the 2011 Tournament. I still would have published it, however. Maybe I’m just missing your claim?
  
uoduckfan33 says:

March 14, 2012 at 4:42 pm

I think it was HSAC that did an article last season about how 8 and 9 seeds get screwed relative to 10-12 seeds. Based on Pomeroy’s rankings and your results, I think Memphis really got hit hard with that 8 seed. A team that perhaps deserved a 5-7 seed now has to face Michigan St. in the second round.

tennismetrics says:

March 15, 2012 at 12:00 am

I noticed your bracket had Syracuse in the Elite 8. Does this take in to account their loss of Fab Melo for the tournament? Would the way to adjust for his absence be removing his minutes played from the experience factor and his numbers that contribute to the offensive and defensive ratings?

If you do adjust for that loss, assuming you haven’t already, doesn’t if affect significantly Syracuse’s predicted success in the tournament?

Alex says:

March 15, 2012 at 12:09 am

I love your work, just wondering if you could update your analysis with tonight’s results of the Cal/USF play-in game. Do you predict Temple would beat USF? or Ohio?

don says:

March 15, 2012 at 1:49 pm

Based upon these interpretations the ’12 Final Four will consist of Kentucky vs Marquette and Ohio State vs Temple.

JoeyG says:

March 17, 2012 at 2:22 am

This model takes into account the player’s tournament experience, but does it take into account a coach’s tournament experience. Coach’s like Tom Izzo and Billy Donovan have continued tournament success despite have different players on a year-in, year-out basis.

siftin thru nonsense says:

March 18, 2012 at 12:09 pm

Where are the results? How does this compare with LMRC (Bayesian)?? Let’s see the results? Did you remove them?

Pingback: Monday Medley « No Pun Intended
Pingback: March Madness and Employment Litigation - It's All About the Numbers ... and a Little Luck : Michigan Employment Law Advisor
propanol31@aol.com says:

April 4, 2012 at 8:52 pm

Hey, Thanks for your work! I used your exact bracket for my pool which consisted of 40 participants. I finished in 1st place. In the past I was always near the bottom of the pool no matter how much research I did. Your model sounded like as good a way as any to pick teams so I just copied it. Please do this again next year so perhaps I can win again! 🙂

Pingback: The RPI is Not the Real Predictive Indicator | The Harvard College Sports Analysis Collective
Pingback: Survival of the Fittest: Predicting the 2013 NCAA Tournament | The Harvard College Sports Analysis Collective
Pingback: Unit Link Terbaik Di Indonesia