By: Jake Levene and Zach Diamandis
Around this time every year, a familiar but always exciting debate reignites: who should be the NBAâs MVP? There are familiar faces, such as Lebron James, who has been within the top 10 of MVP voting since his second season in 2004-05, and exciting newcomers, like former Sixth Man of the Year winner James Harden. Since we are at the end of the season, it is fairly easy to determine a shortlist of candidates; however, we wanted to investigate, via regression analysis, how well you can model MVP voting. The hypothesis that motivated this question was the following: while quantitative performance statistics have some influence on the selection of MVP, the most important factor in determining the winner of the NBAâs MVP in any given year is ânarrative.â Narrative is a naturally hard to define concept, but by that we mean the kinds of the qualitative, hard to model factors that often influence voters. A player making a big leap or having a lot of media attention can certainly sway the public opinion on the vote. So, the worse our regression is, the stronger this hypothesis appears because we are only using quantitative measures.
In order to test this hypothesis, we ran a series of regressions, and judged them based on the number of MVP winners correctly predicted in the last 20 seasons. We used the top 5 MVP vote getters from each of the seasons based on their MVP score – the weighted average of a playerâs first, second, third, fourth and fifth place votes (a first place vote counts as 10 points, a second place vote as 7, a third place vote as 5, a fourth place vote as 3, and a fifth place vote as 1). Our best model correctly predicted the MVP in 16 of the last 20 seasons.
Model Building Process:
To arrive at this final model, we tried a number of different approaches. Initially, we ran a rather large regression with 17 different predictors, which included several counting, shooting, and team level statistics. We also included advanced individual performance metrics such as win shares, value over replacement player (VORP), and box plus minus (BPM). There were originally problems of collinearity, so we performed a backwards-stepwise regression, which resulted in the regression below. Another strategy we tried was to run several regressions using Râs âallpossregsâ command and select a model based on its Mallowâs CP. Our best model using this strategy predicts 14 of 20 winners correctly. By that measure (predictive ability), adjusted R squared, and residual standard error, this model is quite similar but slightly inferior to the model we ultimately used.
Discussion of Final Model and Predictions:
There are a number of interesting conclusions that can be drawn from our final model that merit comment. One salient point is the importance of team wins as a predictor of MVP score. Team wins proved to be by far the most important predictor, and that is undoubtedly part of the reason that our model did not correctly predict Russell Westbrookâs 2017 MVP win. The Oklahoma City Thunder won 47 games that year, an unusually low value for an MVPâs team. Also, Westbrookâs performance last season was regarded by many experts to be quantitatively inferior to James Harden from a strictly basketball perspective. This suggests that narrative exerts influence, as Westbrookâs triple double average won the day despite Hardenâs more robust statistical performance. On the other hand, the modelâs largest predicted MVP score belonged to Stephen Curryâs historic 2016 MVP campaign, in which he became the first unanimous MVP ever, putting together one of the greatest individual seasons in history with shooting splits of 50-45-90 (only reached by two other players ever), and breaking the records for most 3 pointers made in a season and team wins.
Another interesting topic that merits some attention is a brief discussion of the four seasons for which the analysis predicted the wrong MVP winner. These were Russell Westbrookâs aforementioned 2017 campaign, Dirk Nowitzkiâs 2007 victory, Steve Nashâs second of his back to back wins in 2006, and Allen Iversonâs 2001 victory. A plot of the cookâs distances of all points is below, and these four are, of course, the four most obviously unusual points:
We initially hypothesized that Westbrook would be the biggest outlier because of the controversy surrounding his victory, which stemmed largely from his teamâs unusually low win total, which was unprecedented for an MVP winner. Although the model did predict James Harden as 2017âs winner over Westbrook, by all accounts Westbrook was not an outlier. Both Westbrook’s 2017 season and Nowitzki’s 2007 campaign were not seen to be overly influential; however, our model flagged Nash’s 2006 season as a potentially overly influential point. This is likely because he averaged only 19 points per game, a rather low total for an MVP winner, and was fairly unremarkable in all other categories besides assists. Nashâs defense is notoriously underrated and difficult to capture in simple statistics, and that season is one in which ânarrativeâ clearly played a role. Nash was the leader of the so called â7 seconds or lessâ Suns teams that revolutionized todayâs NBA, but was not known as a defensive stalwart. Shaquille OâNeal, who won only one MVP, has repeatedly claimed he deserved to win both of Nashâs MVPs. Our model does not OâNeal finished inside the top 5 in MVP voting in 2006, and he has a substantially lower predicted MVP score than Nash in Nashâs 2005 winning season, lending no credence to that claim. As a final note, the predicted winners for these four seasons would have been James Harden, Steve Nash, Chauncey Billups, and Shaquille OâNeal, respectively.
Sources of Error:
The first and most obvious source of error is incomplete data. While the Basketball-Reference gives full summary statistics, the NBA has stated multiple times that there are no publicly available defensive statistics that fully and satisfactorily capture playersâ defensive skill. Thus, our model likely under accounts for defense when assessing MVP. Further, we did not add all possible available statistics into the model as repressors, so it is possible that a different selection of statistics would lead to different model results. Finally, as mentioned earlier, we posit narrative is highly influential, which is supported by a relatively low R squared value. To improve this model, we could account for narrative by using a variable like positive news coverage or Google searches with the players name and âMVPâ.
A second source of error is that the process for MVP score is not consistent each year. Some years have a different number of voters than other years, so the MVP score for each pairwise set of years is not exactly comparable. There are a couple of lockout years in the data, but we addressed this issue by scaling these years in proportion to the number of games missed. Although some of our significant predictors probably scaled fairly accurately, as most individual statistics tend not to change all that much after 66 and 50 games (the totals for the two lockout years), it is possible that games played and team wins might have been different because of injuries or other sudden changes.
Prediction for This Year:
As a final note, one additional test of how good our model is will be how well it predicts this yearâs MVP voting, so we decided to use the relevant statistics to make MVP score predictions for some of the players who we expect to be among the highest vote getters this year. The model predicts that James Harden will have the highest MVP score and win the award, with a score of 1198, and, in descending order, the players that followed him were Lebron James, Kevin Durant, Russell Westbrook, Demar Derozan, Anthony Davis, and Giannis Antetokounmpo. Many NBA media outlets and personalities are predicting a win for Harden, so this bodes well for our model and will almost certainly result in a new model success rate of 17/21 or 81%. A future area of interest could be applying our model midseason to see how it fares with incomplete data and how robust it is to swings in team and player performance.
Editors Note: This post was initially conducted as a final project for the class Statistics 109, âIntro To Statistical Modelling,â and has been adapted for this blog. If you have any questions about this article, please feel free to reach out to Jake and Zach at jlevene@college.harvard.edu and z_diamandis@college.harvard.edu