By Matt Goldberg, Adam Gilfix, Steven Rachesky, Nathaniel Ver Steeg
“Defense wins championships” and “the best offense is a good defense” are two of the most common adages in sports. The timelessness of these expressions reveal the long-held belief that a strong defense is a crucial element of a successful franchise. In a 2013 New York Times article discussing his predictions for Super Bowl XLVII, Nate Silver supported this with some numbers, noting that the 20 all-time best defenses went 14-6 in Super Bowl appearances, while the 20 all-time best offenses managed only a mediocre 10-10. A year before, a Freakonomics blog suggested that the adages may be a bit of an exaggeration, but even if defense isn’t the most crucial part of a team, it is still an important one, and this certainly holds true in the NFL.
Of course, central to a good defense is a good defensive coordinator, and an all-important trait of a good defensive coordinator is the ability to preempt the offense’s play-calling. In particular, a savvy defensive coordinator who can predict whether an offense will pass or run will provide immense value to his team by calling defensive plays that best stifle the offense’s plans. Therefore, in November and December of 2015, as part of our final project for Harvard’s Data Science course, we attempted to predict offensive play-calling from a defensive coordinator’s perspective.
To do so, we first scraped data from Pro Football Reference (PFR) using Matt Goldberg’s pfr Python package for getting data from PFR’s site. Specifically, we collected play-by-play data from every game that took place in or between the 2002 and 2014 NFL seasons. After filtering the data to include only runs and passes during which no penalties were accepted, we had a raw dataset of roughly 450,000 plays and around 100 raw features describing each play, including the amount of time left in the game, field position, whether the play was a run or a pass (outcome of interest), and specific details about the play such as how many yards it gained, which players were involved on the play, and the result of the play (i.e., if it was a touchdown, interception, tackle, etc.).
Then, we cleaned the data and generated new features. New features introduced in this stage include the proportion of plays that were passes in the previous season, current season, and within the current game; the scoring margin; seconds elapsed in the half; and timeouts remaining in the half for each team. We also added indicators for whether the team was in field goal range, whether the team was in the red zone, whether there were less than 3 minutes remaining in the half, and if the team’s previous play from scrimmage was a pass. We intended for these newly-engineered features to give the model more situational insight for each play, hopefully encoding some of our domain expertise (at least, we like to think of ourselves as NFL “experts”) into the set of features with which our model has to work.
After the data-cleaning phase, we were able to inspect the data in order to get a sense of the story it tells. Also in this exploratory step, we were able to generate a few graphics that help us understand the data in an interesting and easy-to-understand way. For example, for each team, we plotted the proportion of offensive plays that were passes for each year in the dataset. Thus, you can see below how each team’s offensive tendency to prefer the pass has changed over time, with greater values representing more pass-happy offenses:
There are a few interesting charts that are worth a second look. Washington’s chart demonstrates a noticeable preference for the run during Robert Griffin III’s ROY-winning season in 2012. Similarly, the Seahawks have been especially devoted to the run since Russell Wilson’s rookie year in 2012, passing no more than about 50% of the time in each of the years since his arrival. Another historic run-heavy team was the 2004 Pittsburgh Steelers, who ran an astonishing 63% of the time; that team featured Rookie of the Year QB Ben Roethlisberger and veteran RB Jerome Bettis, who ran for 13 touchdowns. It is also noteworthy that during Denver’s crazy “Tebow-mania” season of 2011, passing became far more rare than it had in previous seasons for the Broncos’ offense. On the other hand, Tom Brady’s offense in New England has been consistently pass-heavy in every year since 2002.
In addition to the running and passing tendencies of each team, we also looked into rushing direction and passing location by team and league-wide, just to get a sense of the distribution of play calls in terms of both run/pass (purple and blue, respectively) and direction/location. Below are the distributions for two teams with notably extreme distributions, as well as the league-wide distributions of runs and passes.
These graphics gave us a sense of the percentages of runs and passes in the dataset, while also giving potential insight into what features may be indicative of certain play calls.
After exploring the data, we took to modeling. In particular, we tested several classifiers using the features that survived feature selection; classifiers we tested include non-regularized and regularized logistic regression, random forests, and gradient boosting. In the end, the most accurate model was built using gradient boosting and, when trained on a random subset of the 2014 season, achieved an accuracy of 70.3% on the remainder of the 2014 season; this accuracy hovered slightly above 70% for every year in the dataset. This is certainly an improvement over our baseline, which was to simply predict pass for every play in the dataset; since about 57% of plays were passes in 2014, the model correctly predicted an additional 13% of plays over the baseline. Using gradient boosting also enabled us to investigate which features were most and least important:
As you can see, the top three most important features are time elapsed in the half, the percentage of plays that were passes for the offense’s previous season, and the offense’s scoring margin. Other notable important features include yards to go for a first down, current season and in-game passing tendencies, and field position.
While models like the one we built here may not be accurate enough to rely on them blindly, their output could certainly help inform a defensive coordinator’s play-calling decisions. With a more objective and statistical understanding of the likelihood that an opposing offense is about to run or pass, defensive coordinators can preempt offensive plays and render them ineffective. When the model output inspires a defensive coordinator to have high confidence in a prediction, it allows the coach to be much more aggressive in his play calling, whether it be stacking the box before a run or calling the correct coverage or pass rush for a pass. Therefore, don’t be surprised to see model-based predictions such as these coming soon to an NFL Microsoft Surface near you.
In case readers did not click through earlier, or are interested in exploring our process and the story of the project more in depth, here is our Github website containing our work with links to our code and even a YouTube video.