At the end of the December match between Arsenal and West Bromwich Albion at the Hawthorns, West Brom’s left-back Kieran Gibbs put in a cross that struck Calum Chambers on the hand, and Mike Dean awarded a penalty to West Brom. West Brom subsequently converted the penalty to ensure a 1-1 draw against heavily favored Arsenal. This incident was the latest in a series of high profile incidents involving Mike Dean giving controversial decisions against Arsenal that have stretched back many years. In a September 2015 London derby against Chelsea at Stamford Bridge, Mike Dean controversially sent off two Arsenal players. In the wake of this, Arsenal fans filed a petition to the Football Association to ban Mike Dean from refereeing Arsenal matches. This Mike Dean anti-Arsenal theory was compounded when a video emerged of Dean appearing to celebrate a Tottenham goal when refereeing a match against Aston Villa at White Hart Lane in November 2015.
All this begs the question, is Mike Dean biased against Arsenal? And by extension, are there any other team-referee combinations out there that show large discrepancies in results? Fans of teams in the Premier League often have a referee that they casually single out for bias baed on anecdotal evidence. For example, the party line is that Howard Webb was Manchester United’s 12th man before his retirement in 2014 while Michael Oliver was a guaranteed three points dropped.
To go about analyzing whether referees, I picked every single combination of Premier League teams and referees such that the referee managed at least 15 of their matches between the 2005/6 and 2016/17 season. For each combination, I calculated the pre-match win/draw/loss probabilities from the odds provided by Bet365 for each of the matches where the team had that specific referee. I simulated each match for every combination 10,000 times using the pre-match odds to create a bootstrapped probability distribution for the number of points the team would expect to achieve from that sequence of matches, and compared this to the number of points a club actually attained in that sequence of matches. I computed where on the simulated distribution the actual number of points occurred, and calculated the probability of achieving less than or equal to that number of points (and subsequently, greater than or equal to test if a referee is biased in favor).
For example, Howard Webb refereed 37 Manchester United matches between 2005/6 and when he retired in 2014. Based on pre-match betting odds, Manchester United was expected to attain 70.6 points in those 37 matches, but actually attained 73. Manchester United achieved less than or equal to 73 points in 64.9% of our simulations, and greater than or equal to 73 in 40% (N.B. these numbers add up to more than 100% because it double counts the probability that Manchester United attain exactly 73 points). From this, we fail to reject the null hypothesis and conclude that Manchester United do not perform any differently than expected when compared to pre-match betting odds.
We found that there were 220 combinations that fit our 15 match minimum criteria. Using our raw calculated probabilities, we found that there were 13 combinations of referees that were significant at the 10% level and 7 at the 5% level in showing that the team performed significantly worse than expected:
We also found eighteen combinations at the 10% level and 9 at the 5% level where the team performed better than expected:
However, there is an inherent flaw with this analysis. Here, we are testing 220 different null hypotheses (one for each combination). If we test 220 hypotheses, we would expect about 5% (or 11) to be significant at the 5% level just by random chance. The fact that less than 5% of the combinations are significant at the 5% level of significance is an incredibly interesting result.
Fortunately, there is a way to account for the possibility of false discovery rates when testing many different hypotheses through the Benjamini–Hochberg Procedure. The Benjamini Hochberg procedure tries to control for False Discovery Rates by trying to avoid Type I errors (false positives). Since we have many different hypotheses, we run this procedure to calculate adjusted p values given all the inputted p values. After we do this, we plot the original p values (calculated above) against our adjusted p values using B-H. First, for the p values corresponding to teams underperforming expectations.
We find that when we adjust the p values, the lowest p value becomes .99. We repeat the procedure again for the p values corresponding to teams outperforming expectations.
Here we find similar but not as drastic results. Our minimum p value is .905, but we still find that none are even close to being significant at the 5% level.
These results are incredibly telling. They tell us that based on pre-match betting odds, no team/referee combination shows particularly alarming signs of bias. This contradicts the findings of the September 2015 study conducted by the blog Discovering Statistics, which specifically focused in on Mike Dean and Arsenal. The study tells runs many different rigorous statistical tests (both Bayesian and Frequentist) that all show pretty damning evidence that Mike Dean could very well be biased against Arsenal.
Therefore, there must be something about our assumptions or our methodology for how we went about conducting this post that led us to come to a different conclusion. One potential explanation is that pre-match betting odds are taking referee assignments into account and adjusting their odds to reflect certain worrisome combinations (like Mike Dean and Arsenal). One way of testing whether this is the case is to compare pre-match betting odds for when Mike Dean is refereeing Arsenal to comparable matches (by team quality and home field) when Mike Dean is not refereeing. Another way of testing this is to see if pre match betting odds shift when referee assignments are announced (usually in the week before a match). Both ideas have potential for a future blog post, and I look forward to delving into these questions further.
If you have any thoughts on this, please feel free to leave them in the comments below.
If you have any questions about this article, please feel free to reach out to Andrew at firstname.lastname@example.org or on Twitter at @andrew_puopolo.