By Jack Schroeder
While most people only hear about the sport once every four years, College Curling has grown rapidly over the past decade, now encompassing over forty schools across the country. The quality of play is admittedly lower in college curling than in professional leagues due to the amount of newer curlers, but with Nationals coming up this weekend, I decided to create an Elo rating system for the sport.
The Times of London put together an infographic explaining the rules of curling back in 2010: There are two major differences between college and Olympic curling: college games are eight ends long (instead of ten) and are played without an umpire. The general sense of sportsmanship around curling is referred to as the “spirit of curling” and involves calling one’s own infractions and “broomstacking” (socializing with one’s opponents after each game), among other things.
Currently, College Curling uses a points-based system to determine team quality. Four points are earned for a win, and two for a loss. Games for points occur at bonspiels (tournaments) and through head-to-head scrimmages. Points can also be earned in other ways, such as host points for organizing a bonspiel, or extra “victory points” for winning a bonspiel. Additionally, if a school is located in an “emerging region,” where weekly travel to bonspiels would be prohibitively expensive, they can earn points through non-college bonspiels or league play at their home club. While this allows schools located in nontraditional environments to compete, it can also distort their quality relative to schools who compete in leagues but cannot get points for those games. These points are accumulated through the regular season (traditionally October-February), and then the top 16 teams qualify for Nationals.
Using points alone to judge team quality is problematic because they are not given at constant rates. While a traditional curling team fields four players at a bonspiel or scrimmage, schools are allowed to compete with three-player teams. However, as an incentive to field full teams, three-player teams only earn 3/4 of the possible points (3 for a win, 1.5 for a loss). Moreover, schools only earn points for their top two teams at a bonspiel. In the event that a school sent three teams that each went 2-1, that school only receives 20 points, and the other 10 points are left on the table. This is to discourage teams from oversaturating bonspiels with their own teams, since successful teams from the same school can end up “playing” one another in the later round. Finally, in the event a team plays more or fewer games than another at a bonspiel, overall points gained can be “normalized” as if each team played the same number of games.
In addition, College Curling has the unique dynamic of split teams. Most bonspiels permit schools to pool together players into a unified team that can compete. However, the schools only gain points on a fractional basis determined by how much of the split team they comprised. Combined with the two-team rule, there is a heavy incentive to stack talent on full squads to maximize points gained, since a split team has to essentially win the bonspiel to have their points count toward qualification. Although the sample is small, split teams tended to do worse than full squads as a whole this season.
I think the points system manages these complications decently well. It incentivizes the growth of the sport through the “emerging region” rule and by freeing up bonspiel slots through the two-team rule. Split teams are also a fun part of the sport and are in line with the spirit of curling. However, it is tough to gauge team quality through points given the problems outlined above, which is where Elo comes in.
For those unfamiliar with Elo, it is a statistical measure of quality originally developed for use in chess. It has been applied in areas as varied as NBA and eSports. Traditionally, an average Elo rating is 1500, with a standard deviation of 200. It is a closed system, so points gained by one team are lost by another, and a constant called the k-factor determines how volatile the system is. A small k-factor yields rankings that barely ever change, and vice versa. There are many methods to determining the correct k-factor, and some models vary this k-factor based on difference in quality, margin of victory, etc.
To start calculating Elo ratings in College Curling, I had to gather results from each bonspiel (tournament) and head-to-head scrimmage in the regular season. Although head-to-head results are posted online, only the points accrued from each bonspiel are made publicly available. As a result, I had to individually reach out to each bonspiel organizer and ask for the completed draw. The organizers of the UW-Green Bay/St. Norbert bonspiel never got back to me, so those results are not incorporated into the ratings. While it was a small tournament, it would have been useful to get a larger sample of games from the Midwest teams, given they played almost 200 fewer games than their Northeast counterparts (331 to 517). The head-to-head scrimmage data was easily accessible, though, and all of those results were included.
After gathering results, I had to set initial Elo ratings. While it appeared game results were impossible to come by (since only points gained at each tournament were published, not the complete draw), the final standings from previous years were accessible through Archive.org. I used the 2017-18 final standings to calculate initial ratings. Since a rule change this year essentially doubles points gained through bonspiels and scrimmages, I doubled each team’s points. In order to translate points into Elo ratings, I then z-scored the standings, which returns a measurement of how many standard deviations each school was from the mean. I made Elo ratings out of the z-scores by assuming a mean of 1500 and a standard deviation of 200. I felt using only last year’s results would be the most efficient way to set initial ratings, but in the future, it may be worthwhile to explore using multiple years to set initial ratings (discounted by time, since even curlers must graduate).
I had to decide how to give initial ratings to new schools. Team USA’s success at the 2018 Winter Olympics encouraged seven schools to start competing this year, and they needed initial ratings as well. I z-scored these schools as if they had earned 0 points last season, which yielded an initial Elo of 1230. Split teams also posed a challenge. I thought about averaging both teams’ Elo ratings together to create a combined rating. However, since split teams performed worse than full squads, and there is an incentive to keep talent on full squads, I settled on treating each split team like an expansion school, entering a bonspiel with a 1230 rating.
I then created an Elo model in R that updated ratings after each game. I had to determine the best k-factor for the system. The sample was too small to run an optimization, so I first ran the model with a k-factor of 20 (which is the optimal rating for club soccer and NBA). However, this factor was slightly biased against expansion teams like North Dakota State and Syracuse, who qualified for Nationals and won a majority of their games. I tried out alternatives before deciding on a k-factor of 30, which gave Syracuse (who had the highest win percentage of any Northeast school) enough room to improve and left Harvard with an average rating of 1500. Any rating higher than 30 tended to overweigh the most recent bonspiel. Only three teams moved over a standard deviation with this k-factor: North Dakota State, Syracuse (both positive), and RPI (who were negatively impacted by the two-team rule). These ratings were highly significant in predicting win percentage this season (p<0.001).
Plotting Elo ratings versus win percentage allows me to see which teams the model over/underweights. Schools outside the trendline’s confidence interval are outliers. Some, like North Dakota State, are outliers because they are new to the system and had an initial rating that was too low. Others, like Unity, played too few games to get an accurate sample. Newer teams and low-sample teams tended to be underweighted, though. Teams overweighted by the model were those that either started with an inaccurately high initial rating (like Maine) or, through a combination of the two-team rule and heightened competition, were not as successful this year but are benefitting from the k-factor not being higher (like RPI).
The distribution of initial and current ratings is similar, once expansion schools are filtered out of the initial distribution. This implies that the spread of ratings has remained relatively equal even as new schools have entered the league.
Each school can also be ranked by Elo and traditional points. For some schools, these ranks are relatively equal. SUNY Poly, for instance, is the strongest school by both metrics. But some schools are undervalued points-wise, particularly midwestern teams like St. Norbert and UW-Green Bay. Indeed, UW-Green Bay (along with Penn and Villanova) would qualify for Nationals in an Elo-based qualification system, but the points system is still better for determining qualification because it helps spread the game (due to the benefits for “emerging region” schools), avoids Elo-based incentives (like refusing to scrimmage in order to maintain a high rating), and rewards participation (as schools would likely only let their strongest curlers compete in an Elo-only system). This is all besides the model’s two main flaws: it is still relatively slow to rank new schools highly and does not take into account which team plays from each school. While the model allows one to better judge relative quality, points are still useful administratively.
The model can also analyze the pools at Nationals. Pools A and D are the toughest, with average Elo Ratings of 1640. This makes sense, given that Pool A has the strongest northeastern school (SUNY Poly) and Pool D has the strongest midwestern school (UW-Stevens Point). Pool B is the easiest, with an average Elo of only 1565. That said, given North Dakota State is the only school underweighted outside the confidence interval, Pool B will not be a cakewalk. Pool C has the largest gap between the second and third-ranked schools, with St. Norbert rated 140 points higher than Nebraska. Since only the top two teams advance into the championship bracket, this may be indicative of a relatively stable pool.
I intend on maintaining this model next season through an R Shiny page on the HSAC website (updated after each bonspiel). In order to do so, and to improve the model, I think the following changes need to be implemented in the College Curling reporting process. First, scores should be reported for every game along with LSFE (last stone in first end). Scores in every game would allow margin of victory to factor into the ratings (such as through a mobile k-factor), whereas currently many bonspiels do not report final scores. In addition, given the advantage of LSFE, initial hammer could be incorporated into the model similar to home-field advantage in other sports. Moreover, team quality varies widely based on who plays, which the model should take into account. While it would be impractical to report each position for each game, scrimmages and bonspiel roster forms already require teams to identify their skips. Naming each bonspiel team by skip (ex. Harvard (Schroeder)) would sharpen the model and eliminate the confusion that surrounds numbered teams (ex. Harvard 2) at bonspiels (and in reporting points). Finally, and most importantly, completed draws should be published at the completion of each bonspiel. This not only allows teams to check whether points have been accurately recorded, but it also opens up game results to casual observers. Curling has made leaps and gains this year in participation, quality, and outside interest. It would be unthinkable for results from midseason college basketball tournament to not be published. If we want people to take curling seriously, we need to follow suit.