Sorting Strokes: Classifying Tennis Players Based on Stats and Style

By Johnattan Ontiveros

Motivation

One of the greatest sports rivalries of the past 17 years, has been between Roger Federer and Rafael Nadal. Since the 2005 Miami Masters Finals, when 17 year-old Nadal shocked the tennis world by upsetting the seemingly unstoppable Federer in straight sets, fans have witnessed some of the best matches ever played between the pair. 

At first glance, the two would seem near 50-50 competitors in any match with Nadal having won 24 of their overall 40 meetings. However, when accounting for the court surface, the odds in any given match shift greatly.

Nadal’s overall winning record against Fed comes from the fact that he dominates on clay courts (14-2). Nadal’s blazing left-handed topspin, exacerbated on clay where the balls bounce higher, is the perfect attack against Federer’s one-handed backhand. Grass courts on the other hand, are Federer’s domain for the opposite effect as the ball bounces much lower. The interaction between the play style of two players is something that needs to be taken into account in win probabilities along with pure skill levels.

So how can tennis playing styles be quantified, and how big of an effect does it have on predicted win outcomes? Starting with quantifying play styles, there are four speculatively style groups: aggressive baseliners, serve-and-volleyers, counter-punchers, and all-court players. Aggressive baseliners have strong and consistent groundstrokes, but weaker net play. Players such as Novak Djokovic fit this description rather well. The all-court player is the “jack of all trades”, and uses a variety of shots to defeat his opponents, best reflected by Federer’s play. 

Methodology

In order to sort ATP players into these categories, I used K-means clustering to classify tennis players based on their average ace percentage, serve speed, points played at net, net points won, unforced error rate, and their forehand and backhand strokes. The K-means model uses these statistics to group players together by the similarity in their stats and determines the optimal grouping of players. For example, if we were just using serve speed and forehand to classify players, the K-means clustering may determine that there are two play styles: right-handed players and left-handed players. This would indicate that handedness was a good variable to split players by, but serve speed was not different enough to justify a split. The study group was the ATP Men’s Top 100 in 2017, looking at over a decade's worth of Grand Slam point-by-point data collected by Jeff Sackman of Tennis Abstract. The results of the K-means clustering yielded four classes as the optimal number of play styles to use.

The labels used for the classes are not perfect, but will be used for ease of reference. A player like Dominic Thiem could straddle two categories as he is a big server and also does not make many errors. Furthermore, a sign that the clustering algorithm itself could further be improved is the grouping of Federer, Djokovic, and Andy Murray into the same class. From a qualitative perspective, this arguably might not be the case, but the separation of Federer and Nadal’s play styles is a reassuring result.

Analysis

Now that each player in the ATP Top 100 is classified, we can see how large of an effect playing style has on win outcomes. I fit a logistic regression using the Elo rating difference between players to control for a player’s base skill level, and the classified styles of each player. The data consists of all Masters 1000 matches from 2004 to 2017. In order to verify that the relationship between the variables and the win probability is well fit, I made the calibration plot below which indicates that this model performs well. For example, it shows that 70% favorites according to Elo difference (the red line) win the match about 70% of the time (black dots). 

The finalized model yielded five significant interactions between play style and win probability. The model estimates the probability of a player winning a match given both the Elo (skill) difference and play styles. From the table below we can see that holding all else constant, when a left-handed counter-puncher plays a big server, the counter-puncher's odds multiplicatively increase by 1.95. Conversely, when an all-court player with an Elo advantage plays a big server, their predicted odds of winning the match multiplicatively decreases by 0.4.

Using the Nadal - Federer case as a reference, we can fully compare their two play styles using the average Elo difference between all matchups amongst the ATP 100 in 2017: 94 points. So, with an Elo advantage of 94, a big server, Federer’s style, has a 61% probability of winning against a left-handed counter-puncher, Nadal’s style. Considerably over a 50/50 coin toss, which isn’t too surprising, since the big server has a higher Elo.

However, if the counter-puncher has the 94 point Elo advantage over the big server, they have a 74% chance of winning. This shows that controlling for the skill level of players using Elo, a counter-puncher would see a 13% boost in win probability compared to if the matchup had been flipped. 

Conclusion

There are four classifiable tennis play styles using publicly available tennis data. Perhaps if there were more granular data (which recently has become more publicly available), the model would identify more than four classifiable playing types. By the k-means clustering, these four styles break down into the left-handed and right-handed counter-punchers, all-court players, and big servers. Left counter-punchers have the largest significant advantage over big servers with a multiplicative increase in the odds of winning of 1.95. Big servers, however, have a significant advantage over right counter-punchers of 1.4 times the odds. 

But, it is important to keep in mind that the results of this project are far from perfect. This data does not take court surface into account, and does not have equal amounts of data for each surface, player, or style type. The advantage of counter-punchers could be further accentuated on clay whereas big servers would benefit on grass. Lastly, I did my best to separate skill level and play style from each other, but skill level inevitably leaks into the stats. Federer could appear to be an all-court player visually, but his remarkably impressive serving stats make him look more like a big server on paper. This became evident when Federer, Djokovic, and Murray were put in the same category. In conclusion, this provides a useful way of classifying different styles of tennis players based on statistics. In some instances, it confirms what the fans believe and in others it changes their perceptions about certain players. After all, data emphasizes and confirms what we see but can also illuminate what is not being seen.

If you have any further questions, contact Johnattan Ontiveros '22 @jhontiveros@college.harvard.edu

About the author

harvardsports

View all posts

Leave a Reply

Your email address will not be published. Required fields are marked *