By Carlos Pena-Lobel
With the NBA Finals involving some of the best teams of all time about to start, the French Open has been more of an afterthought. However, last year I wrote some Python code to simulate tennis matches, and so I want to use it to analyze some tennis while we still have a major underway.
Firstly a brief discussion of the simulation I wrote. The simulation takes in a number of sets, it randomly assigns service (equivalent to the racket spin) but can be modified to allow either player to serve first, and has the option to end sets in a tiebreak at 6-6, or to continue in the vein of the Isner-Mahut 5th set. Although it should go without saying, the simulation follows all the rules of a traditional scoring and serving.
By adding in these characteristics, it allows me to see how variance would change if men played 3 sets, or how the top women would do if they had 5 sets to play, and it allows me to look at effects of serving first in a match/set, potentially on the scale of a whole tournament.
Moreover, I have written 2 other functions to go from simulating individual matches to tournaments. The first simulates a single elimination tournament of size 2^n (meaning it can deal with 2,4,8,16,32,64,128… teams/players). Secondly, I wrote a function that simulates the draw of an open. Whereas in the United States we typically assume that the 1 seed would play the 128 seed, 2 v 127… this is not close to being true in tennis. Nor is it completely random either. Instead following the rules established on page 25, the function randomly creates a valid seeding of the players.
Now to “play the actual points” in each match, the simulation relies on an input vector of 5 stats from each player. First it takes each player’s percentage of serves in play. Secondly it takes in a player’s 1st serve and 2nd serve win percentage, as well as their 1st and 2nd serve return percentage. As you can probably assume, it first checks whether the first serve is in, and if so simulates the point based on both players 1st serve percentages. Otherwise, it simulates a second serve, and if that is in, it uses the 2nd serve percentages. Finally if that is out, the player double faults and the opponent wins the point. Now if either serve is “in-play” I use Bill James log5 formula to calculate the chance of each player winning the point. Since this is the biggest assumption I will go in detail a little more about what Log5 is and what it assumes.
If Team A has a Win Percentage of 60%, this assumes they will win 60% of their games against an average team. However, if they are playing Team B, a team that is below average, we would expect Team A to win greater than 60% of their games. Bill James first outlined this in 1981 to calculate matchup probabilities in baseball, but has since been expanded to include many other types of matchups in other sports, or even interplayer matchups such as batter/pitcher matchups. Finally the reason that we include the “5” is because teams were initially being compared to an average team with a 50% winning percentage, modifying this would set the average team to be changed. Since tennis points are a 0 sum game (like most other sports matchups), serving and return percentages should approximate a bimodal distribution centered around .5. Below is a graph of how log5 takes in 2 input probabilities and outputs an overall probability.
Thus the idea is to apply log5 to tennis serving. Yes, this is inherently a bad assumption for tennis, but the question is how bad? The problem, as can be seen on the graph above, is assuming that Team A is the server, and Team B is the returner. Team B will rarely ever be above 50% which means that the actual winning percentage will be higher than the true winning percentage for Team A. Over all players and returners, this violates the definition of the average win % for Team A. Thus there is a problem here. However, using log5 does account for both quality of server and returner, so while it is likely overall underestimating number of breakpoints it probably isn’t biasing the result. Moreover, since this is only the first use of this new idea, I will certainly be revisiting this as I improve it over the course of the next few majors and posts.
First I scraped data for each player in the tournament from the player’s ATP page from last year on clay, an example of Djokovic’s page is here. If the player had no stats, I went back 5 years on clay until they did have stats. If they had no stats or had empty stats, I set them to the 33rd percentile. Secondly, I scraped the seeding/actual draw from the Roland Garros website. From that point I simulated 100,000 tournaments with the results below for players winning more than 1.5% of tournaments. Quickly, 68 of the 128 players won in at least 1 simulation, and David Goffin was the highest ranked player to not win in a single simulation.
As you can see, the top 3 players are exactly who Vegas chose, with Verdasco rounding out our top 4 instead of Wawrinka, who won in ~.05% of our simulations. The reason for that is likely that he has a terrible first serving percentage on clay, and a model is only as good as its underlying stats.
Next I simulated 100,000 tournaments with a randomized valid draw which is listed below. As you can see the players most likely to win with a random draw are very similar to before, as the draw doesn’t mean as much when the winner has to win 7 matches to win the tournament.
In this case Simone Bolelli does even better than before, even though in reality he lost in the first round. This can likely be explained by him not playing much on clay in 2015, and one of his only performances was as runner up in Doubles at Monte Carlo which is likely inflating his stats. Also, once again David Goffin is the highest ranked player not to win a simulation, and this time 70 of 128 players win a simulation.
However, the goal of running the simulation without the actual draw was to compare it to the results from the actual draw, which will lets us see which players had the easiest and the toughest draws. And as clearly shown, Djokovic had a rather tough draw, with a likely matchup against Nadal pending in the semis. In fact it seems that Nadal and Djokovic were the 2 players most affected by the draw. Yet Nadal had some early round cake walks, and was bound to likely have to beat both Murray and Djokovic, so the seeding he got in earlier rounds seems to actually have helped him. However on the other side, Djokovic had the possibility of facing only the winner of Nadal/Murray, but was drawn into playing both, not to mention an early round gauntlet of Roberto Bautista Agut and Ferrer or Berdych. However, the most surprising omission from this list is Andy Murray. He should be very thankful for a draw in which he avoided both Djokovic and Nadal, and yet he is only slightly positive.
Yet despite all of the stats, there is nothing to account for injury or age as we lost Federer before the tournament was even drawn, and two of our top competitors, Nadal and Tsonga during the 3rd round. Thus the matchup between our 2 likely winners never came to fruition, much to the chagrin of tennis fans everywhere.
Lastly (since this post is a week too late), who do we predict to win from this point? Running 10,000 trials of the final 4, we get that Djokovic is far and away the favorite with him winning 61% of the time. Murray is in 2nd winning 31% of the matchups, with Thiem and Wawrinka both winning ~4% of the time. Basically look to see Djokovic win his first career slam.