By David Roher
Update: An astute commenter below pointed out a key mathematical error that I made. The corrected values on the table are in italics.
It seems that every game in a short series gets classified as either a “must-win” game or a “swing” game. The latter usually comes whenever a series is tied, be it 1-1, 2-2, or even 0-0 while the former comes whenever one team must win to still have a good shot at winning the series. (The deciding game 5 or 7, of course, fits both of these descriptions and thus sends sportswriters around the country into fits of joy.)
Tonight’s upcoming Game 3 of the World Series features the Yankees and Phillies tied 1 game apiece, making it a “swing” game. There’s no doubt that this game will be of much importance in deciding the eventual winner, but how much more so than any other game? It’s the World Series. Every game is important, and eventually we just run out of different ways to say how important each one is.
That’s not to say it couldn’t be more important, though. Game 3 might indeed be the pivot point of the series. It could also be that the concept is overblown. I’ll try to figure out the truth mathematically, after the jump.
This is the way I’m going to frame the definition of the swing game: there is a a series that has already ended between two teams. You know nothing at all about this series except its maximum length (i.e. best of 7) and that the two teams playing each other each had a 50% chance of winning any given game in the series. Some guy in an inflatable sumo wrestler costume (it’s Halloween, after all) comes up to you and makes you guess the winner of the series. He’ll give you one additional piece of information: you give him a game number, and he’ll tell you who won it (or that no one did, if it wasn’t played). What game result should you ask for to maximize your chances of picking the right team?
From there, the easiest way to think of the importance of a single game is by finding the average probability that the winning team from that game has of winning the entire series. The problem with this definition is that the answer is obvious – it has to be Game 7, since the winner of that game is guaranteed to win it all. That’s not necessarily what we’re looking for here: if the series gets to the final game, then obviously that game will be the most important. But words like “swing” and “pivotal” imply that we’re more interested in the path that the series takes. In the hypothetical example above, if we guess a game that might not have happened, we run the risk of not getting any information out of it.
A way to fix this is by multiplying that increase in probability by the probability that the game will occur in the first place. Then we’ll add half the probability that the game does NOT occur to model the fact that we still have a 50% chance of guessing correctly with no additional information. We can figure out the probability of a series lasting at least x games pretty easily though binomial distributions. Here are those values for a 7-game series with equal teams:
At least 1-4 games: 100%
At least 5 games: 87.5% (chance that a team has won either 1, 2, or 3 games out of 4)
At least 6 games: 62.5% (team has won either 2 or 3 games through 5)
7 games: 31.25% (team has won exactly 3 games out of 6)
We can use these flat values for games 1 and 7, since the series is guaranteed to be tied going into those games. Games 2 and 6 present a problem: in game 6, for instance, the winning team can either have been up or down 3-2. If they were up, their chance would be 100% after the game. If they were down, it would only be 50%. This is easily correctable, though – just take the average. Games 3, 4, and 5 present this problem and one more – there are 2 configurations each of the series situation. Game 3 can either be 1-1 or 2-0, for example. We take another average here, but it will be weighted for the probability of each situation within the game.
Now that we have those numbers, we can figure out how important a single game is. I have the complete results split by situation below, but here’s the main result:
Games 1, 2, 3, 4, 5, 6, and 7: 65.625% chance that we’ll get the winner of the series by basing our answer on the outcome.
Games 1-4 being equal make some sense. But the idea that all 7 games come out to be the same is really weird when considering the disparate math. For example, in Game 4, we get to the result by finding the weighted average of the series win probability for a team winning after 3-0 (25%), 2-1 (25%), 1-2 (25%), 0-3 (25%). Game 7 is modeled by taking the probability that it occurs and adding half the probability that it does not occur…and each result is exactly .65625.
Even that would make sense if they were all the same, I suppose. But I can’t figure out for the life of me why Game 5 would be different. (Update: It’s not, I just messed up the math before. See Matt Agard’s comments for details). I’m interested in trying this for a best-of-5 series, trying to generalize for all lengths, and also answering the “swing” question in a different way (including more knowledge of what’s going on the series, for example).
Situation | Winner’s P | P of Game | P of Situation | P Correct |
G1, Total | 0.65625 | 1 | 1 | 0.65625 |
G2, 1-0 | 0.8125 | 1 | 0.5 | 0.40625 |
G2, 0-1 | 0.5 | 1 | 0.5 | 0.25 |
G2, Total | 0.65625 | 1 | 1 | 0.65625 |
G3, 2-0 | 0.9375 | 1 | 0.25 | 0.234375 |
G3, 1-1 | 0.6875 | 1 | 0.5 | 0.34375 |
G3, 0-2 | 0.3125 | 1 | 0.25 | 0.078125 |
G3, Total | 0.65625 | 1 | 1 | 0.65625 |
G4, 3-0 | 1 | 1 | 0.125 | 0.125 |
G4, 2-1 | 0.875 | 1 | 0.375 | 0.328125 |
G4, 1-2 | 0.5 | 1 | 0.375 | 0.1875 |
G4, 0-3 | 0.125 | 1 | 0.125 | 0.015625 |
G4, Total | 0.65625 | 1 | 1 | 0.65625 |
G5, 3-1 | 1 | 0.875 | 2/7 | |
G5, 2-2 | 0.75 | 0.875 | 3/7 | |
G5, 1-3 | 0.25 | 0.875 | 2/7 | |
G5, Total | 0.6875 | 0.875 | 1 | 0.65625 |
G6, 3-2 | 1 | 0.625 | 0.5 | 0.5 |
G6, 2-3 | 0.5 | 0.625 | 0.5 | 0.34375 |
G6, Total | 0.75 | 0.625 | 1 | 0.65625 |
G7, Total | 1 | 0.3125 | 1 | 0.65625 |
There are a lot of things I purposely ignored here, but one main thing is that if a team won game 1, it’d be better to adjust their prob. of winning a single game to more than 50%, and that might change the results a bit. But in the results here, the games are purposely independent of each other.
David, really interesting post. Looking forward to a good discussion from the group, especially from out stats guys.
Using your numbers, it looks like the Yankees will have an 87.5% chance if they win game 4, but only a 50% chance if they lose game 4.
I wonder how would this look if we used the Pythagorean Formula rather than assuming equal teams.
One last thought…it would be really interesting to look at these numbers using actual history, not binomial distribution. What is the role of momentum?
Has anyone here PLAYED baseball? If so, you know statistics are MEANINGLESS, because there is one very unpredictable and illogical element influencing outcome: Human Behavior.
It stands to reason that the winner of a series is the team that is able to manufacture enough runs to win 4 games out of 7. The statistics you should probably map out, to give this analysis any credibility, is the likelihood that Team A can score MORE than Team B’s average runs per game (you can do a home/road split, but it doesn’t matter much in the Bigs). If the Phillies average runs per game= (820/162) or ~5 runs a game, then it stands to reason that on any given day, they’re going to score five runs (you can even throw in standard deviations to give a wider range of values that would sufficiently ‘outscore’ the Phils). The Yanks RPG average is 5.68. That’s pretty close, but it’s safe to say the more these two teams play the closer the scores would be (there’s been only 1 game with a +4 result all series if I recall correctly). The one area where statistics MAY prove to be useful is in the indication of how many runs teams score in the late innings (7/8/9). The importance of this figure is derived from the fact that taking a lead late in the game reduces the opportunity for the opponent to score often and win, simply because the lead is taken so late in the game, there aren’t a sufficient number of outs within which to make up ground and surpass the opponent. No shock that teams that are prolific late inning scorers are usually in the playoffs and have success there as well (case in point, NY Yankees 2009).
If the exercise is “predict the Series winner with only the knowledge of the outcome of one game”, then it’s rather silly, because we’ve all seen teams race out front to lose in the end, or lose three in a row and come back and reel off 4 wins. The easy way to present this is to compare each game’s winner vs. the series winner, and one game will stand out more than the others in terms of determining a winner. I suspect however, that it won’t be a material advantage, and that it can safely be said that the winner of game 7 will absolutely be the series winner, and all other games are a crapshoot. The Yankees came out winners in 1996 after falling 0-2 and came out losers in 2001 down 0-2 and in 2004 up 3-0. That’s why you play the games, and that’s why the wisest man in baseball history said “It ain’t over ’til it’s over”. Any mathematical predictive model will find the human element simply too complex and inconsistent to allocate one game as being the swing or pivotal game. In Oct/Nov, the games are ALL PIVOTAL!
Thanks for sharing, nonetheless.
Hmmm
I somehow think you are missing the real concept behind the swing game rethoric.
Wouldnt that be, that the winnex of game X takes enough “momentum” with him to have a higher win probability for the remaining games.
So the “correct” way to check this would be to take the real games and play your game with each of them and calculate these numbers. Then you may see that for instance Game 4 has a much higher Probability then your model predicts and therefore is a real “swing game”. I am not familar with baseball history enough to conclude whether you can obtain a reasonable sample size from historic 7 game series, but I believe pure theoretical Math is not going to solve the problem.
Kulko, I agree with you completely about using real results to find the answer. The reason I started out with this is because it’s a good jumping-off point and provides a frame of reference. By comparing the real results to the mathematical ones, we can not only see where things are different but also then figure out why they’re different.
Also, the results would likely be difficult to compile and calculate. I do plan on eventually running them, however. Here’s what I think we’ll find.
– One team will almost always be favored to win the series, which means that a team that won any given game will have been more likely to win the entire series. So the numbers will go up for all the games (besides 7, which can’t go up any more).
– Also due to one team’s being more likely to win, the probabilities of game 7 being played will almost certainly go down from 31.25%.
– However, due to variations in starting pitcher matchups making it perhaps more likely for a team to win at least one game, game 5 will probably be played more often. My guess is that game 6, being in the middle of 5 and 7, will stay roughly the same.
– Given that I think game 5 will be played more often, my guess is that we’ll see that game also be the swing game by real results as well, but by a larger margin than it is now.
What I did above was purposefully abstract – I believe that real results will give a more interesting answer than mine. Just an interesting side note, though – I was watching Game 5 on FOX and they showed the number of teams that have gone on to win the series after taking a 3-1 lead. According to binomial distribution, it should have been 87.5%. The real results? 87.8. That’s just one instance, but I think it illustrates why going to the math first can be interesting.
Thanks for reading!
The reason you’ve found game five to have a greater value than all the other games is you’ve miscalculated the probabilities of the various game states. Let me demonstrate
Team A playing Team B
1/16 of the time Team A has beaten Team B before the 5th game
1/4 of the time Team A leads B 3-1 before fifth game
3/8 of the time time it is tied 2-2 before 5th game.
1/4 of the time A is behind 3-1 to B
1/16th of the time B has beaten A before the fifth game.
Therefore your appropraite P of Sit is not (.25, .5, .25) but instead (2/7, 3/7, 2/7)
Oh, and as a follow up, that correction shows all games to be equal value on expectation.
Absolutely right Matt, that’s where the error is. I’ll update it ASAP.
Great piece of work.
While you certainly answer the question you set out to answer – the one with the guy in the inflatable sumo wrestler outfit – I wonder if there’s a bigger question.
Surely the notion of a “swing” game can be quantified by looking at the difference in probability of winning the whole series, based on the result of one match. So at 1-0, if you win, you have a 0.8125 chance of winning the series, but if you lose, it’s 0.5. So this can “swing” by 0.3125, depending on the result of this match.
At 2-0, the difference between winning and losing is 0.25.
Then it’s possible to truly say at what score the game may be considered a “swing” game (obviously 3-3 is the ultimate “swing” game, as you say).
Hi, i think that i saw you visited my weblog so i got here to go back the
want?.I’m attempting to find issues to improve my site!I assume its adequate to use a few of your concepts!!