By Shuvom Sadhuka
Every year, the MLB showcases its best players in the All-Star game. The game is never without controversy, as certain players who seem to deserve the honor are left off the league rosters (“snubs”), whereas others having not-so-stellar seasons find a way into the game. This is in part due to how the teams are selected: fans vote for players they think deserve the honor, and the player with the most votes at each position earns a starting spot on that league’s team. The players then vote for their peers to fill out the reserve rosters.
As one may expect, fans often “stuff the ballot box” for home-team players, such as in the notorious 2015 All-Star game, when the American League starting roster featured four players from the Kansas City Royals (at one point in balloting, eight of nine players were Royals).
The overall impact may seem marginal, especially given that a player who misses out on the starting lineup still has a shot at the reserve lineup, which is decided by peers. Nonetheless, it’s worth investigating the structure of the All-Star game starting rosters. In particular:
- Are there significant differences between players who were All-Star starters and the rest of the league? On which statistics do they differ most?
- Which players who weren’t named All-Star starters most deserved to be starters (“snubs”)?
- Which players who were named All-Star starters least deserved to be starters (“fairies”)?
To conduct this analysis, I used pybaseball, an open-source python library for baseball statistics, to pull statistics on the first half of each season for every position player in the MLB, starting from 2008 (the farthest back pybaseball’s database goes). Players who played fewer than 45 games or recorded fewer 135 at-bats are removed. I then marked all players as All-Star starters or not based on Baseball Reference data. An example entry looks like this:
While 2008 is an arbitrary cutoff enforced by the pybaseball package, it doesn’t necessarily compromise our analysis, as (1) online voting has dramatically changed fan voting, so 2008 seems like a reasonable starting point, (2) the criteria for what fans look for in players has also changed; some fans nowadays care about WAR, which wasn’t the case in the 1950s, for example. Nonetheless, we should be mindful of the limitations of our data.
To start, let’s look at how All-Star starters and the rest of the league differ across various statistics (all stats are converted to rates to control for All Stars having more plate appearances, so HR is actually HR/Plate Appearance):
Figure 1: Percentage Difference (where 1.0 represents 100% difference) between All-Star Starters and Rest of League across various statistics. Note that All-Star starters were intentionally walked at a far greater rate than others.
We see a large difference in intentional walks — All-Star starters are intentionally walked nearly 150% more than average — which makes sense, as pitchers intentionally walk batters when they believe they have better chances against the next batter (indicating the current batter is stronger). Interestingly, All-Star starters had fewer sacrifice hits than the rest of the league, perhaps because they got their RBIs while also getting on base (as opposed to sacrifice hits). All-Star starters also struck out at a lower rate, as expected.
We can formalize these ideas in a table using a two-sample t-test. The t-test tests the plausibility of the assumption that two datasets are from the same model. The p-value essentially indicates how often we can expect the two datasets to be as different as the ones we have assuming a certain model for both.
For example, if we get a p-value of 0.01 for the difference in home runs between All-Star starters and everyone else, then if All-Star starters and other players both had the same probabilities to hit home runs, we’d expect to observe the differences we see between the two groups about 1% of the time.
Here are some p-values from our two-sample t-test (we convert all stats like home runs to rates like home runs per plate appearance to control for All Stars having more plate appearances):
These are some ridiculously low p-values, but this is exactly what we would expect: All Star starters are supposed to be the best of the best, so we would expect them to hit home runs, drive in runs, and get on base at a significantly higher rate. It’s worth noting, though, that voters don’t seem to care as much about speed (stolen bases had a p-value of 0.39) or age (perhaps countering the notion that the All Star game features well-past-their-prime players who find a way in because of generous fans).
Figure 2: Age Distribution of All-Star starters (red) vs rest of MLB (blue)
To find the snubs and fairies, we need a method to predict All-Star starters. Since being an “All-Star” is a subjective notion, any metric we use, such as WAR, VORP, or HRs, is subject to our own biases. Our goal is instead to see who the fans left out by their own metrics. To do so, we train three machine learning models — Logistic Regression, k-Nearest Neighbors (kNN), and Random Forest — to predict whether a player is an All-Star starter or not. In each case, we train the model on 80% of the data, make predictions on the remaining 20%, and repeat this five times to ensure every point in the dataset also gets a prediction (except in Logistic Regression, in which case the model is trained on all the data).
Note that this means our “snubs” and “fairies” list relies on the imprecision of the model: if our model were perfect and classified each player as a starter or not correctly, then there would be no snubs or fairies! If we visualize each player as a 0 or 1 in this space defined by various statistics (each axis is a different statistic, like HRs, RBIs, ABs, etc.), then what we’re looking for is essentially 0s surrounded by a lot of 1s (snubs) and 1s surrounded by a lot of 0s (fairies).
Figure 3: We want to find a model to distinguish the All-Star starters (orange) from the rest (grey). Here we plot an example with two statistics: HR/PA and Batting Average. As we add more and more statistics, our model will try to distinguish between the two clusters of data (this plot will become multi-dimensional in our model).
Let’s start by looking at the players that all three models agree were snubs (there’s a lot: 20+!):
No Love: Players Classified as Snubs by KNN, RF, and Logistic Regression
|Paul Goldschmidth (Arizona Diamondbacks)||2018||18||48||0.274|
|Melky Cabrera (San Francisco Giants)||2012||7||39||0.354|
|Martin Prado (Atlanta Braves)||2010||7||36||0.355|
|Carl Crawford (Tampa Bay Rays)||2010||7||40||0.316|
|Jose Altuve (Houston Astros)||2015||7||35||0.302|
We can also look at this from another perspective; whereas kNN and Random Forest are pure classification algorithms, logistic regression outputs a probability that a player is an All-Star starter. If this probability is greater than 0.5, we say he’s an All-Star starter.
No Love: Snubs with Highest Logistic Regression Probability to Be an All-Star Starter
|Player (Team)||Year||HR||RBI||AVG||All-Star Starter Probability||Lost Out To(which player was the starter?)|
|Victor Martinez (Detroit Tigers)||2014||21||55||.328||89.3%||Nelson Cruz (Baltimore Orioles)|
|Justin Morneau (MinnesotaTwins)||2009||20||67||.320||85.9%||Mark Teixeira (New York Yankees)|
|Miguel Cabrera (Detroit Tigers)||2011||17||67||.324||85.8%||Adrian Gonzalez (Boston Red Sox)|
|Miguel Cabrera (Detroit Tigers)||2012||18||56||.323||85.1%||Prince Fielder (Tigers)|
|Joey Votto (Cincinnati Reds)||2017||24||61||.312||83.4%||Ryan Zimmerman (Washington Nationals)|
Some of these snubs are truly head scratching. Rafael Devers’ exclusion from the All-Star roster, both starting and reserve, for example, was hotly contested in MLB circles this year. Others, however, just took a backseat to other outstanding players at their position. First basemen are generally renowned for the batting abilities and don’t have many defensive responsibilities, so it’s no surprise that four of the five logistic regression snubs were first basemen, who all took a backseat to other first basemen having excellent seasons.
Moreover, all three of our models agreed that the Tigers, Diamondbacks, and Blue Jays had the most snubs, so get on it Detroit, Arizona and Toronto! Support your players!
And who were the biggest fairies? Well, our models agree on a lot of them — 71 to be precise! Here’s a list of 5:
Free Pass: Players Classified as ‘Fairies’ by KNN, RF, and Logistic Regression
|Jackie Bradley Jr. (Boston Red Sox)||2016||13||53||.294|
|Chase Utley (Philadelphia Phillies)||2014||6||40||.286|
|Rafael Furcal (St. Louis Cardinals)||2012||5||32||.274|
|Alcides Escobar (Kansas City Royals)||2015||2||28||.277|
|Joe Mauer (Minnesota Twins)||2010||3||34||.310|
Free Pass: Fairies with Lowest Logistic Regression Probability to Be an All-Star Starters
|Scott Rolen (Cincinnati Reds)||2016||4||32||.256||0.8%|
|Derek Jeter (New York Yankees)||2014||2||21||.268||1.1%|
|Dan Uggla (Atlanta Braves)||2012||11||43||.229||1.1%|
|Yadier Molina (St. Louis Cardinals)||2015||5||25||.278||1.3%|
|Salvador Perez (Minnesota Twins)||2010||13||34||.263||1.8%|
And which teams had the most fairies? The St. Louis Cardinals, with a “whopping” 6 over the past 11 years, so congrats Cardinals fans? And sorry Royals fans, even your brave efforts in 2015 weren’t quite enough…
Nevertheless, there are a couple limitations to our model worth discussing.
- Our models don’t account for positional variation in the MLB. For example, it could be that a few years had exceptionally strong shortstops (e.g. Derek Jeter and Nomar Garciaparra) and only one would count as an All-Star starter, which biases our models.
- We also don’t account for year-to-year variation in the MLB. It could be the case that offensive production dipped across the MLB in certain years, which could mean that players who were outstanding by that year’s standards but worse by overall standards made it into the starting roster.
- Our models don’t differentiate between positions or account for defense. Some positions require greater defensive prowess (catcher) and hence less offense, but our models don’t differentiate between a catcher and designated hitter.
There are various methods to possibly correct for these limitations. For example, we could try to adjust for the quality of the season; if home runs were up across the MLB in 2015, for example, we would want to make each home run count less. We could try standardizing each statistic by year, but this also runs into problems: do we standardize with respect to the entire league for that year? Do we standardize only with respect to players who meet a certain baseline (i.e. do we want to standardize home runs and include a bunch of pinch hitters with lots of at-bats)? The general fear here is overfitting; if we adjust for year, position, park, etc., we may begin to model the noise in our dataset, although admittedly adding one of these may help our model.
It’s my belief, however, that having two exceptionally strong shortstops (or multiple exceptionally strong players in a year) still warrants labeling one as a snub — good hitters shouldn’t be penalized just because they happen to play the same position and same year as a slightly stronger player.
Nonetheless, our model showed that All-Star starters, for the most part, are indeed very good players and while a few players get snubbed every year, fans on the whole select players with strong seasons.
If you have any questions for Shuvom about this article, please feel free to reach out to him at firstname.lastname@example.org