The following post is just part of a larger project on predicting pitcher injuries. Read the full paper here: Predicting Pitcher Injuries
By Kurt Bullard, Jake Meagher, and Declan Garvey
Over the past five years, over 23 percent of pitchers have been placed on the Disabled List (DL). We set out to create a model—using logistic regression—that could help predict whether or not a pitcher would get injured the following season based on traditional and advanced statistics from both the previous year and the entire career of each pitcher:
Data Collection:
We looked at pitcher data from 20102014 to match the respective injury data that we found from 20112015, since we were concerned with how last year’s usage and performance might affect next year’s injury risk. We wanted to look at pitcherspecific injuries—ones that came as a result of pitching stress put on certain parts of the body, as opposed to injuries that were “fluky.” For that reason, we only considered injuries involving the arm, shoulder, back, and side. Other injuries, we assumed, were not inherently related to pitching stress (e.g. gastrointestinal). In total, there were 3330 pitcherseasons that met this initial criteria.
Our independent variables came from three different sources:
Baseball Reference: Strikeouts, Age, Dummy Variable for AL/NL, Games, Games Started, Dummy Variable for Starting Pitcher (Games Started > 0), Complete Games, Innings Pitched, Hits, Runs, BB, FIP, Batters Faced, Strike Percentage, and Career Batters Faced
Baseball Info Solutions Data (from Fangraphs): Percentage of Pitch Thrown and Average Velocity for the Following Pitches: Fastball, Cutter, Slider, and Curveball
Tommy John Database: A dummy variable signaling whether or not a pitcher had had Tommy John Surgery before
We did end up trimming some of the data that we did not find representative of a decent MLB pitcher. For one, we did not include three pitcher seasons where the pitcher made an appearance did not record an out, which messes with FIP and makes the seasons unusable. Also, if a pitcher doesn’t record an out the entire year, he’s more likely than not a mediocre talent that should not influence the model.
In addition, we did not include pitchers who recorded less than 10 innings in a season. We realize that this may be a bit problematic in that some pitchers may have gotten injured less than 10 innings into the season and did not play for that reason rather than not play due to lack of talent—seven percent of these pitchers ended up getting injured. That being said, the overall injury rate for pitchers hovered around 23%, so most of the pitchers who pitched fewer than 10 innings were simply underutilized rather than injured. However, there was very little Baseball Info Solutions data for pitchers with few appearances, so it would have thrown off the actual impact of pitch speed and selection. At the end, we were left with 2749 pitcherseasons.
Transformations:
We first transformed a few variables from totals into rates for fear of collinearity with games played or batters faced. We put strikeouts, wild pitches, runs, hits, and walks over batters faced to eliminate collinearity.
In addition to removing the above rate transformations, we also decided to transform a handful of others after examining their distributions.

Square root of batters faced

Square root of games played

Square root of wild pitches per batters faced

Square root of Curveball Percentage

Square root of Slider Percentage

Log of Cutter Percentage
The distributions for the remainder of the variables appeared to be approximately normal, so we found no need to transform them. Below is a key of the variables that we were considering, posttransformations
Results:
After running several stepwise regression and then crossvalidating them against each other, we ended up with the following model:
Variable Label in R 
Actual Variable (with transformations) 
age 
Age 
al 
Indicator for American League 
g 
Square root of games played 
sp 
Indicator for Starting Pitcher 
cg 
Complete games 
hbf 
Hits per batters faced 
bbbf 
Walks per batters faced 
bf 
Square root of batters faced 
wpbf 
Square root of wild pitches per batters faced 
strpct 
Strike percentage 
totalbf 
Total batters faced (career) 
tj 
Indicator for Tommy John 
pit 
Total pitches thrown (career) 
fb. 
Fastball percentage 
fbv. 
Average fastball velocity 
sl. 
Square root of slider percentage 
slv 
Average slider velocity 
ct. 
Log of (cutter percentage + 0.01) 
ctv 
Average cutter velocity 
cb. 
Square root of curveball percentage 
cbv 
Average curveball velocity 
ch. 
Changeup percentage 
chv 
Average changeup velocity 
The logodds of injury risk are negatively correlated with (from most to least significant, where significant relationships are bolded):

Complete Games with the Log of Batters Faced

Age

Strikeouts

Fastball Percentage

Tommy John With Fastball Velocity

Walks per batters faced and Tommy John

Strikeouts per batters faced

Walks per batters faced

Cutter Indicator

Fastball Velocity
The logodds of injury risk are positively correlated with (from most to least significant, where significant relationships are bolded):

Total Batters Faced

Complete Games

Log of Batters Faced

Fastball Velocity with Fastball Percentage

Strikeouts per Batters Faced and Age

Tommy John

Fastball Percentage and Age

Fastball Percentage and Cutter Indicator

Strikeouts Per Batters Faced and Fastball Velocity
The most significant variables in this model make a lot of sense. In terms of positive correlations, complete games, total batters faced, and the log of the previous season’s batters faced all make sense because it punishes you for throwing a lot of pitches in individual games, seasons, and careers. Fastball velocity and percentage also make sense, since a quicker fastball is more dangerous when you throw it more often. Lastly, having had Tommy John points to having had previous injury, which is a good indicator of getting injured once again.
The significant negative correlations also make sense in terms of the structure of the data set. Fastball percentage is negatively correlated with injury risk, as the less fastball one throws, the more armstraining offspeed pitches are thrown. The best pitchers are the ones who throw the most strikeouts and complete games, meaning they probably have sound mechanics, explaining why strikeouts and complete games with the log of batters faced are significant. Tommy John and Fastball Velocity are significant, which could serve as an indicator to the level of healing from the last injury—perhaps pitchers who recover more fully from it throw faster upon return. Lastly, age is also negative in the regression, but that is most likely due to sample bias and is a slight shortcoming of our model: the pitchers who lasted to 32plus were usually the really good ones who in most likelihood never experienced any devastating injury; meanwhile, a lot of really young pitchers get injured and never come back from it.
Predictions for 2016:
Last year, 598 pitchers threw more than 10 innings. We had to remove 19 pitchers because they did not have Baseball Info Solutions data, which left 579 pitchers from last season who threw 10 or more innings and had the relevant velocity and pitch selection data. The following is the histogram of the injury risk predictions:
We predicted the average risk of a pitcher getting injured in 2016 to be 23.2%, which makes sense, since the fiveyear mean was in fact 23.4%. We would expect future years to hover around this value, since there hasn’t been a secular change in pitcher usage or philosophy.
The range of the model’s predictions went from a peak of 80.3% to a low of 2.8%. We present the top 10 most likely and least likely pitchers to get injured next season, respectively.
Top Ten
Bottom Ten
The pitcher most likely to get injured is Cleveland Starter Josh Tomlin. The starting pitcher is just coming off of Tommy John surgery, and also only throws 53% fastballs, which was in the lowest quartile last season. He’s faced 1,675 batters in his career already, and also threw two complete games last year. The pitcher will turn 31 this season as well, which doesn’t bode well since he strikes out approximately one every four batters.
High up on the list is Aroldis Chapman, who the Dodgers tried to trade for recently. The Reds’ reliever strikes out two out of every five batters, and also throws his fastball an average of 99.5 MPH, which is the highest in the league.
Towards the bottom of the list is R.A. Dickey, which also makes sense, since he is knuckleball pitcher, and that tends to put less stress on the arm than do cutters and curveballs and highvelocity fastballs.
Here are some other relevant pitcher injury risks:
We believe that the logistic regression predicting pitcher injuries is a useful model given that it was constructed using only publicly available baseball statistics. However, it’s definitely not an endall and beall model, as Dodgers’ pitcher Brandon McCarthy pointed out to us. There is more information that we would want to make a better model that we can’t have—eating habits and training regimen, amongst other things that there is no data for. In addition, pitcher mechanics are also a large component of injury risk—those who have worse fundamentals tend to get injured at a higher rate. Using this model in conjunction with qualitative analysis of one’s pitching motion would perhaps be even more helpful. Nonetheless, our model is a useful start in identifying pitcher injury risk so that both pitchers themselves and team management can adjust pitching selection and workload as a means of injury prevention.
Your article would definitely be the most useful posts I’ve ever read, it’s necessary
Great start on the issue. I see one factor that has an overall bearing on the propensity for injury among the segment called ‘Starter’ that can be defined as any pitcher who starts a minimum of 162 IP in a season which is the standard qualification for the ERA title. This factor can be described as ‘predictive analysis’ …the availability of statistical measures that can be used to make pregame, as well as ingame tactical decisions. Starters are pitching fewer innings as managers see a benefit in utilizing a reliever in specific game situations (based on statistical analysis on matchup’s etc…) In theory this should reduce the injury rate for starters overtime as indicated by your data. . The game today is far more competitive than it has ever been. This is a trend for all athletic pursuits professional major leagues or other wise, for all the reasons that are very well documented (availability of measurable stats used in predictive strategy and tactical moves, the science of nutrition and exercise regimes, proven diagnostic procedures, and surgical techniques, to name a few…) With knowledge comes risk adversity. Managers would rather utilize personal and tactics based on statistical evidence (knowledge) than risk a hunch based on anecdotal experience, when you have data in your hand.
An interesting hypothesis would be: is there a direct correlation between reduced innings pitched over a season (minimum 162 IP) and injury rate among starters?
The other interesting factor in your data is correlations between the different pitch types. In other words is there a relationship between a pitcher’s repertoire and injury rate? In less competitive era’s when player evaluation was less systematic, pitching prospects were selected on fast ball mechanics. The one thing you cannot teach, as the saying goes, is velocity. Teach a fireballer to locate his fastball, and how to throw an off speed pitch (more often than not, a curve ball) and you have your next starter. In today’s game pitchers are taught to throw a number of ‘out’ pitches that have the potential to do harm with even more unnatural stress on the body