Predicting Pitcher Injuries

The following post is just part of a larger project on predicting pitcher injuries. Read the full paper here: Predicting Pitcher Injuries

By Kurt Bullard, Jake Meagher, and Declan Garvey

Over the past five years, over 23 percent of pitchers have been placed on the Disabled List (DL). We set out to create a model—using logistic regression—that could help predict whether or not a pitcher would get injured the following season based on traditional and advanced statistics from both the previous year and the entire career of each pitcher:

Data Collection:

We looked at pitcher data from 2010-2014 to match the respective injury data that we found from 2011-2015, since we were concerned with how last year’s usage and performance might affect next year’s injury risk. We wanted to look at pitcher-specific injuries—ones that came as a result of pitching stress put on certain parts of the body, as opposed to injuries that were “fluky.” For that reason, we only considered injuries involving the arm, shoulder, back, and side. Other injuries, we assumed, were not inherently related to pitching stress (e.g. gastrointestinal). In total, there were 3330 pitcher-seasons that met this initial criteria.

Our independent variables came from three different sources:

Baseball Reference: Strikeouts, Age, Dummy Variable for AL/NL, Games, Games Started, Dummy Variable for Starting Pitcher (Games Started > 0), Complete Games, Innings Pitched, Hits, Runs, BB, FIP, Batters Faced, Strike Percentage, and Career Batters Faced

Baseball Info Solutions Data (from Fangraphs): Percentage of Pitch Thrown and Average Velocity for the Following Pitches: Fastball, Cutter, Slider, and Curveball

Tommy John Database: A dummy variable signaling whether or not a pitcher had had Tommy John Surgery before

We did end up trimming some of the data that we did not find representative of a decent MLB pitcher. For one, we did not include three pitcher seasons where the pitcher made an appearance did not record an out, which messes with FIP and makes the seasons unusable. Also, if a pitcher doesn’t record an out the entire year, he’s more likely than not a mediocre talent that should not influence the model.

In addition, we did not include pitchers who recorded less than 10 innings in a season. We realize that this may be a bit problematic in that some pitchers may have gotten injured less than 10 innings into the season and did not play for that reason rather than not play due to lack of talent—seven percent of these pitchers ended up getting injured. That being said, the overall injury rate for pitchers hovered around 23%, so most of the pitchers who pitched fewer than 10 innings were simply under-utilized rather than injured. However, there was very little Baseball Info Solutions data for pitchers with few appearances, so it would have thrown off the actual impact of pitch speed and selection. At the end, we were left with 2749 pitcher-seasons.

Transformations:

We first transformed a few variables from totals into rates for fear of collinearity with games played or batters faced. We put strikeouts, wild pitches, runs, hits, and walks over batters faced to eliminate collinearity.

In addition to removing the above rate transformations, we also decided to transform a handful of others after examining their distributions.

  • Square root of batters faced

  • Square root of games played

  • Square root of wild pitches per batters faced

  • Square root of Curveball Percentage

  • Square root of Slider Percentage

  • Log of Cutter Percentage

The distributions for the remainder of the variables appeared to be approximately normal, so we found no need to transform them. Below is a key of the variables that we were considering, post-transformations

Results:

After running several stepwise regression and then cross-validating them against each other, we ended up with the following model:

Variable Label in R

Actual Variable (with transformations)

age

Age

al

Indicator for American League

g

Square root of games played

sp

Indicator for Starting Pitcher

cg

Complete games

hbf

Hits per batters faced

bbbf

Walks per batters faced

bf

Square root of batters faced

wpbf

Square root of wild pitches per batters faced

strpct

Strike percentage

totalbf

Total batters faced (career)

tj

Indicator for Tommy John

pit

Total pitches thrown (career)

fb.

Fastball percentage

fbv.

Average fastball velocity

sl.

Square root of slider percentage

slv

Average slider velocity

ct.

Log of (cutter percentage + 0.01)

ctv

Average cutter velocity

cb.

Square root of curveball percentage

cbv

Average curveball velocity

ch.

Changeup percentage

chv

Average changeup velocity

The log-odds of injury risk are negatively correlated with (from most to least significant, where significant relationships are bolded):

  • Complete Games with the Log of Batters Faced

  • Age

  • Strikeouts

  • Fastball Percentage

  • Tommy John With Fastball Velocity

  • Walks per batters faced and Tommy John

  • Strikeouts per batters faced

  • Walks per batters faced

  • Cutter Indicator

  • Fastball Velocity

The log-odds of injury risk are positively correlated with (from most to least significant, where significant relationships are bolded):

  • Total Batters Faced

  • Complete Games

  • Log of Batters Faced

  • Fastball Velocity with Fastball Percentage

  • Strikeouts per Batters Faced and Age

  • Tommy John

  • Fastball Percentage and Age

  • Fastball Percentage and Cutter Indicator

  • Strikeouts Per Batters Faced and Fastball Velocity

 The most significant variables in this model make a lot of sense. In terms of positive correlations, complete games, total batters faced, and the log of the previous season’s batters faced all make sense because it punishes you for throwing a lot of pitches in individual games, seasons, and careers. Fastball velocity and percentage also make sense, since a quicker fastball is more dangerous when you throw it more often. Lastly, having had Tommy John points to having had previous injury, which is a good indicator of getting injured once again.

The significant negative correlations also make sense in terms of the structure of the data set. Fastball percentage is negatively correlated with injury risk, as the less fastball one throws, the more arm-straining offspeed pitches are thrown. The best pitchers are the ones who throw the most strikeouts and complete games, meaning they probably have sound mechanics, explaining why strikeouts and  complete games with the log of batters faced are significant. Tommy John and Fastball Velocity are significant, which could serve as an indicator to the level of healing from the last injury—perhaps pitchers who recover more fully from it throw faster upon return. Lastly, age is also negative in the regression, but that is most likely due to sample bias and is a slight shortcoming of our model: the pitchers who lasted to 32-plus were usually the really good ones who in most likelihood never experienced any devastating injury; meanwhile, a lot of really young pitchers get injured and never come back from it.

Predictions for 2016:

Last year, 598 pitchers threw more than 10 innings. We had to remove 19 pitchers because they did not have Baseball Info Solutions data, which left 579 pitchers from last season who threw 10 or more innings and had the relevant velocity and pitch selection data. The following is the histogram of the injury risk predictions:

We predicted the average risk of a pitcher getting injured in 2016 to be 23.2%, which makes sense, since the five-year mean was in fact 23.4%. We would expect future years to hover around this value, since there hasn’t been a secular change in pitcher usage or philosophy.

The range of the model’s predictions went from a peak of 80.3% to a low of 2.8%. We present the top 10 most likely and least likely pitchers to get injured next season, respectively.

Top Ten

Bottom Ten

The pitcher most likely to get injured is Cleveland Starter Josh Tomlin. The starting pitcher is just coming off of Tommy John surgery, and also only throws 53% fastballs, which was in the lowest quartile last season. He’s faced 1,675 batters in his career already, and also threw two complete games last year. The pitcher will turn 31 this season as well, which doesn’t bode well since he strikes out approximately one every four batters.

High up on the list is Aroldis Chapman, who the Dodgers tried to trade for recently. The Reds’ reliever strikes out two out of every five batters, and also throws his fastball an average of 99.5 MPH, which is the highest in the league.

Towards the bottom of the list is R.A. Dickey, which also makes sense, since he is knuckleball pitcher, and that tends to put less stress on the arm than do cutters and curveballs and high-velocity fastballs.

Here are some other relevant pitcher injury risks:

We believe that the logistic regression predicting pitcher injuries is a useful model given that it was constructed using only publicly available baseball statistics. However, it’s definitely not an end-all and be-all model, as Dodgers’ pitcher Brandon McCarthy pointed out to us. There is more information that we would want to make a better model that we can’t have—eating habits and training regimen, amongst other things that there is no data for. In addition, pitcher mechanics are also a large component of injury risk—those who have worse fundamentals tend to get injured at a higher rate. Using this model in conjunction with qualitative analysis of one’s pitching motion would perhaps be even more helpful. Nonetheless, our model is a useful start in identifying pitcher injury risk so that both pitchers themselves and team management can adjust pitching selection and workload as a means of injury prevention.

About the author

harvardsports

View all posts

3 Comments

  • Great start on the issue. I see one factor that has an overall bearing on the propensity for injury among the segment called ‘Starter’ that can be defined as any pitcher who starts a minimum of 162 IP in a season which is the standard qualification for the ERA title. This factor can be described as ‘predictive analysis’ …the availability of statistical measures that can be used to make pre-game, as well as in-game tactical decisions. Starters are pitching fewer innings as managers see a benefit in utilizing a reliever in specific game situations (based on statistical analysis on matchup’s etc…) In theory this should reduce the injury rate for starters overtime as indicated by your data. . The game today is far more competitive than it has ever been. This is a trend for all athletic pursuits professional major leagues or other wise, for all the reasons that are very well documented (availability of measurable stats used in predictive strategy and tactical moves, the science of nutrition and exercise regimes, proven diagnostic procedures, and surgical techniques, to name a few…) With knowledge comes risk adversity. Managers would rather utilize personal and tactics based on statistical evidence (knowledge) than risk a hunch based on anecdotal experience, when you have data in your hand.
    An interesting hypothesis would be: is there a direct correlation between reduced innings pitched over a season (minimum 162 IP) and injury rate among starters?
    The other interesting factor in your data is correlations between the different pitch types. In other words is there a relationship between a pitcher’s repertoire and injury rate? In less competitive era’s when player evaluation was less systematic, pitching prospects were selected on fast ball mechanics. The one thing you cannot teach, as the saying goes, is velocity. Teach a fireballer to locate his fastball, and how to throw an off speed pitch (more often than not, a curve ball) and you have your next starter. In today’s game pitchers are taught to throw a number of ‘out’ pitches that have the potential to do harm with even more unnatural stress on the body

Leave a Reply

Your email address will not be published. Required fields are marked *