Analysis of Every 2016 MLB Home Run
Recently, Major League Baseball has seen a transformation in the way the game is played. While the games of 10-15 years ago saw high-scoring, steroid-fueled shootouts, today’s games are much lower scoring. The median MLB team this year averaged 4.46 runs per game, down from 4.76 in 2007. While .3 runs per game may seem trivial, it must be considered that this difference is evident over the largest dataset in all of professional sports: an MLB season. An average score .3 runs (nearly 7 percent) lower over 162 games is a sign that the style of gameplay is changing. With scoring consistently declining over the last 10 years, a player’s ability to hit Home Runs (HRs) is at a premium. The purpose of this project will be to try to observe what sort of factors go into a Home Run, and also which types of HR’s are most valuable to an MLB team.
With the “Moneyball” revolution in Major League Baseball, the emphasis on using analytical tools to solve on-field problems is unlike anything seen previously in professional sports. As the demand for baseball data has grown since the Oakland A’s first began using analytics as a tool to win games, so too has the amount of data available. Two years ago, Major League Baseball made Statcast data available to the public for the first time. Statcast data can be used to find insights such as the angle a fielder takes to the ball, how fast a runner runs while trying to steal a base, and even data about HR’s that was previously impossible to observe. This project will attempt to use Statcast data to mine insights about every HR hit in 2016.
This project will address the following objectives:
- How can MLB teams maximize offensive output, even as offensive output around the league is declining?
- Which sort of Home Runs are most valuable to teams?
- What are the different types of Home Runs?
- Are players likely to hit the same type of Home Run consistently?
Overall, there has not been too much research published in the area of Home Run tracking. One of the more interesting pieces in the field is “The Physics of Baseball”, by Alan M. Nathan. This paper is probably the most comparable to mine, as Nathan looks for a correlation between Exit Speed and Distance Traveled of each baseball. However, limiting the dataset to Home Runs and introducing more variables will add totally new aspects to the study.
Another study that used Statcast information to try to draw baseball insights is the article “Need To Get To First? Don’t Use Your Head”, by Anthony Castrovince. The study uses Statcast data to show that when given the option between running or diving into first base, players should opt to stay on their feet. However, this study doesn’t use the data to draw any insights from home run data.
My favorite study in this area was done by Rob Arthur at FiveThirtyEight. Arthur looked at all batted balls over the course of the 2015 season, and saw how much of each ball’s exit velocity was due to the hitter and how much was due to the pitcher. Arthur ended up finding that the exit velocity was five parts hitter and one part pitcher. For this reason, I am going to focus my time and efforts around studying what makes hitters hit balls harder, and ignore the pitching side of the equation for now.
The data set that I am using includes data about every MLB HR hit in the 2016 season. Figure A shows each variable in the data set, as well as meaningful descriptive statistics if the variable is numerical.
Using the data above, I will create two different types of models. The first model will be a Multiple Linear Regression and the second will be a Clustering Analysis. I will first attempt to draw insights using my Multiple Linear Regression model to see if different factors are more or less influential on the distance of a Home Run. Next, I will perform a clustering analysis in order to break down every Home Run from 2016 into a predetermined number of different categories. My hypothesis is that four clusters will be a good amount, however if it isn’t showing the sort of separation that I’m hoping for allowing for more or less clusters is certainly a possibility. Finally, I will see if there is a strong trend towards players always hitting the same type of HR. It will be interesting to see if there are some players that always hit the same type of HR’s (i.e. line drive, high shot, always same distance, etc.).
While the conclusions derived from the clustering analysis were interesting, the more relevant statistical study for the sake of this assignment is the building of a predictive model using the 2016 MLB HR data set. In order to create a test for my model, I will start by partitioning the data. The partition will include every variable (more will be eliminated later in the process), will use the standard 12345 seed, and will use a 60/40 Training/Validation split.
Luckily, there are not any missing values in this data. While there are technically some outliers (based on the assumption that any data point outside of 3 SD’s from the mean is an outlier), there is no need to remove these outliers. Since every Home Run was hit under a uniform set of rules and in realistic condition, any HR that was particularly long, high, low, or short should be kept in the data set. Because of these factors, a basic partition was the only pre-processing necessary.
Model #1 – Multiple Linear Regression
The first model I will create to try to project Home Run distance will be Multiple Linear Regression. This regression will use Exit Velocity, Vertical Angle, Horizontal Angle and Apex in order to try to predict Home Run distance.
After the first iteration of the model, results were solid. Figures B and C show the output of the model and the error metrics for the model, respectively.
Figure B Figure C
While the results of this model were strong, the Horizontal Angle variable ran as insignificant, with a P-Value of .268 well above the .05 threshold. For this reason, I re-ran the regression with the Horizontal Angle variable removed and every other aspect of the model kept constant. Figures D and E show the output and error metrics for the second iteration of the model, respectively.
Figure D Figure E
While the outputs of these two models are very similar, the first model had a slightly higher RSq. For this reason, and since there is no inherent value in the reduction of the number of variables in this case, I will stick with the first iteration of the model as the superior model. This model says that there is a strong positive correlation between exit velocity and distance, as every additional MPH of exit velocity lends itself to 3.3 feet more distance on each HR on average. There is a negative correlation between elevation angle and distance, as every additional degree of elevation takes roughly 4.67 feet off of the average HR. There is a slightly negative correlation between horizontal angle and distance, as a change of 1 degree in horizontal angle takes away roughly .014 feet on the average Home Run. This is logical, since a lower horizontal angle means that it was closer to an opposite field HR. As touched on before, opposite-field HR’s are generally not hit as hard as other HR’s. Finally, a higher apex was good for home run distance, as an increase in the apex of 1 foot led to a 1.2 foot longer HR on average.
Model #2 - Clustering Analysis
In the first part of my analysis, I looked into what made Home Runs travel longer or shorter distances. For the second part of my analysis, I will look into what the different “types” of Home Runs are, and look into whether players are more likely to always hit the same type of HR. To begin this process, I ran a clustering analysis of every Home Run of the 2016 season using the following inputs: Distance, Exit Velocity (MPH), Vertical Angle, Horizontal Angle, and Apex. My hypothesis is that if I set the number of clusters to 4, the following clusters will appear:
- Long Bomb HR’s (Long distance, medium elevation angle, high exit velocity)
- Line Drive HR’s (High exit velocity, medium distance, low apex, and low launch angle)
- Fly-Ball HR’s (High elevation angle, high apex, lower distance)
- “Bloops” (Shortest distance, lowest speed, slightly lower apex)
After running the analysis, the results seemed to follow my hypothesis. Figure E is a screenshot of the original coordinates of each cluster center, while Figure F contains a screenshot of the normalized coordinates. Figure D is formatted to show trends within the data since all fields are comparable when normalized, with green showing a higher relative value and red showing a lower relative value.
Figure F Figure G
Overall, the results of this clustering analysis ended up being pretty much in line with what I expected. Cluster 1 is the “Long-Bomb” HR’s, with the highest distance and highest exit velocity. Cluster 2 is the “Fly-Ball” HR’s, with lower exit velocity but the highest exit velocity and apex. Cluster 3 is home to the “Line-Drive” HR’s, which have nearly the same average exit velocity as Cluster 1 but the lowest launch angle and apex. Cluster 4 appears to be the unimpressive “Bloops” category, which consist of the shortest, slowest Home Runs. Cluster 4 also has the lowest horizontal angle, meaning that Cluster 4 Home Runs were primarily hit to opposite field (Right-field for a right-handed batter or Left-field for a left-handed batter). It is a well-accepted fact that opposite field Home Runs are generally not hit as hard as other Home Runs. It is also worth noting that Cluster 4 was by far the smallest cluster, with 949 observations within the data set. For context, Cluster 1 had 1,639 observations, Cluster 2 had 1,438 observations, and Cluster 3 had 1,584 observations.
In-Depth Interpretation of Clustering Analysis
In order to better interpret the results above, I decided to see which players were most likely to hit each type of HR, as well as whether players were more likely to trend towards having most of their HR’s fall into the same category. In order to gather these insights, I removed every player from the data set that didn’t have at least 10 HR’s. This eliminated the issue of trying to draw insights from players that only hit a few HR’s all year, and decreased the size of the player pool from 602 to 214. After observing the frequency of each cluster type for each player, the following insights were drawn:
- Highest proportion of Cluster 1 “Long-Bomb” HR’s: Colby Rasmus. 11 of Rasmus’ 15 HR’s fell into the “Long-Bomb” category. If I was an MLB General Manager, I would use this insight to try and sign or trade for Colby Rasmus. The fact that 11 of his 15 Home Runs were long bombs would mean that he had many “no-doubt” HRs. It also means that his stats weren’t inflated by very many balls that weren’t hit hard but ended up in the seats. This insight is supported by the fact that Colby Rasmus’ average Home Run ball would be a HR in 27.4 of the 30 MLB parks, nearly half a standard deviation above league average.
- Highest proportion of Cluster 2 “Fly-Ball” HR’s: David Ross and Aaron Hill (Tie). Both players had 6 of their 10 HR’s fall into the “Fly-Ball” category. If I were a GM or coach on a team with one of these players, I would tell them to work on bringing down the average vertical angle of the balls they hit. These players are likely missing opportunities at Home Runs by getting too much air under the ball, which takes away from the distance it can travel. If they could bring the vertical angles of the balls they hit down a bit, they would have a better chance at hitting more HR’s.
- Highest proportion of Cluster 3 “Line-Drive” HR’s: Joe Mauer. 7 of Mauer’s 11 Home Runs in 2016 fell into Cluster 3. This is logical, since Mauer generally tries to keep his vertical angle on batted balls low. Mauer usually tries to hit well-placed singles and doubles, rather than trying to hit HR’s. In any case, if I were the GM or coach on a team with Mauer, I would try to get him to put more air under the balls he hits if I were looking for more Home Run output from my team. Mauer’s current skill set is useful, but it would be important to know how to alter it if he was looking for different results.
- Highest proportion of Cluster 4 “Bloop” HR’s: Didi Gregorius. An astounding 16 of Didi Gregorius’ 20 HR’s in 2016 fell into Cluster 4. This is likely due to the fact that Gregorius plays his home games at Yankee Stadium, which is notorious for allowing softly-hit baseballs to end up as Home Runs. Not only did Gregorius have 80 percent of his HR output in 2016 fall into the “Bloop” category, but he didn’t have any HR’s in the “Long-Bomb” cluster. If I was the GM of a team with Didi Gregorius, I would try to trade him based on these numbers. While 20 HR’s from a Shortstop (typically a position which doesn’t hit many HR’s) seems impressive, these numbers are artificially inflated by a large number of softly-hit HR’s. This would indicate that Gregorius lacks real power capabilities, and rather was the beneficiary of good luck and a forgiving ballpark. Furthermore, if I was only looking to build a team that hit the most HR’s, I would trade Gregorius for Rasmus even though Gregorius hit 5 more HR’s than Rasmus last year. A closer look at both players’ numbers would show that Rasmus is actually the more powerful hitter, as the vast majority of his Home Runs were hit harder than any HR that Gregorius hit this year.
In order to see whether players were likely to hit the same type of HR rather than a more even distribution, I found the percentage of HR’s made up by the most frequent HR and least frequent HR from each player. I then subtracted the proportion of the least frequent HR type from the proportion of the most frequent HR type in order to find a metric that I called the “HR Type Spread”. This “spread” statistic can basically be interpreted as saying “of all a player’s Home Runs, the most common type made up X percentage of his HR’s more than his least common type. After compiling data on the “HR Type Spreads” for every player with at least 10 HR’s, it was found that the mean HR Type Spread was 32.6% while the median HR Type Spread was 31.5%. For the sake of this argument, I believe the median is a more accurate depiction of average since it throws away outliers.
In terms of what this means, it’s saying that the average MLB player is 31.5% more likely to hit his most likely HR type than he is to hit his least-likely HR type. In other words, the proportion of most-common Home Runs hit makes up 31.5% more of his total HR’s hit than his least likely HR would. I had expected this number to be a little higher, as this would demonstrate that on average players aren’t drastically more likely to hit their most likely type of HR than they are to hit their least likely type of HR. There are some exceptions to this rule, most notably stars like Kris Bryant, Nolan Arenado and Brian Dozier that rarely ever hit “Bloop” HR’s. However, on average the distribution of HR types by player is more uniform than I had anticipated.
Overall, I was very satisfied with the results my two models produced. My first model gave insight as to what sort of factors to look for when trying to hit a baseball farther. To summarize briefly: players looking to maximize the distance traveled by their HR’s should focus on hitting the ball faster and at a lower trajectory in general.
My second model went in-depth with looking at different types of Home Runs, what type of players are more likely to hit each type of HR, and whether players were likely to hit the same type of Home Run with a large amount of frequency. To briefly summarize the results of this analysis:
- There are four types of HR’s (Long-Bomb, Fly-Ball, Line-Drive, Bloop)
- The average MLB player has a proportion of their most common HR that is 35.1 percent greater than the proportion of their least common HR
- Certain players would make great or poor trade targets based on their results in this study.
- Most common HR is Long-Bomb, rarest is Bloop
Performance Measures of Model #1
Based on the performance metrics shown in section 3.4, I feel confident that the model performed well. While .8 is usually the cutoff RSq for what is considered to be a very predictive model, my model came in just under that threshold at roughly .78 RSq.
Performance Measures Model #2
While Multiple Linear Regression and Clustering Analysis generate different error metrics inherently, I will analyze the clustering analysis using a similar methodology to RSq in terms of measuring the “fit” of the model. For this reason, the primary error metric I would use to judge my clustering analysis would be the normalized average distance from the center of each cluster. On average, the normalized distance from each cluster was lower than most of the examples we observed in class. For this reason, I feel confident that both my models pass the criteria of performance measures. For reference, Figure H below contains the average distance of each normalized coordinate.
Overall, my project turned out in a way that I was satisfied with. While the pre-processing and post-processing wasn’t too painstaking with such a clean data set, I was able to draw concrete conclusions from the data that I gathered and the tests that I ran. I believe that my ability to draw specific recommendations (i.e. trade for Colby Rasmus, trade away Didi Gregorius) is a differentiating factor with my project and also a testament to the amount of work that I put into the project and the understanding that I have for the subject matter. I believe that my models are both viable based on conventional error metrics, and that both of the models could provide insights that are helpful to MLB front offices. Specifically, my clustering model would be fantastic for identifying more trade targets for a team trying to add power, as it would allow them to add players who hit “Long-Bomb” HR’s while avoiding players that have their HR totals inflated artificially by hitting balls softly that end up in the seats.