The Growing Disparity between MLB Teams

Jordan Bean
6 min readJul 20, 2018

--

Following our defining of the problem we did some exploratory analysis on new players entering the league and whether there have been any noteworthy shifts with regard to variables such as time in the minors, age, and draft round.

Now, we’re going to explore the shifts in team performance. For the sake of this article, team performance will be analyzedusing final standings for each year. Most columns are self-explanatory, and the “games_back” column indicates the number of games the team finished behind their division winner in that year. A value of 0 indicates that the team won their division.

(Full code for this post is here)

The goal

The goal of this analysis will be to look for changes in metrics that will give us insight into the way the nature of competition may have changed over time. The analysis will be a mix of descriptive statistics with the columns that we have above as well as creating our own variables and segmentations of the data.

For the sake of this analysis, we’ll be thinking of “nature of competition” as the disparity between the stronger and the weaker teams. The reason that I think this is an important and insightful way to look at the data is if all teams have a winning percentage of ~.500, then the talent is approximately equal across the league and more games are competitive (and therefore more enjoyable to watch).

As the gap grows between the better and worse team, the concentration of talent creates an environment where the on-field product suffers even if the league-wide talent is unchanged. It’s similar to what’s happening in the NBA with limited top talent grouping in a small subset of teams (except that it hasn’t had as adverse of an effect in that league).

We’ll start by looking at some visuals that depict how the distribution of win percentage looks across the data. We’ll then continue by looking at the games_back variable, and finally the run differential.

Exploring win percentage data

The distribution of win percentages is unsurprinsingly bi-modal on either side of the .500 mark, indicating that the majority of teams finish just above or below .500 in a given year.

Examining the year-by-year distribution doesn’t show any significantly outlier activity. The range on the bottom end of win percentage for 2017 is higher than previous years and the 25th percentile is comparable. That said, the median value is at its lowest value in two decades, even though the magnitude of the difference is small.

In tandem with the median winning percentage being lower in 2017, the percentage of teams that finished below .500 (i.e. had more losses than wins) was at a near multi-decade high. We would expect this variable to be around 50% given the win percentage distribution that we looked at earlier, but 2017 showed abnormally high behavior.

The interquartile range for win percentage — the 75th percentile of values minus the 25th percentile of values — does not show meaningfully different behavior recently, as there’s been a slight uptick over the past few years but it’s still in line with historical values. A larger value here would mean there’s more of a difference between the mid-tier teams (i.e. not league leaders nor finishing in last place).

Finally, the difference between the max and minimum winning percentage by year — essentially showing the difference between the best and worst teams — doesn’t show anything that would cause concern.

How competitive are the final standings?

Moving on to looking at the number of games back, we see that the number is running above it’s historical trend for the last two years. The reason this is noteworthy is that average number of games is a proxy for competitiveness of the division. While it masks the potential for competitive intensity between the top 2 or 3 teams in the division, it does give a reading on the overall quality of competition across the season.

After looking at the 75th percentile for number of games behind the division leader — which turned out to be 22.5 games — we can look at the percentage of teams in a given year that are further back than that figure. Our baseline is that 25% of teams should be at least as many or more games back than our 75th percentile. We see, once again, that 2016 was slightly above our target value, while 2017 far exceeded the baseline with a value of 37%.

How competitive are the games?

One last variable we’ll look at trended data for that relates to quality of play and competitiveness is the run differential. This is a simple calculation of the absolute value of the difference between the score of each team in a game. So, for example, if the Yankees beat the Red Sox 5–2, the run differential is 3.

A higher run differential indicates that games are, on average, less competitive, while a lower indicates higher competitive intensity. I’ll reiterate here again that this doesn’t tell us whether any single game or even set of games is competitive or not, but rather looking holistically at the season, how did it trend relative to other years?

As we can see from the chart, the run differential has trended upward over the past 5 years, and for the last two has trended above its historical norm.

Conclusion: Let’s bring it all together

There have been a lot of charts and graphs looking at trended time series data in this post and the previous one. While none can individually give us conviction that anything has changed, collectively we are seeing different information consistently.

We can, as a result of looking at the data, be more certain that the change in MLB attendance is not a fluke, but rather a confluence of factors that are collectively influencing the quality of play on the field and the number of people that are fans of the game.

In the prior post to this series we saw that the type of player entering the league is changing. Now, using the charts and analysis above, we can say that it looks like the inter-team competition may be shifting. While many of the statistics that we looked at were volatile, when we isolate the behavior to the last few years, may show unfavorable trends.

Since this is an evolving topic, and certainly has many other correlated or influencing variables, I’d love to hear if you have any thoughts on other variables to explore further.

Some other related ideas that I’m going to (eventually) pursue include how MLB attendance trends against other sports, look for variables to gauge amateur interest (e.g. Little League, Babe Ruth, AAU), and maybe most importantly, whether MLB is losing fans or if they’re shifting their medium of consumption (i.e. from attendance at games to watching on a mobile device).

--

--

Jordan Bean
Jordan Bean

Written by Jordan Bean

I create original content that connects data, analytics, and strategy. Support my work by becoming a member jordanbean.medium.com/membership

No responses yet