What’s the deal with declining MLB attendance? Part 1
One of my favorite parts about working with data is that it brings clarity to uncertainty. It changes a conversation from speculation to action. It also can be used to debunk myths or give concrete evidence to commonly held beliefs. Ultimately, data can answer questions.
The question that prompted this mini-series of articles came from a WSJ article that was commenting on a “sharp drop in attendance” during the early part of the 2018 Major League Baseball season.
The line in the article that stuck out to me most was: “The simplest answer, and the one [MLB Commissioner] Manfred would prefer, is the weather.” This seems like an awfully simple question to a much more complex situation.
“The simplest answer [to declinding attendance], and the one [MLB Commissioner] Manfred would prefer, is the weather.”
Indeed, the article goes on to give a counterargument to Manfred’s claim:
“Through this time last year, Blue Jays attendance is down 29% in Toronto at the Rogers Centre, a stadium with a retractable roof. It’s down 3% at Seattle’s Safeco Field, even with the Mariners sporting one of baseball’s best records. Crowds are also down 10.9% in Oakland, 6.7% in San Francisco and 4.2% in Tampa Bay, markets where weather is almost never a factor.”
The question
Intuitively, weather should affect attendance at games, especially at outdoor stadiums in climates like Boston, New York, and Detroit during the spring months. But, does it? If yes, to what magnitude, and what other factors could be contributing to a change in attendance? Can we make reasonable predictions for attendance per game using weather and related factors (month, team)?
I’ve been working on learning Python, and this seemed like a good time to bring together disparate concepts into a cohesive project. The full code can be found here and I’d welcome any feedback that people have on improvements!
The process
So, I set out to answer the above questions. Through a series of posts, I’ll walk through each step of the process to reach an answer. At a high level, they will be:
1. Find the relevant data
2. Clean the data and put it into a usable format
3. Perform exploratory data analysis
4. Feature engineering and selection for model building
5. Model building and interpretation
6. Statistical inference and significance testing
7. Conclusion
The ultimate goal of the project is to determine what effect, if any, weather has on attendance at major league baseball games.
The ultimate goal of the project is to determine what effect, if any, weather has on attendance at major league baseball games, and to use statistical and modeling techniques in Python to come to a data-backed conclusion.
Step 1: Find the relevant data
The process starts with having data to work with. Working backward from the solution, there are two overarching datasets that will need to be found, complemented by a series of other supporting data points.
1. Granular data on Major League Baseball games with accompanying attendance figures.
2. Weather data for each team’s city across the relevant time frame.
Baseball data
Major league baseball has historically been on the front foot of data collection given the seemingly infinite parts of a game that can be captured.
This made finding the relevant data fairly straightforward, and the decision became how much data to incorporate. For the sake of this analysis, I decided to use all game data from 1990 onward. The data for each season was found at www.retrosheet.org, saved as .txt files, and ultimately concatenated into a single data frame.
The files contained about 160 columns, with everyone from umpires to players to box score. I settled on 32 columns that I thought would be most relevant for the sake of this analysis, a sampling of which are below.
Weather data
The weather data was a bit surprisingly more difficult to find. The ideal dataset would have been a daily average temperature in each team’s city for every relevant year, which would then be joined on the game date and location.
I found a number of government sources that had pretty detailed information, but the timeline of the data capture was inconsistent, and most sites limited data pulls to 10-years. That gave me the potential option to do 3 data pulls per city and merge all the files together. Time consuming, but an option.
I found a few suggestions of other sites, one of them being Weather Underground, but they recently limited their data pulls to paid customers for the timeline that I was looking at.
I actually found the perfect full dataset that I was looking for on Kaggle, but it was limited to only the 2016 season. The author had scraped Baseball Reference and each game had the starting temperature at game time. Bingo. I tried to trace back his code on Github to see if I could replicate it for the other years.
Unfortunately, I couldn’t, and my web scraping skills aren’t nearly on par with his, so this was another option I kept on the back pocket or as a follow-on challenge after completing the analysis.
After some more trial and error on sites, I decided to use data from the National Weather Service from which I was able to pull the monthly average for each MLB city in the U.S. and use the closest approximation city for Canadian teams (Montreal, Toronto).
The final data contained the average temperature for each month for each city, and a classification variable for whether the stadium was indoor (1) or outdoor (0).
Coming up next
Okay, so now we have the data. Coming up next will be the data cleaning, merging, and exploratory data analysis.