MLB GameDay XML Parser for Python

ADDENDUM (OH SHIT I FORGOT) : You need one non-standard module for this, and it’s BeautifulSoup. Try ‘easy_install bs4’ or get it from here:

I’m not posting this on GitHub or anything because this was just an exercise in the process of teaching myself Python, and it could very well be buggy as shit and I don’t want to support it at all. BUT:

I wrote a Python module that has a few functions that are useful in grabbing and parsing the gameday data that MLB makes available via XML files. You can get it here:

There are 10 functions in total, but of particular interest to Dude McRandom will probably be the WriteGames function, which will take every pitch from a list of games and write the data (game info, inning info, at bat info, and pitch f/x info) to a CSV file, one row per pitch. I’m going to list the functions and what they do, but after all that will be a concrete example, so you might want to just scroll down a bunch. 

mlbgid module functions:


This returns a list of the field names that correspond to the data that will be written to the CSV file when you use WriteGames(). Needed for the sake of the csv.DictWriter (more on that later)


This takes an open CSV file and returns a DictWriter object that will translate the Python dictionary of values extracted from the XML to a row of CSV data


This takes a string, formatted as an MLB game ID, and returns an XML element object that contains all of the game data. MLB game IDs are formatted like so: YYYY_MM_DD_AWAYTEAMmlb_HOMETEAMmlb_gamenumber, where gamenumber is usually 1, but would be 2 if it’s the second game of a doubleheader


This returns a list of MLB game IDs consisting of all games that occurred on the present day. 


This takes a date string and returns a list of MLB game IDs consisting of all games that occurred on the date represented by the string. The string MUST be formatted as YYYY-MM-DD

ParsePitch(pitch, GameProperties, InnProperties, ABProperties, istop, writer)

This takes an XML element for a pitch, a dictionary of game attributes, a dictionary of inning attributes, a dictionary of at bat attributes, a boolean variable indicating whether it is the top of an inning or not, and a CSV DictWriter. It combines the dictionaries into one set of attributes and writes it to the open CSV file handled by the DictWriter. This is the business end of building the CSV file. It returns nothing of value.


This takes an XML element for an atbat, and returns a dictionary of attributes for the atbat.


This takes an XML element for an inning, and returns a dictionary of attributes for the inning.

ParseGame(game, gameid, writer)

This takes an XML element for a game, a string of the MLB game ID, and a CSV DictWriter, and writes the game’s pitches to the CSV file handled by the DictWriter

WriteGames(gameids, filename)

This takes a list of MLB game IDs, and a string representing a CSV filename that you want to write to, and writes the pitch data for every game in the list to the csv file. It returns the file object, so add a .close() on the end if you want to close the file. 


The most useful implementation I can think of, for the moment, is to write a script that will grab the day’s games and write their pitch data to a csv file. With this module, you would do it like this:

import mlbgid


That’s it. The first line imports the module. The second line calls the WriteGames() function. The first argument for WriteGames is mlbgid.TodaysGames(), which is a list of MLB game IDs for games that happened today. The second argument is a file name that refers to ‘games.csv’ in your root. The file does not need to be preexisting. ‘.close()’ is appended to close the games.csv file. 


  • The MLB XML files use home_team_runs and away_team_runs to represent the score after the play in question. I thought that was fucking stupid. One might want to figure out things like “list all pitches thrown while the score was 2-1” or some such, so I jiggered the code a bit such that away_team_runs and home_team_runs represent the score at the beginning of the at bat in question. The only time this gets tricky is if a run scores in the middle of an at bat (i.e. on a wild pitch). If that occurs, then those fields will continue to represent the score at the beginning of the at bat, until the next batter comes up. I could probably fix this, but, reasons. 
  • WriteGames() writes a header row to the CSV file. This is just for ease of interpretation. If you use it to write Spring Training or exhibition games, though, not all pitch f/x data is recorded for those. So you’ll have the full header row but not all of the data will be filled in for those games. Just FYI. I’m not sure exactly what happens if you use WriteGames() on a day when there are both spring games with no pitch f/x data and regular season games with pitch f/x data. Try it and find out for yourself! Fucker. 
  • WriteGames() will write little notifications for each game it completes. If it comes across a game where the consolidated XML file is not available, the notification will say “[Game] not found.” This should only happen in 3 instances: 1) World Baseball Classic games , 2) 2012 regular season games where the Marlins were an away or home team and MLB stupidly included a blank entry for ‘flomlb,’ but don’t worry, it included a full entry for ‘miamlb’ and that data WILL be written, 3) games that did not occur on that day because of rain/snow/whatever postponement; the future makeup game will be collected when you run WriteGames() on that date. 
  • Certain pitches thrown in 2010 and earlier will probably have some blank data because pitch f/x collection was sort of spotty
  • That’s all I can think of for now. Like I said this was literally a babby’s first Python exercise so I can’t guarantee the code is all that rigorous, but I’ve tested it a bunch and it seems to work fine. 

Questions —-> @phylan but like I said I’m not exactly supporting this shit.

  1. chasingutley posted this