Monday, May 21, 2007

Lies, Damned Lies, and Statistics

I've always loved numbers and statistics. Especially when you apply them to football. There is something gratifying about being able to distill a football game down into a nice little table of data. Maybe I'm weird like that, but I know there are others out there who feel the same way.

One problem I've had over the years is that there just isn't a lot of good statistical data out there for college football. It is either high level stuff like total yards and touchdowns or it is meaningless splits that break down the yards into ridiculous categories like turf type ( My team wins more games on Bermuda!). After years of searching for something better, I decided it was time to create something myself.

Over the last month, I have been working on a computer program that collects play by play data for games, parses out the details, and then saves that information into a database. I can then slice and dice the numbers to my hearts content to find out all kinds of interesting statistics about games, teams, and entire conferences. It is still in the early phases, but I thought I would share some of the results of my work with everyone.

I'm working on generating some reports from the data and I'll be refining those over the summer. I'm looking for feedback on what types of statistics and queries you would like to see in the reports, so after you view them drop me a comment. I'll also be posting around on some message boards looking for feedback. I'll post blog entries as I load more teams, but you can always find the latest updates on the right-hand side of the website under the new Bruin Roar Football Statistics section.

For now, I've loaded all the games for UCLA from 2005 and 2006. I've also loaded in data for USC from 2006 as a comparison. My plan is to load all of the games from last season, for all of the teams we play in 2007, into the database. If I get a positive response from readers then I'll continue to post new reports, for that weeks game, during the season. I think it will definitely be a big resource for you arm-chair analysts out there.

Technical Details

For those of you interested in how the program works, please read on. If you find such computer-speak boring then you may want to check out now.

The program is written primarily in Java. I use the Apache Commons HTTP Client for retrieving the web pages with the play by play data. I then do a screen-scrap of the page and pull out just the play descriptions. Parsing the details of the play data isn't technically difficult but it did take up the most effort.

There are lots of subtle differences in the way plays are described, so finding every possible combination, and reliably extracting that information, has proven to be challenging. I also have to validate the results, as sometimes the original data is just flat out wrong. I try and fix what I can, but it is hard to catch everything as there are over 2,000 plays in a season for a typical team. The good news is that the data is probably 95% correct, so a few misclassified plays one way or the other wont impact the overall numbers much.

Once I have the data parsed, I store it into a pretty simple object structure in memory. I use Velocity to extract the object data out into different file formats. The main format is a set of SQL statements for inserting the information into a local MySQL database. I put everything into a single de-normalized table, just to make the report generation as quick and simple as possible.

To create the reports, I use a local instance of Tomcat running some JSP pages. I have another Java program that loops through all the teams and games, passes those as parameters to the JSP pages, and then saves off the HTML generated. Finally, I run a script that FTPs all the HTML documents to the web server and, viola, you have the reports.

To run the whole thing for one team, for one season, takes less than 5 minutes. There are still a few manual processes in there, but I'm trying to automate the entire thing. I'm still tracking down bugs and refining the program, but I'm pretty happy with the way it works.

Enjoy!

3 comments:

Anonymous said...

Your counterpart is a Trojan.

Here is his website:
http://www.trojanfootballanalysis.com/index.html

--Bill

Anonymous said...

how do u arrive at time of possession?

CPBruinFan said...

I get the time of possession information from the same source as the rest of the data. The Play by Play page has information for each drive including the time of possession. I don't count time consumed on kick-offs, punts, or field goals.

The numbers in the reports may differ from the official numbers, but they should be pretty close. Everything is still in the early development phases and I still need to validate the drive information. Hopefully, I'll have everything ironed out before the start of the season.