Adventures in Open Data ETL

Earlier this month, I had the pleasure to speak at the Code for America Summit about work that Accela is doing to help the City of Evanston, IL publish it’s restaurant inspection data in LIVES format.

The benefits of publishing data in standard formats is becoming clearer, but many governments still have questions about how it is done. In the next few posts, I want to walk through the mechanics of taking data from Accela Automation and publishing it to an open data portal, and then using that open data to create a LIVES-compliant data feed.

We’re going to work backwards by first examining how to take open data published to CivicData.com and converting it to LIVES format. In the next post, we’ll detail the process of publishing data from an Accela Automation instance to CivicData.com

Show me the data

The first step to creating standard data is to first make data available in an open format. CivicData.com is Accela’s open data platform that can be used by any Accela customer, any Code for America Brigade and anyone else that wants to publish and work with open data.

Working with the City of Evanston, we published their restaurant inspection data to CivicData.com. This data is set up to synch regularly (about once an hour) to ensure that the latest data is available for users. Because CivicData.com is built on CKAN, we have access to CKAN’s APIs to retrieve this data.

By making simple REST calls to the CKAN API, we can get Evanston’s data in JSON format. The beautiful thing about JSON is it’s malleability, allowing us to restructure it to match the LIVES standard. In addition, we used the City of Evanston’s public geocodeing service to add coordinates to each record in the dataset.

With all of the pieces in place, we can bring everything together using the power of the UNIX toolset.

Simple but powerful UNIX tools

Because JSON is widely supported, we have lots of choices for the language or platform we may use to parse it. Since the process we built for the City of Evanston is meant to be repeatable and serve multiple cities, we wanted to set up a dedicated server to host data feeds. Automating the process of retrieving Evanston’s data from CivicData.com made a lot of sense, so we envisioned the process we created to do the conversion being kicked off by a simple cron job.

One of my favorite utilities for parsing JSON is jq – when installed on UNIX like systems it can be used in conjunction with a host of other simple but powerful tools to convert data. So with all of this in mind, a simple shell script became the most obvious answer for doing the required data conversion – simple, easy & straightforward.

For now, the data coming from Evanston’s Accela Automation instance does not have coordinates associated with each record, so we also needed to do some additional work to geocode each record before converting to the LIVES format. To keep the primary shell script uncluttered, we decided to use a simple PHP script to do the geocoding out of band. Our shell script simply invokes this PHP script to do the geocoding once the needed data has been fetched from CivicData.com.

Again, because JSON is widely supported, we could have chosen any number of languages to write our geocoder script in. PHP is ubiquitous, pretty easy to use for small tasks and has good JSON support – so it made sense for our particular use case. Our master shell script is kicked off daily via cron (note – we could do this more frequently, but for now a daily interval seems to make sense).

Thats it! Our complete ETL script is here.

This example helps to show the power of open data. When we have data published in an open format and we employ the simple yet powerful tools we have available to manipulate and process data, we can create powerful solutions to benefit our customers.

Look for more ETL stories in future posts.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s