A reusable data processing workflow

Correction (September 2, 2014 8:55pm EDT): We originally stated that the script should combine data from multiple American Community Survey population estimates. This methodology is not valid. This post and the accompanying source code have been updated accordingly. Thanks to census expert Ryan Pitts for catching the mistake. This is why we open source our code!

The NPR Visuals team was recently tasked with analysing data from the Pentagon’s program to disperse surplus military gear to law enforcement agencies around the country through the Law Enforcement Support Office (LESO), also known as the “1033” program. The project offers a useful case study in creating data processing pipelines for data analysis and reporting.

The source code for the processing scripts discussed in this post is available on Github. The processed data is available in a folder on Google Drive.

Automate everything

There is one rule for data processing: Automate everything.

Data processing is fraught with peril. Your initial transformations and data analysis will always have errors and never be as sophisticated as your final analysis. Do you want to hand-categorize a dataset, only to get updated data from your source? Do you want to laboriously add calculations to a spreadsheet, only to find out you misunderstood some crucial aspect of the data? Do you want to arrive at a conclusion and forget how you got there?

No you don’t! Don’t do things by hand, don’t do one-off transformations, don’t make it hard to get back to where you started.

Create processing scripts managed under version control that can be refined and repeated. Whatever extra effort it takes to set up and develop processing scripts, you will be rewarded the second or third or fiftieth time you need to run them.

It might be tempting to change the source data in some way, perhaps to add categories or calculations. If you need to add additional data or make calculations, your scripts should do that for you.

The top-level build script from our recent project shows this clearly, even if you don’t write code:

#!/bin/bash

echo 'IMPORT DATA'
echo '-----------'
./import.sh

echo 'CREATE SUMMARY FILES'
echo '--------------------'
./summarize.sh

echo 'EXPORT PROCESSED DATA'
echo '---------------------'
./export.sh

We separate the process into three scripts: one for importing the data, one for creating summarized versions of the data (useful for charting and analysis) and one that exports full versions of the cleaned data.

How we processed the LESO data

The data, provided by the Defense Logistics Agency’s Law Enforcement Support Office, describes every distribution of military equipment to local law enforcement agencies through the “1033” program since 2006. The data does not specify the agency receiving the equipment, only the county the agency operates in. Every row represents a single instance of a single type of equipment going to a law enforcement agency. The fields in the source data are:

  • State
  • County
  • National Supply Number: a standardized categorization system for equipment
  • Quantity
  • Units: A description of the unit to use for the item (e.g. “each” or “square feet”)
  • Acquisition cost: The per-unit cost of the item when purchased by the military
  • Ship date: When the item was shipped to a law enforcement agency

Import

Import script source

The process starts with a single Excel file and builds a relational database around it. The Excel file is cleaned and converted into a CSV file and imported into a PostgreSQL database. Then additional data is loaded that help categorize and contextualize the primary dataset.

Here’s the whole workflow:

We also import a list of all agencies using csvkit:

  • Use csvkit’s in2csv command to extract each sheet
  • Use csvkit’s csvstack command to combine the sheets and add a grouping column
  • Use csvkit’s csvcut command to remove a pointless “row number” column
  • Import final output into Postgres database

Summarizing

Summarize script source

Once the data is loaded, we can start playing around with it by running queries. As the queries become well-defined, we add them to a script that exports CSV files summarizing the data. These files are easy to drop into Google spreadsheets or send directly to reporters using Excel.

We won’t go into the gory details of every summary query. Here’s a simple query that demonstrates the basic idea:

echo "Generate category distribution"
psql leso -c "COPY (
select c.full_name, c.code as federal_supply_class,
  sum((d.quantity * d.acquisition_cost)) as total_cost
  from data as d
  join codes as c on d.federal_supply_class = c.code
  group by c.full_name, c.code
  order by c.full_name
) to '`pwd`/build/category_distribution.csv' WITH CSV HEADER;"

This builds a table that calculates the total acquisition cost for each federal supply class:

full_name federal_supply_code total_cost
Trucks and Truck Tractors, Wheeled 2320 $405,592,549.59
Aircraft, Rotary Wing 1520 $281,736,199.00
Combat, Assault, and Tactical Vehicles, Wheeled 2355 $244,017,665.00
Night Vision Equipment, Emitted and Reflected Radiation 5855 $124,204,563.34
Aircraft, Fixed Wing 1510 $58,689,263.00
Guns, through 30 mm 1005 $34,445,427.45
...

Notice how we use SQL joins to pull in additional data (specifically, the full name field) and aggregate functions to handle calculations. By using a little SQL, we can avoid manipulating the underlying data.

The usefulness of our approach was evident early on in our analysis. At first, we calculated the total cost as sum(acquisition_cost), not accounting for the quantity of items. Because we have a processing script managed with version control, it was easy to catch the problem, fix it and regenerate the tables.

Exporting

Export script source

Not everybody uses PostgreSQL (or wants to). So our final step is to export cleaned and processed data for public consumption. This big old query merges useful categorical information, county FIPS codes, and pre-calculates the total cost for each equipment order:

psql leso -c "COPY (
  select d.state,
    d.county,
    f.fips,
    d.nsn,
    d.item_name,
    d.quantity,
    d.ui,
    d.acquisition_cost,
    d.quantity * d.acquisition_cost as total_cost,
    d.ship_date,
    d.federal_supply_category,
    sc.name as federal_supply_category_name,
    d.federal_supply_class,
    c.full_name as federal_supply_class_name
  from data as d
  join fips as f on d.state = f.state and d.county = f.county
  join codes as c on d.federal_supply_class = c.code
  join codes as sc on d.federal_supply_category = sc.code
) to '`pwd`/export/states/all_states.csv' WITH CSV HEADER;"

Because we’ve cleanly imported the data, we can re-run this export whenever we need. If we want to revisit the story with a year’s worth of additional data next summer, it won’t be a problem.

A few additional tips and tricks

Make your scripts chatty: Always print to the console at each step of import and processing scripts (e.g. echo "Merging with census data"). This makes it easy to track down problems as they crop up and get a sense of which parts of the script are running slowly.

Use mappings to combine datasets: As demonstrated above, we make extensive use of files that map fields in one table to fields in another. We use SQL joins to combine the datasets. These features can be hard to understand at first. But once you get the hang of it, they are easy to implement and keep your data clean and simple.

Work on a subset of the data: When dealing with huge datasets that could take many hours to process, use a representative sample of the data to test your data processing workflow. For example, use 6 months of data from a multi-year dataset, or pick random samples from the data in a way that ensures the sample data adequately represents the whole.

 

Dailygraphics Next

One-stop tooling for creating responsive news graphics from a range of D3-based templates

Sidechain

Responsive iframes for modern browsers

Interactive Template

A modern site generator with live reload and support for loading data from ArchieML, Google Docs/Sheets, CSV, JSON, and more

 

On The Team Blog

More