A reusable data processing workflow

By David Eads | September 02, 2014

Correction (September 2, 2014 8:55pm EDT): We originally stated that the script should combine data from multiple American Community Survey population estimates. This methodology is not valid. This post and the accompanying source code have been updated accordingly. Thanks to census expert Ryan Pitts for catching the mistake. This is why we open source our code!

The NPR Visuals team was recently tasked with analysing data from the Pentagon’s program to disperse surplus military gear to law enforcement agencies around the country through the Law Enforcement Support Office (LESO), also known as the “1033” program. The project offers a useful case study in creating data processing pipelines for data analysis and reporting.

The source code for the processing scripts discussed in this post is available on Github. The processed data is available in a folder on Google Drive.

Automate everything

There is one rule for data processing: Automate everything.

Data processing is fraught with peril. Your initial transformations and data analysis will always have errors and never be as sophisticated as your final analysis. Do you want to hand-categorize a dataset, only to get updated data from your source? Do you want to laboriously add calculations to a spreadsheet, only to find out you misunderstood some crucial aspect of the data? Do you want to arrive at a conclusion and forget how you got there?

No you don’t! Don’t do things by hand, don’t do one-off transformations, don’t make it hard to get back to where you started.

Create processing scripts managed under version control that can be refined and repeated. Whatever extra effort it takes to set up and develop processing scripts, you will be rewarded the second or third or fiftieth time you need to run them.

It might be tempting to change the source data in some way, perhaps to add categories or calculations. If you need to add additional data or make calculations, your scripts should do that for you.

The top-level build script from our recent project shows this clearly, even if you don’t write code:

#!/bin/bash

echo 'IMPORT DATA'
echo '-----------'
./import.sh

echo 'CREATE SUMMARY FILES'
echo '--------------------'
./summarize.sh

echo 'EXPORT PROCESSED DATA'
echo '---------------------'
./export.sh

We separate the process into three scripts: one for importing the data, one for creating summarized versions of the data (useful for charting and analysis) and one that exports full versions of the cleaned data.

How we processed the LESO data

The data, provided by the Defense Logistics Agency’s Law Enforcement Support Office, describes every distribution of military equipment to local law enforcement agencies through the “1033” program since 2006. The data does not specify the agency receiving the equipment, only the county the agency operates in. Every row represents a single instance of a single type of equipment going to a law enforcement agency. The fields in the source data are:

State
County
National Supply Number: a standardized categorization system for equipment
Quantity
Units: A description of the unit to use for the item (e.g. “each” or “square feet”)
Acquisition cost: The per-unit cost of the item when purchased by the military
Ship date: When the item was shipped to a law enforcement agency

Import

Import script source

The process starts with a single Excel file and builds a relational database around it. The Excel file is cleaned and converted into a CSV file and imported into a PostgreSQL database. Then additional data is loaded that help categorize and contextualize the primary dataset.

Here’s the whole workflow:

Convert Excel data to CSV with Python.
- Parse the date field, which represents dates in two different formats
- Strip out extra spaces from any strings (of which there are many)
- Split the National Supply Number into two additional fields: The first two digits represent the top level category of the equipment (e.g. “WEAPONS”). The first four digits represent the “federal supply class” (e.g. “Guns, through 30 mm”).
Import the CSVs generated from the source data into PostgreSQL.
Import a “FIPS crosswalk” CSV into PostgreSQL. This file, provided to us by an NPR reporter, lets us map county name and state to the Federal Information Processing Standard identifier used by the Census Bureau to identify counties.
Import a CSV file with Federal Supply Codes into PostgreSQL. Because there are repeated values, this data is de-depulicated after import.
Import 5 year county population estimates from the US Census Bureau’s American Community Survey using the American FactFinder download tool. The file was added to the repository because there is no direct link or API to get the data.
Create a PostgreSQL view that joins the LESO data with census data through the FIPS crosswalk table for convenience.

We also import a list of all agencies using csvkit:

Use csvkit’s in2csv command to extract each sheet
Use csvkit’s csvstack command to combine the sheets and add a grouping column
Use csvkit’s csvcut command to remove a pointless “row number” column
Import final output into Postgres database

Summarizing

Summarize script source

Once the data is loaded, we can start playing around with it by running queries. As the queries become well-defined, we add them to a script that exports CSV files summarizing the data. These files are easy to drop into Google spreadsheets or send directly to reporters using Excel.

We won’t go into the gory details of every summary query. Here’s a simple query that demonstrates the basic idea:

echo "Generate category distribution"
psql leso -c "COPY (
select c.full_name, c.code as federal_supply_class,
  sum((d.quantity * d.acquisition_cost)) as total_cost
  from data as d
  join codes as c on d.federal_supply_class = c.code
  group by c.full_name, c.code
  order by c.full_name
) to '`pwd`/build/category_distribution.csv' WITH CSV HEADER;"

This builds a table that calculates the total acquisition cost for each federal supply class:

full_name	federal_supply_code	total_cost
Trucks and Truck Tractors, Wheeled	2320	$405,592,549.59
Aircraft, Rotary Wing	1520	$281,736,199.00
Combat, Assault, and Tactical Vehicles, Wheeled	2355	$244,017,665.00
Night Vision Equipment, Emitted and Reflected Radiation	5855	$124,204,563.34
Aircraft, Fixed Wing	1510	$58,689,263.00
Guns, through 30 mm	1005	$34,445,427.45
...

Notice how we use SQL joins to pull in additional data (specifically, the full name field) and aggregate functions to handle calculations. By using a little SQL, we can avoid manipulating the underlying data.

The usefulness of our approach was evident early on in our analysis. At first, we calculated the total cost as sum(acquisition_cost), not accounting for the quantity of items. Because we have a processing script managed with version control, it was easy to catch the problem, fix it and regenerate the tables.

Exporting

Export script source

Not everybody uses PostgreSQL (or wants to). So our final step is to export cleaned and processed data for public consumption. This big old query merges useful categorical information, county FIPS codes, and pre-calculates the total cost for each equipment order:

psql leso -c "COPY (
  select d.state,
    d.county,
    f.fips,
    d.nsn,
    d.item_name,
    d.quantity,
    d.ui,
    d.acquisition_cost,
    d.quantity * d.acquisition_cost as total_cost,
    d.ship_date,
    d.federal_supply_category,
    sc.name as federal_supply_category_name,
    d.federal_supply_class,
    c.full_name as federal_supply_class_name
  from data as d
  join fips as f on d.state = f.state and d.county = f.county
  join codes as c on d.federal_supply_class = c.code
  join codes as sc on d.federal_supply_category = sc.code
) to '`pwd`/export/states/all_states.csv' WITH CSV HEADER;"

Because we’ve cleanly imported the data, we can re-run this export whenever we need. If we want to revisit the story with a year’s worth of additional data next summer, it won’t be a problem.

A few additional tips and tricks

Make your scripts chatty: Always print to the console at each step of import and processing scripts (e.g. echo "Merging with census data"). This makes it easy to track down problems as they crop up and get a sense of which parts of the script are running slowly.

Use mappings to combine datasets: As demonstrated above, we make extensive use of files that map fields in one table to fields in another. We use SQL joins to combine the datasets. These features can be hard to understand at first. But once you get the hang of it, they are easy to implement and keep your data clean and simple.

Work on a subset of the data: When dealing with huge datasets that could take many hours to process, use a representative sample of the data to test your data processing workflow. For example, use 6 months of data from a multi-year dataset, or pick random samples from the data in a way that ensures the sample data adequately represents the whole.

A reusable data processing workflow

Automate everything

How we processed the LESO data

Import

Summarizing

Exporting

A few additional tips and tricks

Recent Work

The world’s melting ice has surprising impacts. Can you guess them?

Living in the shadow of a dangerous shrinking glacier

The unexpected link between imperiled whales and Greenland’s melting ice

Open-Source Tools

Dailygraphics Next

Sidechain

Interactive Template

On The Team Blog

February 12, 2024

How we used gigabytes of shipping data to show risks to endangered whales