Ethics and Compliance Applications of Open Data

Although my skills as a data analyst are nascent, it occurred to me very early on that the emerging trend toward Open Data would be transformative in ways we couldn’t anticipate. This isn’t a novel idea, as a look at any of the following TED talks will show:

The Year Open Data Went Worldwide
How Open Data is Changing International Aid
Demand a More Open Source Government

What hadn’t occurred to me, however, was how revelatory even cursory looks at data might be for an ethicist.  As part of my ongoing project to learn the R programming language (R is a statistical and data analysis application, freely available), I decided on an exploratory mission to find out something ethically interesting with the tool. Armed with only my intermediate (at best) knowledge of statistics and my introductory level of expertise with R, I wanted to see if I could find out something I didn’t already know.

After spending a few minutes looking for relevant datasets, I found the Wage and Hour Enforcement database at theU.S. Department of Labor. It seemed to me that we might be able to learn something about the way businesses are treating their workers.  This dataset includes all enforcement actions, both successful and unsuccessful, since 2007.

Though there is a near-infinity of ways you could these data, I decided to look at two questions: which are the worst industries to work in from a wage and hour perspective and which companies are worst to work at from the same perspective.  I expected that these enforcement actions would be relatively normally distributed and proportional to worker population.  They turned out to be neither.  A few R commands (which I’ll preserve for anyone interested in learning R) and I was able to see that some industries and some companies are far worse than others.

The dataset has a separate record for every enforcement action and every record has an industry code.  To see which industries were the worst, I just had to count the number of times each industry was mentioned.  There are over 1500 industry classifications, so sorted the list by number of appearances and took the top 15:

 ncaisclasscount <- as.data.frame(table(whd$naics_code_description))  
sortedncaisclass <- ncaisclasscount[order(-ncaisclasscount$Freq),]
topfifteen <- sortedncaisclass[1:15,]
barplot(topfifteen$Freq, names.arg=topfifteen$Var1, las = 2, cex.lab=.1, horiz=TRUE)

For ease of interpretation, I then put it into a horizontal bar chart

barplot(topfifteen$Freq, names.arg=topfifteen$Var1, las = 2, cex.lab=.1, horiz=TRUE)

which looks like this (click through for PDF version that you can zoom into: because of the size of the labels, it wasn’t possible to capture this in a graphic that fit in the blog format, ditto below)

So it turns out that restaurants are a terrible business if you’re an employee (if you use Wage and Hour enforcements as a proxy for bad behaviour by employers). They hold the top 2 places which, combined, are 5 times worse than the next industry.

How about individual companies, then? Is the revelation that restaurants are not particularly good employers borne out in the company data?  For this, I essentially repeated the previous process, only I counted employer frequencies rather than industries.

 df2 <- as.data.frame(table(whd$trade_nm))  
big2 10)
sorted2 <- big2[order(-big2$Freq), ]
topten <- sorted2[1:10 , ]
barplot(topten$Freq, names.arg=topten$Var1)

which resulted in (click here for PDF version):

So the industry data is definitely proven out by the company data, but there are some surprises. Subway looks to be a really terrible company to work for, followed by MacDonald’s (there were some data quality issues I didn’t take the time to correct but the combined MacDonald’s plots would equal about 500 enforcements.)  To really get an idea of how relatively bad each company was, you’d have to combine these data with how many employees are employed by each company, but this, at least, gives you a high-level view.

I can’t emphasise enough how cursory and incomplete this look at this data is. The point is to demonstrate how useful open data can be for pointing out practical issues in ethics. This could be the starting point for a lot more analysis, like investigating why the restaurant industry has so many enforcements and if anything could be done about it.  

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s