Earlier this year, ProPublica updated the Dollars for Docs database of payments from pharmaceutical companies to doctors. I wrote a scraping framework to aggregate this data from the pharmaceutical companies’ disclosures — as HTML, XML and PDF tables — and to publish it inside the Dollars for Docs site.
I wrote a longer account of my travails at the ProPublica Nerd Blog. In short, I built HTML scrapers for complex, standards-non-compliant websites and learned the inner workings of PDFs to create a dataset of about two million rows. I worked with my colleagues Charlie Ornstein, Tracy Weber, Jen LaFleur and Joe Kokenge over the course of a few months to ensure this huge amount of data was absolutely correct.
The updated project has received a huge amount of media attention and high readership. I can’t take credit, however, for the beautiful design update; that was my colleague Sisi Wei’s handiwork.
The PDF-oriented portions of the project have been incorporated into Tabula, an open-source project for extracting tabular data from PDFs maintained by myself, Mike Tigas and Manuel Aristarán.