Free The Files: Massively-Crowdsourced Transcriptions

During an election, not all political groups have to disclose their political ad spending. This makes it hard to know which groups are trying to influence political discourse. But TV stations in the 50 largest media markets—where a large portion of political ad money is spent—do have to disclose political ads aired on their channels. The trouble is, these disclosure forms are filed with the FCC as scanned images. Occasionally, they included handwritten corrections. There’s no way to mechanically extract the information contained in the filings (OCR is too inaccurate to depend on); transcribing them ourselves would take too long.

Instead of transcribing them entirely by ourselves, ProPublica created the Free the Files project to ask our readers to transcribe them with us. Our readers rose to the challenge, thanks to our social team’s hard work, and transcribed almost half of the more than forty thousand disclosures filed during the 2012 election.

I, along with my colleague Al Shaw, built a Ruby on Rails application to display the results and to serve as the interface for the actual transcription. The transcription interface was designed using a technique Al created called Casino-Driven Design.

The data itself was compiled into a database. Once participants had entered enough identical data that we had algorithmically confirmed to be correct, crunched and published on the Free the Files site. Every completed, or “freed,” filing was transcribed at least twice; many were transcribed three or more times because participants disagreed about one or more data points.

My colleagues and I had to make a journalistic decision about which data points to ask participants to transcribe. We chose to request relatively few, hoping that it would encourage more people to participate. For instance, we excluded the days and programs during which ads were scheduled to run. We were successfully in balancing ease of transcription with completeness: we handled over 100,000 individual transcriptions by over 1,000 users and got useful information.

Since some TV markets were more politically in play than others, I wrote an algorithm to prioritize swing-state markets (like Ohio’s) over less politically interesting ones (e.g. Los Angeles) when participants were “randomly” assigned filings to transcribe. I also worked on the script to scrape the FCC’s disclosure site and integrate with DocumentCloud to upload the documents and metadata from our readers’ transcriptions.

In addition to public-facing aspects of the app, I also wrote a extensive admin dashboard to view transcription statistics, ban troublesome users (who never showed up, thankfully), and sort out duplicate contracts uploaded to the FCC by TV stations.

Users could seamlessly log in either with a native account or with their Facebook account, using a user management system I implemented in Rails and JavaScript.

The application received huge amounts of traffic during weeks leading up to the November 2012 election and scaled well.