What do you do with several gigabytes of some politician’s emails? What do you do when, after you start asking questions, a major politician’s church suddenly takes its website off the internet, but you manage to rescue a copy of the pastor’s blog?
I built a search engine and document analysis pipeline called “Stevedore” to handle all these document search cases. The tool is based on Apache Tika, so it can handle almost any document that contains text. The search engine uses templates to allow easy customization of search forms and result lists – to fit a reporter’s technological familiarity and the well-structuredness (or not) of the documents.
For the first story, with Marco Rubio’s emails, reporters used the tool for this story. For the second case, I managed to rescue Scott Walker’s church’s website; a reporter searched Walker’s pastor’s blog with Stevedore for this story.
Stevedore isn’t open-source as of June 2015, but hopefully it will be in not too long.