I devised a machine-learning system to help ICIJ journalists search a massive set of leaked documents.
Computers have made it simple for whistleblowers to leak hundreds of thousands of documents. But technology hasn’t quite caught up so that reporters can make sense of that many documents.
In two collaborations with the International Consortium of Investigative Journalists, I built experimental software to help solve that problem.
The system I built](https://qz.com/1786896/ai-for-investigations-sorting-through-the-luanda-leaks/), using the Universal Sentence Encoder, allows reporters to search the documents and find what they didn’t know they were looking for. It helps locate “kinds of things”, like water bills or board meeting minutes, which helped untangle the web of Isabel dos Santos’s wealth and enabler.
A predecessor system used doc2vec to achieve a similar goal for the Mauritius Leaks project.
Dos Santos, an allegedly-corrupt Angolan billionaire and daughter of a former dictator, responded to ICIJ’s findings with an incredulous tweet: “715 thousand documents read? Who believes that?”
She’s right. Humans didn’t have to read all 715,000 documents, in part because my algorithm did.