Making sense from a deluge of data is more crucial than ever.
Part of my role at Fairfax Media is to use my software and data management background, working with news editorial staff to create data-driven journalism.
Terms such as data-driven and database journalism can be used to imply that “normal” forms of journalism are not data driven. This is not true and is often a deliberate misuse of the terms to attack traditional journalists.
Even a casual observation of my journalist colleagues will show mountains of A5 notebooks, emails, spreadsheets, address books and voice recording going back years. There are literally shelves full of data surrounding us as I write this.
The difference that data-driven journalism implies is one of scope and scale not attitude to facts. All good journalism is driven by facts.
The terms data-driven and database journalism are similar but subtly different. Data-driven journalism is a process of finding facts in data sets - it is about analysing existing data sources for new patterns. Database journalism implies creating a new repository of data, usually independent of the source, where different types of content are collected and maintained.
As the availability of data increases it is only getting harder for individuals to keep their eye on where the news might be. Social media in particular has dramatically increased the noise but the occasional breaking news gem can make it worth the pain.
Investigative journalists have to deal with the output of other entities' information systems. The deeper story may be where two seemingly disparate data sets have something in common and/or where the outliers themselves are the story. The data being handled ranges from structured to unstructured.
A classic technique used by big business and government is to swamp the news agency with information that overwhelms the ability of the investigative teams to absorb it in a thorough and timely fashion. The hope is that the cost and effort involved in discovering useful facts will prevent journalists even begin trying. Information technology is the tool by which we can counter this lack of transparency.
Combining systematic knowledge management approaches to journalism is a fascinating journey for several reasons.
First, difference in work cycles. Managing data is a never-ending job. There is always more data to collect, keep current and to clean. Automating ingestion of data can be time-consuming and expensive. On the other hand, the news publishing cycle is more immediate and deadline driven. No story has the luxury of taking too long to be written and published. There is a tension between taking time to get the data in a solid and stable queryable format, and discovering and deriving facts for stories from it.
Second, the variety of source data formats is bewildering. Unfortunately, supporting open data formats is a fairly low government priority so web-scraping techniques need to be employed. Data comes in as images of tables embedded in PDFs which need to have character recognition, or even in the form of printed documents that need to be keyed in. When commissioning an investigation on a topic, it is not always known in advance what data is available, what format and what state it will be in. A lot of work goes into getting the data to a point where we can see what we’ve got to work with.
Finally, the outputs expected from the process differ. As a data manager, I call a successful implementation when the investigators can answer their questions themselves in a timely fashion while maintaining the integrity of the underlying data. For the investigations teams, success is measured in the number of stories produced, the insights brought to light and a bit of old fashion “keeping the bastards honest”. This means that a data collection and processing project that produces a beautifully clean database rich in facts may not actually produce any story at all. Data-driven projects can be a bit of a gamble for the newsroom.
Ultimately the outputs of our collaboration are stories (data-driven journalism) and a new database of knowledge (database journalism) for exploration by journalists in the future.
There is a lot of data out there. In the past few years newsrooms around the world have adopted technologies to stay on top of the torrent. Finding the story in the data is only half of the process - data visualisation techniques are tapping into the information to produce new expressions in storytelling to convey meaning and create richer engagement.
At the end of the day, IT professionals and journalists have a lot to learn from, and to offer, each other.
Our latest piece of database journalism on political interests can be found here.
This author is on Google+