Data Journalism

An Introduction

Petr Kočí [@tocit]
Marcel Šulek [@veproza]

Vilnius, December 7 - 8, 2013

Let's play with data!

We will try to make a map similar to this or this one.

But first, let's create a sample data set:

Our Data Toolbox

Please download and install

How could data be useful to journalists?

Philip Meyer (his blog)

One of the earliest examples of computer assisted reporting was in 1967, after riots in Detroit, when Meyer, on temporary assignment with the Detroit Free Press, used survey research, analyzed on a mainframe computer, to show that people who had attended college were equally likely to have rioted as were high school dropouts. -- Wikipedia

Stanislav Gross

1960: Computer Aided Reporting

2000: Database Journalism

2005: Data-driven Journalism

2010: Data Journalism, "Fact-based journalism"

Future: Newsgames, Drone Journalism, Machine Learning ...

No matter what we call it, it's still Journalism.

No matter what we call it, it's still Journalism.

We just explore new technological possibilities to get the information, understand it and tell the story.

Sometimes, we need to scrape the source's web site to get the information we need, sometimes it's better and easier just to make a phone call.

It's important to choose the right tool to fulfill our job.

Figure out which method is most effective.

Why is Data Journalism Important Now?

  • Flood of publicly available data from institutions and corporations (before - after)
  • Flood of data generated by each one of us

Why is Data Journalism Important Now?

  • More on-line tools freely availabe and easier to learn
  • Demand of the public: We are overloaded with information, we don't need more, but better - please help us make sense of it all
  • Crisis of the traditional business model of newspapers

Challenges of Data Journalism

  • Unreliable, broken and dirty data full of errors
  • Data in closed proprietary formats
  • Frequent changes in methodology

Challenges of Data Journalism

  • Time consuming
  • Uncertain results
  • Cost efficiency
  • Overhyped

Examples from around the World

More examples from NYT

Examples from around the World

More examples from the Guardian

Examples from around the World

Examples from around the World

More examples from around the world

Some of our projects

All the cars towed away in Prague

Where did five billion crowns go?

Where did five billion crowns go?

Which presidential candidate is right for you?

Quick and dirty charts

GDP predictions

Debts of All Municipalities

Who Is Selling Rotten Food

Presidential Timeline

Secret Service Investigating Prime Minister's Wife

The Map of Power

Public Transportation in Prague

Armed and Dangerou

When Are You Going to Die?

Crime Rates According to Cell Phone Data

Why People Got to Hospital

How Active was Your MP

What is data?

From Latin data, nominative plural of datum (“that is given”)

We are most interested in data that is

  1. structured
  2. computer-readable
  3. usually tabular (XLS, CSV)
  4. or tree-like (XML, JSON)

Data != Truth

We need to acquire it, clean it, refine it, analyze it and try to extract meaning (stories) from it.

Important questions to ask

  • Where did the data come from? (eg.)
  • What is the motivation of the person or institution that is collecting the data?
  • What methods were used collecting the data and have the methods change in time?
  • Is the data complete? What is missing and why?
  • What does the data mean and how can it be useful to our readers?


Obtaining data from websites programatically


Not only tables, but also text, pictures....

Cleaning and aggregating data

Extracting Data From PDFs

Visualising Data

Continue on your own

Thank you, and let's stay in touch!