an introduction to data journalism

One take on what data journalism may or may not be… a lecture presented to journalism students at the University of Lincoln, UK, February 2014.

1

Let’s start with an easy(?!) quesJon -‐ what is journalism?

One way of answering that quesJon is to list some of the funcJons, or aMributed, associated with it – informing, educaJng, holding to account, watchdog funcJon, campaigning, contextualising.

2

Sensemaking seems to me to be an important part of it… In part contextualisaJon, in part idenJfying the bits that make the difference, the bits that make it important, the bits that make is news that people need to know..

3

Second quesJon: what is data? NaJonal staJsJcs, sports results, polls, financial figures, health data, school league tables, etc etc.

Is a book data? Or a speech? What if I split a speech up into separate words, count the occurrence of each unique word and then display the result as a “tag cloud”, or word frequency diagram.

4

One way of thinking about data is that it is a parJcular sort of source, or a source that can respond to a parJcular style of quesJoning in a parJcular way.

Another take on this is that many “data sources” are experts on a parJcular topic, experts that know a lot of a very parJcular class of facts.

5

So what is data journalism?

One way is to think of it as a process, as exemplified by Paul Bradshaw’s inverted pyramid of data journalism. I see it more as a conversaJon in which data is one of the conversants. The conversaJonal view also allows us to think about process, but more important, for me, is that in a conversaJon, it gets personal…

6

The inverted pyramid gives us one way of considering the data journalisJc process, or at least idenJfying some of the steps involved in a data invesJgaJon.

But there are many other ways of conceptualising the process – for example, finding stories and telling stories…

7

When it comes to finding stories, do we:

a)  want to find stories in a dataset we are provided with, or b)  use data to help draw out a story lead we have already been Jpped off to?

8

One of the ways I like to work with data is to have a conversaJon with it – asking quesJons of it and then further quesJons based on the responses I get.

9

SomeJmes it looks at first as if we have data in a form where we might be able to do something with it – then we realise it needs cleaning and reshaping.

For example, in this case we have percentage signs contaminaJng numbers, data organised in separate secJons – but how do we get a “well behaved” view over data from all the wards – and different sorts of data: votes polled per candidate versus the size of the electorate in a parJcular ward for example.

Walkthrough: hMp://blog.ouseful.info/2013/05/03/a-‐wrangling-‐example-‐with-‐openrefine-‐making-‐ready-‐data/

10

One of the first datasets I played with was MPs’ expenses data. Here are a couple of ways I started to chat with it – imagine talking to someone whop knows about *all* the expenses claims put in by every MP over a parliamentary session… (The charts were created using an online interacJve tool developed by IBM called Many Eyes.)

The bar chart Is ordered, for a parJcular expenses area, by total amount for each individual MP.

The block histogram shows how many MPs made a total claim in parJcular expenses area of a parJcular binned value. (A ‘bin’ is a range.)

A scaMerplot is another very powerful sort of chart – we can plot two sorts of value against each other to see if there are any groups, or trends.

Some scaMerplot tools allow you to size or colour nodes according to further dimensions. Colouring nodes by group (if sensible groups exist) can also help you see whether parJcular groups are clustered or group together in parJcular areas of the chart.

Maps can be used to pull out different sorts of relaJonships – for example, plokng markers in the centre of each MP’s ward coloured by the total value of travel expenses claim in a parJcular area, we can easily see whether or not an MP is claiming an amount significantly different to MPs in neighbouring wards. In this case – travel expenses – we might expect (at first glance at least) a homophiliJc effect – folk a similar distance away from Westminster should presumably make similar sorts of travel claim? At second glance, we might then start to refine our quesJoning – does ward size (in terms of geographical area) or rurality have an effect? Does an MP travel to and from home more than neighbours (or perhaps claim more in terms of accommodaJon in London?)

13

SomeJmes we need to provide quite a lot of explanaJon when it comes to making sense of even a simple data visualisaJon – “what am I supposed to be looking at?”

14

ContextualisaJon can take many forms – Trinity Mirror Group have a data unit that produces parJally packaged data stories and lines for regional Jtles, who can then add local colour, knowledge, interpretaJon and spin to the resulJng story.

15

For many readers – it may be that data ONLY makes sense when appropriately contextualised.

In passing, it’s also worth noJng that someJmes the data you don’t collect someJmes affects the interpretaJon of the data you do…

Foe example: hMp://www.open.edu/openlearn/science-‐maths-‐technology/mathemaJcs-‐and-‐staJsJcs/staJsJcs/diary-‐data-‐sleuth-‐when-‐the-‐data-‐you-‐dont-‐collect-‐affects-‐the-‐data-‐you-‐do

16

In passing, it’s worth menJoning that one thing staJsJcs does is help provide context.

Is this number a big number in the greater scheme of things? Is this thing likely to happen by chance or is there a meaningful causal relaJonship between this thing and another thing?

The chart in the corner is a reminder about how surprising probabiliJes can be. The chart shows the probability (y-‐axis) that two people share a birthday (the number of people is given on the x-‐axis). The chart shows that if there are 23 or more people in a room, there is more than a 50/50 chance that two of them will share a birthday (that is, share the same birth day and month, though not necessarily same birth year).

How many people are in the room? If it’s more than 23 – I bet that at least two people share a birthday (at least in terms of day and month).

17

The other way of using data is to tell stories. But what does that even mean…?

18

A common source of stories based on data are polls, either polls that are commissioned by a publisher with a view to generaJng a story, or commissioned by a lobbying group or PR form to promote not only stories around a parJcular issue, but stories that follow a line favourable to the organisaJon that commissioned the poll (or detrimental to posiJons that whoever commissioned the poll is campaigning against).

When presented with a press release wriMen around a PR company commissioned poll, look to the raw data to see where the numbers that appear in the press release quotes actually come from.

In the above example, I could for example claim that 96% of people (creaJve reading of the numbers) did not appear to disagree with the idea that press behaviour should be independently regulated (creaJve reading of the quesJon; the repeated negaJves also serve to further confuse the clarity of what is, or isn’t actually being claimed…).

And when reading raw results, or quoJng from them, take care which numbers you quote. SomeJmes the presentaJon of the results can lead to you misreading them or the way they add up.

SomeJmes, two or more polls may be commissioned around the same topic and appear to give contradictory results. For an example of this, see: hMp://www.open.edu/openlearn/science-‐maths-‐technology/mathemaJcs-‐and-‐staJsJcs/staJsJcs/two-‐can-‐play-‐game-‐when-‐polls-‐collide

19

Many polling organisaJons publish press releases featuring “highlight” results from a poll. The more reputable ones also publish copies of the poll or survey quesJons and the results that were returned.

YouGov polls oren split results down by poliJcal persuasion or newspaper preference, as well as demographically segmenJng responses by gender, age or region.

The majority of polling organisaJons publish the data via PDFs rather than “as data”, for example, in the form of spreadsheet datatables. Tools such as Tabula (URL) are making it increasingly easy to extract the data contained within PDFs into actual datatables. Your local techie should also be able to “scrape” the data from a PDF document and put it into a data from.

For examples of how to scrape data as well as images from PDF documents, see:

-‐  scraping data tables from PDFs: -‐  extracBng images from PDFs:

Even if you feel as if you can’t do this yourself, you should make yourself aware of what is possible and achievable by people who have the skills to performs these tasks.

20

Stephen Few has wriMen several excellent books about creaJng data visualisaJons and data dashboards, although you shouldn’t necessarily believe everything he says!

This quote gets across the idea that just as we use emphasis and tone in wriMen communicaJon, we can also can and should make use of emphasis and tone in charts.

Many newspapers are starJng to make use of charts that show several datapoints (for example, several bars in a bar chart) but highlight one or two of them that are the focus of a parJcular storyline, the other points or bars being used to provide context.

In chart design, “less is more” oren works (this reflects a principle aMributed to data visualisaJon guru Edward Ture of using “least ink” when creaJng charts).

21

This video -‐ showing part of a lecture by science ficJon writer Kurt Vonnegut – shows how simple lines can tell archetypal stories. Note how the narraJon sets the scene -‐ the axes are explained then the line is constructed. When the x-‐axis represents Jme, remember that someone riding the line as it was constructed does not necessarily know what the future holds. When you see a line chart with Jme as an x-‐axis, remember that it shows a trace of a story that unfolded over Jme.

Another powerful example of this can be found on Youtube – search for house price rollercoaster to find an animaJon where how price values over Jme are visualised as an animated roller coaster ride…

22

This second clip shows Swedish health staJsJcian made famous by his “data performances”, Hans Rosling, narraJng an animated data visualisaJon rendered using a dynamic bubble chart technique that he popularised via his Gapminder website. Note how the first 30 seconds of the clip are spent explaining the set up of the chart – what the axes mean, what the bubbles represent. When you see a rich data driven interacJve on a website, how much coaching and contextualisaJon is provided to help the user/reader make sense of it?

If you turn the sound off on the Rosling clip, how much sense do the moving bubbles make in terms of the story they tell without the benefit of Rosling’s narraBon? Can you tell where to focus your aMenJon to pull out a meaningful storyline? Are there many possible storylines that can be pulled out? What tricks does Rosling use to focus your aMenJon on – and illustrate – the story he is telling? Is there any sleight of hand in terms of not commenJng on what some of the other bubbles are doing (is he using, or could he potenJally use, misdirecJon to focus your aMenJon aware from possible stories he does not want you to pull out of the data?)

For more examples of Rosling’s compelling performances, see the recent OU/BBC Two co-‐producJon “Don’t Panic – The Truth About PopulaJon Change” available on the Gapminder website: hMp://www.gapminder.org/videos/dont-‐panic-‐the-‐facts-‐about-‐populaJon/

23

Few suggests that graphical communicaJon requires stylisJc devices that emphasise parJcular aspects of a graphic. Hans Rosling achieves this by both poinJng to items of interest, reinforcing with emphasis with both his narraJon and the use of overlays on the graphic itself.

So how can we go about drawing emphasis within a staJc graphic or chart, such as one might find in a print publicaJon?

24

To show one way of emphasising parJcular elements of a graphic, let’s produce a quick chart of our own.

The first thing we need is some data – I’m going to use some data from the Winter Olympics, a grab of the medal table from the back end of the first week of the 2014 games. The quesJon I want to explore is the extent to which the country that is leading the medal table as measured by most number of gold medals awarded, compared to a ranking in which the table is ordered according to the total number of medals awarded.

The data I’m going to use comes from a Wikipedia page. The medal table is contained within an HTML table. To get the data out of the page we are going to screenscrape the HTML table that contains the data. There are a variety of tools for doing this, from browser extensions to scraper applicaJons such as import.io, to environments such as Scraperwiki that provide a range of developer tools configured to support screenscraping based data collecJon.

But the tool I’m going to use is…

25

..Google (spread)sheets, and in parJcular a formula that will import a parJcular HTML table – in this case, the 2nd table in the page – from a specified URL, In this case the URL of the Wikipedia page containing the medal table.

The formula?

=importhtml(“URL”,”table”, tableNumber)

On entering the formula, the spreadsheet will pull the data in from the Wikipedia page and make it available as spreadsheet data.

We can now use the spreadsheet to create charts within the sheet itself. If the data in the Wikipedia page is updated, the data in the spreadsheet will be updated whenever the spreadsheet is refreshed.

26

Whilst we could generate charts within the spreadsheet, I’m actually going to use an online tool called datawrapper (available at datawrapper.de).

Datawrapper charts are starJng to make an appearance in many online news reports, such as those published by the Guardian and Trinity Mirror’s ampp3d, so being familiar with this tool -‐ and what you can do with it – could be a useful skill to have.

To get the data in to datawrapper you can upload a CSV file, or paste a copy of the data in to the upload area. I’ve taken the laMer approach, highlighJng and copying the table from the spreadsheet and then pasJng it in to datawrapper.

27

Having uploaded the data, we can configure several properJes for each column. In many cases datawrapper should be able to detect what sort of content is contained within each column (for example, whether it is a number or a text field).

If necessary, we can apply a limited amount of processing to the contents of a specified column. We can also choose to hide one or more columns from the displayed view. In this case, I am going to hide the Rank, Silver and Bronze columns.

28

We now get to choose the chart type – I’m going to go for a horizontal bar chart and select the default datawrapper style.

29

Different chart types have different configuraJon opJons. I’m going to choose to automaJcally sort the bars based on the selected value – noJce the buMons in the chart that allow us to select whether to display the Gold medal count or the Total medal count.

30

Now we get to add some emphasis – remember emphasis? This is an example about how to show emphasis in a chart…

In this case, I’m going to emphasise the top 2 posiJons in the Gold medal ranking – the “point” of the piece is to explore the extent to which these posiJons hold, or don’t hold, when we rank the table by total medal count.

At this point, we can also give the chart a Jtle, and add some provenance informaJon describing and poinJng to the source of the data.

31

Here’s an example of the final chart, with the ranking (automaJcally) sorted according to total medal count. Note how the order and posiJoning of the two highlighted countries has changed.

The difference is further exemplified when switching between the Gold and Total counts by the use of animaJon – the highlighted bars draw the eye and allow you to beMer see how their relaJve posiJons change across each of the two ranking schemes.

32

Having created chart, you can now save it to your datawrapper account. An embed code for the chart is provided so that you embed the chart within your own web page.

33

Bar charts are a very effecJve way of displaying parJcular sorts of informaJon, such as counts. But what other ways are there of displaying data?

34

Datawrapper provides a variety of chart types, including:

-‐  horizontal and verJcal (column) bar charts, -‐  grouped bars that collate different bars according to groups (for example, elecJon on elecJon percentage of the vote for different poliJcal parJes), -‐  stacked column charts (for example, for a selecJon of countries we could display a column showing the total number of medals constructed by stacking the individual gold, silver and bronze medal counts for those countries) -‐  line charts, which are widely used for plokng some value on the verJcal y-‐axis against Jme on the horizontal x-‐axis -‐  pie charts, to show proporJons of a whole, and variants thereof, such as the donut chart (a pie chart with the middle cut out) -‐  simple data tables (never underesJmate the power of a table – they can be really useful for showing specific values, and can be very powerful when allowing the user to sort the table either by ascending or descending values in parJcular columns) -‐  maps, which as we shall see, can draw out very powerful relaJonships across data elements.

35

We’ve also seen some other “basic” charts that can be useful for displaying the distribuJon of data elements:

-‐  the block histogram shows a count on the y-‐axis of data elements falling within parJcular ranges of values on the x-‐axis -‐  the scaMerplot allows us to plot two values against each other, for example height versus weight. These charts can provide us with clues about possible correlaJons or relaJonships between the two values. Some scaMerplot tools further allow us to colour each point according to group membership so that we can look to see whether numbers are clustered or grouped according to group membership.

36

Visualising data is a powerful way of asking quesJons of data – what data points you choose to display and how you display them represent the framing of the quesJon. What the data looks like is the response, but a response that oren takes careful reading. The data source has drawn you the answer – you need to turn it into words that you can use to formulate further quesJons to check your understanding of the answer first provided. (Each quesJon (each chart) typically leads to another… or more than one other…)

Asking quesJons that have a graphical answer is one way of querying a data source – but are there other approaches?

Let’s explore that a liMle more – what do we mean by asking quesJons of data?

37

A database that most of us use every day is the Google web search engine. We put in a key term or phrase and Google finds web pages ranked according to a variety of criteria that are deemed most relevant to the query you (and it could well be who you actually are that affects the ranking) have made.

SomeJmes we may know what websites we actually want to search over. Google Custom Search Engines provide one way of defining your own search engine that just searches over part of the web that you are interested in.

One of the custom search engines I have developed searches over websites that act as wire services for press releases: hMps://www.google.com/cse/publicurl?cx=016419300868826941330:wvfrmcn2oxc

This allows us to track down the source of many a news item and explore the extent to which a given news story has just churned a press release.

See also: hMp://blog.ouseful.info/2014/02/06/polling-‐the-‐news/ This post also describes how to create a bookmarklet that allows you to highlight a quote in a news report and search for press releases that contain that quote.

38

Here’s an example of the search engine in acJon – I’ve used a bookmarklet that takes a highlighted quote from a news story and passes it to the custom search engine, allowing me to easily see the source of the quote, and the story itself.

I’ve also started defining another related custom search engine that allows us to search news sites and polling companies for stories about, and sources of, polls and surveys:

hMps://www.google.com/cse/publicurl?cx=016419300868826941330:ewbi9skvnmq

39

Custom search engines are a powerful tool for helping us developed focussed web search tools that limit results to a parJcular part of the web we are interested in, either by locaJon or topic.

We can also use (advanced) search limits in ‘everyday’ web queries using the major web search engine.

For example, the query shown on this slide searches for the word underspend appearing in Excel spreadsheets (filetype:xls) that can be found on UK government websites (or more specifically, websites hosted on the gov.uk domain (site:gov.uk)).

Another query limit combinaJon I have found useful is:

confidenBal filetype:ppt

This can turn up presentaJons that have been delivered at closed corporate events but that have leaked on to the web…

40

Even if you don’t consider yourself a geek or database expert, wriJng advanced search queries using search limits is but a small step away from wriJng queries over databases themselves.

One of the most widely used languages for querying databases is SQL. The above slide shows a simple, made up SQL query that could have a similar effect to the simpler search engine query made over a very simple search engine database.

The idea is that we select those webPages where the text content of the webpage contains the word underspend anywhere – the % signs denote wildcard characters so the underspend word can appear preceded or followed by any number of arbitrary characters. We also want the query to be limited to pages that have a parJcular filetype and domain.

Far more complicated queries can be wriMen over far more complex databases. What’s important is that you develop an idea of what sorts of database structure and query are possible, not necessarily that you can run and query such databases yourself.

For more examples, see: Asking QuesJons of Data – Garment Factories Data ExpediJon – hMp://schoolofdata.org/2013/05/24/asking-‐quesJons-‐of-‐data-‐garment-‐factories-‐data-‐expediJon/ Asking QuesJons of Data – Some Simple One-‐Liners hMp://schoolofdata.org/2013/05/13/asking-‐quesJons-‐of-‐data-‐some-‐simple-‐one-‐liners/

41

One of the simplest, but oren one of the most useful, things we can do is to count things. You just need to be creaJve in what you count!

One of the nice features about working with database query languages such as SQL is that we can write queries that count the number of responses and allows us to rank results on that basis. For example, in a database of public spending transacJons with different companies, we could count the number of transacJons with a parJcular company, sum the value of transacJons carried out with a parJcular company, or find the companies with the largest total amount spent with a parJcular company.

42

As has already been menJoned, a key part of the journalisJc exercise is pukng things into context.

When working with data, interpreJng what the data says oren depends on understanding the context and more importantly, the caveats, that arise by virtue of asking a parJcular quesJon of a parJcular dataset that has been collected in a parJcular way under parJcular condiJons.

That said, given a parJcular data set, are there any obvious quesJons we can ask of it?

43

When results are ranked, as for example in the case of league tables, there are oren easy picking stories to be had around top 3/boMom three posiJons. In naJonal rankings, local news stories can be idenJfied if your local schools or council appears in either of those extremes.

For contextualisaJon purposes, it oren makes sense to look at distribuJons. Many summary staJsJcs report on the mean value, but looking at measures of variaJon, or spread, about a mean, as well as the posiJon of a median value, can oren change the context of a story.

If the lecture room has 20 students in it on an income of £6,000 maintenance loan per year, the total income is £120,000 and their average mean income is £6,000. If an academic in the room is on £40,000, the total income for the room is £160,000. The average mean income is now just a liMle over £7,500. If we define a poverty level as a mean income below £10, 000, the members of the room are, on average, in poverty. If a senior academic such as professor on an income over £65,000 wanders into the room, the total income goes to over £225,000. With 22 people now in the room, the average mean income is now over £10,000: the room is out of poverty. The median average income, however, is sJll at £6,000.

As well as top, boMom, mean and median, we should also look to outliers. If Bill Gates or Mark Zuckerberg walks into a bar, the average net worth of people in that bar is likely to go up to a level of previously unimagined wealth.

Here are several reasons why you should pay aMenJon to outliers: -‐  they may be ‘dirty’ or incorrect data points that need to be corrected and that may well raise quesJons about data quality; -‐  the outlier may truly be an outlier, a remarkable point and a story in its own right; -‐  the outlier may skew other measures, such as mean values or other summary staJsJcs. In such cases, it may make sense to use other measures or to rerun the summary staJsJc without including the outlier values to get a beMer feel for how the

44

This rather dense graphic is a view over local council spending data in my local area as relates to spend on libraries. The separate charts show the accumulated spend over a period of Jme with different suppliers. The intenJon of the display was to provide at a glance a view of accumulated spend with different companies across different directorates and spending areas to see whether any companies had a significant spend compared to other companies.

The table at the boMom shows the top of a league table of companies with the largest accumulated spend by directorate and expense type.

At first glance, the spend on phone lines with different suppliers seems to outweigh the spend on books. How can that be? Are the librarians spending their Jme calling premium rate phone lines?

If we guess at 20 libraries and a 6 month spend period, then assume that the phone lines correspond to broadband data bills, do the monthly payments per library sJll seem outrageous? These assumpJons are testable via quesJons to the relevant authoriJes, of course, but demonstrate the care we need to take when trying to understand why a number that may appear to be large is that large.

See also: Local Council Spending Data – Time Series Charts hMp://blog.ouseful.info/2013/11/06/local-‐council-‐spending-‐data-‐Jme-‐series-‐charts/

45

As well as looking for outliers, we should also look for similariJes between things we expect to be different and differences between things we expect to be the same, or at least, similar.

46

Looking again at some of my local council’s spending data, I noJced a search on “music” pulled back what appeared to be a shir in responsibility between directorates for spend on school music service provision.

An obvious quesJon that follows is: if the service did change hands (something we can check), was there a resulJng difference in the way that the directorates were spending? Could we, for example, idenJfy whether any projects got dropped (or at least, renamed out of scope!)?

This forensic approach can also be used to track the consequences of a shir in control of a service, if we know it has happened. When a service changes hand, we can keep a note of the fact and then a year on look for evidence in whether treatment of the service has changed, at least in consequences for spending.

See also: What Role, If Any, Does Spending Data Have to Play in Local Council Budget ConsultaJons? hMp://blog.ouseful.info/2013/11/03/what-‐role-‐if-‐any-‐does-‐spending-‐data-‐have-‐to-‐play-‐in-‐local-‐council-‐budget-‐consultaJons/

47

When asking quesJons of data, one quesJon can oren lead to another.

For example, a query over my local council spending data about amounts spent with the local newspaper, the Isle of Wight Country Press, idenJfied a variety of expense types associated with those spending transacJons. One such expense type was AdverBsing & Publicity. This led to me now steering the conversaJon I was having with this expert (data) source on council spending and taking it on to a slightly different tack: so who else have you been spending adverBsing and publicity budgets with?

48

If you in the posiJon of paying for energy supply bills – electricity and gas – you’ll probably be familiar with the idea that payments are set so you tend to overpay on a monthly basis. Arer collecJng the interest on your overpayments, the uJlity companies may eventually get round to sending you a small repayment to cover the excess (ex-‐ of any interest, of course…).

Is the same true at the council level?

One thing I noJced in the spend my local council spent with supplier Southern Electric was that there appeared to be more than a few “negaJve payments”. So where were these coming from? The chart shown in this slide has posiJve payments made by date (not ordered on an evenly space Jmeline) in black, and the magnitude of negaJve payments shown in red. Where a red triangle sits over a black dot, this shows that a posiJve and negaJve payment of the same amount were made on the same day. Why’s that?

Some days show several negaJve payments – again, what’s happening? There’s not necessarily anything suspicious going on, but what story does this chart appear to tell us, parJcularly in terms of the similariJes in amount of certain posiJve and negaJve spends?

49

Just by the by, this chart refines the quesJon I’m asking of the spend with Southern Electric, asking for more informaJon about posiJve and negaJve payments made on the gas and electricity accounts separately.

50

As well as similariJes and differences, data can tell us tales about trends…

51

Regular releases from the ONS – the Office of NaJonal StaJsJcs – provide bread and buMer news stories on a regular basis according to a known schedule.

For example, monthly job seeker figures get a monthly write-‐up in OnTheWight, the hyperlocal news blog local to me. The report makes a comparison between the current figures and figures from the previous month and from the same month of the previous year. The aim is is so that we can see how the numbers have changed month on month, and year on year.

I started to explore a simple script that would take data directly from the ONS and produce assets that could be reused in a news story – for example, to produce a table showing the change in figures over recent months.

I also started to explore ways in which we could automate the producJon of prose from the data [code: hMps://gist.github.com/psychemedia/7536017]. For example, the following phrase was generated automaJcally from monthly figures:

The total number of people claiming Job Seeker's Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.

The words up and down were selected based on simple if-‐then rule that compared figures to see which was the greater. The numbers and dates are pulled in from the data. The other words are canned phrases.

The automated producJon of text from data is something that has received aMenJon from several companies, parJcular in the area of baseball reports and financial reporJng. See for example: hMp://blog.ouseful.info/2013/05/22/notes-‐on-‐narraJve-‐science-‐and-‐automated-‐insight/

Being able to define sentences and natural language construcJons that can be used

52

If we plot a line chart with some quanJty against a Jme axis, we can oren see increasing or decreasing trends over Jme. If we are looking for constant rates of increase in some value, it oren makes sense to use a log/logarithmic scale to display that value on the y-‐axis Periodic trends can also be seen as ‘waves’ appearing in the line over Jme, but other displays can draw out periodicity or seasonality in a more visually compelling way.

For example, in these charts – of jobless figures on the Isle of Wight once again – we have months ordered along the horizontal x-‐axis and the number of job allowance claimants on the verJcal y-‐axis. The separate coloured lines represent different years. On the ler, we use a legend to idenJfy the lines, on the right is an example of labeling the lines directly.

The lines show strong seasonality in behaviour. Being a tourist desJnaJon, job seeker figures tend to fall over the summer months. Pukng lines for several years on the same axis allows us to compare annual cycles over Jme.

53

Another trend we can try to pull out is change over years for each given month. Here, the horizontal x-‐axis blocks out the months, as before, but within each month we have an ordered range of years. The line within each block thus represents the year-‐on-‐year change in numbers within a given month.

The step change within each month suggests that the way the figures were calculated changed significantly several years ago.

Further reading: a good guide to staJsJcs as used by government, include a descripJon of the way that “seasonal adjustments” are handled, is provided by the House of Commons Library’s StaJsJcal Literacy Guide hMp://www.parliament.uk/business/publicaJons/research/briefing-‐papers/SN04944/staJsJcal-‐literacy-‐guide

54

As well as the paMerns we can see over Jme by plokng data against a Jme axis, we can also look for paMerns in space…

55

In part because they are so recognisable to the majority of people as an idea as well as an artefact, maps are widely used in many publicaJons.

I have already menJoned how the use of a map to compare travel claims by MPs based on their consJtuency locaJons provided a way of making a parJcular sort of comparison between MPs (in parJcular, a comparison based on geographical locaJon).

But we can take the idea of a map more generally, as a spaJal distribuJon of points that are related in some way, with strong relaJons represented as spaJal proximity.

Things that are close together on the page are taken to be close together in some sort of space, a space which may be conceptual or social, not just (or not even) geographic.

56

Take this map, for example, a map of TwiMer users commonly followed by a sample of followers of @UL_journalism.

The map has been laid out so that TwiMer users who are heavily interlinked are grouped closely together (for the most part, at least). A network staJsJc has been used in an aMempt to colour clusters of nodes with high interconnecJon. The coloured regions thus represent a first aMempt at idenJfying different groupings of TwiMer user. You will note how the spaJal layout algorithm and the grouping/colouring algorithm complement each other well – they both seem to tell a similar story, where the story is that certain groups of individuals are somehow alike.

About the technique: hMp://schoolofdata.org/2014/02/14/mapping-‐social-‐posiJoning-‐on-‐twiMer/

Let’s have a closer look at some of the regions…

57

This area seems to be TwiMer accounts that relate in large part to the University of Lincoln and its related organisaJons and acJviJes.

58

This area of the map contains accounts associated with Lincoln more generally. Such a map may be useful for idenJfying companies that are used by students and as such may be useful leads for adverJsing agents looking to sell adverts appearing in university magazines or poster areas.

59

This area of the map actually conflates several different groupings, at least, on my reading of it. In fact, it may make sense to try to find clusters within this group on its on and then recolour accordingly.

So what groups can I see? BoMom ler there looks to be Lincoln local media outlets. Moving counter-‐clockwise between the 6 and 3 o’clock posiJons we see photography related users moving up into celebriJes. As we move further up towards the twelve o’clock posiJon, we see news sites, both “popular” and more industry related (@journalismnews, for example).

That there does not appear to be a strong independent cluster of journalists and industry related sites suggests that, from the sampled followers of UL_Journalism at least, there isnlt necessarily a very strong noJon of following these industry lights…

60

One of the things to menJon about mapping data mapping and visualisaJon techniques is that they oren tells us things we already (think we) know; in that sense, they are not news. But they may also tell us things we know in new, visually appealing ways. And by making use of such ‘confirmatory’ visualisaJons and displays we can build confidence within an audience that they know how to interpret these sorts of representaJon.

61

As the audience becomes comfortable reading the charts and making sense of data, when there is something new or surprising in the data, the surprise manifests itself in the reading of the data or chart.

For journalists working with data, developing a sense of familiarity with how to interpret and read data when it is just confirming what you already know helps to refine your senses for spokng things that are odd, noteworthy, or newsworthy.

Taking a liMle bit of Jme each day to:

-‐  read charts as if they were stories; -‐  look behind the data to find original sources, such as polls or data containing news releases, and then compare the original release with the way it is reported, paying parJcular aMenJon to the points that are highlighted, and how the data is contextualised;

will help you develop some of the skills you will need if you want to be able to idenJfy, develop and treat some of the stories that your specialist source that is data can provide you with, of only you ask…

62

And finally, a couple of handy books and resources on data journalism if you’re interested in reading more generally around the subject…

63

an introduction to data journalism

Education

block histogram

parjcular

total medal

oren makes

negajve payments

press releases

dierent suppliers

expense type