what is really going on?

What is Really Going On?

1


Large amounts of news are produced everyday by news media who interpret and highlight the events

around us. Each news agency coverage of events outside their own country is likely colored by a variety of

factors, such as the preferences/inclinations of their audience, a limited amount of resources (e.g. number of

pages, staff) and the relations between countries. Our project sought to compare the coverage of events by news

media around the world with the coverage of the New York Times World section (NYT) and in the likely case

where there are differences/biases, investigate how and why the coverage in the NYT differs from the rest of the

world.

Using countries as the unit of analysis on data from January to April 2015, we found that there was a

correlation between the number of articles written in the NYT about a country and the number of articles

written by world media about the country. However, this correlation could not be explained by factors such as

the countrys demographics (e.g. population, Gross Domestic Product, Labor Force) or the countrys

relationships with the rest of world (e.g. foreign direct investment inflow, official development assistance).

While we were unable to determine why the NYTs coverage differs from the rest of the world, we have

developed visualizations1 to see the differences for themselves, perhaps helping them to figure out what is really

going on in both the world media and in the New York Times.

COMPONENTS OF THE PROJECT

This section outlines the main parts of the project in terms of data capture, analysis, and visualization, as

well as my contributions where applicable. Figure 1 shows an overview of how the parts are related to each

other.

1 Link to visualizations: http://128.40.150.34/~ucfntka. This can be accessed via the UCL network and requires the data service

gdeltnytDataServer to be running on Node.


2

Figure 1: Overview of project components and relationships

Data Capture

We used three main sources of data for the project:

A. The Global Database of Events, Language, and Tone (GDELT)2. The database contains event data

collected based on the worlds news media. Each event is a record in the database, and each record

contains information such as the number of times it is mentioned in news articles, the location the event

took place and the parties involved in the event (The GDELT Project, n.d.). We used data from January

to April 2015 for the project, amounting to 18 million records in the database. I contributed by

downloading the csv files and creating scripts in MySQL database to import the files. To decrease the

query time needed when the user interacts with the visualization, I also wrote scripts to create new tables

2 GDELT link: http://gdeltproject.org/data.html#rawdatafiles


3

where the records were grouped by country, by date, and by both country and date. These reduced query

time considerably.

B. The New York Times (NYT) APIs3. We used the NYT Article Search API to retrieve data on articles

published in the New York Times World section from January to April 2015. We also used the NYT

Community API to retrieve comments written about these articles. I contributed by writing the Python

scripts to retrieve the data from these APIs and organize the data by both country and date. I also created

MySQL scripts to import the results to the database.

C. The World Bank API4. We used the World Bank indicators API to retrieve information on about 20

indicators such as population, labor force and Gross Domestic Product for over 200 countries around the

world. The most recent information, typically between 2012 and 2014, was retrieved for each indicator.

These indicators would be used for the regression analysis later. I contributed by writing the Python

script to retrieve the data.

We also created API services5 for the project, which we used in our visualizations. I contributed by writing

and documenting all the API services on node.js. One important way to improve the API services would be to

plan on writing more flexible API endpoints from the beginning, so that it is easier to retrieve more data or

retrieve different subsets of data when needed for analysis/visualization.

Data Analysis

There were 2 components for the analysis:

A. Regression analysis. To investigate what influenced how often the New York Times covered a particular

country, we conducted regression analysis using GDELT and World Bank data as independent variables.

3 New York Times API link: http://developer.nytimes.com/docs 4 World Bank Data API link: http://data.worldbank.org/node/9 5 Documentation for these services can be found at http://128.40.150.34:8886/ when the gdeltnytDataServer is running on node


4

As the New York Times data was based on articles in its World section where the focus was on events

outside the United States, we removed US-related data in the analysis. We tested multiple models using

countries demographics and their relations to the rest of the world and found that almost all models

explained little to none of the variation in how much a country was mentioned in the New York Times.

Instead, the number of times a country was mentioned in the GDELT database was far better at

explaining variation in how much a country was mentioned in the New York Times. The models tested

and their cross-validated R-squared values are summarized in Table 1. I contributed by writing the

Python scripts to integrate the NYT, GDELT and world bank data and to run the regression analyses

using the scikit-learn package (Pedregosa et al, 2011). Given more time, other important factors could

be tested, such as those relating to education levels in the country, whether the country was English

speaking, and a countrys specific relationship with the United States (e.g. trade relations, diplomatic

relations). A better approach to the analysis may be to use events as the unit of analysis and investigate

how likely the New York Times would report a particular event in the GDELT database, using factors

such as country of origin, significance/impact of event and number of news sources covering the event.

We did not take this approach mainly because it would take significant effort to link individual events

in the GDELT to NYT articles.


5

Table 1: Summary of regression results

B. Sentiment analysis. Although not visualized or mentioned on the website, we conducted sentiment

analysis on readers comments on New York Times articles as well.

Data Visualization

The website for the project was set up with Bootstrap, a javascript framework for developing responsive,

mobile first projects (Bootstrap, n.d.). It houses the projects visualizations and storyline. I created 2 maps in

d3.js for the project (with design input from group mates), which are housed in iframes on the website:

A. News Coverage Map. This is an interactive choropleth map that allows users to investigate how news

coverage of countries varied over time. Figure 2 depicts the map. Users can use a date slider to change

the period they wish to investigate, and the map would update accordingly. Using the checkboxes below

the slider, users can choose to look at only GDELT coverage, only NYT coverage, or both together.

Figure 2: News Coverage Map for January 2015


6

For GDELT coverage, the map uses GDELT data to calculate the number of articles written per

day over the time period for each country (A) based on the dates on the slider. These figures are

compared against the number of articles written per day over the base period, January to April 2015 (B),

using the following formula:

Difference in GDELT news coverage on country = (A B) / B x 100%

The results are mapped using a diverging color scheme from ColorBrewer. Blue is used for countries

with higher news coverage than usual over the time period chosen, and red is used for countries with

lower news coverage than usual. The darker the color, the larger the deviation from usual. For example,

Nepal was covered much more than usual by news media after the earthquake in the last week of April,

and shows up as dark green if the dates are selected on the slider. Users can also hover over a country to

see more information on the GDELT coverage for the particular country.

The New York Times coverage is represented by yellow spotlights on countries that have been

mentioned by the NYT World section at least once. This method was chosen to allow users to look at

both the NYT coverage and GDELT coverage at the same time. The New York Times has a limited


7

amount of space for stories everyday (about 30 articles per day on average), and this visualization shows

where there are discrepancies between NYT coverage and the rest of the worlds media coverage for

particular time periods.

One future improvement could be to adjust the size (or some other characteristic) of the spotlight

for the NYT coverage based on the number of times the country is mentioned using a formula similar to

the GDELT coverage. It would also be useful to add information on NYT coverage when users hover

over or click on the spotlight. An especially useful piece of contextual information to add would be to

show the top news headlines in the country from GDELT and NYT in a sidebar when users click on the

country, so that users can see if the content reported (if any) was similar as well.

B. Correlation Map. This is a choropleth map that sought to compare how often countries were covered by

the world media (represented by GDELT) versus the New York Times. Figure 3 depicts the map. This

map correlates the number of times a country was mentioned by the global media (from GDELT data)

with the number of times a country was mentioned in the New York Times. Both numbers were

standardized by dividing them with the maximum number of times any country was mentioned in the

dataset from January to April 2015. A darker green meant a higher correlation between world media

coverage and NYT coverage. Users can hover over a country to see the correlation figure.


8

Figure 3: Correlation map

There are many ways of improving the visualization further in future. I could have tested alternative

ways to standardize the datasets to see if the data could be visualized better. Visualizing the overall

correlations also tells users nothing about the day-to-day variations within countries, which may be more

pertinent when thinking about media bias. The visualization could be improved by showing a chart

similar to Figure 4 when users click on individual countries, so they get a better understanding of the

day to day variations, and where coverage actually differs between NYT and the world media in general.

Here, the number of mentions per day from each dataset are normalized by dividing the number of

mentions by the maximum number of times the particular country was mentioned over the entire time

period. This creates a ratio between 0 and 1, with the red line representing NYT and the blue line

representing media around the world (GDELT). While we did visualize this for one country on the

website, it would be more interesting for users to investigate individual countries themselves.


9

Figure 4: Correlation chart for China (correlation of 0.41)

FUTURE WORK

Other than the improvements mentioned, future work could expand the scope of the project. We focused on

the New York Times for this project as information was easily accessible through its API. One future direction

could be to look at how different news outlets cover world events. This could also be compared against trends in

social media to understand what people pay attention to. It would also be useful to find objective sources of

events in the world and compare them against the GDELT database. News is our lens to the rest of the world,

and we would be better off if we understood how and why our lenses are biased.


10

References (includes data sources and tools that were mentioned in the text)

Bootstrap. (n.d.). Bootstrap. Available from: http://getbootstrap.com/. Accessed 20th May 2015.

ColorBrewer2. (n.d.). ColorBrewer2. Available from: http://colorbrewer2.org/. Accessed 20th May 2015.

D3 (2013). Overview. Available from: http://d3js.org/. Accessed 20th May 2015.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.

(2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. p. 28252830.

The GDELT Project. (n.d.). Intro. Available from: http://gdeltproject.org/. Accessed 20th May 2015.

The New York Times. (2014). APIs. Available from: http://developer.nytimes.com/docs. Accessed 20th May

2015.

The World Bank. (n.d.). Data. Available from: http://data.worldbank.org/node/9. Accessed 20th May 2015.

what is really going on?

Documents