what is really going on?

10
What is Really Going On? 1 What is Really Going On? Large amounts of news are produced everyday by news media who interpret and highlight the events around us. Each news agency coverage of events outside their own country is likely colored by a variety of factors, such as the preferences/inclinations of their audience, a limited amount of resources (e.g. number of pages, staff) and the relations between countries. Our project sought to compare the coverage of events by news media around the world with the coverage of the New York Times’ World section (NYT) and in the likely case where there are differences/biases, investigate how and why the coverage in the NYT differs from the rest of the world. Using countries as the unit of analysis on data from January to April 2015, we found that there was a correlation between the number of articles written in the NYT about a country and the number of articles written by world media about the country. However, this correlation could not be explained by factors such as the country’s demographics (e.g. population, Gross Domestic Product, Labor Force) or the country’s relationships with the rest of world (e.g. foreign direct investment inflow, official development assistance). While we were unable to determine why the NYT’s coverage differs from the rest of the world, we have developed visualizations 1 to see the differences for themselves, perhaps helping them to figure out what is really going on in both the world media and in the New York Times. COMPONENTS OF THE PROJECT This section outlines the main parts of the project in terms of data capture, analysis, and visualization, as well as my contributions where applicable. Figure 1 shows an overview of how the parts are related to each other. 1 Link to visualizations: http://128.40.150.34/~ucfntka. This can be accessed via the UCL network and requires the data service “gdeltnytDataServer” to be running on Node.

Upload: thoughtful-practice

Post on 08-Nov-2015

60 views

Category:

Documents


1 download

DESCRIPTION

This report investigates how news coverage in the New York Times differs from the rest of the world.

TRANSCRIPT

  • What is Really Going On?

    1

    What is Really Going On?

    Large amounts of news are produced everyday by news media who interpret and highlight the events

    around us. Each news agency coverage of events outside their own country is likely colored by a variety of

    factors, such as the preferences/inclinations of their audience, a limited amount of resources (e.g. number of

    pages, staff) and the relations between countries. Our project sought to compare the coverage of events by news

    media around the world with the coverage of the New York Times World section (NYT) and in the likely case

    where there are differences/biases, investigate how and why the coverage in the NYT differs from the rest of the

    world.

    Using countries as the unit of analysis on data from January to April 2015, we found that there was a

    correlation between the number of articles written in the NYT about a country and the number of articles

    written by world media about the country. However, this correlation could not be explained by factors such as

    the countrys demographics (e.g. population, Gross Domestic Product, Labor Force) or the countrys

    relationships with the rest of world (e.g. foreign direct investment inflow, official development assistance).

    While we were unable to determine why the NYTs coverage differs from the rest of the world, we have

    developed visualizations1 to see the differences for themselves, perhaps helping them to figure out what is really

    going on in both the world media and in the New York Times.

    COMPONENTS OF THE PROJECT

    This section outlines the main parts of the project in terms of data capture, analysis, and visualization, as

    well as my contributions where applicable. Figure 1 shows an overview of how the parts are related to each

    other.

    1 Link to visualizations: http://128.40.150.34/~ucfntka. This can be accessed via the UCL network and requires the data service

    gdeltnytDataServer to be running on Node.

  • What is Really Going On?

    2

    Figure 1: Overview of project components and relationships

    Data Capture

    We used three main sources of data for the project:

    A. The Global Database of Events, Language, and Tone (GDELT)2. The database contains event data

    collected based on the worlds news media. Each event is a record in the database, and each record

    contains information such as the number of times it is mentioned in news articles, the location the event

    took place and the parties involved in the event (The GDELT Project, n.d.). We used data from January

    to April 2015 for the project, amounting to 18 million records in the database. I contributed by

    downloading the csv files and creating scripts in MySQL database to import the files. To decrease the

    query time needed when the user interacts with the visualization, I also wrote scripts to create new tables

    2 GDELT link: http://gdeltproject.org/data.html#rawdatafiles

  • What is Really Going On?

    3

    where the records were grouped by country, by date, and by both country and date. These reduced query

    time considerably.

    B. The New York Times (NYT) APIs3. We used the NYT Article Search API to retrieve data on articles

    published in the New York Times World section from January to April 2015. We also used the NYT

    Community API to retrieve comments written about these articles. I contributed by writing the Python

    scripts to retrieve the data from these APIs and organize the data by both country and date. I also created

    MySQL scripts to import the results to the database.

    C. The World Bank API4. We used the World Bank indicators API to retrieve information on about 20

    indicators such as population, labor force and Gross Domestic Product for over 200 countries around the

    world. The most recent information, typically between 2012 and 2014, was retrieved for each indicator.

    These indicators would be used for the regression analysis later. I contributed by writing the Python

    script to retrieve the data.

    We also created API services5 for the project, which we used in our visualizations. I contributed by writing

    and documenting all the API services on node.js. One important way to improve the API services would be to

    plan on writing more flexible API endpoints from the beginning, so that it is easier to retrieve more data or

    retrieve different subsets of data when needed for analysis/visualization.

    Data Analysis

    There were 2 components for the analysis:

    A. Regression analysis. To investigate what influenced how often the New York Times covered a particular

    country, we conducted regression analysis using GDELT and World Bank data as independent variables.

    3 New York Times API link: http://developer.nytimes.com/docs 4 World Bank Data API link: http://data.worldbank.org/node/9 5 Documentation for these services can be found at http://128.40.150.34:8886/ when the gdeltnytDataServer is running on node

  • What is Really Going On?

    4

    As the New York Times data was based on articles in its World section where the focus was on events

    outside the United States, we removed US-related data in the analysis. We tested multiple models using

    countries demographics and their relations to the rest of the world and found that almost all models

    explained little to none of the variation in how much a country was mentioned in the New York Times.

    Instead, the number of times a country was mentioned in the GDELT database was far better at

    explaining variation in how much a country was mentioned in the New York Times. The models tested

    and their cross-validated R-squared values are summarized in Table 1. I contributed by writing the

    Python scripts to integrate the NYT, GDELT and world bank data and to run the regression analyses

    using the scikit-learn package (Pedregosa et al, 2011). Given more time, other important factors could

    be tested, such as those relating to education levels in the country, whether the country was English

    speaking, and a countrys specific relationship with the United States (e.g. trade relations, diplomatic

    relations). A better approach to the analysis may be to use events as the unit of analysis and investigate

    how likely the New York Times would report a particular event in the GDELT database, using factors

    such as country of origin, significance/impact of event and number of news sources covering the event.

    We did not take this approach mainly because it would take significant effort to link individual events

    in the GDELT to NYT articles.

  • What is Really Going On?

    5

    Table 1: Summary of regression results

    B. Sentiment analysis. Although not visualized or mentioned on the website, we conducted sentiment

    analysis on readers comments on New York Times articles as well.

    Data Visualization

    The website for the project was set up with Bootstrap, a javascript framework for developing responsive,

    mobile first projects (Bootstrap, n.d.). It houses the projects visualizations and storyline. I created 2 maps in

    d3.js for the project (with design input from group mates), which are housed in iframes on the website:

    A. News Coverage Map. This is an interactive choropleth map that allows users to investigate how news

    coverage of countries varied over time. Figure 2 depicts the map. Users can use a date slider to change

    the period they wish to investigate, and the map would update accordingly. Using the checkboxes below

    the slider, users can choose to look at only GDELT coverage, only NYT coverage, or both together.

    Figure 2: News Coverage Map for January 2015

  • What is Really Going On?

    6

    For GDELT coverage, the map uses GDELT data to calculate the number of articles written per

    day over the time period for each country (A) based on the dates on the slider. These figures are

    compared against the number of articles written per day over the base period, January to April 2015 (B),

    using the following formula:

    Difference in GDELT news coverage on country = (A B) / B x 100%

    The results are mapped using a diverging color scheme from ColorBrewer. Blue is used for countries

    with higher news coverage than usual over the time period chosen, and red is used for countries with

    lower news coverage than usual. The darker the color, the larger the deviation from usual. For example,

    Nepal was covered much more than usual by news media after the earthquake in the last week of April,

    and shows up as dark green if the dates are selected on the slider. Users can also hover over a country to

    see more information on the GDELT coverage for the particular country.

    The New York Times coverage is represented by yellow spotlights on countries that have been

    mentioned by the NYT World section at least once. This method was chosen to allow users to look at

    both the NYT coverage and GDELT coverage at the same time. The New York Times has a limited

  • What is Really Going On?

    7

    amount of space for stories everyday (about 30 articles per day on average), and this visualization shows

    where there are discrepancies between NYT coverage and the rest of the worlds media coverage for

    particular time periods.

    One future improvement could be to adjust the size (or some other characteristic) of the spotlight

    for the NYT coverage based on the number of times the country is mentioned using a formula similar to

    the GDELT coverage. It would also be useful to add information on NYT coverage when users hover

    over or click on the spotlight. An especially useful piece of contextual information to add would be to

    show the top news headlines in the country from GDELT and NYT in a sidebar when users click on the

    country, so that users can see if the content reported (if any) was similar as well.

    B. Correlation Map. This is a choropleth map that sought to compare how often countries were covered by

    the world media (represented by GDELT) versus the New York Times. Figure 3 depicts the map. This

    map correlates the number of times a country was mentioned by the global media (from GDELT data)

    with the number of times a country was mentioned in the New York Times. Both numbers were

    standardized by dividing them with the maximum number of times any country was mentioned in the

    dataset from January to April 2015. A darker green meant a higher correlation between world media

    coverage and NYT coverage. Users can hover over a country to see the correlation figure.

  • What is Really Going On?

    8

    Figure 3: Correlation map

    There are many ways of improving the visualization further in future. I could have tested alternative

    ways to standardize the datasets to see if the data could be visualized better. Visualizing the overall

    correlations also tells users nothing about the day-to-day variations within countries, which may be more

    pertinent when thinking about media bias. The visualization could be improved by showing a chart

    similar to Figure 4 when users click on individual countries, so they get a better understanding of the

    day to day variations, and where coverage actually differs between NYT and the world media in general.

    Here, the number of mentions per day from each dataset are normalized by dividing the number of

    mentions by the maximum number of times the particular country was mentioned over the entire time

    period. This creates a ratio between 0 and 1, with the red line representing NYT and the blue line

    representing media around the world (GDELT). While we did visualize this for one country on the

    website, it would be more interesting for users to investigate individual countries themselves.

  • What is Really Going On?

    9

    Figure 4: Correlation chart for China (correlation of 0.41)

    FUTURE WORK

    Other than the improvements mentioned, future work could expand the scope of the project. We focused on

    the New York Times for this project as information was easily accessible through its API. One future direction

    could be to look at how different news outlets cover world events. This could also be compared against trends in

    social media to understand what people pay attention to. It would also be useful to find objective sources of

    events in the world and compare them against the GDELT database. News is our lens to the rest of the world,

    and we would be better off if we understood how and why our lenses are biased.

  • What is Really Going On?

    10

    References (includes data sources and tools that were mentioned in the text)

    Bootstrap. (n.d.). Bootstrap. Available from: http://getbootstrap.com/. Accessed 20th May 2015.

    ColorBrewer2. (n.d.). ColorBrewer2. Available from: http://colorbrewer2.org/. Accessed 20th May 2015.

    D3 (2013). Overview. Available from: http://d3js.org/. Accessed 20th May 2015.

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

    Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.

    (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. p. 28252830.

    The GDELT Project. (n.d.). Intro. Available from: http://gdeltproject.org/. Accessed 20th May 2015.

    The New York Times. (2014). APIs. Available from: http://developer.nytimes.com/docs. Accessed 20th May

    2015.

    The World Bank. (n.d.). Data. Available from: http://data.worldbank.org/node/9. Accessed 20th May 2015.