connecting political data to media data

Post on 07-Jul-2015

171 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation given at ASCoR Spring Colloquium ‘Big Data at the University of Amsterdam’ on February 18, 2014

TRANSCRIPT

Connecting political data to media data

Laura Hollink

VU University AmsterdamWeb & Media group

ASCoR Spring Colloquium ‘Big Data at the University of Amsterdam’February 18, 2014

Laura Hollink Damir JuricGeert-Jan Houben

Martijn KleppeMax KemmanHenri Beunders

Johan OomenJaap Blom

Funded by Clarin-NL

Questions we want to answer

• Which events have attracted a lot of media attention?

• What are the differences between different media? E.g. in different newspapers, or newspapers vs. radio bulletins?

• Has the coverage changed over time?

• How are the events visualized (photos, layout of newspaper, etc.).

Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches.

Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches. Archives of hundreds of

newspaper with tons of newspaper issues or 10’s of Millions of articles between 1618-1995.

(We only use 1945-1995)

Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches.

Roughly 1.8 Million news bulletins between 1937-1984

(We only use 1945-1995)

Archives of hundreds of newspaper with tons of newspaper issues or 10’s of Millions of articles between 1618-1995.

(We only use 1945-1995)

PoliMedia methods

Step 1: Translate the Dutch parliamentary debates to the standard structured web format RDF

nl.proc.sgd.d.194519460000002

nl.proc.sgd.d.194519460000002.1

PartOfDebateDebate

http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002

http://statengeneraaldigitaal.nl/

http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf

nl.proc.sgd.d.19720000002

Handelingen Verenigde Vergadering...

Dutch

1945-11-20rdf:type

dc:id

dc:source

dc:source

dc:publisher

dc:language

dc:date

hasPart

rdf:type

nl.proc.sgd.d.194519460000002.1.1hasPart

DebateContext

rdf:type

nl.proc.sgd.d.194519460000002.1.2

Speech

rdf:type

hasPart

nl.proc.sgd.d.194519460000002.1.3

hasSubsequentSpeech

"Mijnheer de Voorzitter, de Commissie van …"

hasSpokenText

sem:hasActorSpeaker_0006

4

Party_kvp

hasParty

hasSpeaker

member_of _parliament

"De voorzitter opent de vergadering…"

hasText

http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr

coveredIn

Party

KVP

Katholieke Volkspartijrdf:type

hasAcronym

hasFullName

Joannes Antonius James

Bargefoaf:firstName

foaf:lastName

Bargerdfs:label

http://resolver.politicalmashup.nl/nl.m.00064

dc:source

Politician

rdf:typehasRole

nl.proc.sgd.d.194519460000002.2

hasSubsequentPartOfDebate

XML by War in

Parliament Project

Modeling the debates as events

• An event has a date, a location, actors, and possibly sub-events.

• We build on the Simple Event Model (SEM).

•links to the original sources•reusing existing

vocabularies

nl.proc.sgd.d.194519460000002

Debate

http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002

http://statengeneraaldigitaal.nl/

http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf

nl.proc.sgd.d.19720000002

Handelingen Verenigde Vergadering...

Dutch

1945-11-20rdf:type

dc:id

dc:source

dc:source

dc:publisher

dc:language

dc:date

dc:title

•the part-of structure and chronological order of the debates.

nl.proc.sgd.d.194519460000002

nl.proc.sgd.d.194519460000002.1

PartOfDebate

hasPart

rdf:type

nl.proc.sgd.d.194519460000002.1.1hasPart

DebateContext

rdf:type

nl.proc.sgd.d.194519460000002.1.2

Speech

rdf:type

hasPart

nl.proc.sgd.d.194519460000002.1.3

hasSubsequentSpeech

"Mijnheer de Voorzitter, de Commissie van …"

hasSpokenText

"De voorzitter opent de vergadering…"

hasText

nl.proc.sgd.d.194519460000002.2

hasSubsequentPartOfDebate

Handelingen Verenigde Vergadering...

dc:title

•the different roles and parties that a speaker can have in his/her career.

nl.proc.sgd.d.194519460000002.1.2

Speech

rdf:type

"Mijnheer de Voorzitter, de Commissie van …"

hasSpokenText

sem:hasActorSpeaker_0006

4

Party_kvp

hasParty

hasSpeaker

member_of _parliament

http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr

coveredIn

Party

KVP

Katholieke Volkspartijrdf:type

hasAcronym

hasFullName

Joannes Antonius James

Bargefoaf:firstName

foaf:lastName

Bargerdfs:label

Politician

rdf:typehasRole

Step 2: Linking speeches in the debate to the newspaper articles that cover them

We created a linking method to deal with our two challenges:1.How to link documents that are so different in nature?2. Can we use the structure of the debates: people, chronologic

order of speeches, introductions to each new topic, etc?

Detect topics in

speeches

Create queries

Search newspaper

archive

Topics

Named Entities

Name of speaker

Detect Named

Entities in speeches

Candidate articles

Queries

Rank candidate

articles

Links between speeches

and articles

Debates

Date of debate

Step 2: Linking speeches in the debate to the newspaper articles that cover them

Detect topics in

speeches

Create queries

Search newspaper

archive

Topics

Named Entities

Name of speaker

Detect Named

Entities in speeches

Candidate articles

Queries

Rank candidate

articles

Links between speeches

and articles

Debates

Date of debate

Intuition 1: The name of the speaker should appear in the article and the article should be published within a week of the debate

Step 2: Linking speeches in the debate to the newspaper articles that cover them

Detect topics in

speeches

Create queries

Search newspaper

archive

Topics

Named Entities

Name of speaker

Detect Named

Entities in speeches

Candidate articles

Queries

Rank candidate

articles

Links between speeches

and articles

Debates

Date of debate

Intuition 1: The name of the speaker should appear in the article and the article should be published within a week of the debate

Intuition 2: the more the article and the speech overlap in terms of topics and named entities, the more they are related.

Evaluation: what do we use to rank the candidate articles?

• Experiment on 150 <newspaper article, speech in debate> pairs, 2 raters, K = 0.5

• Compare text of candidate articles to:• Setting 1: Named Entities in speech

• Setting 2: Named Entities + Topics in speech

• Setting 3: Named Entities + Topics in speech and larger part-of-debate

Score Setting 1 Setting 2 Setting 3

I don’t know 0.14 0.15 0.08

0 - unrelated 0.38 0.23 0.12

1- related 0.29 0.36 0.36

2- explicit mention of the debate 0.19 0.26 0.44

1+2 0.48 0.62 0.80

Results

• An open data set of Dutch parliamentary debates,

• with almost 3 Million links between 450.000 speeches and URL’s of 1.5 Million news paper articles and radio bulletins at the National Library.

• accessible though a Web demonstrator and through a SPARQL endpoint.

Demo

SPARQL endpoint

• A service to query a knowledge base using the SPARQL query language.

“All speeches with more than 60 associated news items.”

SELECT ?speech ?no_newsitems {{ SELECT ?speech (COUNT(?news) AS ?no_news_items) WHERE{ ?speech <http://purl.org/linkedpolitics/nl/polivoc#coveredAt> ?news . }GROUP BY ?speech }FILTER (?no_news_items > 60) }

Reflection: to what extend can we answer these questions?

• Which events have attracted a lot of media attention?

• What are the differences between different media? E.g. in different newspapers, or newspapers vs. radio bulletins?

• Has the coverage changed over time?

• How are the events visualized (photos, layout of newspaper, etc.).

Future work

• More types of links

• From just “coveredIn” to “quotedIn”, “coveredIn”, “backgroundOf” “talksAbout”

• More types of media

• More types of (political) events.

Project ‘Talk of Europe / Traveling Clarin Campus’2014-2015Funded by CLARIN-ERIC

From left to right: Max Kemman, Marnix van Berchum, Laura Hollink, Astrid van Aggelen, Steven Krauwer, Henri Beunders. (Unfortunately, Martijn Kleppe and Johan Oomen were not present to join the group pic.)

Plans of ‘ToE/TTC’

1.Publish proceedings of the EU parliamentary debates in RDF• hosted by DANS

2.Organize 3 workshops/hackathons/‘Traveling Clarin Campuses’ in which we invite international partners to work with the data.

3.In collaboration with international partners:• enrich with annotations, e.g. topics, structured data about people, parties,

etc. • link to national datasets, e.g. media or national parliaments

top related