headline analysis - utkweb.eecs.utk.edu/.../headline-analysis.pdf · headline data summary....
TRANSCRIPT
![Page 1: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/1.jpg)
Headline Analysis
John Qiu
William Mckeehan
Joshua Chavarria
![Page 2: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/2.jpg)
Test Questions
1. What graph clustering?.
1. What is one of the graph clustering algorithms that was implemented in our
headline analysis?
1. What is the name of the API used to collect our data?
![Page 3: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/3.jpg)
John Qiu
• Born in China, came to America at age 2 - Grew up in Franklin, TN
• BBA in Economics, Minor in Math - May 2014
• MS in Business Analytics - Dec 2016
• Work at Oak Ridge National Lab - Health Data Sciences Institute
• Focus on Natural Language Processing
![Page 4: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/4.jpg)
William McKeehan
www.mckeehan.info
![Page 5: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/5.jpg)
Joshua Chavarria• Computer Science Major
• Hometown: Los Angeles, CA
• Interests:
• Gaming
• Soccer
• Guitar
• Traveling
![Page 6: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/6.jpg)
Introduction
● With headline analysis we are
clustering keywords in headlines
from a variety of sources in order
to compare them.
● Our hypothesis is that sources
with different perspectives are
going to have different
associations within their headlines
● (For example, CNN is more likely
to have Trump in a headline with
Russia, whereas Fox might have
Trump mentioned with Business.)
![Page 7: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/7.jpg)
Motivation
• We believe that by looking at the associations within the
headlines of the sources, we can identify the different narratives
of each source.
• Goal: Compare a subset of news sources in order to show that
sources with differing perspectives would have different
associations within their headlines
![Page 8: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/8.jpg)
Outline• Approach
• Overview
• Algorithms
• Applications
• Implementation
• Open Issues
![Page 9: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/9.jpg)
Approach
1) Gather news source
2) Extract Entities
3) Note Relationships between co-
occurrences
4) Use clustering algorithms to aggregate
the relationships and compare sources
![Page 10: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/10.jpg)
Overview of Cluster Analysis
• Cluster Analysis is not an algorithm, but rather a group of algorithms
• Any nonuniform data contains underlying structure due to the
heterogeneity of the data. The process of identifying this structure in
terms of grouping the data elements is called clustering
• Graph clustering is the process of finding sets of related vertices in a
graph and grouping them into “clusters”.
• This is a common technique amongst various fields, such as
statistical data analysis, data mining, and pattern recognition.
![Page 11: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/11.jpg)
Overview of Cluster Analysis: Visual Example
![Page 12: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/12.jpg)
Overview of Cluster Analysis• Given a data set, the goal of clustering is
to divide the data set into clusters such
that the elements assigned to a particular
cluster are similar or connected.
• Desirable Cluster Properties in Graphs:
• At least one path connecting each pair
of vertices within a cluster.
• If vertex u can’t reach vertex v, they
should not be in the same cluster.
• A subset of vertices forms a good
cluster if the induced subgraph is
dense.
![Page 13: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/13.jpg)
Graph Clustering Algorithms: Intro
In a graph setting, clustering means partitioning the graph so that edges within a
group are large and edges across groups are small
![Page 14: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/14.jpg)
Algorithms: Hierarchical Clustering
• A global clustering algorithm that creates a
hierarchical decomposition of sets of objects
using similarity matrix.
• Two Methods
• Agglomerative Approach (Bottom-Up)
• Divisive Approach (Top-Down)
• Advantages:
• Easy to implement and more robust to
noise.
• Disadvantages:
• Computationally demanding for large
data sets.
• Hard to identify clusters by dendogram
![Page 15: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/15.jpg)
Agglomerative Hierarchical Clustering Pseudo Code:
Using Cosine Similarity as Similarity Measure:
• Initialize all vertices as individual clusters
• Using Adjacency Matrix, calculate pairwise similarity between all vertices
• Either:
• Merge the most similar vertices into same cluster (Single linkage
clustering) or
• Merge most different vertices into their most similar clusters (Complete-
linkage clustering)
• Update Adjacency Matrix
• Repeat for all vertices in a cluster Complexity:
![Page 16: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/16.jpg)
Applications
• Clustering is often used to automatically generate feature representation for
data corresponding to a defined similarity measure.
• Specific uses include:
• Dimensionality reduction
• Multi-objective optimization
• Outlier/Anomaly detection
• Segmentation
• Applications:
• Recommendation systems - classifying users based on preferences
• Image Segmentation - classifying sections of images based on similar
pixels
![Page 17: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/17.jpg)
Implementation: Data Collection
• Collect/compare headlines
• EventRegistery.org• Free
• Over 100,000 news publishers
• API
• Python Library
![Page 18: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/18.jpg)
Bad Data Examples
• “DA seeks to revoke bond for accused drunk driver”
• “Levant Mediterranean dishes up small plates with big
flavor”
• “Manalapan (2) at Colts Neck (19) - Girls Lacrosse”
• “Checheche Catholic priest in sex scandal - Nehanda
Radio”
![Page 19: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/19.jpg)
Native Media Source # Articles Political Affiliation
Agency Reuters 426 NA
Associated Press 688 NA
Cable Fox News 184 Trump
MSNBC 60 Clinton
Internet Breitbart 81 Trump
The Huffington
Post
254 Clinton
Network ABC News 134 Both
CBS News 78 Both
NBC News 96 Both
Newspaper New York Times 306 Clinton
Radio NPR.org 158 Clinton
Headline
Data
Summary
![Page 20: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/20.jpg)
Descriptive Statistics
![Page 21: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/21.jpg)
How Can We use Clustering to Analyze our Headlines
And Compare Sources?
We will be working weighted undirected graphs to represent our data in two ways
Word Level Representation:
Clustering on a single source’s word-co-occurrence graph is an abstraction of
related content can be compared between sources.
Document Level Representation:
Use document representation similarity measures for all documents to
reveal similarities.
![Page 22: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/22.jpg)
How do Computers See/Read/Get Information
From Text?
1) Learn to Count Words
2) Learn which Words to count
3) Learn to produce representation words
![Page 23: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/23.jpg)
1) Term Document Vector/Matrix (Salton 1968)Definition: A document D from a corpus with n many unique
terms can be represented by a Term Document
Vector D = [d1,...,dn ] of length n
Pros:
• Quick to generate/normalize.
• Simple to interpret
• Introduced similarity measure to text data -
Euclidian Distance and Centroid clustering
(Salton 1975)
Cons:
• Huge Dimensionality but really sparce
• No language structure - word order
• Not how words work
![Page 24: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/24.jpg)
Reutersnum articles: 426
orig vocab size 1587
mindf2 vocab size 607
vocab size 607
clust finished in 0.463397979736
words related to trump
right
rutte
fillon
![Page 25: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/25.jpg)
Associated Press
---- Associated Press -------------------------------------
num articles: 688
orig vocab size 1687
mindf2 vocab size 860
vocab size 860
clust finished in 0.377697944641
words related to trump
conservative
![Page 26: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/26.jpg)
---- Fox News -------------------------------------
num articles: 184
orig vocab size 938
mindf2 vocab size 277
vocab size 277
clust finished in 0.170491933823
words related to trump
to
2016
struggle
starts
but
own
![Page 27: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/27.jpg)
---- MSNBC -------------------------------------
num articles: 60
orig vocab size 274
mindf2 vocab size 64
vocab size 64
clust finished in 0.00706195831299
words related to trump
up
![Page 28: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/28.jpg)
num articles: 81
orig vocab size 559
mindf2 vocab size 137
vocab size 137
clust finished in 0.0235621929169
words related to trump
gorsuch
for
or
clinton
![Page 29: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/29.jpg)
The Huffington Post
num articles: 254
orig vocab size 1239
mindf2 vocab size 362
vocab size 362
clust finished in 0.130997180939
words related to trump
election
didn
nomination
now
moonlight
![Page 30: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/30.jpg)
ABC News
num articles: 134
orig vocab size 633
mindf2 vocab size 177
vocab size 177
clust finished in 0.0359399318695
words related to trump
lawmakers
aca
bill
listening
her
blueprint
himself
prosecutor
![Page 31: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/31.jpg)
CBS News
num articles: 78
orig vocab size 457
mindf2 vocab size 87
vocab size 87
clust finished in 0.0120129585266
words related to trump
putin
health
russia
![Page 32: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/32.jpg)
NBC News
num articles: 96
orig vocab size 513
mindf2 vocab size 176
vocab size 176
clust finished in 0.0229661464691
words related to trump
flynn
![Page 33: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/33.jpg)
The New York Times
num articles: 306
orig vocab size 1262
mindf2 vocab size 345
vocab size 345
clust finished in 0.0602300167084
words related to trump
independence
let
pen
post
france
america
nears
ties
looks
foreign
pennsylvania
being
stories
at
![Page 34: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/34.jpg)
Resultsotal num articles: 2465
orig vocab size 4636
mindf2 vocab size 2332
vocab size 2332
clust finished in 11.3553888798
words related to trump
governing
negotiate
feeling
that
camp
bad
citizens
gay
backing
demands
beijing
sparks
homes
partner
hike
![Page 35: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/35.jpg)
2) Better Representations from Labeled Datasets
Part of Speech Tagging:
Brown Corpus 1960 1,000,000 words tagged with part of speech
Lemmatization - mapping words to a root form:
E.g. [Franch, French] -> French
![Page 36: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/36.jpg)
Open Issues
• Parameter selection
• Scalability
• Evaluation
• Fake News
![Page 37: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/37.jpg)
Issue - Parameter selection
• How do you
determine the
parameter values to
give as input to the
clustering algorithm?
![Page 38: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/38.jpg)
Issue - Scalability
• How does the runtime and
memory consumption of the
algorithm behave for massive
input graphs?
![Page 39: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/39.jpg)
Issue - Evaluation
• How to decide which clusterings is the best?
![Page 40: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/40.jpg)
Issue - Fake News
![Page 41: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/41.jpg)
References
http://www.lsi.upc.edu/~bejar/amlt/articulos/Graph%20Clustering03.pdf
http://world.mathigon.org/Graph_Theory
http://micans.org/mcl/
http://searchengineland.com/google-news-ranking-stories-30424
http://cs-people.bu.edu/mp/images/pap101a.pdf
https://en.wikipedia.org/wiki/Named-entity_recognition
https://en.wikipedia.org/wiki/Parse_tree
![Page 42: Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare](https://reader033.vdocuments.mx/reader033/viewer/2022051910/5fffc7ccd0ac98780008bff4/html5/thumbnails/42.jpg)
Discussion