understanding crowds’ migration on the web yong wang komal pal aleksandar kuzmanovic northwestern...
TRANSCRIPT
Understanding Crowds’ Migration
on the Web
Yong WangKomal PalAleksandar Kuzmanovic
Northwestern University
http://networks.cs.northwestern.edu
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
MSN
CNN
(5.8M)
(6.1M)(14.3M)
(4.3M)(19M)
(2.3
M)
(1.3
M)
(4.7M)
(2M)
A User-Driven Web Network
Node: #unique visitors to website.
Edge: #Common visitors between endpoints.
Fig: Target graph
2
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Motivation
Study the Web from the point of view of its users
– Evaluate properties of network• Analyze user movement among websites• Determine properties of the user-driven Web network• Compare to Online Social Networks and “classical” Web
networks
– Mine data to serve –• Online advertisers• Search engines
3
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Our Contributions
Generate the user-driven Web network
Study the user-driven Web
Apply the user-driven Web
4
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Outline
Generate the user-driven Web network
Study the user-driven Web
Apply the user-driven Web
5
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Information Reconstruction
Fact– Plethora of information made publicly available on a
daily basis• E.g., Google Trends, AdPlanner, Analytics, ALEXA, etc.
Problem– The publicly available information snippets are not
comprehensive
Approach– Combine multiple data sources and develop
methods to reconstruct globally meaningful information
6
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web 7
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Parent node
Child/edge nodes
Generating a User-Driven Web
8
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Crawling
Breadth First Search for 15 days
3 seeds – nytimes.com, sina.com.cn, timesofindia.com
US centric network : ~297K nodes and 2M edgesChina centric network : ~290K nodes and 2.7M edgesIndia centric network : ~297K nodes and 2.8M edges
Captured information:• Unique #users – Google AdPlanner• Shared users – Google Trends
9
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Problems without Normalization
Network without Normalization (Problems!!!)
100
50
25
C
F
B
G
100
20
10
D
C
E
A
100
20
10
D
C
E
A
100
50
25
C
F
B
G
Fig: Sub-graph AFig: Sub-graph B
Fig: Merged graphs A&B without normalization
Weight to the first child is always set to 100
10
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Ideal Normalized Network
11
100
20
10
D
C
E
A
10
5
2.5
C
F
B
G
Fig: Normalized graph – Target scenario
Weights scaled w.r.t weight(AD)
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Normalization Process
Parent nodes
Relationship between Website 2 and child nodes of Website 1
12
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Normalization Process
Phase 1: Select a starting point (a node with max in-degree – say C)– Select parent (A) of C, and child of A (D). – Normalize all other parent nodes to weight of AD
(by querying the parent nodes together with A) • Normalized nodes: Nodes whose all edges are normalized
13
AB
F
G
C
D
Normalized node
Child of a normalized node
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Phase 2: Back link from a child of a normalized node to its parent– The weight of the forward link must be equal to the
weight of the backward link
14
Normalization Process
A C
B D
E
Normalized node
Child of a normalized node
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Phase 3: A child of a normalized node (D) shares a child (C) with a normalized node (A)– We can normalize D (by querying it together with
node A)– Note: the shared child (green) could itself either be
a normalized node or a child of a normalized node
15
Normalization Process
A
B
E
C
D
Either normalized node or a child of a normalized node
Normalized node
Child of a normalized node
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Phase 4: A node (D) shares a child (C) with a normalized node (A)– We can normalize D (by querying it together with
node A)– Note: Node D (black) is initially neither a normalized
node nor a child of a normalized node
16
Normalization Process
A
B
E
C
D
Neither normalized node nor a child of a normalized node
Either normalized node or a child of a normalized node
Normalized node
Child of a normalized node
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Normalization Process
Validation– Popularity ranking of our normalized network
compared to Google AdPlanner – The two tanking results match in 91.66% of cases
Adding absolute traffic– Google AdPlanner for #unique users
Unifying two scale systems– Top 10 children are sufficient– Relative weight -> Absolute weight
17
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Outline
Generate the user-driven Web network
Study the user-driven Web
Apply the user-driven Web
18
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Weighted Degree Distribution
–The sum of link weights for each node
–Log-normal distribution• OSN and WWW follow a power-law distribution
– Small-traffic sites filtered by Google Trends–Seed-free properties with distinctions
• Extreme values
19
Minimum degree nodes Maximum degree nodes
High peak => strong connectedness
US network India network China network
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Average Path Length and Diameter
–User-Driven Web has properties closer to Online Social Networks than to WWW
• The human component makes the network more connected
–Larger average path length for the Chinese network• Because high-degree clusters in the core are loosely
connected with low-degree clusters at edges• For the other 2 networks, high-degree clusters in the core
are well connected to the nodes at the edges
20
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
– High clustering coefficients• 4 orders of magnitude higher than the corresponding
random graphs– Clustering coefficients uniform for the three networks
• China:– High-degree and low-degree nodes are separately
clustered and loosely connected• US:
– High-degree nodes are clustered in the core while low degree nodes are not well clustered
• India:– A smaller difference between high- and low-degree
node clusters
21
Clustering Coefficient
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
User Driven Web is closer to Online Social Networks than to WWW in all properties– The human component prevails
Seed-free properties– Independent from the starting crawling point
Scale-free properties– Independent from the network scale
22
Network Properties
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Outline
Generate the user-driven Web network
Study the user-driven Web
Apply the user-driven Web
23
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Online Advertising
MSN
CNN
(5.8M)
(6.1M)(14.3M)
(4.3M)(19M)
(2.3
M)
(1.3
M)
(1700)
(2M)
24
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Website Selector
Problem: Find the best selection of websites (ad hosts) that provide maximum visibility at minimum cost
Target users – – Independent advertisers – Ad commissioners
Alternative approaches:– Greedy
• Choose the websites in descending order of their popularity
– Sub-optimal • Linear optimization without shared user information
25
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Modeling
Inputs – – CPI model – random normal distribution – User-driven web – Budget
Output – – List of potential ad hosts providing maximum
visibility within budget constraints
26
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Optimization Problem
Maximize :
Σi uixi – ΣjΣk(j≠k) sjkxjxk
subject to linear constraint :
Σi cixi < = B
where –
xi – website (node) i
ui – unique #users on node xi sjk – #shared users between xj and xk
ci – CPI for node xi B – budget constraint
27
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Performance Results
Greedy approach used as a baselineSub-optimal approach lacks shared-user
information– And hence doesn’t perform well in improving ads
visibility Website Selector improves performance by 22-
25%
28
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Eliminating High-Volume Websites
5% of top 1,000 websites eliminated (volume >= 1M)
Several cases of high volume nodes being ignored due to significant number of shared users
MSN
CNN
(2.9M)
(11M) (23M)
(1.2
M) (0.7M)
CPI~$42
CPI~$49CPI~$53
✗
29
Yong Wang, Komal PalUnderstanding Crowd’s Migration on the Web
Conclusions
Generated user-driven web– Used publicly available information – Designed methods to fuse pieces into a global network
Studied user-driven web and its properties– Scale- and seed-free network properties– User-driven web different from “classical Web” but
similar to Online Social NetworksDesigned website selector– Incorporates idea of “shared visitors” between websites– Increases visibility of ads by 22-25%, increases revenue– Tailored for ad commissioners
30
Thank You
http://networks.cs.northwestern.edu