Mosaic: Quantifying Privacy Leakage in Mobile Networks
Aleksandar Kuzmanovic
EECS DepartmentNorthwestern University
http://networks.cs.northwestern.edu
Buy 1 Get 1 Free
Mosaic– Joint work with
• Ning Xia (Northwestern University)• Han Hee Song (Narus Inc.)• Yong Liao (Narus Inc.)• Marios Iliofotou (Narus Inc.)• Antonio Nucci (Narus Inc.)• Zhi-Li Zhang (University of Minnesota)
Synthoid– Joint work with
• Marcel Flores (Northwestern University)
2
3
Scenario
IP1
ISP A … IPK IP1
CSP B … IPK
Different services
Dynamic IP,CSP/ISP
Different devices
Different “footprints”
4
Problem
Other research
work
We are here!Tessellation Mosaic
IP1
ISP A
… IPK
IP1
CSP B
… IPK
Input packet traces
How much private information can be obtained and expanded about end users by monitoring network traffic?
5
Motivation
IP1
ISP A
… IPK
IP1
CSP B
… IPK
I will know everything about everyone!
Agencies
Bad guys
Mobile Traffic:• Relevant: more personal information• Challenging: frequent IP changes
6
• How to track users when they hop over different IPs?
Challenges
With Traffic Markers, it is possible to
connect the users’ true identities to their sessions.
Sessions:Flows(5-tuple) are grouped into sessions
timeIP3
timeIP2
timeIP1Traffic Markers:Identifiers in the traffic that can be used to differentiate users
7
Datasets
• 3h-Dataset: main dataset for most experiments• 9h-Dataset: for quantifying privacy leakage• Ground Truth Dataset: for evaluation of session
attribution• RADIUS: provide session owners
Dataset Source Description
3h-Dataset CSP-A Complete payload
9h-Dataset CSP-A Only HTTP headers
Ground Truth Dataset CSP-B Payload & RADUIS info.
Methodology Overview
TessellationIP1
ISP A
… IPK
IP1
CSP B
… IPK
Mosaic construction
Network data analysis
Public web crawling
Mapping fromsessions to users
Traffic attribution
Viatraffic markers
Via activity fingerprinting
Combine information from both network data and OSN profiles to infer the user mosaic.
8
9
Traffic Attribution via Traffic Markers
Traffic Markers:
• Identifiers in the traffic to differentiate users• Key/value pairs from HTTP header• User IDs, device IDs or sessions IDs
Domain Keywords Category Source
osn1.com c_user=<OSN1_ID> OSN User ID Cookies
osn2.com oauth_token=<OSN2_ID>-## OSN User ID HTTP header
admob.com X-Admob-ISU Advertising HTTP header
pandora.com user_id User ID Cookies
google.com sid Session ID Cookies
How can we select and evaluate traffic markers from network data?
10
Traffic Attribution via Traffic Markers
OSN IDs as Anchors:• The most popular user identifiers among all services• Linked to user public profiles
OSN IDs can be used as anchors, but their coverage on sessions is too small
OSN Source Session Coverage
OSN1 ID HTTP URL and cookies 1.3%
OSN2 ID HTTP header 1.0%
Top 2 OSN providers from North America
Only 2.3% sessions contain OSN IDs
Block Generation: Group Sessions into Blocks
11
Traffic Attribution via Traffic Markers
99K session blocks generated from the 12M sessions
Session interval δ • Depends on the CSP• δ=60 seconds in our
study
Other sessions?OSN ID
≥δ
IP 1
Block• Session group on the
same IP from the same user
• Traffic markers shared by the same block
timeIP 1IP
timeIP
Block
Culling the Traffic Markers: OSN IDs are not enough
12
Traffic Attribution via Traffic Markers
• Uniqueness: Can the traffic marker differentiate between users?• Persistency: How long does a traffic marker remain the same?
Traffic markers
mobclix.com
#uid
OS
N1 ID
pandora.com#user_id
mobclix.com
#u
mydas.m
obi#m
ac-id
google.com#sid
craigslist.org#cl_b
Persistency ~= 1The value of Google.com#sid remains the same for the same user nearly all the observation duration
Uniqueness = 1No two users will share the same google.com#sid value
We pick 625 traffic markers with uniqueness = 1, persistency > 0.9
Traffic Attribution: Connecting the Dots
13
Traffic Attribution via Traffic Markers
IP 1
IP 2
IP 3
Same OSN ID
Same traffic marker
Traffic markers are the key in attributing sessions to the same user over different IP addresses
Tessellation User Ti
( )
14
Traffic Attribution via Activity Fingerprinting
• What if a session block has no traffic markers?
Assumption (Activity Fingerprinting):• Users can be identified from the DNS names of their
favorite services
DNS names:• Extracted 54,000 distinct DNS names• Classified into 21 classes
Serviceclasses
Serviceproviders
Search bing, google, yahoo
Chat skype, mtalk.googl.com
Dating plentyoffish, date
E-commerce amazon, ebay
Email google, hotmail, yahoo
News msnbc, ew, cnn
Picture Flickr, picasa
… …
Activity Fingerprinting:• Favorite (top-k) DNS names as the
user’s “fingerprint”
15
Traffic Attribution via Activity Fingerprinting
Y-axis:closer to 1, more distinct
the fingerprint is
Mobile users can be identified by the DNS names from their preferred services
Normalized DNS names
• Fi : Top k DNS names from user as “activity fingerprint”• : Uniqueness of the fingerprint
X-axis:normalized by the total number of DNS names
16
Traffic Attribution Evaluation
Ri
RADIUS user(Ground Truth)
Session
Ri Rj
Ti
Correct (Not complete)
Tj
Not correct
Coverage = -----------------------
identified sessions/users
total sessions/users
= -------------------
correctly identified sessions/users
total identified sessions/users
Accuracy on Covered Set
TiTessellation user
(Correct?)
17
Traffic Attribution Evaluation
• Evaluation Results
Session
User
Co
vera
ge
78.60%
69.00%
49.80%
43.20%
2.40%
15.70%
OSN ID extraction Via traffic markers Via activity fingerprinting
Session
User
Accu
racy o
n C
ove
red
Se
t
92.50%
96.40%
94.50%
99.30%
100.00%
100.00%
OSN ID extraction Via traffic markers Via activity fingerprinting
• Mosaic of Real-World User Alice
Example MOSAIC with 12 information classes(tesserae):• Information (Education, affiliation and etc.) from OSN profiles• Information (Locations, devices and etc.) from user sessions
18
Construction of User Mosaic
Least gain
Most gain
Sub-classes:Residence, coordinates, city, state, and etc.
• Leakage from OSN profiles vs. from Network Data
19
Information from OSN profiles and network data complement and corroborate each other
Analysis on network data provides real-time activities and locations
OSN profiles provide static user information (education, interests)
Quantifying Privacy Leakage
Information from both sides can corroborate to
each other
• Traffic markers (OSN IDs and etc.) should be limited and encrypted
Preventing User Privacy Leakage20
Protect traffic markers
Restrict3rd parties
• Third party applications/developers should be strongly regulated
Beyond Traffic Encryption
Trackers and Information Aggregators
21
Current Approaches (in the Advertising Domain)
Block or disrupt ad interaction– May disrupt regular site operation
Privacy preserving infrastructures– Requires participation of ad networks
Do Not Track, Opt-out mechanisms– Requires trust in ad networks
22
Synthoid
Endpoint User Profile Control
– User explicitly defines who he wants to be online, i.e., his online profile
– Synthoid imprints this profile into all possible trackers and information aggregators
23
Ad Network Synthoid
24
Synthoid Performance
Volume of Synthoid traffic varied 1% -- 100%
25
It is possible to completely alter the user profile with small amount of artificial traffic
• Prevalence in the use of OSNs leaves users’ true identities available in the network• Significant portions of flows can be attributed to
users, even without any direct identity leaks
• Ubiquitous encryption is not likely to take place, plus it does not solve the problem
• Our endpoint user profile control lets users explicitly define their online profiles• Works for all possible trackers at once• Requires no changes or consent from trackers
• Currently developing Synthoid for search, mobile, information aggregators, price discriminators, etc.
26
Conclusions
Thanks!
Questions?
27
http://networks.cs.northwestern.edu
28