twarfing: gathering intelligence from twitter data
DESCRIPTION
A Presentation on the use of Semantic Web technology in analyzing Twitter data for attacks against users. This is meant to illustrate how Semantic Web technology can be useful in real-world applications.TRANSCRIPT
Twarfing: Gathering Intelligence from Twitter Data
Morton SwimmerSenior Threat Researcher
Trend Micro, Inc.
Thursday, January 21, 2010
Overview
• What is Twitter?
• Malicious activity on Twitter
• URL shortening services
• The Twarfing architecture
• Examples
• Conclusions
Thursday, January 21, 2010
Thanks
• Ben April, Trend, for providing the massive hardware
• Based in part on a Virus Bulletin Presentation with Costin Riau, Chief of R&D, Kaspersky Labs
Thursday, January 21, 2010
What is Twitter
• Founded by Jack Dorsey, Biz Stone and Evan Williams back in 2006
• Publish/Subscribe Communications system
• SMS
• Web site
• WS API and RSS
• Application
Thursday, January 21, 2010
Twitter Internals
• 140 chars max to be SMS compatible
• SMS has a 160 char restriction
• But Twitter needed to add the user name
• Just text and metadata
Thursday, January 21, 2010
Users not necessarily human!
• Devices
• From buoys to power meters
• Search for Twitter on instructables.com
Thursday, January 21, 2010
Users not necessarily human!
• Not surprising that malware would use it, but
• Not good for Command and Control
• Easily blocked after detection
• … and twitter has been trigger happy with blocking
Thursday, January 21, 2010
August 2008: Malware on Twitter
Thursday, January 21, 2010
April 2009 – Twitter gets hit by XSS worm
• Multiple variants of JS.Twettir.a-h identified
• Thousands of spam messages containing the word "Mikeyy“ filled the timeline
• Proof of concept – no malicious intent
• Later, the author (Mikey Mooney) got a job at exqSoft Solutions, a web security company
• September 2009 - Yet another attack
• Phishing for the login details
Thursday, January 21, 2010
2009 Trending topics start being exploited
Thursday, January 21, 2010
June 2009 – Koobface spreading in Twitter
• Originally only targeted Facebook and MySpace users
• Now spreads through more social networks:
• Facebook, MySpace, Hi5, Bebo, Tagged, Netlog and most recently… Twitter
Thursday, January 21, 2010
Twitter architecture, historically
• Multiple Ruby on Rails servers
• Mongrel HTTP servers
• Centrally MySQL backed
Thursday, January 21, 2010
Twitter architecture, now• Details super-secret, but
this is what we think
• Front end
• Ruby-based front end
• Mongrel HTTP servers
• Back end
• Starling for queuing/messaging
• Scala-based code
• MySQL
• denormalized data whenever possible
• Only for backup and persistance
• Lots of caching using memcached
Thursday, January 21, 2010
Stats (June 2009)• 25M users
• I measured 475K different users posting over a one week period in August 2009
• 300 tweets/sec
• MySQL handles 2400 reqs per second
• API traffic == 10x website traffic!
• Indicates that far more people are using applications
• TweetDeck, Twitteriffic, Digsby, Twhirl
• Many are Adobe Air based (!)
• One key to Twitter's success!
Thursday, January 21, 2010
<id><user><text><created_at><source><truncated><in_reply_to_status_id><in_reply_to_user_id><favorited><in_reply_to_screen_name>
Twitter metadata
user information
Tweet text
in each tweet!
Thursday, January 21, 2010
Tweet text• Retweets: RT passes the
note along
• now also in metadata
• Location: L tells friends where I am
• Hashtags: #
• topics
• show group associations
• just for tagging
• User reference: @ for public discussion
• URLs
Thursday, January 21, 2010
Long URLs, short URLs
• URL shortening services have grown up around Twitter
• longurl.org counts 208 different ones
• Malicious URLs are one potential threat
• obscure the true URL
• May become malicious later
• RickRolling, but maliciously
• Benefits:
• ‘bit.ly’ and others blocks malicious URLs
Thursday, January 21, 2010
Google SB API
• August 2009
• Twitter began filtering malicious URLs
• Mikko Hyppönen:
• seemed to indicate Google SB API!
• After more testing, we discovered it used some additional filtering
Thursday, January 21, 2010
<id><name><screen_name><location><description><profile_image_url><url><protected><followers_count><profile_background_color><profile_text_color><profile_link_color><profile_sidebar_fill_color><profile_sidebar_border_color><friends_count><created_at>
<favourites_count><utc_offset><time_zone><profile_background_image_url><profile_background_tile><statuses_count><notifications><following><verified>
User data
Thursday, January 21, 2010
Twitter to RDF
• Goal
• Avoid inventing own vocabulary
• Makes it easier to mix-in data later
• eg. Facebook, Tumblr, ...
• Most Twitter data fit readily into the SIOC ontology
• The rest used DublinCore
• Proprietary ontology for internal data
Thursday, January 21, 2010
Using existing vocabs
• DublinCore, http://dublincore.org/
• SIOC, http://sioc-project.org/
• describe information from online communities
• FOAF, http://www.foaf-project.org/
• machine-readable pages describing people
• GeoOWL
Thursday, January 21, 2010
Example
<http://twitter.com/float3r/status/3492845110> a sioc:Post ; dc:created "Sun, 23 Aug 2009 15:10:31 +0000" ; sioc:has_creator <http://twitter.com/float3r> ; sioc:content "RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdf"@no ;.<http://twitter.com/float3r> a sioc:User ; sioc:id "251294"^^xsd:integer ; rdfs:label "float3r" ; sioc:avatar "http://a1.twimg.com/profile_images/53008680/Picture_022_normal.jpg"^^xsd:anyURI ;.
float3r: RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdfSun, 23 Aug 2009 15:10:31 +0000
Thursday, January 21, 2010
Example, continued
float3r: RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdfSun, 23 Aug 2009 15:10:31 +0000
Thursday, January 21, 2010
WhiteTwarf/RedTwarf
White Twarf
Thursday, January 21, 2010
Whitetwarf
• Receives a subset of the tweets via twitter search
• Stores metadata from twitter
• Processes text part for internal metadata
• User references, hashtags, Informal tags, URLs
• Creates canonical text representations
• Export to an RDF store for analysis
Thursday, January 21, 2010
WhiteTwarf architecture v.2
Twitter Stream tweet processing
couchDB
tweets users
Domain Reputation
URLs
URL processing
RDF Converter
RDF Converter
4Store
RDF Graph
Thursday, January 21, 2010
Tweet Processing
• Tweet fetch process
• Fetch a limited number of tweets as JSON
• Extract metadata and text
• Separate user metadata from tweet metadata
• Store in the CouchDB database
• Tweet processing
• Extract Tags, URLS, user references
• Text signatures
• Meant to remove small differences in text
• Normalization and whitespace removal
• UTF-8 tricks expansion/removal
• CouchDB views
Thursday, January 21, 2010
CouchDB
• Scalable document and key-value store
• APIs REST and JSON based
• Embedded Javascript and MapReduce for querying
• http://couchdb.apache.org/
Thursday, January 21, 2010
4Store
• Scalable RDF Store
• Developed at Garlik in C and is GPLed
• Has REST and native APIs
• Queries via SPARQL
• http://4store.org/
Thursday, January 21, 2010
RDF Stores in general
• Store RDF triples (often as quads or named graphs)
• Have recently become much more scalable
• Still have limitations, though
• eg. have locking issues on writes
• Allegrograph 4 (Franz, Inc) may resolve some of these for us
Thursday, January 21, 2010
URL redirector processing
• For every URL extracted by the view we follow the link
• In most cases we get a 30x response (redirect)
• These are also followed after storing
• Testing showed: usually faster to use shortener APIs
• So we are testing code that uses API instead for known shorteners
• We also capture other HTTP metadata
• Basically we are looking for malicious links
Thursday, January 21, 2010
Detection malicious activity
• Data exported to an RDF Store
• This is a graph database
• Allows for complex queries
• Does have some performance issues and is not real time
• Simple Attack scenario
• User is observed to post to a malicious domain
• We want to see what else he has posted
Thursday, January 21, 2010
Correlation with domain reputations
mal
http://unk.com/what.exe
tweet/1234
malicious
mal.com
http://mal.com/evil.exe
tweet/5678
tw:posts
tw:posts
tw:hasURL
tw:hasURL
drs:hasFQDN
drs:hasRating
Note that the examples in this
presentation are all radically simplified
for clarity.
Thursday, January 21, 2010
Matching this in SPARQL
?m
?u2
?t1
malicious
?f1
?u1
?t2
tw:posts
tw:posts
tw:hasURL
tw:hasURL
drs:hasFQDN
drs:hasRating
Thursday, January 21, 2010
Matching this in SPARQL
?m
?u2
?t1
malicious
?f1
?u1
?t2
posts
posts
tw:hasURL
tw:hasURL
drs:hasFQDN
drs:hasRating
SELECT ?m ?u2WHERE {?m tw:posts ?t1.?t1 tw:hasURL ?u1.?u1 drs:hasFQDN ?f1.?f1 drs:hasRating drs:Malicious.?m tw:posts ?t2.?t2 tw:hasURL ?u2.}
Thursday, January 21, 2010
Advantages of using RDF
• Avoid multiple explicite joins
• An intuitive visualization available
• Mixing in new data as easy as adding it to the RDF store
Thursday, January 21, 2010
Another attack scenario
• Observed: User modified URL on retweet to be malicious
@iceman: This link is cool http://cool.com/ice.html
@notniceman: RT: @iceman: This link is cool http://c00l.com/ice.exe
iceman
notniceman
tweet/1001
tweet/1005
http://cool.com/ice.html
http://c001.com/ice.exe
thislinkiscool
tw:posts
tw:posts
tw:hasTextSignature
tw:hasURL
tw:hasURL
Thursday, January 21, 2010
Matching it
?m1
?m2
?t1
?t2
?u1
?u2
?ts
tw:posts
tw:posts
tw:hasTextSignature
tw:hasURL
tw:hasURL
≠
Thursday, January 21, 2010
A few stats
• New architecture since October
• Since the new version started, we logged 3.6 million users
• and keep about 12 million tweets around
• Which is about 300 million triples
• Once out of experimental phase we need to increase this number dramatically
Thursday, January 21, 2010
RedTwarf
• Goal: look for new attack patterns
• Based on WhiteTwarf data
• Using Data and Text-mining techniques to detect rules
• Probably will be Lucene based
Thursday, January 21, 2010
Conclusions
• Twarf is still an experimental side project
• Twitter is becoming a popular attack vector
• Goal: protecting you, our customers
• Identifying the future development directions of Twitter threats
I would like to thank you for your support with 140 characters or less and guess what? I just did it! #swnyc
Thursday, January 21, 2010
Thursday, January 21, 2010