social network analytics on cray...

42
Social Network Analytics on Cray Urika-XA Mike Hinchey, [email protected] Technical Solutions Architect Cray Inc, Analytics Products Group April, 2015

Upload: others

Post on 04-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Social Network Analytics on CrayUrika-XA

Mike Hinchey, [email protected] Solutions Architect

Cray Inc, Analytics Products GroupApril, 2015

Page 2: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Agenda1. Introduce platform

2. Technology and architecture for analytics

3. Use case analysis and results

4. Conclusions

Urika-XA•

Apache Spark•

Social Network Analysis•

Page 3: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Urika-XA Hardware

Extreme Analytics

48 Analytic Nodes•96 CPU's, 1536 cores•6 TB total RAM•38 TB total local SSD (for HDFS)•48 TB total local HDD•120 TB Sonexion 900 Lustre Storage•FDR InfiniBand Fabric Network•Standard 42U Rack•Dual rack configuration also available•

Page 4: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Urika-XA Software

Extreme Analytics

Cloudera Hadoop Distribution•and Management UI•

HDFS•on the 38 TB local SSD•

YARN•manages jobs on 48 nodes•

Hadoop MapReduce•Apache Spark•

Page 5: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Urika-GD

Graph Discovery

4 TB RAM•128 XMT compute processors•128 hardware threads per processor•Lustre file system•RDF: W3C Resource Description Framework•SPARQL, W3C graph query language•

Page 6: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Goals for this Project1. Business Use Case

2. Technology and Architecture

Demonstrate analytics•On a broadly accessible use case•Showing valuable insights•

Bring together various technologies and techniques•Demonstrate architecture of an end-to-end solution•Cray R&D also uses this for performance tests•

Page 7: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Business Use CaseCollect data from social media

Discover communities of users with interest in a particular topic(consumer electronics, sports)

Identify users according to role: key influencers, rebroadcasters,connectors

Page 8: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Process Overview

Page 9: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Technology OverviewApache Spark applications•

Data load and transform•Community detection•Analytics•Query for visualization•

Web app, JavaScript•Query from the Spark app•Charts and graphs to visualize data and results•This presentation•

Page 10: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Technology and ArchitectureBring together various technologies and techniques

Demonstrate an end-to-end solution

Page 11: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Lambda ArchitecturePrinciples for an analytics system that includes Batch and Real-timepipelines

Based on functional programming (lambdas)

To achieve consistency, reliability, etc

Source data is immutable, append-only

Business/analytics code duplicated for Batch and Real-time use cases

Page 12: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Lambda ArchitectureBatch layer for completeness and accuracy: typicallyHadoop/MapReduce

Speed layer for real-time, minimal latency, may sacrifice accuracy

Data stream

Batch layer

Real-time stream

Serving layer

Presentation

Page 13: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Kappa ArchitectureRethinking the Lambda Architecture - multiple frameworks andduplication of code is too difficult

Rethinking the traditional database - based on a transaction log, but onlyinternally

Use Streams everywhere, the transaction/event log is the foundation ofall data

Avoid the traditional batch pipeline (where possible, wrt legacysoftware)

Avoid inconsistent caches of data, like memcached, within apps, etc

Page 14: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Kappa ArchitectureBatch is a slow-lane stream, and allows for re-processing of historicaldata

Real-time is a fast-lane stream using the same framework, so code isshared

Data stream

Batch stream

Real-time

Serving layer Presentation

Page 15: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Apache SparkFunctional API, immutability, stateless

Immutable dataset abstraction, transparently distributed

High-level API: map, reduce, filter, group by, join, union, left outer join

Graph Algorithms: pagerank, svd++, connected components, shortestpaths

Machine Learning: k-means, linear regression, logical regression, naivebayes

Streaming: real-time, periodic

Page 16: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Why XA?Considering the principles of Lambda and Kappa Architectures,

And the capabilities of Spark,

What is the value of Urika-XA?Pre-configured Hadoop/Yarn cluster•

Minimize time to value for a project•

Hardware architecture built for both batch and real-time•

Page 17: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Why XA?Hardware Architecture built for both batch sizeand real-time speedLots of memory and CPU

HDFS on fast, local SSD

Shared file system, Lustre

Perform numerous transformations and joins in memory•

Bigger joins, and temporary files are fast•

Parallel and fast, for input and output data•Reliable without 3x data duplication•

Page 18: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

SNA - ETLSource data is stored in immutable files

Start ETL (extract, transform, load) process based on some start data (tore-process old data)

The spark-streaming window specifies how much data per micro-batch

Page 19: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

SNA - Real-timeFast-lane window is seconds: for real-time alerts, complex events

Aggregations, metrics

Complex Event Processing (CEP), such as spotting trending hashtags

Page 20: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Community DetectionLabel Propagation (LP) is a Graph-based Community DetectionAlgorithm (CDA)

LP is not implemented streaming, so executed periodically, on one dayof collected data

More data produces better results, more meaningful communities

This is done in a second stream: not real-time, longer window, lowerlatency

Page 21: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

VisualizationThis presentation is a web app, loads data that is output from the Sparkjob

d3.js: render charts and graphs in SVG•crossfilter.js: manipulate data across multiple dimensions•dc.js: reusable charts•

Page 22: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

SNA Analytics Pipeline

Page 23: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

SNA Analytics PipelineSocial Network Analytics

ETL Algorithms Analysis Visualization

ETL (extract, transform, load): Spark Streaming, Scala

Algorithms: GraphX Label Propagation, Machine Learning

Analysis: Spark, Scala, SQL

Visualization: JavaScript, D3, SVG

Page 24: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Source Data - Twitter.comTweet download is based on search terms (related to consumerelectronics, sports, life sciences, etc)

Streaming download since April 2014

Data archived in files to allow reprocessing

Page 25: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

0 2 4 6 8 10121416182022240

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

45,000,000

twee

ts p

er h

our

09/2910/01 10/08 10/15 10/22 10/2911/01 11/08 11/15 11/22 11/2912/01 12/08 12/15 12/22 12/2901/01 01/08 01/15 01/22 01/2902/010

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,0008,000,000

twee

ts p

er d

ay

Source Data - Twitter.comThe full Twitter firehose is about 600M tweets/day.During the displayed timeframe, we collected674,106,415 tweets, about 0.91% of the firehose.

Page 26: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

09/29 Mon10/01 Wed10/08 Wed10/15 Wed10/22 Wed10/29 Wed11/01 Sat11/08 Sat11/15 Sat11/22 Sat11/29 Sat12/01 Mon12/08 Mon12/15 Mon12/22 Mon12/29 Mon01/01 Thu01/08 Thu01/15 Thu01/22 Thu01/29 Thu02/01 Sun02468

101214161820222426

files

per

day

Source Data - StorageJSON saved to files, gzipped

2,290 files, 317GB

Page 27: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

SNA - Counts and AggregationsAggregations are done for both

Tweets

Users

Unique hashtags

periodic, per window•running total since processing began•

Page 28: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

More Counts and AggregationsHashtags matched to topics

Top hashtags

Top hashtags per user

Errors in source data

NSFW: censor out some tweets based on keywords

Page 29: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Build the network for CDALabel propagation (LP) is a community detection algorithm (CDA), built-in to Spark-GraphX

Input is a network - a list of relationships between entities

We'll look at users that mention other users in tweets

Further restrict to where Users have mentioned each other

If user A mentioned user B•and B mentioned A•then infer that A knows B•

Page 30: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

CommunitiesLP results in one community for each user

Community

User

member

ACBPSTL

Real_DealRaps

wildabeast24

JuggDaGreat

meggahpopularCraftMatik

AlMcFallinIII

lyriCALVINom

MiltownBloeParkLyfeEnt

CORTEZ_HSP

TheSaurus831CraveMyThoughts

TheComedyHumorAdorableWords

femaIenotes

diaryforteens

FemaIeThings

TeenagerNotes

FemaleTexts

StealHisHeart

TheseDamnQuote

LooneyTunes002

PolitiBunny

truckinmatador

Ann_Marie1

medfordcaniac

cdnKaren fazwiesenfeld

andilinks

grsvt81

sarahzview

Philscbx

Brockr1967Brock

MLKstudios

AmareshMisraFC justinwooten

99212017

99212017

99212017

99212017

9921201799212017

99212017

99212017

9921201799212017

99212017

99212017

996217376

996217376

996217376

996217376

996217376

996217376

996217376996217376

996217376

996217376

999453985

999453985999453985

999453985

999453985

999453985

999453985999453985

999453985

999453985

999453985

999453985

999453985

999453985

999453985

999453985

Page 31: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Community metricsCount the users in each community

Density is proportion of users that know each other

Filter out tiny and huge communities as not interesting

Page 32: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Community CharacterizationCommunity

Topic

references

557130838

465729427

105792697

14562685

2402207456

1440483044341087665

968986351

1105181540

2937701728

1963043526

50225717

1030726256

392508844

171599451

2419276662

616930338

2951801733

1875210830

38188541

Sports

Finance

Consumer Electronics

Find ways to describe communities

Most popular topics amongusers

Page 33: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Community CharacterizationCommunity

Hashtag

references

2341961923

1602729157

910377870

384107910910377870

282816280

2873767141

259906442

1407262566

105792697

2528383177

2929986897

2665469203

2944446625

14243930

14243930

910377870

2329106982

28537986902771282304

hardwork

GenerationsLegacy

RageBoy

Bellarke

autism

NBA

watch

Music

cover

MaxScherzer

quote

google

Ubuntu

vaccines

money

startpharma

Business

fandomscollide

stream

Most popular hashtags amongusers

Page 34: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

User RolesIdentify user roles within community•

key influencers: retweeted by others•rebroadcasters: retweet a lot•

Identify users role between community:•

connectors•

relationships with people in different groups•and the strength to each community is balanced•

Page 35: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Results

Page 36: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Communities and PopularHashtags

Page 37: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

User Roles within a Community

Page 38: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Connector Role acrossCommunities

Page 39: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Conclusions

Page 40: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

ConclusionsAnalytics needs a variety of techniques: Graph, Machine Learning,Iterative, Streaming

Spark: functional, high-level, transparently distributed

Urika-XA: pre-configured cluster, 6T memory

Page 41: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

ReferencesApache Spark: http://spark.apache.org/

Twitter Data: https://dev.twitter.com/streaming/public

Lambda Architecture: http://lambda-architecture.net/

Kreps, Jay, "Questioning the Lamba Architecture", 7/2/2014,http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Kleppmann, Martin, "Turning the database inside out with ApacheSamza", 9/21/2014, https://youtu.be/fU9hR3kiOK0

DC, dimensional charting: http://dc-js.github.io/dc.js/

Page 42: Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Questions?Or contact me later...

Cray Analytics, Urika-XA: http://cray.com/analytics

Mike Hinchey, [email protected]