eventbrite data platform talk foir sfdm

48
Data Platform Vipul Sharma – [email protected]

Upload: vipul-sharma

Post on 15-Jan-2015

2.314 views

Category:

Technology


0 download

DESCRIPTION

Slides for Eventbrite's data platform talk at SF data mining meetup.

TRANSCRIPT

Page 1: Eventbrite Data Platform Talk foir SFDM

Data Platform

Vipul Sharma – [email protected]

Page 2: Eventbrite Data Platform Talk foir SFDM

A social event ticketing and discovery platform

Page 3: Eventbrite Data Platform Talk foir SFDM

$1B total sales

68M tickets sold

1.4M events hosted

.5M organizers served

23M attendees served

12 countries

Page 4: Eventbrite Data Platform Talk foir SFDM

Event Lifecycle

Page 5: Eventbrite Data Platform Talk foir SFDM

Frictionless is the mantra!

Page 6: Eventbrite Data Platform Talk foir SFDM

Data Platform and Discovery

Page 7: Eventbrite Data Platform Talk foir SFDM
Page 8: Eventbrite Data Platform Talk foir SFDM
Page 9: Eventbrite Data Platform Talk foir SFDM
Page 10: Eventbrite Data Platform Talk foir SFDM
Page 11: Eventbrite Data Platform Talk foir SFDM
Page 12: Eventbrite Data Platform Talk foir SFDM
Page 13: Eventbrite Data Platform Talk foir SFDM
Page 14: Eventbrite Data Platform Talk foir SFDM

Analytics

• Add–Hoc queries by Analysts

Page 15: Eventbrite Data Platform Talk foir SFDM

Fraud and Spam

Page 16: Eventbrite Data Platform Talk foir SFDM

Data Platform

Page 17: Eventbrite Data Platform Talk foir SFDM
Page 18: Eventbrite Data Platform Talk foir SFDM

Hadoop Cluster

• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3

formatted• CDH3 • Fair Scheduler• HBase

Page 19: Eventbrite Data Platform Talk foir SFDM

Infrastructure

• Search• Solr• Incremental updates towards event driven

• Recommendation/Graph• Hadoop• Native Java MapReduce• Bash for workflow

• Social• Cassandra• Denormalized vview

• Persistence• MySql• HDFS• HBase• MongoDB (Moving to Cassandra)

Page 20: Eventbrite Data Platform Talk foir SFDM

Infrastructure

• Stream• RabbitMQ• Internal Fire hose• Storm

• Offline• MapRedude• Streaming• Hive• Hue

Page 21: Eventbrite Data Platform Talk foir SFDM

DiscoverySocial, Interest, Local

Page 22: Eventbrite Data Platform Talk foir SFDM
Page 23: Eventbrite Data Platform Talk foir SFDM

Categorization - Prism

Tech

MusicConference

Sports

Page 24: Eventbrite Data Platform Talk foir SFDM

Prism - Features

• Supervised Learning• Logistic Regression using MLE• Pair wise classification into 20 categories• High precision lower recall• Use mapreduce for feature extraction• Use for clustering as well

Page 25: Eventbrite Data Platform Talk foir SFDM

Prism – Training Data

• Binary classification for each category• Training data needed for positive and negative

• Conference and not Conference• Sports and not Sports

• Samasource and Crowdflower• Stem words to create initial set• Positive, negative, negative with stem words

Page 26: Eventbrite Data Platform Talk foir SFDM

Prism - Features

• Convert Event and Organizer data in feature vector

• Event details, Organizer details, Ticket details• Boolean representation of predefined attributes

• Words – tf-idf, dictonaries• Phrases• Domains• Rules – regular expression• Functions – business logic e.g. ticket price between $10-

$20• Compounds – boolean combination of features & and ||

rules– <COMPOUND1>:techcrunch & disrupt & techcrunch.com– <COMPOUND2>:COMPOUND2 && after && party

Page 27: Eventbrite Data Platform Talk foir SFDM

Prism - Features

• Each feature is represented in various context• Event Title, Event Description, Organizer Title, Organizer Description

• Each feature has meta info – Termclass• <LANG_EN>, <CONF_LANG_EN>,<ADULT_LANG_EN>• <SPORTS_LANG_EN>:<EVENT_TITLE>ball

• Feature vector is represented as sparse vector

+1 391158:1 401814:1 410526:1 411489:1 411606:2 413910:1 427659:1 438369:1 449735:1 449736:2 455478:1 456741:1 463188:1

693|||||warrior spirit's 3rd annual fundraising auction|||||1:<DESC>again,1:<NAME>annual,1:<DESC>annual,2:<DESC>approaching,2:<NAME>auction,4:<DESC>auction,2:<DESC>auctions,2:<DESC>bring

Page 28: Eventbrite Data Platform Talk foir SFDM

Prism - Training

• Binary classifier• Multiclass less accurate• Each event get classified into 20 category• MapReduce for creating sparse matrix• MapReduce for batch classification

• Distributed cache for feature set and models

• We can use same sparse matrix for clustering

Page 29: Eventbrite Data Platform Talk foir SFDM

Attendee

• What your interests are? - Prism• Who your friends are? – Explicit and Implicit• What are the interests of your friends? - Prism• Which of your friend have your interests? – IBG• Location of users and events

• Purchase events location• Facebook location• Our database• Other signals – ip, mobile app etc

Page 30: Eventbrite Data Platform Talk foir SFDM

You will like to attend this event

Page 31: Eventbrite Data Platform Talk foir SFDM

Item Hierarchy (You bought camera so you need batteries - Amazon)

Collaborative Filtering – User-User Similarity (People who bought camera also bought batteries - Amazon)

Collaborative Filtering – Item-Item similarity(You like Godfather so you will like Scarface - Netflix)

Social Graph Based (Your friends like Lady Gaga so you will like Lady Gaga, PYMK – Facebook, Linkedin)

Interest Graph Based (Your friends who like rock music like you are attending Eric Clapton Event–Eventbrite)

Recommendation Engines

Page 32: Eventbrite Data Platform Talk foir SFDM

Why Interest?

Events are Social Events are Interest

Dense Graph is IrrelevantInterest are Changing

Page 33: Eventbrite Data Platform Talk foir SFDM

How do we know your Interest?

• We ask you• Based on your activity

• Events Attended• Events Browsed (In Future)

• Facebook Interests• User Interest has to match Event category• Static

• Prism

Page 34: Eventbrite Data Platform Talk foir SFDM

Model Based vs Clustering

Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem

Item-Item vs User-User

Page 35: Eventbrite Data Platform Talk foir SFDM

Implicit Social Graph

U1

U2 U3

U4 U5

E1

E2 E3

E4

Page 36: Eventbrite Data Platform Talk foir SFDM

Mixed Social Graph

U1

U2 U3

U4 U5

E1

E2 E3FB

LI

Page 37: Eventbrite Data Platform Talk foir SFDM

23M * 260 * 260 = 1.5 Trillion Edges

6 Billion edges ranked

Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship

Page 38: Eventbrite Data Platform Talk foir SFDM

Feature Generation

• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event Attendees

• MAP: eid: uid• REDUCE: eid:[uid]

• Attendees Social Graph• Input: eid:[uid]• MAP: uidi:[uid]

• REDUCE: uid:[neighbors]

• Interest based features, user specific, graph mining etc• Upload feature values to HBase

Page 39: Eventbrite Data Platform Talk foir SFDM

HBase

• Why Hbase?• To process 6B edges lookup features for each node and

each edge• 6B/1000 /86400 = 70 days!!• 1M/sec = 1.5 hrs• Processing 1.3 TB of data with mapreduce

• Collect data from multiple Map Reduce jobs• Stores entire social graph• Features for each node and edge

Page 40: Eventbrite Data Platform Talk foir SFDM

Data Model

Rowkey U UU

uid1 f1 f2 f3 uid2:f4 uid2:f5 uid3:f4

rowid neighbors events featureX

2718282 101 3 0.3678795

rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx

2718282 31 1 0.3183 83 2 0.618

Page 41: Eventbrite Data Platform Talk foir SFDM

U1

U2 U3

Page 42: Eventbrite Data Platform Talk foir SFDM

HBase

Page 43: Eventbrite Data Platform Talk foir SFDM

Hadoop Tips & Tricks

• Joins• Distributed cache• Hive map side joins

• Hive• Nice set of statistical functions• Lots of hive queries

• Hbase• Lots of memory• WAL• LZO• Proper configs• Avoid hot regioservers

Page 44: Eventbrite Data Platform Talk foir SFDM

Hadoop tips & tricks

• Combiners did not work• Shuffle and Merge

Page 45: Eventbrite Data Platform Talk foir SFDM

More Innovation

• Rethink everything• Add social to search• Add time series features• Real time updates using firehose and storm• Various sorts of data

Page 46: Eventbrite Data Platform Talk foir SFDM

Developers! Developers! Developers!

• Interested in scaling, messaging, data, machine learning, mobile, services

• We will continue to push the boundaries of hard problems

[email protected][email protected]

Page 47: Eventbrite Data Platform Talk foir SFDM

Storm at Eventbrite

Tuesday August 21, 2012 at Eventbrite HQ

How we are using Storm for real time processing of our data

Andrew Whang [email protected]

http://www.eventbrite.com/event/4010290888

Page 48: Eventbrite Data Platform Talk foir SFDM

Questions?