yahoo's next generation user profile platform

34
1 Yahoo’s Next Generation User Profile Platform Kai Liu, Lu Niu Yahoo Inc.

Upload: hadoop-summit

Post on 19-Jan-2017

485 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Yahoo's Next Generation User Profile Platform

1

Yahoo’s Next Generation User Profile Platform

Kai Liu, Lu Niu

Yahoo Inc.

Page 2: Yahoo's Next Generation User Profile Platform

2

Agenda

- What is User Profile - Architecture Evolution - Schema Design- Optimization- Future Work

Page 3: Yahoo's Next Generation User Profile Platform

3

Agenda

- What is User Profile - Definition- Use Cases- Logical View- User ID Type

- Architecture Evolution - Schema Design- Optimization- Future Work

Page 4: Yahoo's Next Generation User Profile Platform

4

What is User ProfileA User Profile is a visual display of personal data associated with a specific user.(Wikipedia)

Page 5: Yahoo's Next Generation User Profile Platform

5

Use Cases

Page 6: Yahoo's Next Generation User Profile Platform

6

Logical View

Page 7: Yahoo's Next Generation User Profile Platform

7

User ID Type

- Desktop- BID: for anonymous users- SID: for registered users

- Mobile- IDFA: for iOS devices- GPSAID: for Android devices

Page 8: Yahoo's Next Generation User Profile Platform

8

Agenda

- What is User Profile - Architecture Evolution

- Old architecture- Problems- New architecture

- Schema Design- Optimization- Future Work

Page 9: Yahoo's Next Generation User Profile Platform

9

Classic Architecture of Data System

Data Preparation

(ETL)

Computation(Hadoop)

Deep Storage(HDFS)

Page 10: Yahoo's Next Generation User Profile Platform

10

Old Architecture

HDFS(full) Hive

AggregationETLBatch Data(daily, hourly,

minutely)

Ad Serving

HDFS(incre)

1 day 1 day Modeling

Insights

Page 11: Yahoo's Next Generation User Profile Platform

11

Problems

- Aggregation is very expensive- HDFS follows Write Once Read Many approach. - Actually only ~30% of users get updates every day.

- Impossible to support multiple update frequencies- Lack of capability to process event stream

Page 12: Yahoo's Next Generation User Profile Platform

12

- Spark- Fast- Consistent stack (batch/streaming)

- HBase- Random read/write capabilities- Flexible schema

- Hive- Large scale ad-hoc query engine- SQL like interface

New Architecture Components

Page 13: Yahoo's Next Generation User Profile Platform

13

New Architecture

HBase

Hive

HDFS

Kafka

Batch Data

Stream Data

10 mins - 1 day

1 - 10 secs Ad Serving

Spark Batch

Spark Streaming

Modeling

Insights

Page 14: Yahoo's Next Generation User Profile Platform

14

How problems get solved

- Incremental updates avoid full data load. - Multiple Spark jobs with different frequencies running

concurrently.- Spark streaming for event stream processing.

Page 15: Yahoo's Next Generation User Profile Platform

15

Agenda

- What is User Profile - Architecture Evolution - Schema Design

- Understand the data- Table design

- Optimization- Future Work

Page 16: Yahoo's Next Generation User Profile Platform

16

Understand the data

* (1) Ad serving; (2) User Modeling; (3) Audience Insights

Split user profile into multiple HBase tables.

Data Type Update Pattern Use Cases

Properties K/V pairs Overwrite (1)(2)(3)

Events Time Series Append only (3)

Segments List of K/V pairs Read-Modify-Write (1)(3)

Features Hybrid Overwrite + Read-Modify-Write (1)(2)

Page 17: Yahoo's Next Generation User Profile Platform

17

HBase Data Model

Page 18: Yahoo's Next Generation User Profile Platform

18

Table Design - Properties

Row KeyColumn Family: Properties

c: age c: gender c: device1 c: device2 …...

0_284386766

1_1877933007

id_type + user_id

val 1 val 2 val 3

Page 19: Yahoo's Next Generation User Profile Platform

19

Table Design - Events

Row KeyColumn Family: Events

c: event

0_284386766_1463848639

0_284386766_1463935039

id_type + user_id + event_type + timestamp

valueRows are sorted by timestamp

Page 20: Yahoo's Next Generation User Profile Platform

20

Table Design - Segments

Row KeyColumn Family: Segments

c: type1 c: type2 c: type3 …...

0_284386766

1_1877933007

id_type + user_id

* Different segments in different column to avoid atomic operation

value

Page 21: Yahoo's Next Generation User Profile Platform

21

Features Events

Query “Get age, gender of user A” “Get events of user A from 05/21/2016 to 05/22/2016”

Write Pattern ❏ Write only❏ Keep multiple versions

❏ Append only❏ Use TTL to auto-remove records

Rollback❏ Set TIMERANGE to

fetch last version in application layer

❏ Filtered out bad records in application layer

❏ Deletion based on timestamp if necessary

Different Access Patterns

Page 22: Yahoo's Next Generation User Profile Platform

22

Agenda

- What is User Profile - Architecture Evolution - Schema Design- Optimization

- Pre-split tables- Pre-aggregation in Spark- Lazy aggregation for inactive users- Sequential read on Hive

- Future Work

Page 23: Yahoo's Next Generation User Profile Platform

23

Pre-Split Tables

Page 24: Yahoo's Next Generation User Profile Platform

24

Pre-Split Tables- Data Skew: User data is not evenly distributed across different id types - Pre-split tables based on data distribution

{SPLITS =>["\x00\x00\x00\x01\x50", "\x00\x00\x00\x01\xA0", "\x00\x00\x00\x02\x00", "\x00\x00\x00\x02\x40", "\x00\x00\x00\x02\x80", "\x00\x00\x00\x02\xC0", , "\x00\x00\x00\x03\x00", "\x00\x00\x00\x04\x00"]

}

Page 25: Yahoo's Next Generation User Profile Platform

25

- 1 Billion native ads events per day on 0.1 Billion users - Group by (user id, time interval)- Reduce the writes by 10X

Pre-Aggregate events in Spark

Page 26: Yahoo's Next Generation User Profile Platform

26

Pre-Aggregate features in Spark- 5 Billion app activities per day on 0.5 Billion devices- 1 Billion search keywords per day on 0.06 Billion devices- Aggregate on user id for both features. One Spark job instead of two.

Page 27: Yahoo's Next Generation User Profile Platform

27

Lazy aggregation for inactive users

- Problem: read-modify-write is expensive- Facts:

- A large portion of the users might not be accessed frequently- Update jobs are not evenly distributed over time

- Solution: Lazy aggregation for inactive users

Page 28: Yahoo's Next Generation User Profile Platform

28

- Maintain a set of users as active users- Active users

- read-modify-write- Inactive users

- Append updates only- Merging updates:

- Batch job- Upon request

Lazy aggregation for inactive users

Spark

r-m-w

w

HBase

r-m-w

Active Users

Inactive Users

update1

update2

Page 29: Yahoo's Next Generation User Profile Platform

29

Sequential read on Hive

- HBase to Hive- Sync data to Hive using HBase snapshots without

impact Region Servers. - Hive access the data using HBaseStorageHandler.

- Move sequential reads to Hive- User modeling- Audience insights

Page 30: Yahoo's Next Generation User Profile Platform

30

Agenda

- What is User Profile - Architecture Evolution - Schema Design- Optimization- Future Work

Page 31: Yahoo's Next Generation User Profile Platform

31

Future Work

- Explore Impala/Presto for better query performance. - Expose API for incremental modeling capability;

Page 32: Yahoo's Next Generation User Profile Platform

32

Questions?

Page 33: Yahoo's Next Generation User Profile Platform

33

Appendix

Page 34: Yahoo's Next Generation User Profile Platform

34

More optimization- Less column family as possible- Turn off autoflush- Throttling writes if necessary- Compress data before sending to Hbase- Kryo for serialization