빅데이터를 위한 aws 모범사례와 아키텍처 구축 패턴 :: 양승도 :: aws summit...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

양승도 | 솔루션즈 아키텍트

2016년 5월 17일

빅데이터를 위한AWS 모범사례와 아키텍처 구축 패턴

목차

§ 데이터의 증가 & 분석의 진화

§ 참조 아키텍처

§ 어떤 기술을 사용해야 할까?

§ 왜?

§ 어떻게?

§ 고객 사례(MangoPlate)

§ 디자인 패턴

데이터의 폭발적 증가

Volume

Velocity

Variety

빅데이터 진화

배치

보고서

실시간

경보

예측

예보

Amazon Glacier

S3 DynamoDB

RDS

EMR

Amazon Redshift

Data PipelineAmazon Kinesis CloudSearch

Kinesis-enabled app

Lambda ML

SQS

ElastiCache

DynamoDBStreams

넘쳐나는 도구들

쿨~한 참조 아키텍처가 있는지?어떤 도구를 사용해야 하는지?

왜? 어떻게?

아키텍처 원칙

“데이터 버스”를 분리• Data → Store → Process → Answers

작업에 적합한 도구를 사용• Data structure, latency, throughput, access patterns

람다 아키텍처 아이디어 사용• Immutable (append-only) log, batch/speed/serving layer

AWS 관리형 서비스 활용• No/low admin

비용에 대한 고려• Big data ≠ Big cost

빅데이터 처리를 단순하게…

수집 저장 분석소비 /

시각화

답변 시간(지연)

처리량

비용

수집

데이터의 종류

트랜잭션• Database reads & writes (OLTP)• Cache

검색• Logs• Streams

파일• Log files (/var/log)• Log collectors & frameworks

스트림• Log records• Sensors & IoT data

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

AmazonGlacier

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

heFi

le S

tora

ge

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Database

FileStorage

Search

수집 저장Lo

ggin

gIo

TAp

plic

atio

ns

StreamStorage

저장

스트림스토리지

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

heSt

ream

Sto

rage

File

Sto

rage

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

Database

FileStorage

Search

수집 저장Lo

ggin

gIo

TAp

plic

atio

nsü

스트림 스토리지 옵션들

AWS 관리형 서비스• Amazon Kinesis → streams• Amazon DynamoDB Streams → table + streams• Amazon SQS → queue• Amazon SNS → pub/sub

Do-It-Yourself• Apache Kafka → stream

생산자와 소비자를 분리

영구적인 버퍼

다수의 스트림을 수집

메시지의 순서 유지

스트리밍 맵리듀스

병렬적인 소비

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 14 4 3 3 2 2 1 1

Shard 1 / Partition 1

Shard 2 / Partition 2

Consumer 1Count of Red = 4

Count of Violet = 4

Consumer 2Count of Blue = 4

Count of Green = 4

DynamoDB Stream Kinesis Stream Kafka Topic

스트림 스토리지

Queues & Pub/Sub ? • 생산자 및 소비자/가입자를

분리• 영구적인 버퍼• 다수의 스트림을 수집

• No 메시지 순서• No 병렬적 소비 for Amazon

SQS• Amazon SNS 는 다수의

큐 또는 람다(Lambda) 함수로 전달 가능

• No 스트리밍 맵리듀스

Consumers

Producers

Producers

Amazon SNS

Amazon SQS

queue

topic

function

ʎ

AWS Lambda

Amazon SQSqueue

Subscriber

AmazonKinesis

DynamoDBStreams

Amazon SQSAmazon SNS Kafka

Managed Yes Yes Yes No

Ordering Yes Yes No Yes

Delivery at-least-once exactly-once at-least-once at-least-once

Lifetime 7 days 24 hours 14 days Configurable

Replication 3 AZ 3 AZ 3 AZ Configurable

Throughput No Limit No Limit No Limit ~ Nodes

Parallel Clients Yes Yes No (SQS) Yes

MapReduce Yes Yes No Yes

Record size 1MB 400KB 256KB Configurable

Cost Low Higher(table cost) Low-Medium Low (+admin)

어떤 스트림 스토리지를 사용해야 하는가?

파일스토리지

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

heSt

ream

Sto

rage

File

Sto

rage

Transactional Data

File Data

Stream Data

Search Data

Database

Search

저장

A

iOS Android

Web Apps

Logstash

Mobile Apps

수집Lo

ggin

gIo

TAp

plic

atio

nsü

왜 Amazon S3가 빅데이터에 좋은가?

• 기본적으로 빅데이터 프레임워크 지원(Spark, Hive, Presto, etc.) • 스토리지를 위한 컴퓨팅 클러스터가 불필요 (HDFS와 다름)• Amazon EC2 스팟 인스턴스를 활용하여 하둡 클러스터 운영 가능• 동일한 데이터로 여러 종류(Spark, Hive, Presto) 클러스터를 동시에 사용• 오브젝트 갯수 무제한• 99.999999999%의 내구성을 위한 설계• 고 가용성 – AZ 장애 극복• 수명주기를 활용한 계층-스토리지 (Standard, IA, Amazon Glacier)• 보안 – SSL, client/server-side encryption at rest• 저비용• 매우 높은 대역폭 – 총 처리량 제한 없음

• 매우 자주 접근하는(hot) 데이터는HDFS 사용

• 자주 접근하는 데이터는 Amazon S3 Standard 사용

• 드물게 접근하는 데이터는 Amazon S3 Standard – IA 사용

• 거의 접근하지 않는(cold) 데이터는Amazon Glacier 사용하여 아카이브

S3와 HDFS, Amazon Glacier를 함께…

데이터베이스+ 검색

계층

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

heSt

ream

Sto

rage

File

Sto

rage

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

수집 저장Lo

ggin

gIo

TAp

plic

atio

nsü

Database + Search Tier

데이터베이스 + 검색 계층 Anti-pattern

Data TierSearch

Amazon ElasticsearchService

Amazon CloudSearch

Cache

RedisMemcached

SQL

Amazon AuroraMySQLMariaDBPostgreSQLOracleSQL Server

NoSQL

CassandraAmazon

DynamoDBHBaseMongoDB

Database + Search Tier

모범 사례 – 성격에 맞는 적합한 도구 사용

구체적인 예

데이터 구조와 접근 패턴

접근 패턴 What to use?

Put/Get (Key, Value) Cache, NoSQLSimple relationships → 1:N, M:N NoSQL

Cross table joins, transaction, SQL SQLFaceting, Search Search

데이터 구조 What to use?

Fixed schema SQL, NoSQLSchema-free (JSON) NoSQL, Search

(Key, Value) Cache, NoSQL

데이터 / 접근 온도?

Hot Warm Cold데이터 용량 MB–GB GB–TB PB아이템 크기 B–KB KB–MB KB–TB응답시간 ms ms, sec min, hrs내구성 Low–High High Very High

요청 비율 Very High High Low비용/GB $$-$ $-¢¢ ¢

Hot Data Warm Data Cold Data

데이터 / 접근 특성: Hot, Warm, Cold

Cache SQL

Request RateHigh Low

Cost/GBHigh Low

LatencyLow High

Data VolumeLow High

GlacierSt

ruct

ure

NoSQL


Low

High

Search

Amazon ElastiCache

AmazonDynamoDB

AmazonAurora

AmazonElasticsearch

Amazon EMR (HDFS) Amazon S3 Amazon Glacier

Average latency ms ms ms, sec ms,sec sec,min,hrs ms,sec,min

(~ size) hrs

Data volume GB GB–TBs(no limit)

GB–TB(64 TB Max)

GB–TB GB–PB(~nodes)

MB–PB(no limit)

GB–PB(no limit)

Item size B-KBKB

(400 KB max)

KB(64 KB)

KB(1 MB max) MB-GB KB-GB

(5 TB max)GB

(40 TB max)

Request rate High -Very High

Very High(no limit) High High Low – Very

High

Low –Very High(no limit)

Very Low

Storage costGB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10

Durability Low -Moderate Very High Very High High High Very High Very High


Hot Data Warm Data Cold Data어떤 데이터 저장소를 사용?

분석

분석A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

Amazon Redshift

Impala

Pig

Amazon ML

AmazonKinesis

AWSLambda

Amaz

on E

last

ic M

apRe

duce

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

he

Stre

am P

roce

ssin

gBa

tch

Inte

ract

ive

Logg

ing

Stre

am S

tora

ge

IoT

Appl

icat

ions

File

Sto

rage

Hot

Cold

WarmHot

Hot

ML

Transactional Data

File Data

Stream Data

Mobile Apps

Search Data

수집 저장 분석ü ü

Streaming

처리 / 분석데이터에 대한 분석은 유용한 정보를 발견하고, 결론을제안하고, 그리고 의사결정을 지원하기 위한 목적으로데이터를 점검, 정제, 변환, 그리고 모델링 하는 프로세스

예시대화형 대쉬보드 → 대화형 분석(Interactive Analytics)일일/주간/월간 보고서 → 배치 분석(Batch Analytics)결제/사기 경고, 1 분 측정 → 실시간 분석(Real-time Analytics)심리 분석, 예측 모델 → 기계 학습(Machine learning)

대화형 분석

대용량의 데이터 (warm/cold)답변을 얻기까지 수 초 소요

예: 셀프 서비스 대쉬보드

배치 분석

대용량의 데이터 (warm/cold)답변을 얻기까지 수 분 또는 수 시간 소요

예: 일일, 주간, 월간 보고서 작성

실시간 분석

적은 용량의 Hot 한 데이터와 질문답변을 얻기까지 짧은 시간 소요 (수 밀리초 또는 수 초)

실시간 (이벤트)• 데이터 스트림에서 이벤트 실시간 응답• 예: 결제/사기 경고

준 실시간 (마이크로 배치)• 데이터 스트림의 마이크로 배치를 통한 준 실시간 운영• 예: 1 분 측정

기계 학습을 통한 예측

기계 학습(ML)은 컴퓨터에게 명시적으로 프로그래밍 하지않고 학습할 수 있는 기능을 제공

기계 학습 알고리즘:감독 학습 ← “teach” program

- Classification ← Is this transaction fraud? (Yes/No) - Regression ← Customer Life-time value?

자율 학습 ← let it learn by itself- Clustering ← Market Segmentation

기계 학습• Mahout, Spark ML, Amazon ML

대화형 분석• Amazon Redshift, Presto, Impala, Spark

배치 분석• MapReduce, Hive, Pig, Spark

스트림 처리• Micro-batch: Spark Streaming, KCL, Hive, Pig• Real-time: Storm, AWS Lambda, KCL

Amazon Redshift

Impala

Pig

Amazon Machine Learning

AmazonKinesis

AWSLambda

Amaz

on E

last

ic M

apRe

duce

Stre

am P

roce

ssin

gBa

tch

Inte

ract

ive

ML

분석

Streaming

분석 도구와 프레임워크

Spark Streaming Apache Storm Amazon KinesisClient Library AWS Lambda Amazon EMR (Hive,

Pig)

Scale / Throughput ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes

Batch or Real-time Real-time Real-time Real-time Real-time Batch

Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 + Auto Scaling AWS managed Yes (Amazon EMR)

Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ

Programminglanguages Java, Python, Scala Any language

via Thrift

Java, via MultiLangDaemon ( .Net, Python, Ruby,

Node.js)

Node.js, Java, Python

Hive, Pig, Streaming languages

High

어떤 데이터 처리 기술을 사용해야 하는가?

AmazonRedshift Impala Presto Spark Hive

Query Latency Low Low Low Low Medium (Tez) – High (MapReduce)

Durability High High High High High

Data Volume 1.6 PB Max ~Nodes ~Nodes ~Nodes ~Nodes

Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)

Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3

SQL Compatibility High Medium High Low (SparkSQL) Medium (HQL)

HighMedium

어떤 데이터 처리 기술을 사용해야 하는가?

Store Analyze

https://aws.amazon.com/big-data/partner-solutions/

ETL

What About ETL?

소비 /시각화

예측

분석 및 시각화

IDE

애플리케이션 & API

Consume

Anal

ysis

& V

isua

lizat

ion

Note

book

s

Predictions

Apps & APIs

IDE

저장 분석 소비ETL

Business users

Data Scientist, Developers

Amazon QuickSight

소비

참조 아키텍처

수집 저장 분석 소비

A

iOS Android

Web Apps

Logstash

Amazon RDS

Amazon DynamoDB

AmazonES

AmazonS3

ApacheKafka

AmazonGlacier

AmazonKinesis

AmazonDynamoDB

Amazon Redshift

Impala

Pig

Amazon ML

AmazonKinesis

AWSLambda

Amaz

on E

last

ic M

apRe

duce

AmazonElastiCache

Sear

ch

SQL

NoS

QL

Cac

he

Stre

am P

roce

ssin

gBa

tch

Inte

ract

ive

Logg

ing

Stre

am S

tora

ge

IoT

Appl

icat

ions

File

Sto

rage An

alys

is &

Vis

ualiz

atio

n

Hot

Cold

Warm

Hot

Slow

Hot

ML

Fast

Fast

Transactional Data

File Data

Stream Data

Note

book

s

Predictions

Apps & APIs

Mobile Apps

IDE

Search Data

ETL

Streaming

Amazon QuickSight

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

유호석 | CTO

2016년 5월 17일

Customer StoryMangoPlate

Redshift 기반의 망고플레이트분석 및 추천 시스템

망고플레이트 소개

사람들이 맛있는 곳을쉽고 빠르게 찾을 수 있도록

도와주는 서비스[ ]

망고플레이트의 성장

누적다운로드: 250만+

MAU : 180만+

월간 페이지뷰: 2000만+

현재

망고플레이트에서 하고 있는 추천 및 분석 업무

• Recommendation Engine • Restaurant to Restaurant Similarity 계산• User to User Similarity 계산• User to Restaurant Similarity 계산

• Fraud Detection • Fake review/user Identification • User/Review/Picture Scoring • Restaurant Rating

• User Behavior 분석• Web/App user mapping • Retention queries • User segmentation/testing

서비스 성장에 따른 Pain Point들

• 서비스 성장에 따라 기존 시스템으로 계산 시간이 점점 오래 걸림

• 추천 및 Rating 알고리즘 고도화로 분석 Query 가 복잡해짐

• 분석하고 싶은 데이터가 모두 흩어져 있음

어떤 Solution을 이용해야할까?

• 우리팀의 상황• 별도의 분석 Script를만들기에는 개발자 부족• Algorithm의 70%는 Query에 의존• 분석에 필요한 모든 Log를 수집하여 S3에 저장

• Redshift를 도입하기로 결정

[ ]클라우드에서 실행되는 신속하고 강력한페타바이트 규모의 SQL기반 데이터

웨어하우스 서비스

Redshift의 장점

• 쉽게 Petabyte 규모까지 Scale 가능

• 빠른 계산 속도

• 저렴한 가격• dc1.large의경우 월 20만원으로 시작 가능

• 표준 SQL 지원 및 다양한 Analytics Function 지원

단계별로 AWS 및 Redshift도입

국내 클라우드 서비스• VM instances• MySQL• Redis

AWS Seoul Region• EC2• RDS• Elastic Cache• VPC• Route53• S3• SNS• SQS• Redshift

Migration Consulting&

Technical Support

AWS Tokyo Region• S3• SNS• SQS• Redshift

+

2015 2016

망고플레이트의 Architecture분석이 필요한 모든 데이터를 한곳에

AWS S3 AWS Redshift

. . .

copy table

SQL DB records to raw file

AWS EC2

Analytics Visualization & ConsumeCollection Store

AWS RDSAWS EC2

AWS EC2

AWS RDS

무엇이 좋아졌을까요?

• Algorithm queries• Restaurant Similarity: 600 초 > 80 초 (7.5배) • Restaurant/User Recommendation: 720 초 > 80 초 (9배)

• Retention queries• Base Table: 1200 초 > 60 초 (20배)• Main: 2400 초 > 200 초 (12배)

분석 속도 개선

무엇이 좋아졌을까요?

• Analytic function(window function) 들 적용• median, dense_rank• ntile, stddev_samp/stddev_pop

• JSON function들을 이용하여 쉽게 로그테이블 분석• Json_extract_path_text• json_extract_array_element_text

분석 Query들의 단순화

Managed Service by

감사합니다 !https://www.mangoplate.com/career

디자인 패턴

디자인 패턴

§ 대화형 분석(Interactive Analytics) : 대화형 대쉬보드

§ 배치 분석(Batch Analytics) : 일일/주간/월간 보고서

§ 실시간 분석(Real-time Analytics) : 결제/사기 경고1 분 측정

§ 기계 학습(Machine learning) : 심리 분석, 예측 모델

여러 단계처리에서 분리된 스토리지

Store Process Store Process

processstore

“데이터 버스” 분리

다수의 처리 애플리케이션(또는 커넥터)이 다양한데이터 저장소에서 읽기/쓰기 가능

processstore

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Amazon Kinesis S3Connector

Amazon Kinesis

AWS Lambda

Amazon S3

Amazon DynamoDB

Hive SparkStorm

Amazon Kinesis S3Connector

processstore

처리 프레임워크(KCL, Storm, Hive, Spark, etc.)는다수의 데이터 저장소에서 읽기 가능

Spark Streaming Apache StormAWS Lambda

KCLAmazon Redshift Spark

Impala Presto

Hive

AmazonRedshift

Hive

Spark PrestoImpala

Amazon KinesisApache Kafka

Amazon DynamoDB Amazon S3data

Hot Cold데이터 온도

처리

응답

시간

Low

High Answers

Amazon EMR (HDFS)

Hive

NativeKCLAWS Lambda

Batch

데이터 온도 vs. 처리 응답시간

실시간 분석

Producer ApacheKafka

KCL

AWS Lambda

SparkStreaming

Apache Storm

Amazon SNS

AmazonML

Notifications

AmazonElastiCache

(Redis)

AmazonDynamoDB

AmazonRDS

AmazonES

Alert

App state

Real-time Prediction

KPI

processstore

DynamoDBStreams

Amazon Kinesis

대화형 & 배치 분석

Producer Amazon S3

Amazon EMR

Hive

Pig

Spark

AmazonML

processstore

Consume

Amazon Redshift

Amazon EMRPresto

Impala

Spark

배치

대화형

Batch Prediction

Real-time Prediction

Batch Layer

AmazonKinesis

DATA

processstore

Amazon Kinesis S3 Connector

Amazon S3

Applications

Amazon Redshift

Amazon EMR

Presto

Hive

Pig

Spark ANSWER

Speed Layer

ANSWER

Serving LayerAmazon

ElastiCacheAmazon

DynamoDBAmazon

RDSAmazon

ES

ANSWER

AmazonML

KCL

AWS Lambda

Spark Streaming

Storm

람다 아키텍처

요약

“데이터 버스”를 분리• Data → Store → Process → Answers

작업에 적합한 도구를 사용• Data structure, latency, throughput, access patterns

람다 아키텍처 아이디어 사용• Immutable (append-only) log, batch/speed/serving layer

AWS 관리형 서비스 활용• No/low admin

비용에 대한 고려• Big data ≠ Big cost

여러분의 피드백을 기다립니다!

https://www.awssummit.co.kr

모바일 페이지에 접속하셔서, 지금 세션 평가에참여하시면, 행사후 기념품을 드립니다.

#AWSSummit 해시태그로 소셜 미디어에 여러분의행사 소감을 올려주세요.

발표 자료 및 녹화 동영상은 AWS Korea 공식 소셜채널로 곧 공유될 예정입니다.

감사합니다 !aws.amazon.com/big-data

빅데이터를 위한 aws 모범사례와 아키텍처 구축 패턴 :: 양승도 :: aws summit...

Technology