apache falcon : 22 sept 2014 for hadoop user group france (@criteo)

22
© Talend 2014 1 HUG France - 22 Sept 2014 - @Criteo Data Management platform for Hadoop Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre

Upload: cedric-carbone

Post on 01-Dec-2014

878 views

Category:

Internet


0 download

DESCRIPTION

Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo) By Cedric Carbone (@carbone) and JB Onofre (@jbonofre) #HUGFR

TRANSCRIPT

Page 1: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 1

HUG France - 22 Sept 2014 - @Criteo

Data Management platform for Hadoop

Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/

Cédric Carbone, Talend CTO@carboneJean-Baptiste Onofré, Falcon Committer@jbonofre

Page 2: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 2

Overview

• Falcon is a Data Management solution for Hadoop

• Falcon in production at InMobi since 2012

• InMobi gave Falcon to ASF in April 2013

• Falcon is in Apache incubation

• Falcon embedded per default inside HDP

• Falcon leverages a lot of Apache components- Oozie, Ambari, ActiveMQ, HCat, Sqoop…

• Committer/PPMC/IPMC:- #8 InMobi

- #5 Hortonworks

- #1 Talend

Page 3: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 3

Why Falcon?

Page 4: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 4

Why Falcon?

Page 5: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 5

What is Falcon?

• Data MotionImport, Export, CDC

• Policy-based Lifecycle ManagementRetention, Replication, Archival, Anonymization of PII data

• Process orchestration and schedulingLate data handling, reprocessing, dependency checking, etc.Multi-cluster management to support Local/Global Aggregations, Rollups, etc.

• Data GovernanceLineage, Audit, SLA

Page 6: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 6

Falcon - The Solution!

• Introduces a higher layer of abstraction – Data SetDecouples a data location and its properties from workflowsUnderstanding the life-time of a feed will allow for implicit validation of the processing rules

• Provides the key services for data processing appsCommon data services are simple directives, No need to define them verbosely in each jobAllows process owners to keep their processing specific to their application logicSits in the execution path, intercepts to handle OOB data / retries etc.

• Promotes Polyglot ProgrammingDoes not do any heavy lifting but delegates to tools with in the Hadoop ecosystem

Page 7: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 7

Falcon Basic Concepts : Data Pipelines

• Cluster: : Represents the Hadoop cluster

• Feed: Defines a “dataset”

• Process: Consumes feeds, invokes processing logic & produces feeds

• Entity Actions: submit, list, dependency, schedule, suspend, resume, status, definition, delete, update

Page 8: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 8

Falcon Entity Relationships

Page 9: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 9

Cluster Entity

<?xml version="1.0"?><cluster colo=”talend-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations></cluster>

Needed by distcp for replications

Writing to HDFS

Used to submit processes as MR

Submit Oozie jobs

Hive metastore to register/deregister partitions and get events on partition availability

Used For alerts

HDFS directories used by Falcon server

Page 10: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 10

Feed Entity

<?xml version="1.0"?><feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>

<retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/></feed>

Feed run frequency in mins/hrs/days/mths

Late arrival cutoff

Global location across clusters - HDFS paths or

Hive tables

Feeds can belong to multiple groups

One or more source & target clusters for retention &

replication

Access Permissions

Page 11: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 11

Process Entity<process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)">

<late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process></process>

How frequently does the process run , how many instances can be run in parallel and in what order

Which cluster should the process run on and when

The processing logic.

Retry policy on failure

Handling late input feeds

Input & output feeds for process

Page 12: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 12

High Level Architecture

Page 13: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 13

Demo #1 : my first feed

• Start Falcon on a single cluster

• Submission of one simple feed

• Check into Oozie the generated job

Page 14: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 14

Replication & Retention

• Sophisticated retention policies expressed in one place• Simplify data retention for audit, compliance, or for data re-processing

Page 15: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 15

Motivation: being able to be notified about processes activity “outside” of the cluster

Falcon manages workflow, send JMS notification, and a Camel route react to the notification

Result: trigger an action (Camel route) when a notification is sent by Falcon (eviction, late-arrival, process execution, …)

Mix BigData/Hadoop and ESB technologies

Demo #2: processes notification

Page 16: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 16

Demo #2: workflow

Falcon ActiveMQ

Camel

2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06-05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}]

Page 18: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 18

Data Pipeline Tracing

Page 19: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 19

Topologies

• STANDALONE– Single Data Center– Single Falcon Server– Hadoop jobs and relevant processing

involves only one cluster

• DISTRIBUTED– Multiple Data Centers– Falcon Server per DC– Multiple instances of hadoop clusters

and workflow schedulers

Page 20: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 20

Motivation: replicate subset of data from one cluster to another, guarantee the different eviction depending of the data subset

Falcon manages workflow on primary cluster, and replication on failover cluster

Result: support business continuity without requiring full data reprocessing

Multi cluster failover

Page 21: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 21

Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC (ActiveMQ)

Feed implicit processes (no need of process) providing “native” CDC

Straight forward MR usage (without pig, oozie workflow XML, …)

More data acquisition

Monitoring/Management/Designer dashboard

TLP !!!

Roadmap

Page 22: Apache Falcon : 22 Sept 2014 for Hadoop User Group France (@Criteo)

© Talend 2014 22

Questions?

Data Management platform for Hadoop

Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/

Cédric Carbone, Talend CTO@carboneJean-Baptiste Onofré, Falcon Committer@jbonofre