apache falcon _ hadoop user group france 22-sept-2014

22
© Talend 2014 1 HUG France - 22 Sept 2014 - @Criteo Data Management platform for Hadoop Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - Cédric Carbone, Talend CTO @carbone Jean-Baptiste Onofré, Falcon Committer @jbonofre

Upload: hadoop-user-group-france

Post on 14-Jun-2015

1.966 views

Category:

Internet


2 download

DESCRIPTION

Apache Falcon slides during the 22 Sept 2014 Hadoop meetup @Criteo by @carbone & @jbonofre Data Management Platform for Hadoop

TRANSCRIPT

© Talend 2014 1

HUG France - 22 Sept 2014 - @Criteo

Data Management platform for Hadoop

Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/

Cédric Carbone, Talend CTO@carboneJean-Baptiste Onofré, Falcon Committer@jbonofre

© Talend 2014 2

Overview

• Falcon is a Data Management solution for Hadoop

• Falcon in production at InMobi since 2012

• InMobi gave Falcon to ASF in April 2013

• Falcon is in Apache incubation

• Falcon embedded per default inside HDP

• Falcon leverages a lot of Apache components- Oozie, Ambari, ActiveMQ, HCat, Sqoop…

• Committer/PPMC/IPMC:- #8 InMobi

- #5 Hortonworks

- #1 Talend

© Talend 2014 3

Why Falcon?

© Talend 2014 4

Why Falcon?

© Talend 2014 5

What is Falcon?

• Data MotionImport, Export, CDC

• Policy-based Lifecycle ManagementRetention, Replication, Archival, Anonymization of PII data

• Process orchestration and schedulingLate data handling, reprocessing, dependency checking, etc.Multi-cluster management to support Local/Global Aggregations, Rollups, etc.

• Data GovernanceLineage, Audit, SLA

© Talend 2014 6

Falcon - The Solution!

• Introduces a higher layer of abstraction – Data SetDecouples a data location and its properties from workflowsUnderstanding the life-time of a feed will allow for implicit validation of the processing rules

• Provides the key services for data processing appsCommon data services are simple directives, No need to define them verbosely in each jobAllows process owners to keep their processing specific to their application logicSits in the execution path, intercepts to handle OOB data / retries etc.

• Promotes Polyglot ProgrammingDoes not do any heavy lifting but delegates to tools with in the Hadoop ecosystem

© Talend 2014 7

Falcon Basic Concepts : Data Pipelines

• Cluster: : Represents the Hadoop cluster

• Feed: Defines a “dataset”

• Process: Consumes feeds, invokes processing logic & produces feeds

• Entity Actions: submit, list, dependency, schedule, suspend, resume, status, definition, delete, update

© Talend 2014 8

Falcon Entity Relationships

© Talend 2014 9

Cluster Entity

<?xml version="1.0"?><cluster colo=”talend-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations></cluster>

Needed by distcp for replications

Writing to HDFS

Used to submit processes as MR

Submit Oozie jobs

Hive metastore to register/deregister partitions and get events on partition availability

Used For alerts

HDFS directories used by Falcon server

© Talend 2014 10

Feed Entity

<?xml version="1.0"?><feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>

<retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/></feed>

Feed run frequency in mins/hrs/days/mths

Late arrival cutoff

Global location across clusters - HDFS paths or

Hive tables

Feeds can belong to multiple groups

One or more source & target clusters for retention &

replication

Access Permissions

© Talend 2014 11

Process Entity<process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)">

<late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process></process>

How frequently does the process run , how many instances can be run in parallel and in what order

Which cluster should the process run on and when

The processing logic.

Retry policy on failure

Handling late input feeds

Input & output feeds for process

© Talend 2014 12

High Level Architecture

© Talend 2014 13

Demo #1 : my first feed

• Start Falcon on a single cluster

• Submission of one simple feed

• Check into Oozie the generated job

© Talend 2014 14

Replication & Retention

• Sophisticated retention policies expressed in one place• Simplify data retention for audit, compliance, or for data re-processing

© Talend 2014 15

Motivation: being able to be notified about processes activity “outside” of the cluster

Falcon manages workflow, send JMS notification, and a Camel route react to the notification

Result: trigger an action (Camel route) when a notification is sent by Falcon (eviction, late-arrival, process execution, …)

Mix BigData/Hadoop and ESB technologies

Demo #2: processes notification

© Talend 2014 16

Demo #2: workflow

Falcon ActiveMQ

Camel

2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06-05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}]

© Talend 2014 18

Data Pipeline Tracing

© Talend 2014 19

Topologies

• STANDALONE– Single Data Center– Single Falcon Server– Hadoop jobs and relevant processing

involves only one cluster

• DISTRIBUTED– Multiple Data Centers– Falcon Server per DC– Multiple instances of hadoop clusters

and workflow schedulers

© Talend 2014 20

Motivation: replicate subset of data from one cluster to another, guarantee the different eviction depending of the data subset

Falcon manages workflow on primary cluster, and replication on failover cluster

Result: support business continuity without requiring full data reprocessing

Multi cluster failover

© Talend 2014 21

Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC (ActiveMQ)

Feed implicit processes (no need of process) providing “native” CDC

Straight forward MR usage (without pig, oozie workflow XML, …)

More data acquisition

Monitoring/Management/Designer dashboard

TLP !!!

Roadmap

© Talend 2014 22

Questions?

Data Management platform for Hadoop

Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/

Cédric Carbone, Talend CTO@carboneJean-Baptiste Onofré, Falcon Committer@jbonofre