apache falcon _ hadoop user group france 22-sept-2014
DESCRIPTION
Apache Falcon slides during the 22 Sept 2014 Hadoop meetup @Criteo by @carbone & @jbonofre Data Management Platform for HadoopTRANSCRIPT
© Talend 2014 1
HUG France - 22 Sept 2014 - @Criteo
Data Management platform for Hadoop
Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/
Cédric Carbone, Talend CTO@carboneJean-Baptiste Onofré, Falcon Committer@jbonofre
© Talend 2014 2
Overview
• Falcon is a Data Management solution for Hadoop
• Falcon in production at InMobi since 2012
• InMobi gave Falcon to ASF in April 2013
• Falcon is in Apache incubation
• Falcon embedded per default inside HDP
• Falcon leverages a lot of Apache components- Oozie, Ambari, ActiveMQ, HCat, Sqoop…
• Committer/PPMC/IPMC:- #8 InMobi
- #5 Hortonworks
- #1 Talend
© Talend 2014 5
What is Falcon?
• Data MotionImport, Export, CDC
• Policy-based Lifecycle ManagementRetention, Replication, Archival, Anonymization of PII data
• Process orchestration and schedulingLate data handling, reprocessing, dependency checking, etc.Multi-cluster management to support Local/Global Aggregations, Rollups, etc.
• Data GovernanceLineage, Audit, SLA
© Talend 2014 6
Falcon - The Solution!
• Introduces a higher layer of abstraction – Data SetDecouples a data location and its properties from workflowsUnderstanding the life-time of a feed will allow for implicit validation of the processing rules
• Provides the key services for data processing appsCommon data services are simple directives, No need to define them verbosely in each jobAllows process owners to keep their processing specific to their application logicSits in the execution path, intercepts to handle OOB data / retries etc.
• Promotes Polyglot ProgrammingDoes not do any heavy lifting but delegates to tools with in the Hadoop ecosystem
© Talend 2014 7
Falcon Basic Concepts : Data Pipelines
• Cluster: : Represents the Hadoop cluster
• Feed: Defines a “dataset”
• Process: Consumes feeds, invokes processing logic & produces feeds
• Entity Actions: submit, list, dependency, schedule, suspend, resume, status, definition, delete, update
© Talend 2014 9
Cluster Entity
<?xml version="1.0"?><cluster colo=”talend-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations></cluster>
Needed by distcp for replications
Writing to HDFS
Used to submit processes as MR
Submit Oozie jobs
Hive metastore to register/deregister partitions and get events on partition availability
Used For alerts
HDFS directories used by Falcon server
© Talend 2014 10
Feed Entity
<?xml version="1.0"?><feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
<retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/></feed>
Feed run frequency in mins/hrs/days/mths
Late arrival cutoff
Global location across clusters - HDFS paths or
Hive tables
Feeds can belong to multiple groups
One or more source & target clusters for retention &
replication
Access Permissions
© Talend 2014 11
Process Entity<process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)">
<late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process></process>
How frequently does the process run , how many instances can be run in parallel and in what order
Which cluster should the process run on and when
The processing logic.
Retry policy on failure
Handling late input feeds
Input & output feeds for process
© Talend 2014 13
Demo #1 : my first feed
• Start Falcon on a single cluster
• Submission of one simple feed
• Check into Oozie the generated job
© Talend 2014 14
Replication & Retention
• Sophisticated retention policies expressed in one place• Simplify data retention for audit, compliance, or for data re-processing
© Talend 2014 15
Motivation: being able to be notified about processes activity “outside” of the cluster
Falcon manages workflow, send JMS notification, and a Camel route react to the notification
Result: trigger an action (Camel route) when a notification is sent by Falcon (eviction, late-arrival, process execution, …)
Mix BigData/Hadoop and ESB technologies
Demo #2: processes notification
© Talend 2014 16
Demo #2: workflow
Falcon ActiveMQ
Camel
2014-03-19 11:25:43,273 | INFO | LCON.my-process] | process-listener | rg.apache.camel.util.CamelLogger 176 | 74 - org.apache.camel.camel-core - 2.13.0.SNAPSHOT | Exchange[ExchangePattern: InOnly, BodyType: java.util.HashMap, Body: {brokerUrl=tcp://localhost:61616, timeStamp=2014-03-19T10:24Z, status=SUCCEEDED, logFile=hdfs://localhost:8020/falcon/staging/falcon/workflows/process/my-process/logs/instancePaths-2013-11-15-06-05.csv, feedNames=output, runId=0, entityType=process, nominalTime=2013-11-15T06:05Z, brokerTTL=4320, workflowUser=null, entityName=my-process, feedInstancePaths=hdfs://localhost:8020/data/output, operation=GENERATE, logDir=null, workflowId=0000026-140319105443372-oozie-jbon-W, cluster=local, brokerImplClass=org.apache.activemq.ActiveMQConnectionFactory, topicName=FALCON.my-process}]
© Talend 2014 17
Demo #2 : Data Notification
• All notification will be in ActiveMQ
• Subscribers : Camel routes
• Add some files and see the notification system working!
• http://blog.nanthrax.net/2014/03/hadoop-cdc-and-processes-notification-with-apache-falcon-apache-activemq-and-apache-camel/
© Talend 2014 19
Topologies
• STANDALONE– Single Data Center– Single Falcon Server– Hadoop jobs and relevant processing
involves only one cluster
• DISTRIBUTED– Multiple Data Centers– Falcon Server per DC– Multiple instances of hadoop clusters
and workflow schedulers
© Talend 2014 20
Motivation: replicate subset of data from one cluster to another, guarantee the different eviction depending of the data subset
Falcon manages workflow on primary cluster, and replication on failover cluster
Result: support business continuity without requiring full data reprocessing
Multi cluster failover
© Talend 2014 21
Improvement on the late-arrival messages in FALCON.ENTITY.TOPIC (ActiveMQ)
Feed implicit processes (no need of process) providing “native” CDC
Straight forward MR usage (without pig, oozie workflow XML, …)
More data acquisition
Monitoring/Management/Designer dashboard
TLP !!!
Roadmap
© Talend 2014 22
Questions?
Data Management platform for Hadoop
Ce support est mis à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Pas de Modification 2.0 France. - http://creativecommons.org/licenses/by-nc-nd/2.0/fr/
Cédric Carbone, Talend CTO@carboneJean-Baptiste Onofré, Falcon Committer@jbonofre