introduction to data flow management using apache nifi

32
Introduction to DataFlow management using Apache NiFi Presented by: Anshuman Ghosh

Upload: anshuman-ghosh

Post on 20-Mar-2017

137 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Introduction to data flow management using apache nifi

Introduction to DataFlow

management using Apache NiFi

Presented by: Anshuman Ghosh

Page 2: Introduction to data flow management using apache nifi

Topics we will cover

DataFlow and problems.

What is Apache NiFi – History, key features, core components

Architecture To start with NiFi (Single server setup)

Architecture To scale with NiFi (NiFi cluster setup)

Fundamentals of NiFi Web UI

Building a NiFi DataFlow Processor

Live demo

Testing

Deployment and automation

What next?

Q&A

Page 3: Introduction to data flow management using apache nifi

DataFlow

The term “DataFlow” can be used in variety of contexts.

In our context it is the flow of information between systems.

It is crucial to have a robust platform to create, manage and automate the

flow of enterprise data.

There are many tools for data gathering and data flow, but more often

than not we lack an integrated platform for that.

Probably an ideal situation would be have a seamless integration ,..

Page 4: Introduction to data flow management using apache nifi

What enterprises look for

To be able to get data from any source

… To the systems that performs Analytics

… And to those for user availability

Page 5: Introduction to data flow management using apache nifi

Common DataFlow challenges

System failure

Difference between data production and consumption

Change in dynamic data priority

Protocols and format changes; new systems, new protocols

Need of bidirectional data flow

Transparency and control

Security and privacy

Page 6: Introduction to data flow management using apache nifi

Brief history of Apache NiFi

Developed at NSA (National Security Agency, USA) for over 8 years.

Onyara engineers, for NSA, have developed a project called “Niagara

Files” which later went on to become NiFi.

Trough NSA Technology transfer program it was made available as an open

source Apache project “Apache NiFi” in the year 2014.

Hortonworks has a partnership with Onyara on their “Hortonworks DataFlow

powered by Apache NiFi”

Page 7: Introduction to data flow management using apache nifi

What is Apache NiFi

Holistically Apache NiFi is an integrated platform to collect, conduct and

curate real-time data (data in motion).

Provides an end to end DataFlow management from any source* to any

destination*.

Provides data logistics – real-time operational visibility and control of DataFlow.

Supports powerful and scalable directed graphs of data routing and data

transformation.

All these in a reliable and secure manner.

*complete list of source and destination on official documentation

Page 8: Introduction to data flow management using apache nifi

Key features

Guaranteed data delivery – “at least once” semantics

Data buffering and Back pressure

Data prioritization in queue

Flow specific setting for “latency vs. throughput”

Data provenance

Visual control

Flow templates

Recovery/ Recording through content repository

Clustering to scale-out

Security

Classloader Isolation

Page 9: Introduction to data flow management using apache nifi

Core components of NiFi

NiFi at it’s core follow the concept of Flow Based programming.

Core components of NiFi are

FlowFile – the unit of information packet

FlowFile Processor – the processing engine; black box.

Connection – the relation between Processors and bounded buffer.

Flow Controller – the scheduler in real world.

Process Group – the compact function or subnet

Page 10: Introduction to data flow management using apache nifi

Core components diagram

This is how a typical NiFi DataFlow might look

Page 11: Introduction to data flow management using apache nifi

NiFi Architecture

NiFi executes within a JVM on a host Operating System.

Page 12: Introduction to data flow management using apache nifi

NiFi Architecture – Clustering

Typical NiFi cluster

Page 13: Introduction to data flow management using apache nifi

Core components of NiFi Cluster

NiFi Cluster Manager

Nodes

Primary Node

Isolated Processors

Heartbeats

Page 14: Introduction to data flow management using apache nifi

Fundamentals of the Web UI

Page 15: Introduction to data flow management using apache nifi

Building a DataFlow Processor Drag the “Processor” icon from “Component Toolbar” into the canvas; this

will provide a ‘Add Processor’ wizard

Page 16: Introduction to data flow management using apache nifi

Building a DataFlow Processor General ‘SETTINGS’ for the processor

Page 17: Introduction to data flow management using apache nifi

Building a DataFlow Processor ‘SCHEDULING’ information

Page 18: Introduction to data flow management using apache nifi

Building a DataFlow Processor Setting up mandatory and optional ‘PROPERTIES’

Page 19: Introduction to data flow management using apache nifi

Building a DataFlow Processor Auto alert mechanism

If there is an error it will not allow to start the processor

Page 20: Introduction to data flow management using apache nifi

Building a DataFlow Processor If everything is se, we are ready to initiate/ start the process

Page 21: Introduction to data flow management using apache nifi

Demo 1

In this demo, we will go through a NiFi DataFlow that deals with the

following steps

Connect to Kafka and consume from a topic.

Store consumed data in a local storage (optional).

Anonymize IP address.

Merge content before writing to HDFS (small file issues).

Finally store Kafka data onto HDFS

Look into error handling.

Look into use of expression language.

Page 22: Introduction to data flow management using apache nifi
Page 23: Introduction to data flow management using apache nifi

Demo 2

In this demo, we will go through a NiFi DataFlow that deals with the

following steps

Collect/ fetch data files from a local location.

Update/ add attributes.

Parse JSON strings to DB Insert statements.

Connect to PostgreSQL and Insert.

Error handling.

Page 24: Introduction to data flow management using apache nifi
Page 25: Introduction to data flow management using apache nifi

Unit testing components For component testing nifi-mock module can be used with JUnit.

The TestRunner interface allows us to test Processors and Controller Services.

We need to instantiate and get a new TestRunner (org.apache.nifi.util)

Add Controller Services and configure

Set property of Processors setProperty(PropertyDescriptor, String)

Enqueue FlowFiles by using the enqueue methods of the TestRunner class.

Processor can be started by triggering run() method of TestRunner.

Validate output – using the TestRunners assertAllFlowFilesTransferred and

assertTransferCount methods.

More details can be found here – https://nifi.apache.org/docs/nifi-

docs/html/developer-guide.html#testing

Page 26: Introduction to data flow management using apache nifi

Add Maven dependency

Call static newTestRunner method of the TestRunners class

Call addControllerService method to add controller

Set properties by setProperty(ControllerService, PropertyDescriptor, String)

Enable services by enableControllerService(ControllerService)

Set processor property setProperty(PropertyDescriptor, String)

Override enqueue method for byte[], InputStream, or Path.

run(int); This will call methods with @OnScheduled annotation, Processor’s onTrigger method, and then run the @OnUnscheduled and finally @OnStoppedmethods.

Validate result by assertAllFlowFilesTransferred and assertTransferCount methods.

Access FlowFiles by calling getFlowFilesForRelationship() method

Page 27: Introduction to data flow management using apache nifi

Error handling

Following can occur

Unexpected data format

Network connection, disk failure

Bug in processor

ProcessException and all others (like null pointer)

ProcessException – Rollback and penalize the FlowFiles

All others – Rollback, penalize the FlowFiles and Yield the Processor

Page 28: Introduction to data flow management using apache nifi

Testing automation, Deployment

NiFi provides ‘ReST’ API for all components and entire documentation can

be found here https://nifi.apache.org/docs/nifi-docs/rest-api/index.html

Apache NiFi Community is working to improve on this area

We can setup the deployment in following way

Create an application i.e. entire DataFlow in your local machine and test.

Create a process group around that (optional though)

Create a template. (Can be done from Web UI/ ReST API call)

Download the template. (Can be done from Web UI/ ReST API call)

Use ReST API call to import the template in new environment.

Use ReST API call to Update Processors (Properties, Schedule, and Settings etc.)

Use ReST API call to Instantiate a template

Page 29: Introduction to data flow management using apache nifi

Deployment

There can be one more option to do it.

Copying the whole flow (flow.xml.gz) from one environment to another

Need to copy the entire canvas.

Need to take care of sensitive properties encryption.

Page 30: Introduction to data flow management using apache nifi

What is next

We are planning to work on the testing, deployment side and update it.

Please read more on NiFi development here –https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html

And for user guide – https://nifi.apache.org/docs/nifi-docs/html/user-guide.html

We have carried out POCs on some of our real use cases; please find them here

Link HDFS data ingestion using Apache

Link How to setup Apache NiFi

Link Expression Language Guide

Any questions and/ or suggestions please come by or write

Page 31: Introduction to data flow management using apache nifi

Q&A

Questions?

Page 32: Introduction to data flow management using apache nifi

Thank you!

Presented by: Anshuman Ghosh