integrating apache spark and nifi for data lakes

18
MAKING BIG DATA COME ALIVE Integrating Apache Spark And NiFi For Data Lakes Ron Bodkin Founder & President Scott Reisdorf R&D Architect

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

3.326 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Integrating Apache Spark and NiFi for Data Lakes

MAKING BIG DATA COME ALIVE

Integrating Apache Spark And NiFiFor Data Lakes

Ron Bodkin Founder & PresidentScott Reisdorf R&D Architect

Page 2: Integrating Apache Spark and NiFi for Data Lakes

2

Agenda• Requirements• Design• Demo

Page 3: Integrating Apache Spark and NiFi for Data Lakes

3

• A central repositorywith trusted,consistent data

• Reduce costs by offloading analyticalsystems and archiving cold data

• Derive value quicklywith easier discoveryand prototyping

• A laboratory for experimenting withnew technologiesand data

Goals for a Data Lake

Page 4: Integrating Apache Spark and NiFi for Data Lakes

4

• Automation of pipelines with metadata and performance tracking

• Governance withclear distinction ofroles and responsibilities

• SLA tracking withalerts on failures orviolations

• Interactive data discovery and experimentation

What’s Needed For A Hadoop Data Lake?

Page 5: Integrating Apache Spark and NiFi for Data Lakes

5

Example Ingestion Project

• 4000+ unique flat files and RDMS tables, plus a few streaming data feeds• Mix of incremental and snapshot data• Ingest into Hadoop (minimally HDFS and Hive tables)• Cleansing/encryption and data validation• Metadata capture

Focus shifts over time from data ingestion to transformation then to analytics

Page 6: Integrating Apache Spark and NiFi for Data Lakes

6

Design

Page 7: Integrating Apache Spark and NiFi for Data Lakes

7

Apache Spark Functions•Cleanse• Validate• Profile•Wrangle

Page 8: Integrating Apache Spark and NiFi for Data Lakes

8 05/03/2023© 2016 Think Big, a Teradata Company

Pipeline design with Apache• Visual drag-and-drop • Dozens of data connectors• 150+ pre-built transforms• Data lineage• Batch and Streaming• Extensible

Page 9: Integrating Apache Spark and NiFi for Data Lakes

9

Role separation

• IT Designers design models in NiFi• Register with framework• Integrated development

process© 2016 Think Big, a Teradata Company 05/03/2023

Apache NiFi Think Big framework

• Users configure new feeds• Based on common model• Generated and executed in NiFi

register

deploy

Page 10: Integrating Apache Spark and NiFi for Data Lakes

101005/03/2023

© 2015 Think Big, a Teradata Company

User features around

org. roles

Visual design

Streaming and Batch

Fully governed

Integrated Best

Practices

Secure, modern

architecture

Design Approach

Will be open source (Apache

license)

Page 11: Integrating Apache Spark and NiFi for Data Lakes

1111

Ingest and Prepare

• UI-guided feed creation• Data protection• Data cleanse• Data validation• Data profiling• Powered by Apache Spark

Page 12: Integrating Apache Spark and NiFi for Data Lakes

Unpack and/or merge small files

Put file HDFS

Cleanse/Standardize

Spark

Data ProfileSpark

Metadata

ValidateSpark

Data Ingest Model

Metadata determines behavior of individual componentsAdds many Hadoop-specific higher-level NiFi processors

Index TextElasticsearch

Merge / DedupeHive

Compress & Archive Originals

HDFS,S3

Extract Table JDBC

Get File(s)Filesystem

MessageJMS/Kafka

OtherHTTP/REST, etc.

Data policies

12

Page 13: Integrating Apache Spark and NiFi for Data Lakes

1313

Data self-service and “wrangle”

• Graphical SQL builder• 100+ transform functions• Machine learning• Publish and schedule• Powered by Apache Spark

Page 14: Integrating Apache Spark and NiFi for Data Lakes

1414

Data Discovery

• Google-like searching • Extensible metadata• Data profile • Data sampling

Page 15: Integrating Apache Spark and NiFi for Data Lakes

1515

Operations

• Dashboard• Health Monitoring• Data Confidence• SLA enforcement• Alerts• Performance

reports

Page 16: Integrating Apache Spark and NiFi for Data Lakes

16

• Powerful search capabilities for users against data(think Google-like searching)

• NiFi processor extracts source data from Hadoop tablefor indexing in ElasticSearch

• Incremental updates during ingest

ElasticSearch – Full Text Indexing

Data Lakeselect id,user,tweetfrom twitter_feed

extract JSON

Page 17: Integrating Apache Spark and NiFi for Data Lakes

17

Demo

Page 18: Integrating Apache Spark and NiFi for Data Lakes

1818