owf12/java michael hirt

32
Tackling Big Data with Hadoop and Graphical Open Source Integration Michaël Hirt Data Integration Product Manager

Upload: open-world-forum

Post on 12-May-2015

1.309 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: OWF12/Java Michael hirt

Tackling Big Data with Hadoop and Graphical Open Source Integration

Michaël Hirt Data Integration Product Manager

Page 2: OWF12/Java Michael hirt

© Talend 2011 2

Agenda

1.What is Big Data ?

2.Talend’s Goal

3.What’s next ? Big Data Quality and Big Data management

4.Talend Open Studio for Big Data in action

Page 3: OWF12/Java Michael hirt

What is Big Data?

Page 4: OWF12/Java Michael hirt

© Talend 2011 4

2015

What Is BIG Data?

"Big data" is information of extreme size, diversity, complexity and need for rapid processing.Ted Friedman - Information Infrastructure and Big Data Projects Key Initiative Overview - July 2011

2020

275 exabytesof data flowing over the Internet each day275,000,000,000,000,000,000

200 billionintelligent devices200,000,000,000

50 gigabytes of dataper person on Earth50,000,000,000300 exabytes total

2,300 tweets per second(June 2011)

Page 5: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 5

volume, variety, velocity

How to define Big data is….

Key Takeaway #1

Hans Rosling – uses big data to analyze world health trends

Page 6: OWF12/Java Michael hirt

© Talend 2011 6

The 6 Dimensions of BIG Data

Primary challenges

Volume Velocity Variety Complexity

And also Validation Lineage

Page 7: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 7

Forces us to think differently

Key Takeaway #2

Page 8: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 8

CRM

ERP

Finance

ETLData

Quality

Normalized Data

Traditional Data Warehouse

Business Analyst

Business User

Warehouse Administra

tor

Traditional Data Flows

• Scheduled–daily or weekly, sometimes more frequently.

• Volumes rarely exceed terabytes

Executives

Page 9: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 9

CRM

ERP

Finance

The new world of big data

Social Networking

Big Data

Page 10: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 10

CRM

ERP

Finance

The new world of big data

Social Networking

Mobile Devices

Big Data

Page 11: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 11

CRM

ERP

Finance

The new world of big data

Social Networking

Mobile Devices

Transactions

Network Devices

SensorsBig Data

Page 12: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 12

Data driven business

data

decisions

supports

Your business

drivesInformation provides value to the businessIf you can't rely on your information then the result can be missed opportunities, or higher costs.

Matthew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Executive (EPISTLE).

information

enablesgovernance

Page 13: OWF12/Java Michael hirt

© Talend 2011 13

Big Data Production

Big Data Management

Big Data Consumption

Storage ProcessingFiltering

Mining

Analytics

Search

Enrichment

RDBMSAnalytical DBNoSQL DBERP/CRMSaaSSocial MediaWeb AnalyticsLog FilesRFIDCall Data RecordsSensorsMachine-Generated

Big Data Integration

Big Data Quality

BIG Data Management

Turn Big Data into actionable information

Page 14: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 14

BIG data driven business

BIG data

BIGinformati

on

BIGbusiness

supports

drives

enables

Matthew West and Julian Fowler (1999). Developing High Quality Data Models. The European Process Industries STEP Technical Liaison Executive (EPISTLE).

governance

BIGdecisions

Information provides value to the businessIf you can't rely on your information then the result can be missed opportunities, or higher costs.

Page 15: OWF12/Java Michael hirt

Our goal

Page 16: OWF12/Java Michael hirt

© Talend 2011 16

Talend – The Market Leading Unified Integration Platform

Open source license Free of charge Optional support

Commercial license Subscription model

DataQuality

DataIntegration MDM ESB

Talend Open Studio for

MonitoringExecutionDeploymentRepositoryStudio

DataQuality

DataIntegration

MDM ESB BPM

Talend Enterprise

Talend Unified Platform

Recognized as the open source leader in each of its market category by all industry analysts

Page 17: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 17

Trying to get from this…

Page 18: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 18

to this…

ONLY Talend generates code that is executed within map reduce. This open approach removes the limitation of a proprietary “engine” to provide a truly unique and powerful set of tools for big data.

Why Talend…

Page 19: OWF12/Java Michael hirt

“Big Data for the Masses”

Page 20: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 20

…an open source ecosystem

Talend Open Studio for Big Data “Big Data for the Masses”

Improves efficiency of big data job design with graphic interface

Abstracts and generates code Run transforms inside Hadoop Native support for HDFS, Pig, HBase,

Sqoop and Hive Apache License 2.0 Embedded in Hortonworks Data

Platform Certifed with Cloudera, MapR and

Grenplum

Goal: Democratize Big Data

Pig

Page 21: OWF12/Java Michael hirt

Big Data – How about Data Quality?

© Talend 2012

Page 22: OWF12/Java Michael hirt

© Talend 2011 23

In big data…poor data quality can be magnified at huge scale

Poor Data Quality + Big Data = Big ProblemsPoor Data Quality * Big Data = Big Problems^2

Key Takeaway #3

Page 23: OWF12/Java Michael hirt

© Talend 2011 24

1. Pipelining: as part of the load process

2. Load the cluster then implement and execute a data quality map reduce job

Two methods for inserting data quality into a big data job

Page 24: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 25

Extract – Transform - LoadE-T-L

Page 25: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 26

DQExtract – Improve/Cleanse - Load

E- -L

Page 26: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 27

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

DQ

DQ

Pipelining: data quality with big data

• Use traditional data quality tools

• No new programming, no PHDs• Once and done

Page 27: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 28

Big data alternative: Load and improve within the cluster

• Load first, improve later• Really complex to build, limited

tools• Constant on, increments• Insane performance

CRM

ERP

Finance

Social Networking

Mobile Devices

Big Data

DQ

DQ

Page 28: OWF12/Java Michael hirt

Let us show you…

© Talend 2012

Page 29: OWF12/Java Michael hirt

What’s next for Talend Big Data?

© Talend 2012

Page 30: OWF12/Java Michael hirt

© Talend 2011 31

Talend Open Studio for Big Data

4.0: HDFS 4.1: Hive & Sqoop 4.2: Pig 5.0:

Hbase

5.1:HCatalog & Oozie

Page 31: OWF12/Java Michael hirt

© Talend 2011© Talend 2011 – Stri2y Private & Confidential 32

Talend Open Studio for Big DataPackaged within Hortonworks Data Platform

…Eclipse tools for HIVE, HDFS, PIG, SCOOP

…supports Oozie, Hcatalog, Kerberos

Free to download and use under the Apache license

…democratizing big data through intuitive tools

databignow Q42012 2013

Page 32: OWF12/Java Michael hirt

Questions / Thanks for attendingmhirt_at_talend.com