how to get the most from big data

31
0 Copyright 2015 FUJITSU Human Centric Innovation in Action Fujitsu Forum 2015 18th – 19th November

Upload: fujitsu-global

Post on 12-Apr-2017

475 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: How to Get the Most from Big Data

0 Copyright 2015 FUJITSU

Human Centric Innovation

in Action

Fujitsu Forum 2015

18th – 19th November

Page 2: How to Get the Most from Big Data

1 Copyright 2015 FUJITSU

How to Get the Most from Big Data

Page 3: How to Get the Most from Big Data

2 Copyright 2015 FUJITSU

Dr. Fritz Schinkel Head of Big Data Competence Center, Fujitsu

Page 4: How to Get the Most from Big Data

3 Copyright 2015 FUJITSU

What Do We Expect – What Can We Expect?

Huge datasets – affordable storage

Fast changing values – analysis in time

Streaming data – real-time processing

Frequent business changes – rapid modeling, fast learning

Page 5: How to Get the Most from Big Data

4 Copyright 2015 FUJITSU

Big Data Technology Enablers for Success

Page 6: How to Get the Most from Big Data

5 Copyright 2015 FUJITSU

The Initial Big Data Challenge

Web Search in 1995

Altavista search engine

Full text search on 20 million pages

Big system – but scale-up

Overwhelmed by exponential data growth

Growing time to re-index

Complex page rank impossible

Google’s scale-out approach

Capacity and compute scale “infinitely”

Web index and page rank

Page 7: How to Get the Most from Big Data

6 Copyright 2015 FUJITSU

From Google‘s Success to Big Data Technologies

Compaq Alpha Server GS160

1995 altavista.digital.com

1999

Commodity

hardware

2002 Nutch open source search engine

2001

2014 Yahoo 42.000 server / 455 PB

2011 Facebook 30 PB system

2003 Google paper Map/Reduce

2004 NDFS Nutch Distributed Filesystem

2005 Map reduce + NDFS = Hadoop

2006 Apache Hadoop: Map / Reduce and HDFS

2008 Terasort world record (209 sec / 900 nodes)

Hadoop

dis

trib

utio

ns

2009

2011

Page 8: How to Get the Most from Big Data

7 Copyright 2015 FUJITSU

Distributed Parallel Processing

Distribute data

Code travels to data

Shared Nothing

Scale-out on demand

Affordable standard servers

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

NameNode

JobTracker

Clie

nt

Master

Slaves HDFS

Page 9: How to Get the Most from Big Data

8 Copyright 2015 FUJITSU

reduce reduce reduce reduce

map map map

Map Reduce is like Counting Votes

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A:7

shuffle

B:7 C:2 D:4

Page 10: How to Get the Most from Big Data

9 Copyright 2015 FUJITSU

More than Map Reduce (Hadoop Selected)

MapReduce Execution

Engine (Linear)

Hive SQL

Pig Script

TEZ Execution

Engine (DAG)

YARN Cluster Resource Management

HDFS Redundant, Reliable Persitent Storage

Kafka

Queueing

Datameer Visual

Analytics

Impala

SQL

HBase

NoSQL Key

value store

Spark

Execution Engine (DAG)

Res. Distr. Data (In-Memory)

Spark Stream

-ing

Spark SQL

Spark MLlib

Spark GraphX

Storage

Re- source mgt.

Data mgt.

Data access

Page 11: How to Get the Most from Big Data

10 Copyright 2015 FUJITSU

Analytic Flows for Real-time Big Data Problems

E-commerce Log

ERP DB

CRM DB

Social media Service

Sensor Data Stream

Data collection Machine learning

Aggregate & correlate

Statistical analysis

Prediction

Event Correlation

Research & development, science

Operation, automation,

production

Interactive reporting, advertising

Text analysis

Page 12: How to Get the Most from Big Data

11 Copyright 2015 FUJITSU

Big Data Infrastructure Reference Architecture

Consolidated data Distilled essence Applied knowledge Various data

Extract, Collect Cleanse, Transform Decide, Act Analyze, Visualize

Data Sources Analytics Platform Access

Batch processing platform

Event processing platform

Fast response platform

Data bases

Application server

Web content

Sensor data

Apps Services Queries

Visualization Reporting

Notification

Page 13: How to Get the Most from Big Data

12 Copyright 2015 FUJITSU

Big Data Infrastructure Reference Architecture

Consolidated data Distilled essence Applied knowledge Various data

Extract, Collect Cleanse, Transform Decide, Act Analyze, Visualize

Data Sources Analytics Platform Access

Batch processing platform

Event processing platform

Fast response platform

Social media APIs

IMDB

DB / DW Dat

a at

res

t D

ata

in m

otio

n

Sensor data

Web content

Text / mail

Complex Event Processing

In-Memory / NoSQL / DWH

Big Data Batch Processing

Page 14: How to Get the Most from Big Data

13 Copyright 2015 FUJITSU

Big Data Value Chain

Big Data

Collect Stream

Structured & unstructured data

Devices, sensors,

Internet of Things

Cleanse Transform Analyze

Find Decide Navigate

Research & development, science

Operation, automation,

production

Interactive reporting, advertising

Rapid modelling for faster insights

Social media, open data, linked data

Page 15: How to Get the Most from Big Data

14 Copyright 2015 FUJITSU

Self-Service Data Science

Big Data

Extract Collect

Structured & unstructured data

Devices, sensors,

Internet of Things

Cleanse Transform Analyze

Find Decide Act

Research & development, science

Operation, automation,

production

Interactive reporting, advertising

Social media, open data, linked data

Big Data Import, Analysis, Visualization like Outlook, Excel and Powerpoint

Collect

Cleanse Transform Analyze

Find Decide Act

Page 16: How to Get the Most from Big Data

15 Copyright 2015 FUJITSU

Use Best of Breed Technologies Get Better Insights

Page 17: How to Get the Most from Big Data

16 Copyright 2015 FUJITSU

Sentiment Analysis

Example Subjects

City

Company

Brand

Euro Crisis

Migration

Sentiment aspects

Page 18: How to Get the Most from Big Data

17 Copyright 2015 FUJITSU

Look In and Find Out: Sentimental Data Flow

Scheduled collection

of tweets for certain

keywords

Compute emotional

polarity of texts

Determine author’s

location

Calculate tweet

frequencies

Filter emotional

buzzwords

Sentiment

curves

Colored

tweet list

Word

cloud

Tweet

map

Page 19: How to Get the Most from Big Data

18 Copyright 2015 FUJITSU

Geographical Distribution of the Topic “Refugees”

Attacks to refugee homes Tweets about refugees

Google maps with twitter data analyzed by Fujitsu Data from Amadeu-Antonio-Foundation visualized by Fujitsu

Page 20: How to Get the Most from Big Data

19 Copyright 2015 FUJITSU

Presentation of Result and Drilldown

Time series of popularity

Geographical distribution

List of messages

Cloud of key words

Page 21: How to Get the Most from Big Data

20 Copyright 2015 FUJITSU

Used Technology at a Glance

PRIMEFLEX® for Hadoop

Tweets

Word lists

Geo data

Chart

Map

Table

Import

tables

Export

tables/

charts

Automatic update Dialog

..or other

web server

Import

Twitter data (open interface) Automatic import Geo data for villages Sentiment dictionary

Processing

Data cleansing/ de-duplication Determine tweet locations Polarity of texts Frequency scale

Output

Time series for frequency Geographical distribution Message lists Tag-cloud of popular key words

HTML 5 based

browser frontend

Page 22: How to Get the Most from Big Data

21 Copyright 2015 FUJITSU

Become More Current?

Clothing donation for

refugees appreciated

in Passau!

https://bigdata.fujitsu.com/

Clothing donation for

refugees appreciated

in Passau!

https://bigdata.fujitsu.com/

<10 sec hourly < 10 min

< 10 sec

< 10 sec continuous

<10 sec <10 sec

REST API Map Reduce / Tez

Streaming API Spark (in-memory)

Page 23: How to Get the Most from Big Data

22 Copyright 2015 FUJITSU

Streaming – The Short Wire to Twitter

Stream of tweets for

certain keywords

Group tweets by

author and retweet

Calculate ranking

Aggregate time

window of tweets

Popularity

chart

Colored

tweet list

Page 24: How to Get the Most from Big Data

23 Copyright 2015 FUJITSU

Play and Interact – The User Interface

Chart of top authors

List of messages

Page 25: How to Get the Most from Big Data

24 Copyright 2015 FUJITSU

Used Technology at a Glance

PRIMEFLEX® for Hadoop

Tweets

Chart

Table

Import

RDDs

Export

tables/

charts

Streaming update Dialog

..or other

web server

Import

Twitter data (open interface) Streaming import

Processing

Data cleansing/ de-duplication Determine tweet authors Frequency scale

Output

Chart of the top 250 Message lists of rated authors

HTML 5 based

browser frontend

Page 26: How to Get the Most from Big Data

25 Copyright 2015 FUJITSU

Robust and Fast: Bringing both Together

Scheduled

collection of tweets

for certain keywords

Find key authors

Stream of tweets for

certain users

Group by author

Calculate ranking

Aggregate tweets

Popularity

chart

Colored

tweet list

Author

ranking

Setup streaming

Page 27: How to Get the Most from Big Data

26 Copyright 2015 FUJITSU

PRIMEFLEX for Hadoop and Conclusions

Page 28: How to Get the Most from Big Data

27 Copyright 2015 FUJITSU

Keep pace to business needs: PRIMEFLEX for Hadoop

Rapid modelling for logic

In-memory for speed

Scale-out for volume

Page 29: How to Get the Most from Big Data

28 Copyright 2015 FUJITSU

Platform: PRIMEFLEX for Hadoop

Software stack Hadoop core: Map Reduce / HDFS Streaming and In-memory technologies Analytic framework

Hadoop platform sourcing options On-premise: Entry or Rack option Off-premise: Cloud offering Storage or compute intensive workloads

Service and Consulting Integration Service Tool supported sizing Hadoop and Analytic Services

Entry Rack Cloud

Big Data Management

Analytics

Analytic Services

Integration Service and Sizing

Page 30: How to Get the Most from Big Data

29 Copyright 2015 FUJITSU

Like Some More:

Experts, demos & solutions @ Big Data Booth: C17-C18

Page 31: How to Get the Most from Big Data

30 Copyright 2015 FUJITSU