how to get the most from big data

0 Copyright 2015 FUJITSU

Human Centric Innovation

in Action

Fujitsu Forum 2015

18th – 19th November


How to Get the Most from Big Data


Dr. Fritz Schinkel Head of Big Data Competence Center, Fujitsu


What Do We Expect – What Can We Expect?

Huge datasets – affordable storage

Fast changing values – analysis in time

Streaming data – real-time processing

Frequent business changes – rapid modeling, fast learning


Big Data Technology Enablers for Success


The Initial Big Data Challenge

Web Search in 1995

Altavista search engine

Full text search on 20 million pages

Big system – but scale-up

Overwhelmed by exponential data growth

Growing time to re-index

Complex page rank impossible

Google’s scale-out approach

Capacity and compute scale “infinitely”

Web index and page rank


From Google‘s Success to Big Data Technologies

Compaq Alpha Server GS160

1995 altavista.digital.com

1999

Commodity

hardware

2002 Nutch open source search engine

2001

2014 Yahoo 42.000 server / 455 PB

2011 Facebook 30 PB system

2003 Google paper Map/Reduce

2004 NDFS Nutch Distributed Filesystem

2005 Map reduce + NDFS = Hadoop

2006 Apache Hadoop: Map / Reduce and HDFS

2008 Terasort world record (209 sec / 900 nodes)

Hadoop

dis

trib

utio

ns

2009

2011


Distributed Parallel Processing

Distribute data

Code travels to data

Shared Nothing

Scale-out on demand

Affordable standard servers

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

NameNode

JobTracker

Clie

nt

Master

Slaves HDFS


reduce reduce reduce reduce

map map map

Map Reduce is like Counting Votes

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A:7

shuffle

B:7 C:2 D:4


More than Map Reduce (Hadoop Selected)

MapReduce Execution

Engine (Linear)

Hive SQL

Pig Script

TEZ Execution

Engine (DAG)

YARN Cluster Resource Management

HDFS Redundant, Reliable Persitent Storage

Kafka

Queueing

Datameer Visual

Analytics

Impala

SQL

HBase

NoSQL Key

value store

Spark

Execution Engine (DAG)

Res. Distr. Data (In-Memory)

Spark Stream

-ing

Spark SQL

Spark MLlib

Spark GraphX

Storage

Re- source mgt.

Data mgt.

Data access


Analytic Flows for Real-time Big Data Problems

E-commerce Log

ERP DB

CRM DB

Social media Service

Sensor Data Stream

Data collection Machine learning

Aggregate & correlate

Statistical analysis

Prediction

Event Correlation

Research & development, science

Operation, automation,

production

Interactive reporting, advertising

Text analysis


Big Data Infrastructure Reference Architecture

Consolidated data Distilled essence Applied knowledge Various data

Extract, Collect Cleanse, Transform Decide, Act Analyze, Visualize

Data Sources Analytics Platform Access

Batch processing platform

Event processing platform

Fast response platform

Data bases

Application server

Web content

Sensor data

Apps Services Queries

Visualization Reporting

Notification


Big Data Infrastructure Reference Architecture

Consolidated data Distilled essence Applied knowledge Various data

Extract, Collect Cleanse, Transform Decide, Act Analyze, Visualize

Data Sources Analytics Platform Access

Batch processing platform

Event processing platform

Fast response platform

Social media APIs

IMDB

DB / DW Dat

a at

res

t D

ata

in m

otio

n

Sensor data

Web content

Text / mail

Complex Event Processing

In-Memory / NoSQL / DWH

Big Data Batch Processing


Big Data Value Chain

Big Data

Collect Stream

Structured & unstructured data

Devices, sensors,

Internet of Things

Cleanse Transform Analyze

Find Decide Navigate



production


Rapid modelling for faster insights

Social media, open data, linked data


Self-Service Data Science

Big Data

Extract Collect

Structured & unstructured data

Devices, sensors,

Internet of Things


Find Decide Act



production


Social media, open data, linked data

Big Data Import, Analysis, Visualization like Outlook, Excel and Powerpoint

Collect


Find Decide Act


Use Best of Breed Technologies Get Better Insights


Sentiment Analysis

Example Subjects

City

Company

Brand

Euro Crisis

Migration

…

Sentiment aspects


Look In and Find Out: Sentimental Data Flow

Scheduled collection

of tweets for certain

keywords

Compute emotional

polarity of texts

Determine author’s

location

Calculate tweet

frequencies

Filter emotional

buzzwords

Sentiment

curves

Colored

tweet list

Word

cloud

Tweet

map


Geographical Distribution of the Topic “Refugees”

Attacks to refugee homes Tweets about refugees

Google maps with twitter data analyzed by Fujitsu Data from Amadeu-Antonio-Foundation visualized by Fujitsu


Presentation of Result and Drilldown

Time series of popularity

Geographical distribution

List of messages

Cloud of key words


Used Technology at a Glance

PRIMEFLEX® for Hadoop

Tweets

Word lists

Geo data

Chart

Map

Table

Import

tables

Export

tables/

charts

…

Automatic update Dialog

..or other

web server

Import

Twitter data (open interface) Automatic import Geo data for villages Sentiment dictionary

Processing

Data cleansing/ de-duplication Determine tweet locations Polarity of texts Frequency scale

Output

Time series for frequency Geographical distribution Message lists Tag-cloud of popular key words

HTML 5 based

browser frontend

http://www.wasistbigdata.de/wp-content/uploads/2013/05/Datameer-logo.png


Become More Current?

Clothing donation for

refugees appreciated

in Passau!

https://bigdata.fujitsu.com/

Clothing donation for

refugees appreciated

in Passau!

https://bigdata.fujitsu.com/

<10 sec hourly < 10 min

< 10 sec

< 10 sec continuous

<10 sec <10 sec

REST API Map Reduce / Tez

Streaming API Spark (in-memory)


Streaming – The Short Wire to Twitter

Stream of tweets for

certain keywords

Group tweets by

author and retweet

Calculate ranking

Aggregate time

window of tweets

Popularity

chart

Colored

tweet list


Play and Interact – The User Interface

Chart of top authors

List of messages


Used Technology at a Glance

PRIMEFLEX® for Hadoop

Tweets

Chart

Table

Import

RDDs

Export

tables/

charts

…

Streaming update Dialog

..or other

web server

Import

Twitter data (open interface) Streaming import

Processing

Data cleansing/ de-duplication Determine tweet authors Frequency scale

Output

Chart of the top 250 Message lists of rated authors

HTML 5 based

browser frontend


Robust and Fast: Bringing both Together

Scheduled

collection of tweets

for certain keywords

Find key authors

Stream of tweets for

certain users

Group by author

Calculate ranking

Aggregate tweets

Popularity

chart

Colored

tweet list

Author

ranking

Setup streaming


PRIMEFLEX for Hadoop and Conclusions


Keep pace to business needs: PRIMEFLEX for Hadoop

Rapid modelling for logic

In-memory for speed

Scale-out for volume


Platform: PRIMEFLEX for Hadoop

Software stack Hadoop core: Map Reduce / HDFS Streaming and In-memory technologies Analytic framework

Hadoop platform sourcing options On-premise: Entry or Rack option Off-premise: Cloud offering Storage or compute intensive workloads

Service and Consulting Integration Service Tool supported sizing Hadoop and Analytic Services

Entry Rack Cloud

Big Data Management

Analytics

Analytic Services

Integration Service and Sizing

http://www.wasistbigdata.de/wp-content/uploads/2013/05/Datameer-logo.png


Like Some More:

Experts, demos & solutions @ Big Data Booth: C17-C18