how to get the most from big data
TRANSCRIPT
0 Copyright 2015 FUJITSU
Human Centric Innovation
in Action
Fujitsu Forum 2015
18th – 19th November
1 Copyright 2015 FUJITSU
How to Get the Most from Big Data
2 Copyright 2015 FUJITSU
Dr. Fritz Schinkel Head of Big Data Competence Center, Fujitsu
3 Copyright 2015 FUJITSU
What Do We Expect – What Can We Expect?
Huge datasets – affordable storage
Fast changing values – analysis in time
Streaming data – real-time processing
Frequent business changes – rapid modeling, fast learning
4 Copyright 2015 FUJITSU
Big Data Technology Enablers for Success
5 Copyright 2015 FUJITSU
The Initial Big Data Challenge
Web Search in 1995
Altavista search engine
Full text search on 20 million pages
Big system – but scale-up
Overwhelmed by exponential data growth
Growing time to re-index
Complex page rank impossible
Google’s scale-out approach
Capacity and compute scale “infinitely”
Web index and page rank
6 Copyright 2015 FUJITSU
From Google‘s Success to Big Data Technologies
Compaq Alpha Server GS160
1995 altavista.digital.com
1999
Commodity
hardware
2002 Nutch open source search engine
2001
2014 Yahoo 42.000 server / 455 PB
2011 Facebook 30 PB system
2003 Google paper Map/Reduce
2004 NDFS Nutch Distributed Filesystem
2005 Map reduce + NDFS = Hadoop
2006 Apache Hadoop: Map / Reduce and HDFS
2008 Terasort world record (209 sec / 900 nodes)
Hadoop
dis
trib
utio
ns
2009
2011
7 Copyright 2015 FUJITSU
Distributed Parallel Processing
Distribute data
Code travels to data
Shared Nothing
Scale-out on demand
Affordable standard servers
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
NameNode
JobTracker
Clie
nt
Master
Slaves HDFS
8 Copyright 2015 FUJITSU
reduce reduce reduce reduce
map map map
Map Reduce is like Counting Votes
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A
B
C
D
A:7
shuffle
B:7 C:2 D:4
9 Copyright 2015 FUJITSU
More than Map Reduce (Hadoop Selected)
MapReduce Execution
Engine (Linear)
Hive SQL
Pig Script
TEZ Execution
Engine (DAG)
YARN Cluster Resource Management
HDFS Redundant, Reliable Persitent Storage
Kafka
Queueing
Datameer Visual
Analytics
Impala
SQL
HBase
NoSQL Key
value store
Spark
Execution Engine (DAG)
Res. Distr. Data (In-Memory)
Spark Stream
-ing
Spark SQL
Spark MLlib
Spark GraphX
Storage
Re- source mgt.
Data mgt.
Data access
10 Copyright 2015 FUJITSU
Analytic Flows for Real-time Big Data Problems
E-commerce Log
ERP DB
CRM DB
Social media Service
Sensor Data Stream
Data collection Machine learning
Aggregate & correlate
Statistical analysis
Prediction
Event Correlation
Research & development, science
Operation, automation,
production
Interactive reporting, advertising
Text analysis
11 Copyright 2015 FUJITSU
Big Data Infrastructure Reference Architecture
Consolidated data Distilled essence Applied knowledge Various data
Extract, Collect Cleanse, Transform Decide, Act Analyze, Visualize
Data Sources Analytics Platform Access
Batch processing platform
Event processing platform
Fast response platform
Data bases
Application server
Web content
Sensor data
Apps Services Queries
Visualization Reporting
Notification
12 Copyright 2015 FUJITSU
Big Data Infrastructure Reference Architecture
Consolidated data Distilled essence Applied knowledge Various data
Extract, Collect Cleanse, Transform Decide, Act Analyze, Visualize
Data Sources Analytics Platform Access
Batch processing platform
Event processing platform
Fast response platform
Social media APIs
IMDB
DB / DW Dat
a at
res
t D
ata
in m
otio
n
Sensor data
Web content
Text / mail
Complex Event Processing
In-Memory / NoSQL / DWH
Big Data Batch Processing
13 Copyright 2015 FUJITSU
Big Data Value Chain
Big Data
Collect Stream
Structured & unstructured data
Devices, sensors,
Internet of Things
Cleanse Transform Analyze
Find Decide Navigate
Research & development, science
Operation, automation,
production
Interactive reporting, advertising
Rapid modelling for faster insights
Social media, open data, linked data
14 Copyright 2015 FUJITSU
Self-Service Data Science
Big Data
Extract Collect
Structured & unstructured data
Devices, sensors,
Internet of Things
Cleanse Transform Analyze
Find Decide Act
Research & development, science
Operation, automation,
production
Interactive reporting, advertising
Social media, open data, linked data
Big Data Import, Analysis, Visualization like Outlook, Excel and Powerpoint
Collect
Cleanse Transform Analyze
Find Decide Act
15 Copyright 2015 FUJITSU
Use Best of Breed Technologies Get Better Insights
16 Copyright 2015 FUJITSU
Sentiment Analysis
Example Subjects
City
Company
Brand
Euro Crisis
Migration
…
Sentiment aspects
17 Copyright 2015 FUJITSU
Look In and Find Out: Sentimental Data Flow
Scheduled collection
of tweets for certain
keywords
Compute emotional
polarity of texts
Determine author’s
location
Calculate tweet
frequencies
Filter emotional
buzzwords
Sentiment
curves
Colored
tweet list
Word
cloud
Tweet
map
18 Copyright 2015 FUJITSU
Geographical Distribution of the Topic “Refugees”
Attacks to refugee homes Tweets about refugees
Google maps with twitter data analyzed by Fujitsu Data from Amadeu-Antonio-Foundation visualized by Fujitsu
19 Copyright 2015 FUJITSU
Presentation of Result and Drilldown
Time series of popularity
Geographical distribution
List of messages
Cloud of key words
20 Copyright 2015 FUJITSU
Used Technology at a Glance
PRIMEFLEX® for Hadoop
Tweets
Word lists
Geo data
Chart
Map
Table
Import
tables
Export
tables/
charts
…
Automatic update Dialog
..or other
web server
Import
Twitter data (open interface) Automatic import Geo data for villages Sentiment dictionary
Processing
Data cleansing/ de-duplication Determine tweet locations Polarity of texts Frequency scale
Output
Time series for frequency Geographical distribution Message lists Tag-cloud of popular key words
HTML 5 based
browser frontend
21 Copyright 2015 FUJITSU
Become More Current?
Clothing donation for
refugees appreciated
in Passau!
https://bigdata.fujitsu.com/
Clothing donation for
refugees appreciated
in Passau!
https://bigdata.fujitsu.com/
<10 sec hourly < 10 min
< 10 sec
< 10 sec continuous
<10 sec <10 sec
REST API Map Reduce / Tez
Streaming API Spark (in-memory)
22 Copyright 2015 FUJITSU
Streaming – The Short Wire to Twitter
Stream of tweets for
certain keywords
Group tweets by
author and retweet
Calculate ranking
Aggregate time
window of tweets
Popularity
chart
Colored
tweet list
23 Copyright 2015 FUJITSU
Play and Interact – The User Interface
Chart of top authors
List of messages
24 Copyright 2015 FUJITSU
Used Technology at a Glance
PRIMEFLEX® for Hadoop
Tweets
Chart
Table
Import
RDDs
Export
tables/
charts
…
Streaming update Dialog
..or other
web server
Import
Twitter data (open interface) Streaming import
Processing
Data cleansing/ de-duplication Determine tweet authors Frequency scale
Output
Chart of the top 250 Message lists of rated authors
HTML 5 based
browser frontend
25 Copyright 2015 FUJITSU
Robust and Fast: Bringing both Together
Scheduled
collection of tweets
for certain keywords
Find key authors
Stream of tweets for
certain users
Group by author
Calculate ranking
Aggregate tweets
Popularity
chart
Colored
tweet list
Author
ranking
Setup streaming
26 Copyright 2015 FUJITSU
PRIMEFLEX for Hadoop and Conclusions
27 Copyright 2015 FUJITSU
Keep pace to business needs: PRIMEFLEX for Hadoop
Rapid modelling for logic
In-memory for speed
Scale-out for volume
28 Copyright 2015 FUJITSU
Platform: PRIMEFLEX for Hadoop
Software stack Hadoop core: Map Reduce / HDFS Streaming and In-memory technologies Analytic framework
Hadoop platform sourcing options On-premise: Entry or Rack option Off-premise: Cloud offering Storage or compute intensive workloads
Service and Consulting Integration Service Tool supported sizing Hadoop and Analytic Services
Entry Rack Cloud
Big Data Management
Analytics
Analytic Services
Integration Service and Sizing
29 Copyright 2015 FUJITSU
Like Some More:
Experts, demos & solutions @ Big Data Booth: C17-C18
30 Copyright 2015 FUJITSU