semantic web meetup 14.november 2013

32
Big Data & Hadoop Jean-Pierre König 03. Oktober 2013 Semantic Web Meetup

Upload: jean-pierre-koenig

Post on 11-May-2015

375 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Semantic web meetup 14.november 2013

Big Data & Hadoop

Jean-Pierre König

03. Oktober 2013

Semantic Web Meetup

Page 2: Semantic web meetup 14.november 2013

PROFILE COMPANY

Page 3: Semantic web meetup 14.november 2013

WE ARE HERE Vom Standort Kreuzlingen / Schweiz

bedient YMC seit 2001 namhafte

nationale und internationale Kunden.

Page 4: Semantic web meetup 14.november 2013

WORK

Customers

WITH WE

Page 5: Semantic web meetup 14.november 2013

WORK WITH WE

Partners

Page 6: Semantic web meetup 14.november 2013

WEB SOLUTIONS

BIG DATA ANALYTICS

MOBILE APPLICATIONS

WE CREATE Hosting & Support

Kundenspezifische Individuallösungen fürs Web

Social-Media-Anwendungen (z.B. Corporate Blogs, Wikis, Facebook-Apps etc.)

Web-Strategien

Shop-Systeme, Websites, Intranets

Empfehlungssysteme (z.B. für Apps, Webshops, Websites und Intranet)

Vorhersagemodelle (z.B. für Interessen von App-Usern)

Integrierte Suchsysteme (z.B. auch für unstrukturierte Daten)

Massgeschneiderte Web Analytics Systeme (z.B. mit Echtzeit-Metriken und Effekten in

Sozialen Netzwerken)

Training (Apache Hadoop)

Geolokalisierung für ortsspezifische Services

Integration von Sozialen Netzwerken wie Facebook und Twitter

Apps für Tablets und Smartphones (iPhone, Android)

Mobile Strategien

Page 7: Semantic web meetup 14.november 2013

BIG DATA WHAT IS

Page 8: Semantic web meetup 14.november 2013

WHAT IS

§  More general §  When data sets become so large and complex that it

becomes difficult to process, including capture, curation, storage, search, sharing, transfer, analysis, and visualization

§  It is difficult to work with using most RDBMS, statistic and visualization systems

§  It requires massively parallel software running on tens, hundreds, or even thousands of servers

§  The 3 V’s by Gartner §  Big data is high volume, high velocity, and/or high variety

information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. (2012)

BIG DATA

Page 9: Semantic web meetup 14.november 2013

WHAT DRIVES

§ Human-generated data §  Documents, transaction data, CRM, social media

- your working life is devoted to looking at screens and typing more data into some system.

§ Sensor-generated data §  There is the trend that a large part of the physical

world around us will eventually somehow be online – The Internet of Things.

§ Machine-generated data will quickly top human-generated data

BIG DATA

Page 10: Semantic web meetup 14.november 2013

BUSINESS DRIVES DRIVERS

Web Archives

Data Aggregation

Video, Audio & Image Processing

Data Pre-processing

Infrastructure Management

Sampling

Predictive Analytics

360° Customer Experience Management

Social Media Analysis

(Mass) Personalization

Recommendation Engines

Data as a Service

Research

Fraud protection

Risk management

Environment Safety

Digital Security

Infrastructure Observation

Increase

Revenue

Improve

Decision-

Making

Risk

Prevention

Page 11: Semantic web meetup 14.november 2013

THE EMERGING

§  NoSQL* Movement §  NoSQL databases are finding significant and growing

industry use in big data and real-time web applications.

§  Hadoop and it’s ecosystem §  Enterprise-grade solutions, consulting, support

§  Top 3 vendors: Cloudera, Hortonworks, MapR

§  Adoption throughout the software industry, e.g. IBM BigInsights, Microsoft HDInsight, Oracle Big Data Appliance, EMC/Spring/VMWare Pivotal HD, HP HAVEn, Intel Distribution, Dell w/Cloudera

Also referred to as "Not only SQL"

SOLUTIONS

Page 12: Semantic web meetup 14.november 2013

IN A NUTSHELL HADOOP

Page 13: Semantic web meetup 14.november 2013

WHAT IS

§ An open-source implementation of frameworks for reliable, scalable, distributed computing and data storage Official Hadoop website

§ A reliable shared storage and analysis system O‘Reilly: Hadoop – The Definitive Guide

§ A free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment Margaret Rouse

§ A complete, open-source ecosystem for capturing, organizing, storing, searching, sharing, analyzing, visualizing, and ... Jack Norris

HADOOP

Page 14: Semantic web meetup 14.november 2013

A BRIEF HISTORY OF HADOOP

§  In 2002 Doug Cutting* started with Nutch, a open source web search engine

§  Fortunately Google published papers, that §  describes the architecture of their distributed filesystem, called GFS

(2003)

§  introduced MapReduce (2004)

§  In 2005 Nutch released a new version with NDFS and MapReduce and moved out to form an independent subproject called Hadoop in 2006

§  Cutting joined Yahoo! to build and run Hadoop at web scale

§  In 2008 Hadoop became a top-level Apache project and it was used at Yahoo! (10k cores), Last.fm, Facebook and New York Times

*Doug Cutting is also the creator of Apache Lucene

HISTORY

Page 15: Semantic web meetup 14.november 2013

HADOOP IN A NUTSHELL

§ HDFS

§  A distributed file system for storage

§  Is highly fault-tolerant and is designed to be

deployed on low-cost/commodity hardware

§  1 Master called NameNode, many DataNodes(10+)

§ MapReduce

§  A batch query processor to run an ad hoc query

against your whole dataset and get the results in a

reasonable time

§  1 Master called JobTracker, many TaskTrackers (10+)

HADOOP

Page 16: Semantic web meetup 14.november 2013

HADOOP FACT-SHEET

MapReduce/distributed processing

§  Economical

§  Commodity hardware

§  Scalable

§  Add notes to increase parallelism

§  Fault tolerant

§  Auto-recover job failures

§  Data locality

§  Process where the data resides

HDFS/distributed storage

§  Economical

§  Commodity hardware

§  Scalable

§  Rebalances data on new nodes

§  Fault Tolerant

§  Detects faults and auto recovers

§  Reliable

§  Maintains multiple copies of data

§  High throughput

§  Because data is distributed

HADOOP

Page 17: Semantic web meetup 14.november 2013

HADOOP PRINCIPLES

§ Schema on read

§ Data locality

§ No shared memory or disks

§ Scales out to thousands of servers

HADOOP

Page 18: Semantic web meetup 14.november 2013

HADOOP SYSTEM COMPENENTS

NameNode DataNode Secondary NameNode

HADOOP

TaskTracker JobTracker

HDFS

MapReduce

Masters Slaves (many of them)

Page 19: Semantic web meetup 14.november 2013

WRITING FILES ON HDFS* WRITING

File.txt

Block A

Block B

Block C

Client NameNode

DataNode 1 DataNode 5 DataNode 6 DataNode 9 DataNode N

He, i want to write A, B

and C of my File.txt.

OK, write to DataNodes

1, 5 and 9.

...

Block A Block B Block C

Block A`

Block B` Block C`

Block A` Block B`

Block C`

* Replication Factor of 3

Rack 1 Rack 2

Page 20: Semantic web meetup 14.november 2013

READING FILES FROM HDFS READING

Client NameNode

DataNode 1 DataNode 5 DataNode 6 DataNode 9 DataNode N

Tell me the block

locations of File.txt.

A à DataNode 1,5,6

B à DataNode 1,5,N

C à DataNode 5,9,6

Block A Block B Block C

...

Block A`

Block B` Block C`

Block A` Block B`

Block C` Rack 1 Rack 2

Page 21: Semantic web meetup 14.november 2013

MAPREDUCE IN A NUTSHELL MAPREDUCE

Deer Bear River

Car Car River

Deer Car Bear

Input

Deer Bear River

Car Car River

Deer Car Bear

Split

Bear, 2

Car, 3

Deer, 2

River, 2

Result

Word Count Example

Map

Deer, 1

Bear, 1

River, 1

Car, 1

Car, 1

River, 1

Deer, 1

Car, 1

Bear, 1

Shuffle

Bear, 1

Bear, 1

Car, 1

Car, 1

Car, 1

Deer, 1

Deer, 1

River, 1

River, 1

Reduce

Bear, 2

Car, 3

Deer, 2

River, 2

Page 22: Semantic web meetup 14.november 2013

MAPREDUCE VS.

§  RDBMS §  In a centralized database system, you’ve got one big disk connected to

4 or 8 or 16 big processors.

§  MapReduce §  In a Hadoop cluster, every server has 2 or 4 or 8 CPUs. You can run

your job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. You map the operation out to all of those servers and then you reduce the results back into a single result set.

§  Architecturally, the reason you’re able to deal with lots of data is because Hadoop spreads it out. And the reason you’re able to ask complicated computational questions is because you’ve got all of these processors, working in parallel, harnessed together.

RDBMS

Page 23: Semantic web meetup 14.november 2013

HADOOP ECOSYSTEM

Page 24: Semantic web meetup 14.november 2013

HADOOP ECOSYSTEM ECOSYSTEM

Page 25: Semantic web meetup 14.november 2013

HADOOP’S DATABASE HBASE*

§ Unlike RDMS §  No secondary indexes

§  No transactions

§  De-normalized, Schema less

§ Random read/write access to big data

§ Billions of rows and millions of columns

§ Automatic data sharding

§  Integrates with MapReduce

* Modeled after Google’s BigTable

Page 26: Semantic web meetup 14.november 2013

HADOOP USE CASES

Page 27: Semantic web meetup 14.november 2013

USE CASES

Data Warehousing

§  Complementary ETL process

OLTP

CRM

ERP

File

Server

...

Data

Warehouse

Analytics

Visualization

Reports

ETL

Logs Logs Logs

Social

Media

Sensors

...

HDFS

PIG Hive MapReduce

Sqoop

Flume

Java API

Data Marts

Data Cubes

Page 28: Semantic web meetup 14.november 2013

USE CASES

Data Warehousing

§  Substitutive ETL process

OLTP

CRM

ERP

File

Server

...

Data

Warehouse

Analytics

Visualization

Reports

Logs

Hadoop

Logs Logs

Social

Media

Sensors

...

Page 29: Semantic web meetup 14.november 2013

Logs

USE CASES

Data Warehousing

§  (Predictive) Analytics at scale

OLTP

CRM

ERP

File

Server

...

Data

Warehouse

Analytics

Visualization

Reports

Hadoop

Lo

gs Logs

Social

Media

Sensors

...

Page 30: Semantic web meetup 14.november 2013

USE CASES

Data Warehousing

§  Machine Learning, Natural language processing, sentiment at scale

OLTP

CRM

ERP

File

Server

...

Data

Warehouse

Analytics

Visualization

Reports

Hadoop

...

ML +NLP

Logs Lo

gs Logs

Social

Media

Sensors

*

* Personalized recommendations

§  content, products, services …

Page 31: Semantic web meetup 14.november 2013

YOU! THANK

Page 32: Semantic web meetup 14.november 2013

YMC AG

Sonnenstrasse 4

CH-8280 Kreuzlingen

Switzerland

Photo Credits:

Slide 03: Matterhorn and Lake by Noel Reynolds

Slde 24: Hadoop Ecosystem by Rishu Shrivastava

@YMC_Big_Data

CONTACT US [email protected]

Tel. +41 (0)71 508 24 86

www.ymc.ch