bigdata: quelques enjeux techniques big data/big analytics split . 3 / the information contained in...

17
Thales Communications & Security Big Data: Quelques Enjeux Techniques Essai de Typologie des Problèmes de Big Analytics J.F. Marcotorchino VP, Scientific Director, GBU SIX

Upload: others

Post on 30-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Thales Communications & Security

Big Data: Quelques Enjeux Techniques Essai de Typologie des Problèmes de Big Analytics

J.F. MarcotorchinoVP, Scientific Director, GBU SIX

Page 2: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

2 /2 /

The

info

rmat

ion

cont

aine

d in

this

doc

umen

t and

any

atta

chm

ents

are

the

prop

erty

of T

HA

LES

. You

are

her

eby

notif

ied

that

any

rev

iew

, dis

sem

inat

ion,

dis

trib

utio

n, c

opyi

ng o

r ot

herw

ise

use

of th

is d

ocum

ent i

s st

rictly

pro

hibi

ted

with

out T

hale

s pr

ior w

ritte

n ap

prov

al. ©

TH

ALE

S 2

011.

Tem

plat

e tr

tp v

ersi

on 7

.0.8

BIG DATA/BIG ANALYTICSSplit

Page 3: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

3 /3 /

The

info

rmat

ion

cont

aine

d in

this

doc

umen

t and

any

atta

chm

ents

are

the

prop

erty

of T

HA

LES

. You

are

her

eby

notif

ied

that

any

rev

iew

, dis

sem

inat

ion,

dis

trib

utio

n, c

opyi

ng o

r ot

herw

ise

use

of th

is d

ocum

ent i

s st

rictly

pro

hibi

ted

with

out T

hale

s pr

ior w

ritte

n ap

prov

al. ©

TH

ALE

S 2

011.

Tem

plat

e tr

tp v

ersi

on 7

.0.8

Definitions

Big Data: All the technologies and techniques that help

scaling

� Large File Storage (virtual)

� Distributed processing (Hadoop) / Map-reduce

� NoSQL databases / simple & complex query

Big Analytics: Techniques that are executed on a BigData

infrastructure and have the following properties:

� Adaptation of ad hoc techniques (statistics-learnin g) to this environment

� Scales Linearly ( O(N) or O(NLog(N)) order of magnitude or subject to heavy potential parallelization

� Linearization is mandatory either at “criteria level” or at “constraints polytopes level”

� Use special type of learning techniques through di mensions reduction.

Page 4: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

4 /4 /

The

info

rmat

ion

cont

aine

d in

this

doc

umen

t and

any

atta

chm

ents

are

the

prop

erty

of T

HA

LES

. You

are

her

eby

notif

ied

that

any

rev

iew

, dis

sem

inat

ion,

dis

trib

utio

n, c

opyi

ng o

r ot

herw

ise

use

of th

is d

ocum

ent i

s st

rictly

pro

hibi

ted

with

out T

hale

s pr

ior w

ritte

n ap

prov

al. ©

TH

ALE

S 2

011.

Tem

plat

e tr

tp v

ersi

on 7

.0.8

Les 4 V

The 4 V Challenge

� Volume : Large Storage Capacity are available now

� NAS type (Network Attached Storage): � Virtualized Storage �Cloud Computing

� Velocity: Large Demand for Immediate results

� Stream Analytics for SEP/ CEP (Stream &Complex event processing) � In memory Computations adapted to Key-Value stores

� Variety: Large Diversity of Heterogeneous Data Types

� Structured Data (classical DB entries) or Semi Structureed Data (Images with meta data added)

� Unstructured Data: Text, Speech , Raw Images etc

� Value: Intrinsic Value of the couple « Data/Information » is

now recognized by Business companies

la (((*valeur « α N » (α entier) on doit répartir les calculs sur αmachines pour conserver

Page 5: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

5 /5 /

The

info

rmat

ion

cont

aine

d in

this

doc

umen

t and

any

atta

chm

ents

are

the

prop

erty

of T

HA

LES

. You

are

her

eby

notif

ied

that

any

rev

iew

, dis

sem

inat

ion,

dis

trib

utio

n, c

opyi

ng o

r ot

herw

ise

use

of th

is d

ocum

ent i

s st

rictly

pro

hibi

ted

with

out T

hale

s pr

ior w

ritte

n ap

prov

al. ©

TH

ALE

S 2

011.

Tem

plat

e tr

tp v

ersi

on 7

.0.8

Some Confusions to Avoid

Do not confound : Combinatorial Complexity vs Indexingcomplexity, difficulty of IT computations vs the management of huge data volumes (HPC vs BIG DATA)

� In the first case:

It is not the data amount per se which is a drawback, b ut the intrinsic combinatorial structure of the problem to solve :

� Example: ≅≅≅≅ 1029300 solutions (Berendt -Tassa estimate 2010) to explore for clustering a set of N=10000 objects or individuals.

� Nevertheless N=10000 is not a huge amount

� In the second case:

It is the data amount itself which poses a problem , throughthe structure of the indexing and storing architec tures. (Difficulty due to the scalability constraints)

Page 6: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

6 /6 /

The

info

rmat

ion

cont

aine

d in

this

doc

umen

t and

any

atta

chm

ents

are

the

prop

erty

of T

HA

LES

. You

are

her

eby

notif

ied

that

any

rev

iew

, dis

sem

inat

ion,

dis

trib

utio

n, c

opyi

ng o

r ot

herw

ise

use

of th

is d

ocum

ent i

s st

rictly

pro

hibi

ted

with

out T

hale

s pr

ior w

ritte

n ap

prov

al. ©

TH

ALE

S 2

011.

Tem

plat

e tr

tp v

ersi

on 7

.0.8

How to address Scalability Problems

Scalability by « Linearization » VS Scalability by « Parallelization »

� In the First Mode :

If for a population of N objects the needed computing time isT, in case of a linear algorithm it will take a computing time≅≅≅≅ ααααT if the population size jumps up from N to αααα N.

� In the Second Mode :

If an algorithm dedicated to a population size N can beprocessed on a SINGLE machine within a time T, then if thela population scales up to αααα N (αααα integer ), computations canbe distributed on « αααα » machines to keep a computing timeequal to : T

Combination of both modes is the best possible approach

(if suitable)

Page 7: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

An Operational Characterization of Big Analytics Methods

Big Data Analytics : « Extended » VS « Intrinsic » cases

� « Extended » Case:

� Possible use of the NoSQL storing architectures, or n ew SQL ones

� Exhaustive Analysis of the whole data set is not mandatory at all

� « Analytic Sampling » or « Big Sampling » are sufficient in most cases:

e.g: Customers Segmentation, CRM, Cross selling , Churn & Attrition Analysis,

Intrusions Analysis or HUMS (Health & Usage Monitoring Systems).

� The remaining set of the population except « samples » is processed by

« inferential segmentation » or by « linear assignment »

Page 8: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

An Operational Characterization of Big Analytics Methods

Big Data Analytics : « Extended » VS « Intrinsic » cases

� « Intrinsic » Case: � It is mandatory to rely on the full data se t (exhaustivity ), even if avoiding

to do it , is still remaining a research topic

� No a priori knowledge , or partial knowledge of the p opulation structure

� Data are stored through NoSQL architectures using the a dequate

correspondence formats ( example for graphs DB: Neo4j , FlockDB ( open

source distributed, fault-tolerant graph database for managing data at scale., chosen

by Twitter)

� To manage the exhaustivity constraint, obligation to use heuristics or meta

heuristics based upon linear iterations, or parallelization through

distributed computations

Page 9: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Some NoSQL DB Types

Key Value StoresKey Value StoresKey Value StoresKey Value Stores

Column Oriented DBColumn Oriented DBColumn Oriented DBColumn Oriented DB

Document Document Document Document OrientedOrientedOrientedOriented DBDBDBDB

BigTable (GoogleGoogleGoogleGoogle)

(FacebooFacebooFacebooFacebookkkk)

Infinity DB

((((AmazonAmazonAmazonAmazon)))) DynamoDBDynamoDBDynamoDBDynamoDB

Graph Data Bases Graph Data Bases Graph Data Bases Graph Data Bases

Neo4jNeo4jNeo4jNeo4j

Complex grows likeComplex grows likeComplex grows likeComplex grows like EEEE RelRelRelRel

EEEE = nb. of Entitiesnb. of Entitiesnb. of Entitiesnb. of EntitiesRel Rel Rel Rel = average relationships / average relationships / average relationships / average relationships /

entityentityentityentity

Page 10: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

direction ou services

BIG DATA CONCEPTUAL FOUNDATIONS[Brewer CAP Assignment]

It is impossible to satisfy the 3 items choose 2

Consistancy

AAAAPPPPCCCCAAAA

CP

MemcacheDB /Bekerley DB

VoldemortVoldemortVoldemortVoldemort

CouchDB

HBase

Availability

Partition Tolerence

Page 11: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Ce

docu

men

t ne

peut

êtr

e re

prod

uit,

mod

ifié,

ada

pté,

pub

lié, t

radu

it, d

'une

que

lcon

que

faço

n, e

n to

ut o

u pa

rtie

, ni d

ivul

gué

à un

tier

s sa

ns l'

acco

rd p

réal

able

et é

crit

de T

hale

TH

ALE

S 2

012

Tou

s D

roits

rés

ervé

sM

odèl

e tr

tp v

ersi

on 7

.1.0

Some ideas for solving Intrinsic Big Analytics approaches

Use mainly exhaustive methods (if possible no statistical

sampling) (Data Driven vs Hypothesis Driven )

� Affinity Analysis & Sequential Patterns (pure linear matchings scalar products)

� Use Classifiers with linear criteria

� Practice Iterative Queries

� R2I2: Requêtage Récursif Itératif Intelligent (application de deux techniques en alternance: Similarité

Régularisée + Clustering « on the fly »)

� Unsupervised Clustering (no a priori) (Extending « No K-Means » approaches using

linear relational criteria)

� Text mining (word spotting)

� Reticular Data Analysis

(Social Nets, Huge IT Networks)

Routing procedures, Modularizations, Dynamic Topology

Page 12: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

12 /12 /

The

info

rmat

ion

cont

aine

d in

this

doc

umen

t and

any

atta

chm

ents

are

the

prop

erty

of T

HA

LES

. You

are

her

eby

notif

ied

that

any

rev

iew

, dis

sem

inat

ion,

dis

trib

utio

n, c

opyi

ng o

r ot

herw

ise

use

of th

is d

ocum

ent i

s st

rictly

pro

hibi

ted

with

out T

hale

s pr

ior w

ritte

n ap

prov

al. ©

TH

ALE

S 2

011.

Tem

plat

e tr

tp v

ersi

on 7

.0.8

BIG ANALYTICS TYPOLOGY

Page 13: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Reticular Data Structuring

Classical BI Data Mining

Tentative structuring of Big Analytics Approaches

Learning &Neural Nets

Vector Matching Structuring

Lack of Population Knowledge

Leve

l of P

robl

em C

ompl

exity Learning Model for

unsupervised Classif

Limited Layers Neural Nets

Naïve Bayes

Networks

Self Encoded and Hourglass

Shaped Neural Nets

Image & Video

Analytics

Sequential Patterns Recognition &

Affinity Analysis

Parallel Coordinates

Unsupervised Clustering

Large Networks Topological

Design

Supervised Rule Based

Classification

Social Networks

Communities detection

Reticular Visual

Analytics

BiClass SVM

Faces &Pattern Recognition

Piecewise Linear Regression

Multi Classes

SVM

MOLAP and XOLAP

MDL Learning

Models

Page 14: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Ce

docu

men

t ne

peut

êtr

e re

prod

uit,

mod

ifié,

ada

pté,

pub

lié, t

radu

it, d

'une

que

lcon

que

faço

n, e

n to

ut o

u pa

rtie

, ni d

ivul

gué

à un

tier

s sa

ns l'

acco

rd p

réal

able

et é

crit

de T

hale

TH

ALE

S 2

012

Tou

s D

roits

rés

ervé

sM

odèl

e tr

tp v

ersi

on 7

.1.0

An Example of Intrinsic Big Analytics Problem: Graphs Modularity

Girvan-Newman’s Quadratic formulation

“Liberal”

“Conservative”

“Centrist”

Krebs’ Graph on American PoliticsS. Mandal (MIT)

MIT Heuristic Algo : Construct the modularity matrix and find its largest eigenvalue and eigenvector• Partition network into two parts based on signs of elements in the largest eigenvector• Repeat for each part• If a proposed split does not cause modularity to increase, declare subgraph indivisible and do

not split it• When entire graph consists of indivisible subgraphs, stop

Typical running time �O(N2log N) for a sparse graph

modularity of network is “the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.” (“Deviation to Independence” )• Maximizing modularity

rigorously may be NP-hard• Use heuristic approaches

Page 15: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Ce

docu

men

t ne

peut

êtr

e re

prod

uit,

mod

ifié,

ada

pté,

pub

lié, t

radu

it, d

'une

que

lcon

que

faço

n, e

n to

ut o

u pa

rtie

, ni d

ivul

gué

à un

tier

s sa

ns l'

acco

rd p

réal

able

et é

crit

de T

hale

TH

ALE

S 2

012

Tou

s D

roits

rés

ervé

sM

odèl

e tr

tp v

ersi

on 7

.1.0

By relational transform we turn the criterion into a linear function subjectto linear constraints

Idea : relying on the locally linear « Louvain » algorithm (Blondel-Guillaume ) (Univ Louvain/UPMC LIP6) , use the Linear Relational F orm

���� O(N LogN )

We can do more : using the genericity of the Louvain ’s algo we can use better linear criteria than the Girvan-Newman’s one ba sed on Optimal Transport justifications e.g:« Deviation to Indetermination » (Patricia Conde- Cespèdes )

Xij – Xji = 0 ∀∀∀∀(i,j) (Symmetry)Xii = 1 ∀∀∀∀i (Reflexivity)

Xij + Xjk – Xik ≤≤≤≤ 1 ∀∀∀∀(i,j,k) (Transitivity)

Xij ∈∈∈∈ {0,1} (Binarity)

Page 16: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Ce

docu

men

t ne

peut

êtr

e re

prod

uit,

mod

ifié,

ada

pté,

pub

lié, t

radu

it, d

'une

que

lcon

que

faço

n, e

n to

ut o

u pa

rtie

, ni d

ivul

gué

à un

tier

s sa

ns l'

acco

rd p

réal

able

et é

crit

de T

hale

TH

ALE

S 2

012

Tou

s D

roits

rés

ervé

sM

odèl

e tr

tp v

ersi

on 7

.1.0

Big Analytics :Some Topics of Interest

Big Analytics for

Cyber-Security

Big Analytics for

Smart Transport

Big Analytics for National Security

Big Analytics for maintenance:

Components for attack detection and investigation(Intelligent IDS from normalized log analytics, IS passiveand dynamic mapping, logs analytics, cyber Intelligence )

� Attack detection from relational & content data, intelligent IDS and sandbox coupling,

� Intelligent coupling with IS passive and dynamic mapping

� Big Data platform for logs analytics, visual analytics

Business Analytics Web portal for passenger behavio ur and profile understanding , traffic anomaly detecti on:� New components and use cases focused on mobility

� Approach based on space-time queries, BI, early warning engine, Big Analytics and optimization technics for Smart City

� Fraud detection

Social Web Intelligence for National Security : � Cyber-infringement detection and investigation

� SNA :social mining, crisis management

Maritime security: predictive analysis & anomaly detection

E-border: Big Analytics on passengers logs

applications to vehicle , radar, weapon systems, transport…HUMS :(Health & Usage Monitoring Systems)

Page 17: BigData: Quelques Enjeux Techniques BIG DATA/BIG ANALYTICS Split . 3 / The information contained in this document and any att ... NoSQL databases / simple & complex query Big Analytics:

Ce

docu

men

t ne

peut

êtr

e re

prod

uit,

mod

ifié,

ada

pté,

pub

lié, t

radu

it, d

'une

que

lcon

que

faço

n, e

n to

ut o

u pa

rtie

, ni d

ivul

gué

à un

tier

s sa

ns l'

acco

rd p

réal

able

et é

crit

de T

hale

TH

ALE

S 2

012

Tou

s D

roits

rés

ervé

sM

odèl

e tr

tp v

ersi

on 7

.1.0

Big Analytics innovation trends at medium range horizon

� Coupling Auto-Encoders Neural Nets with Predictive Modeling for � features

extraction

� Opening the « Data Streaming Data Streaming Data Streaming Data Streaming ProcessingProcessingProcessingProcessing » (real time) to more sophisticated

and powerful analytical tools� Towards real life CEPCEPCEPCEP

� Coupling « GeneticGeneticGeneticGenetic AlgorithmsAlgorithmsAlgorithmsAlgorithms » with « RelationalRelationalRelationalRelational linearlinearlinearlinear transformstransformstransformstransforms » �

Linearization procedures

� In Networks Analysis, addressing the complexity of of of of dynamicdynamicdynamicdynamic graphsgraphsgraphsgraphs modeling.

� Dynamic Modularization