Thales Communications & Security
Big Data: Quelques Enjeux Techniques Essai de Typologie des Problèmes de Big Analytics
J.F. MarcotorchinoVP, Scientific Director, GBU SIX
2 /2 /
The
info
rmat
ion
cont
aine
d in
this
doc
umen
t and
any
atta
chm
ents
are
the
prop
erty
of T
HA
LES
. You
are
her
eby
notif
ied
that
any
rev
iew
, dis
sem
inat
ion,
dis
trib
utio
n, c
opyi
ng o
r ot
herw
ise
use
of th
is d
ocum
ent i
s st
rictly
pro
hibi
ted
with
out T
hale
s pr
ior w
ritte
n ap
prov
al. ©
TH
ALE
S 2
011.
Tem
plat
e tr
tp v
ersi
on 7
.0.8
BIG DATA/BIG ANALYTICSSplit
3 /3 /
The
info
rmat
ion
cont
aine
d in
this
doc
umen
t and
any
atta
chm
ents
are
the
prop
erty
of T
HA
LES
. You
are
her
eby
notif
ied
that
any
rev
iew
, dis
sem
inat
ion,
dis
trib
utio
n, c
opyi
ng o
r ot
herw
ise
use
of th
is d
ocum
ent i
s st
rictly
pro
hibi
ted
with
out T
hale
s pr
ior w
ritte
n ap
prov
al. ©
TH
ALE
S 2
011.
Tem
plat
e tr
tp v
ersi
on 7
.0.8
Definitions
Big Data: All the technologies and techniques that help
scaling
� Large File Storage (virtual)
� Distributed processing (Hadoop) / Map-reduce
� NoSQL databases / simple & complex query
Big Analytics: Techniques that are executed on a BigData
infrastructure and have the following properties:
� Adaptation of ad hoc techniques (statistics-learnin g) to this environment
� Scales Linearly ( O(N) or O(NLog(N)) order of magnitude or subject to heavy potential parallelization
� Linearization is mandatory either at “criteria level” or at “constraints polytopes level”
� Use special type of learning techniques through di mensions reduction.
4 /4 /
The
info
rmat
ion
cont
aine
d in
this
doc
umen
t and
any
atta
chm
ents
are
the
prop
erty
of T
HA
LES
. You
are
her
eby
notif
ied
that
any
rev
iew
, dis
sem
inat
ion,
dis
trib
utio
n, c
opyi
ng o
r ot
herw
ise
use
of th
is d
ocum
ent i
s st
rictly
pro
hibi
ted
with
out T
hale
s pr
ior w
ritte
n ap
prov
al. ©
TH
ALE
S 2
011.
Tem
plat
e tr
tp v
ersi
on 7
.0.8
Les 4 V
The 4 V Challenge
� Volume : Large Storage Capacity are available now
� NAS type (Network Attached Storage): � Virtualized Storage �Cloud Computing
� Velocity: Large Demand for Immediate results
� Stream Analytics for SEP/ CEP (Stream &Complex event processing) � In memory Computations adapted to Key-Value stores
� Variety: Large Diversity of Heterogeneous Data Types
� Structured Data (classical DB entries) or Semi Structureed Data (Images with meta data added)
� Unstructured Data: Text, Speech , Raw Images etc
� Value: Intrinsic Value of the couple « Data/Information » is
now recognized by Business companies
la (((*valeur « α N » (α entier) on doit répartir les calculs sur αmachines pour conserver
5 /5 /
The
info
rmat
ion
cont
aine
d in
this
doc
umen
t and
any
atta
chm
ents
are
the
prop
erty
of T
HA
LES
. You
are
her
eby
notif
ied
that
any
rev
iew
, dis
sem
inat
ion,
dis
trib
utio
n, c
opyi
ng o
r ot
herw
ise
use
of th
is d
ocum
ent i
s st
rictly
pro
hibi
ted
with
out T
hale
s pr
ior w
ritte
n ap
prov
al. ©
TH
ALE
S 2
011.
Tem
plat
e tr
tp v
ersi
on 7
.0.8
Some Confusions to Avoid
Do not confound : Combinatorial Complexity vs Indexingcomplexity, difficulty of IT computations vs the management of huge data volumes (HPC vs BIG DATA)
� In the first case:
It is not the data amount per se which is a drawback, b ut the intrinsic combinatorial structure of the problem to solve :
� Example: ≅≅≅≅ 1029300 solutions (Berendt -Tassa estimate 2010) to explore for clustering a set of N=10000 objects or individuals.
� Nevertheless N=10000 is not a huge amount
� In the second case:
It is the data amount itself which poses a problem , throughthe structure of the indexing and storing architec tures. (Difficulty due to the scalability constraints)
6 /6 /
The
info
rmat
ion
cont
aine
d in
this
doc
umen
t and
any
atta
chm
ents
are
the
prop
erty
of T
HA
LES
. You
are
her
eby
notif
ied
that
any
rev
iew
, dis
sem
inat
ion,
dis
trib
utio
n, c
opyi
ng o
r ot
herw
ise
use
of th
is d
ocum
ent i
s st
rictly
pro
hibi
ted
with
out T
hale
s pr
ior w
ritte
n ap
prov
al. ©
TH
ALE
S 2
011.
Tem
plat
e tr
tp v
ersi
on 7
.0.8
How to address Scalability Problems
Scalability by « Linearization » VS Scalability by « Parallelization »
� In the First Mode :
If for a population of N objects the needed computing time isT, in case of a linear algorithm it will take a computing time≅≅≅≅ ααααT if the population size jumps up from N to αααα N.
� In the Second Mode :
If an algorithm dedicated to a population size N can beprocessed on a SINGLE machine within a time T, then if thela population scales up to αααα N (αααα integer ), computations canbe distributed on « αααα » machines to keep a computing timeequal to : T
Combination of both modes is the best possible approach
(if suitable)
An Operational Characterization of Big Analytics Methods
Big Data Analytics : « Extended » VS « Intrinsic » cases
� « Extended » Case:
� Possible use of the NoSQL storing architectures, or n ew SQL ones
� Exhaustive Analysis of the whole data set is not mandatory at all
� « Analytic Sampling » or « Big Sampling » are sufficient in most cases:
e.g: Customers Segmentation, CRM, Cross selling , Churn & Attrition Analysis,
Intrusions Analysis or HUMS (Health & Usage Monitoring Systems).
� The remaining set of the population except « samples » is processed by
« inferential segmentation » or by « linear assignment »
An Operational Characterization of Big Analytics Methods
Big Data Analytics : « Extended » VS « Intrinsic » cases
� « Intrinsic » Case: � It is mandatory to rely on the full data se t (exhaustivity ), even if avoiding
to do it , is still remaining a research topic
� No a priori knowledge , or partial knowledge of the p opulation structure
� Data are stored through NoSQL architectures using the a dequate
correspondence formats ( example for graphs DB: Neo4j , FlockDB ( open
source distributed, fault-tolerant graph database for managing data at scale., chosen
by Twitter)
� To manage the exhaustivity constraint, obligation to use heuristics or meta
heuristics based upon linear iterations, or parallelization through
distributed computations
Some NoSQL DB Types
Key Value StoresKey Value StoresKey Value StoresKey Value Stores
Column Oriented DBColumn Oriented DBColumn Oriented DBColumn Oriented DB
Document Document Document Document OrientedOrientedOrientedOriented DBDBDBDB
BigTable (GoogleGoogleGoogleGoogle)
(FacebooFacebooFacebooFacebookkkk)
Infinity DB
((((AmazonAmazonAmazonAmazon)))) DynamoDBDynamoDBDynamoDBDynamoDB
Graph Data Bases Graph Data Bases Graph Data Bases Graph Data Bases
Neo4jNeo4jNeo4jNeo4j
Complex grows likeComplex grows likeComplex grows likeComplex grows like EEEE RelRelRelRel
EEEE = nb. of Entitiesnb. of Entitiesnb. of Entitiesnb. of EntitiesRel Rel Rel Rel = average relationships / average relationships / average relationships / average relationships /
entityentityentityentity
direction ou services
BIG DATA CONCEPTUAL FOUNDATIONS[Brewer CAP Assignment]
It is impossible to satisfy the 3 items choose 2
Consistancy
AAAAPPPPCCCCAAAA
CP
MemcacheDB /Bekerley DB
VoldemortVoldemortVoldemortVoldemort
CouchDB
HBase
Availability
Partition Tolerence
Ce
docu
men
t ne
peut
êtr
e re
prod
uit,
mod
ifié,
ada
pté,
pub
lié, t
radu
it, d
'une
que
lcon
que
faço
n, e
n to
ut o
u pa
rtie
, ni d
ivul
gué
à un
tier
s sa
ns l'
acco
rd p
réal
able
et é
crit
de T
hale
s©
TH
ALE
S 2
012
Tou
s D
roits
rés
ervé
sM
odèl
e tr
tp v
ersi
on 7
.1.0
Some ideas for solving Intrinsic Big Analytics approaches
Use mainly exhaustive methods (if possible no statistical
sampling) (Data Driven vs Hypothesis Driven )
� Affinity Analysis & Sequential Patterns (pure linear matchings scalar products)
� Use Classifiers with linear criteria
� Practice Iterative Queries
� R2I2: Requêtage Récursif Itératif Intelligent (application de deux techniques en alternance: Similarité
Régularisée + Clustering « on the fly »)
� Unsupervised Clustering (no a priori) (Extending « No K-Means » approaches using
linear relational criteria)
� Text mining (word spotting)
� Reticular Data Analysis
(Social Nets, Huge IT Networks)
Routing procedures, Modularizations, Dynamic Topology
12 /12 /
The
info
rmat
ion
cont
aine
d in
this
doc
umen
t and
any
atta
chm
ents
are
the
prop
erty
of T
HA
LES
. You
are
her
eby
notif
ied
that
any
rev
iew
, dis
sem
inat
ion,
dis
trib
utio
n, c
opyi
ng o
r ot
herw
ise
use
of th
is d
ocum
ent i
s st
rictly
pro
hibi
ted
with
out T
hale
s pr
ior w
ritte
n ap
prov
al. ©
TH
ALE
S 2
011.
Tem
plat
e tr
tp v
ersi
on 7
.0.8
BIG ANALYTICS TYPOLOGY
Reticular Data Structuring
Classical BI Data Mining
Tentative structuring of Big Analytics Approaches
Learning &Neural Nets
Vector Matching Structuring
Lack of Population Knowledge
Leve
l of P
robl
em C
ompl
exity Learning Model for
unsupervised Classif
Limited Layers Neural Nets
Naïve Bayes
Networks
Self Encoded and Hourglass
Shaped Neural Nets
Image & Video
Analytics
Sequential Patterns Recognition &
Affinity Analysis
Parallel Coordinates
Unsupervised Clustering
Large Networks Topological
Design
Supervised Rule Based
Classification
Social Networks
Communities detection
Reticular Visual
Analytics
BiClass SVM
Faces &Pattern Recognition
Piecewise Linear Regression
Multi Classes
SVM
MOLAP and XOLAP
MDL Learning
Models
Ce
docu
men
t ne
peut
êtr
e re
prod
uit,
mod
ifié,
ada
pté,
pub
lié, t
radu
it, d
'une
que
lcon
que
faço
n, e
n to
ut o
u pa
rtie
, ni d
ivul
gué
à un
tier
s sa
ns l'
acco
rd p
réal
able
et é
crit
de T
hale
s©
TH
ALE
S 2
012
Tou
s D
roits
rés
ervé
sM
odèl
e tr
tp v
ersi
on 7
.1.0
An Example of Intrinsic Big Analytics Problem: Graphs Modularity
Girvan-Newman’s Quadratic formulation
“Liberal”
“Conservative”
“Centrist”
Krebs’ Graph on American PoliticsS. Mandal (MIT)
MIT Heuristic Algo : Construct the modularity matrix and find its largest eigenvalue and eigenvector• Partition network into two parts based on signs of elements in the largest eigenvector• Repeat for each part• If a proposed split does not cause modularity to increase, declare subgraph indivisible and do
not split it• When entire graph consists of indivisible subgraphs, stop
Typical running time �O(N2log N) for a sparse graph
modularity of network is “the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.” (“Deviation to Independence” )• Maximizing modularity
rigorously may be NP-hard• Use heuristic approaches
Ce
docu
men
t ne
peut
êtr
e re
prod
uit,
mod
ifié,
ada
pté,
pub
lié, t
radu
it, d
'une
que
lcon
que
faço
n, e
n to
ut o
u pa
rtie
, ni d
ivul
gué
à un
tier
s sa
ns l'
acco
rd p
réal
able
et é
crit
de T
hale
s©
TH
ALE
S 2
012
Tou
s D
roits
rés
ervé
sM
odèl
e tr
tp v
ersi
on 7
.1.0
By relational transform we turn the criterion into a linear function subjectto linear constraints
Idea : relying on the locally linear « Louvain » algorithm (Blondel-Guillaume ) (Univ Louvain/UPMC LIP6) , use the Linear Relational F orm
���� O(N LogN )
We can do more : using the genericity of the Louvain ’s algo we can use better linear criteria than the Girvan-Newman’s one ba sed on Optimal Transport justifications e.g:« Deviation to Indetermination » (Patricia Conde- Cespèdes )
Xij – Xji = 0 ∀∀∀∀(i,j) (Symmetry)Xii = 1 ∀∀∀∀i (Reflexivity)
Xij + Xjk – Xik ≤≤≤≤ 1 ∀∀∀∀(i,j,k) (Transitivity)
Xij ∈∈∈∈ {0,1} (Binarity)
Ce
docu
men
t ne
peut
êtr
e re
prod
uit,
mod
ifié,
ada
pté,
pub
lié, t
radu
it, d
'une
que
lcon
que
faço
n, e
n to
ut o
u pa
rtie
, ni d
ivul
gué
à un
tier
s sa
ns l'
acco
rd p
réal
able
et é
crit
de T
hale
s©
TH
ALE
S 2
012
Tou
s D
roits
rés
ervé
sM
odèl
e tr
tp v
ersi
on 7
.1.0
Big Analytics :Some Topics of Interest
Big Analytics for
Cyber-Security
Big Analytics for
Smart Transport
Big Analytics for National Security
Big Analytics for maintenance:
Components for attack detection and investigation(Intelligent IDS from normalized log analytics, IS passiveand dynamic mapping, logs analytics, cyber Intelligence )
� Attack detection from relational & content data, intelligent IDS and sandbox coupling,
� Intelligent coupling with IS passive and dynamic mapping
� Big Data platform for logs analytics, visual analytics
Business Analytics Web portal for passenger behavio ur and profile understanding , traffic anomaly detecti on:� New components and use cases focused on mobility
� Approach based on space-time queries, BI, early warning engine, Big Analytics and optimization technics for Smart City
� Fraud detection
Social Web Intelligence for National Security : � Cyber-infringement detection and investigation
� SNA :social mining, crisis management
Maritime security: predictive analysis & anomaly detection
E-border: Big Analytics on passengers logs
applications to vehicle , radar, weapon systems, transport…HUMS :(Health & Usage Monitoring Systems)
Ce
docu
men
t ne
peut
êtr
e re
prod
uit,
mod
ifié,
ada
pté,
pub
lié, t
radu
it, d
'une
que
lcon
que
faço
n, e
n to
ut o
u pa
rtie
, ni d
ivul
gué
à un
tier
s sa
ns l'
acco
rd p
réal
able
et é
crit
de T
hale
s©
TH
ALE
S 2
012
Tou
s D
roits
rés
ervé
sM
odèl
e tr
tp v
ersi
on 7
.1.0
Big Analytics innovation trends at medium range horizon
� Coupling Auto-Encoders Neural Nets with Predictive Modeling for � features
extraction
� Opening the « Data Streaming Data Streaming Data Streaming Data Streaming ProcessingProcessingProcessingProcessing » (real time) to more sophisticated
and powerful analytical tools� Towards real life CEPCEPCEPCEP
� Coupling « GeneticGeneticGeneticGenetic AlgorithmsAlgorithmsAlgorithmsAlgorithms » with « RelationalRelationalRelationalRelational linearlinearlinearlinear transformstransformstransformstransforms » �
Linearization procedures
� In Networks Analysis, addressing the complexity of of of of dynamicdynamicdynamicdynamic graphsgraphsgraphsgraphs modeling.
� Dynamic Modularization