attack on graph

Scott Miao2013/12/14

Who am I

• RD, SPN, Trend Micro• 3 years for Hadoop eco system• Expertise in HDFS/MR/HBase• @takeshi.miao

THREATCONNECT

IP, domain, URL, filename, process, file hash, Virus detection, registry key, etc.

Product 1 Product 2 Product 3 …

Threat Conne

ct

Sand-box File

Detection

Threat

Web

Web Reputatio

nFamil

y Write-up

TE

Virus DB

APT KB

Most relevant threat report with actionable

intelligenceon a single portal

Process and correlates different data sources

A GRAPH

The problems

• Store large size of Graph data

• Access large size of Graph data

• Process large size of Graph data

大數據

Property Graph Model (1/3)

https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model




• A property graph has these elements– a set of vertices

• each vertex has a unique identifier.• each vertex has a set of outgoing edges.• each vertex has a set of incoming edges.• each vertex has a collection of properties defined by a map from key to

value.

– a set of edges• each edge has a unique identifier.• each edge has an outgoing tail vertex.• each edge has an incoming head vertex.• each edge has a label that denotes the type of relationship between its

two vertices.• each edge has a collection of properties defined by a map from key to

value.

The domain model for Property Graph Model

The relational model forProperty Graph Model

Massive scalable ?

Active community ?

Analyzable ?

• We use HBase as a Graph Storage– Google BigTable and PageRank– HBaseCon2012

The winner is…

YeahWe are NO. 1 !!

http://www.slideshare.net/cloudera/3-storing-and-manipulating-graphs-in-h-base-updated-last-minute?from_search=1

Use HBase to store Graph data (1/3)

• Schema design– Table: vertex

– Table: edge

‘<vertex-id>@<entity-type>’, ‘property:<property-key>@<property-value-type>’,<property-value>

‘<vertex1-row-key>--><label>--><vertex2-row-key>’, ‘property:<property-key>@<property-value-type>’, <property-value>


• Sample– Table: vertex

– Table: edge

‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’‘myapps-ups.com@domain’, ‘property:asn@String’, ‘…’…‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:path@String’, ‘…’‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:parameter@String’, ‘…’

‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property1’, ‘…’‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property2’, ‘…’


• Tables– create 'test.vertex', {NAME => 'property',

BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'}

– create 'test.edge', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'}

ACCESS

It’s not me, actually…

HBase

Data Sources

Algorithms

Clients1. Put data

2. Get Data

3. Process Data

Put Data

• HBase schema design is simple and human-readable

• They are easy to write your own dumping tool as you need– MR/Pig/Completebulkload– Can write cron-job to clean up the broken-edge

data– TTL can also help to retire old data

• We already have a lot practices for this task

Get Data (1/2)• A Graph API• A better semantics for manipulating Graph

data– As a wrapper for HBase Client API– Rather than use HBase Client API directly

• Simple to UseVertex vertex = this.graph.getVertex("40012");Vertex subVertex = null;Iterable<Edge> edges =

vertex.getEdges(Direction.OUT, "knows", "foo", "bar");for(Edge edge : edges) { subVertex = edge.getVertex(Direction.OUT); ...}

Get Data (2/2)

• We implement blueprints API– It provides interfaces as spec. for users to impl.– Currently basic query methods are implemented– We can get benefits from it• Other libraries support if we can impl. more degrees of

blueprints API– http://www.tinkerpop.com/– RESTful server, graph algorithmn, dataflow, etc

http://www.tinkerpop.com/

http://www.tinkerpop.com/

PROCESS

• Thanks for human-readable HBase schema design and random accessible in natural– Write your own MR– Write your own Pig/UDFs

• Ex. The pagerank– http://zh.wikipedia.org/wiki/Pagerank

http://zh.wikipedia.org/wiki/Pagerank

http://zh.wikipedia.org/wiki/Pagerank

HGraph

• A project is open and put on github– https://github.com/takeshimiao/HGraph

• A partial impl. released from our internal pilot project– Follow HBase schema design– Read data via Blueprints API– Process data with pagerank

• Download or ‘git clone’ it– Use ‘mvn clean package’– Run on unix-like OS

• Use window may encounter some errors

https://github.com/takeshimiao/HGraph

https://github.com/takeshimiao/HGraph

There is another project

http://thinkaurelius.github.io/titan/

http://thinkaurelius.github.io/faunus/






OBSERVATIONS

• It seems bring Hadoop to a de-facto big data platform– Loose bound the MR framework and

accommodate others • There are bunch of data processing migrated

with it

YARN

http://hortonworks.com/hadoop/yarn/

• Impala V.S. Hive (Stinger and Tez)– Impala seems more mature than Hive

• YARN !!– Hive stinger and Tez are based on YARN (HDP2)– Impala also has plan to migrated to YARN (CDH5)– Even HBase !! (HOYA)

SQL-on-Hadoop

Hive built on top of a batch processing framework (even MRv2), but Impala goes itself own way !!

Todd LipconCommitter/PMC member on Apache Thrift, HBаse, and Hаdoop projects

• As I saw in Europe/CA/China, I can say HBase is most popular noSQL solution if you already adopted Hadoop

• Other noSQLs will not help you out of OPS paintpoints

• So the best way is to pick your right tool and play it well

HBase is a popular noSQL

http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1 #p34

http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1

http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1

attack on graph

Technology

property graph model

graph api

vertex vertex

old data

data hbaseclientsalgorithms1

graph algorithmn

data hbase schema design

cominvoicea14 property