media upload and sharing website using hbase

Media Upload and Sharing Website using HBASE

Tushar MahajanSantosh Mukherjee

Shubham Mathur

A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark

http://www.a-pdf.com

Agenda

Motivation for the project

Introduction

Summary of how we used Hadoop

Why HBASE not RDBMS?

Current Status

Challenges

Future Work

Motivation

Facebook, Stumble Upon

HBASE

Motivation Cont.

"Web 2.0 is the business revolution in the computer industry caused by the move to the internet as platform and an attempt to understand the rules for success on that new platform. Chief among those rules is this: Build applications that harness network effects to get better the more people use them. (This is what I've elsewhere called 'harnessing collective intelligence.')"

- Tim O'Reilly, Grand Poobah2.0

Why HBASE and not RDBMS

RDBMS Powerful

For small scale ideal to use

What if someday, my site ranks top in google search.

How do i scale my performance?

Although you can run several instances of mysql on different machines.

But will it help?

Scaling MySQL hard, Oracle Expensive(and hard).

Machine cost goes up faster speed

Turns off all relational feature to scale

Turns off secondary(!) indexes too (!!)‏

That is not the power of RBDMS, its power is to build indexes, scale no of rows.

Prob Cont.

Tables are harder to scale at sizes as low as 500GB

Hard to read data at these sizes

In case of Schema change?

What about schema change or migrations?

Mysql is not your friend

Only gets harder with more data

HBASE

Mostly schema-less

Dynamic distribution

Motivation for HBASE?Google Bigtables

HBASE is an Apache open source project whose goal is to provide Bigtable like storage for the Hadoop Distributed computing Environment.

Data Model

Similar to that of Bigtable.

Applications store data rows in labeled tables.

A data row has a sortable row key and an arbitrary number of columns.

A column name has the form “<family>:<label>” where <family> and <label> can be arbitrary byte arrays.

HBASE storage model

Column oriented database

Column name is arbitary data, can have, variable, number of column per row.

Can random read and write

Tables are split into roughly equal size regions

Region split as they grow, thus dynamically adjusting your data set.

Hbase Query Language(HQL)‏

${HBASE_HOME}/bin/hbase shell [--help] Usage:./bin/hbase shell [--master:IP_ADDRESS:PORT] [--html]

Running the above command on command line presents before you the following prompt: hql>

Sample Hbase QueryTo create a table:

CREATE TABLE table_name(column_family_definition [,column_family_definition] ... )‏

Column_family_definition:

column_family_name

[MAX_VERSIONS=n]

[MAX_LENGTH=n]

[COMPRESSION=NONE|RECORD|BLOCK]

[IN_MEMORY]

[BLOOMFILTER=NONE|BLOOMFILTER|COUNTING_BL

OOMFILTER|RETOUCHED_BLOOMFILTER VECTOR_SIZE=n

NUM_HASH=n]

Sample HBASE Queries (Contd..)‏

SELECT

Syntax:

SELECT { column_name, [, column_name] ... |

expr[alias] | *} FROM table_name

[WHERE row='row_key' | STARTING FROM 'row-key'

[UNTIL 'stop-key']]

[NUM_VERSIONS = version_count]

[TIMESTAMP 'timestamp']

[LIMIT = row_count]

[INTO FILE 'file_name']

Sample HBASE Queries (contd..)‏

Insert data into table

Syntax:

INSERT INTO table_name

(colmn_name, ...) VALUES ('value', ...)‏

WHERErow='row_key'[TIMESTAMP'timestamp'];

column_name:

column_family_name |

column_family_name:column_label_name

HQL FACTS

The hql shell prompt has now been depreciated.

It has been moved to a newer shell version.

PS: Never bother to mention hql in IRC.

Sample php to communicate with Hbase

// open a new connection to rest server. Hbase Master default port is 60010

$hbase = new hbase_rest($ip, $port);

// get list of tables

$tables = $hbase->list_tables();

// get table column family names and compression stuff

$table_info=$hbase>table_schema("search_index");

Sample and end row keys of each region Php File (Cont)

// get start

$regions = $hbase->regions($table);

// select data from hbase

$results = $hbase->select($table,$row_key);

// insert data into hbase the $column and $data can be

arrays with more then one column inserted in one

request

$hbase->insert($table,$row,$column(s),$data(s));

Scaling HBASE

Add more machines to scale

Base model(bigtables) scale past 1000TB

No Inherent reason why HBASE couldn't

How to store data in HBASE?

Maybe not your raw log data...

Results, processing it with hadoop

By storing the defined version in HBASE, can keep up with huge data demands and serve to your website

Website access

Using thrift gateway, php code accesses HBASE

No additional caching other than what Hbase provides

Large data Storage

Over 9 billion rows and 1300 GB in Hbase

Can map reduce a 700GB table in ~20 min

This is about 6 million rows/sec

Challenges

Lack of Documentation

Its new hard to find any document library or tutorial.

Hostel Wireless Issues

Need atleast 2 computer to test.

Thrift is still in early stage. Lot of php issues :( , no help nearby

IRC Freenode #hbase channel was very helpful (but process is slow)‏

Alternatives

Cassandra

Hypertable

References

Home Page http://hbase.org

Wiki http://wiki.apache.com/hadoop/Hbase

Freenode IRC #hbase

http://rajeev1982.blogspot.com/2009/06/hbase-setup-0193.html

http://hbase.org/

http://hbase.org/

http://wiki.apache.com/hadoop/Hbase

http://wiki.apache.com/hadoop/Hbase

Thank You

For the Patience

HBASE is an Apache open source project whose goal is to provide Bigtable like storage for the Hadoop

Distributed computing Environment.

Data Model Similar to that of Bigtable.

Applications store data rows in labeled tables.

A data row has a sortable row key and an arbitrary number of columns.

A column name has the form “<family>:<label>” where <family> and <label> can be arbitrary byte arrays.

HBASE QUERY LANGUAGE (HQL) ${HBASE_HOME}/bin/hbase shell [--help] Usage:

./bin/hbase shell [--master:IP_ADDRESS:PORT] [--html]

Running the above command on command line presents before you the following prompt:

hql>

Sample HBASE Queries To create a table:

Syntax: CREATE TABLE table_name(column_family_definition [,

column_family_definition] … )Column_family_definition:

column_family_name[MAX_VERSIONS=n] [MAX_LENGTH=n] [COMPRESSION=NONE|RECORD|BLOCK] [IN_MEMORY]

[BLOOMFILTER=NONE|BLOOMFILTER|COUNTING_BLOOMFILTER|RETOUCHED_BLOOMFILTER VECTOR_SIZE=n NUM_HASH=n]

Sample HBASE Queries (Contd..) SELECT

Syntax:

SELECT { column_name, [, column_name] ... | expr[alias] | *} FROM table_name

[WHERE row='row_key' | STARTING FROM 'row-key' [UNTIL 'stop-key']]

[NUM_VERSIONS = version_count]

[TIMESTAMP 'timestamp']

[LIMIT = row_count]

[INTO FILE 'file_name']

Sample HBASE Queries (contd..) Insert data into table

Syntax:

INSERT INTO table_name

(colmn_name, ...) VALUES ('value', ...)

WHERErow='row_key'[TIMESTAMP'timestamp']; column_name:

column_family_name | column_family_name:column_label_name

HQL FACTS The hql shell prompt has now been depreciated.

It has been moved to a newer shell version.

PS: Never bother to mention hql in IRC.

Sample Php File To Communicate With HBASE // open a new connection to rest server. Hbase Master

default port is 60010$hbase = new hbase_rest($ip, $port);

// get list of tables$tables = $hbase->list_tables();

// get table column family names and compression stuff$table_info=$hbase>table_schema("search_index");

Sample Php File (Contd..) // get start and end row keys of each region

$regions = $hbase->regions($table);

// select data from hbase$results = $hbase->select($table,$row_key);

// insert data into hbase the $column and $data can be arrays with more then one column inserted in one request

$hbase->insert($table,$row,$column(s),$data(s));

Sample Php File (Contd..) // start a scanner on a set range of table

$handle = $hbase->scanner_start($table,$cols,$start_row,$end_row);

// pull the next row of data for a scanner handle$results = $hbase->scanner_get($handle);

// delete a scanner handle$hbase->scanner_delete($handle);

Overview Basically the file uploaded through the web page is

inserted into the HBASE in the form of its byte representation.

On being requested the file, depending upon the key , we select a region of HBASE and output the user the corresponding file.

Table Schema and a Lot More!

The Hbase Table The table consists of a row key which Is unique.

Associated with the row key it has column family.

The column family comprises of two columns.

One stores the file name whereas the other stores the actual file data.

HBASE SchemaRow Post

TempAddress+timestamp Name Data (in bytes)

Hdfs://Downloads0408200911:12:07

DiaryofJane.mp3 000000101010101010101…….

------------------------ -------------------- ----------------------

------------------------- --------------------- ---------------------

HBASE Schema (Contd..) Give a unique row key corresponding to a column

family.

Associate a time-stamp with the temporary download location of each file.

The time-stamp associated includes the time of upload+the date of upload to nullify the clashes.

Backend Associated: The file available in the temporary download location

is copied into the HBASE

Php used as a framework.

The Thrift API acts as a bridge for php to communicate with the HBASE.

The thrift API enables socket connection

The php code runs the HBASE code written in Java in the hbase directory.

Backend Associated-Thrift A software library and set of code generation tools.

Developed by Facebook.

Used for implementing efficient and scalable backend services.

Goal: To enable efficient and reliable communication across programming languages.

Backend Associated (contd..) The java code takes as argument the download

location and the actual file along with the file name.

A time-stamp is then associated with the download location.

The download location being fixed for every user , we are able to generate a unique key using time-stamp.

Backend Associated (contd..) Associate a file stream to read the file

Change file into its corresponding byte representation using JAVA methods.

Create a put object associated with the table using the row key.

The byte representation of the file and the file name is then fed into this put object.

The put object then inserts the data in HBASE.

Code SnippetPublic class HbaseClient{

Public static void main (String args[]) throws IO Exception

{

String rowkey=time+’.’+temp;

Put p=new Put(Bytes.toBytes(rowkey));

p.add(Bytes.toBytes(“post”),Bytes.toBytes(“name”),Bytes.toBytes(temp));

}

}

Program Execution The url associated with the file is then returned to the

user

The url when clicked is passed as an argument to another java file interacting with HBASE.

This java file creates a “get” object.

Associated with the url, which is also a unique row key for the table, it returns to the user the data.

Conclusion The Thrift API enables the php code to communicate

with the java code written in Hbase directory.

There still remains an isssue with the scalablity of the project.

That we could handle by storing the files in distributed HDFS cluster as the temporary location.

Conclusion (contd..) We could then scale the Hbase tables across multiple

nodes in case the size of the table grows large.

This scaling of the HBASE table could be done on the basis of the regions associated with the table.

Problems Faced Lack of documentation on the Thrift API

Lack of sample codes or tutorials.

Even the thrift home page contains links to the tutorial that doesn’t work.

Lots of scenarios to stumble upon and explore.

media upload and sharing website using hbase

Documents