media upload and sharing website using hbase
TRANSCRIPT
Media Upload and Sharing Website using HBASE
Tushar MahajanSantosh Mukherjee
Shubham Mathur
A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark
Agenda
Motivation for the project
Introduction
Summary of how we used Hadoop
Why HBASE not RDBMS?
Current Status
Challenges
Future Work
Motivation Cont.
"Web 2.0 is the business revolution in the computer industry caused by the move to the internet as platform and an attempt to understand the rules for success on that new platform. Chief among those rules is this: Build applications that harness network effects to get better the more people use them. (This is what I've elsewhere called 'harnessing collective intelligence.')"
- Tim O'Reilly, Grand Poobah2.0
Why HBASE and not RDBMS
RDBMS Powerful
For small scale ideal to use
What if someday, my site ranks top in google search.
How do i scale my performance?
Although you can run several instances of mysql on different machines.
Scaling MySQL hard, Oracle Expensive(and hard).
Machine cost goes up faster speed
Turns off all relational feature to scale
Turns off secondary(!) indexes too (!!)
That is not the power of RBDMS, its power is to build indexes, scale no of rows.
In case of Schema change?
What about schema change or migrations?
Mysql is not your friend
Only gets harder with more data
HBASE is an Apache open source project whose goal is to provide Bigtable like storage for the Hadoop Distributed computing Environment.
Data Model
Similar to that of Bigtable.
Applications store data rows in labeled tables.
A data row has a sortable row key and an arbitrary number of columns.
A column name has the form “<family>:<label>” where <family> and <label> can be arbitrary byte arrays.
HBASE storage model
Column oriented database
Column name is arbitary data, can have, variable, number of column per row.
Can random read and write
Tables are split into roughly equal size regions
Region split as they grow, thus dynamically adjusting your data set.
Hbase Query Language(HQL)
${HBASE_HOME}/bin/hbase shell [--help] Usage:./bin/hbase shell [--master:IP_ADDRESS:PORT] [--html]
Running the above command on command line presents before you the following prompt: hql>
Sample Hbase QueryTo create a table:
CREATE TABLE table_name(column_family_definition [,column_family_definition] ... )
Column_family_definition:
column_family_name
[MAX_VERSIONS=n]
[MAX_LENGTH=n]
[COMPRESSION=NONE|RECORD|BLOCK]
[IN_MEMORY]
[BLOOMFILTER=NONE|BLOOMFILTER|COUNTING_BL
OOMFILTER|RETOUCHED_BLOOMFILTER VECTOR_SIZE=n
NUM_HASH=n]
Sample HBASE Queries (Contd..)
SELECT
Syntax:
SELECT { column_name, [, column_name] ... |
expr[alias] | *} FROM table_name
[WHERE row='row_key' | STARTING FROM 'row-key'
[UNTIL 'stop-key']]
[NUM_VERSIONS = version_count]
[TIMESTAMP 'timestamp']
[LIMIT = row_count]
[INTO FILE 'file_name']
Sample HBASE Queries (contd..)
Insert data into table
Syntax:
INSERT INTO table_name
(colmn_name, ...) VALUES ('value', ...)
WHERErow='row_key'[TIMESTAMP'timestamp'];
column_name:
column_family_name |
column_family_name:column_label_name
HQL FACTS
The hql shell prompt has now been depreciated.
It has been moved to a newer shell version.
PS: Never bother to mention hql in IRC.
Sample php to communicate with Hbase
// open a new connection to rest server. Hbase Master default port is 60010
$hbase = new hbase_rest($ip, $port);
// get list of tables
$tables = $hbase->list_tables();
// get table column family names and compression stuff
$table_info=$hbase>table_schema("search_index");
Sample and end row keys of each region Php File (Cont)
// get start
$regions = $hbase->regions($table);
// select data from hbase
$results = $hbase->select($table,$row_key);
// insert data into hbase the $column and $data can be
arrays with more then one column inserted in one
request
$hbase->insert($table,$row,$column(s),$data(s));
How to store data in HBASE?
Maybe not your raw log data...
Results, processing it with hadoop
By storing the defined version in HBASE, can keep up with huge data demands and serve to your website
Website access
Using thrift gateway, php code accesses HBASE
No additional caching other than what Hbase provides
Large data Storage
Over 9 billion rows and 1300 GB in Hbase
Can map reduce a 700GB table in ~20 min
This is about 6 million rows/sec
Challenges
Lack of Documentation
Its new hard to find any document library or tutorial.
Hostel Wireless Issues
Need atleast 2 computer to test.
Thrift is still in early stage. Lot of php issues :( , no help nearby
IRC Freenode #hbase channel was very helpful (but process is slow)
References
Home Page http://hbase.org
Wiki http://wiki.apache.com/hadoop/Hbase
Freenode IRC #hbase
http://rajeev1982.blogspot.com/2009/06/hbase-setup-0193.html
HBASE is an Apache open source project whose goal is to provide Bigtable like storage for the Hadoop
Distributed computing Environment.
Data Model Similar to that of Bigtable.
Applications store data rows in labeled tables.
A data row has a sortable row key and an arbitrary number of columns.
A column name has the form “<family>:<label>” where <family> and <label> can be arbitrary byte arrays.
HBASE QUERY LANGUAGE (HQL) ${HBASE_HOME}/bin/hbase shell [--help] Usage:
./bin/hbase shell [--master:IP_ADDRESS:PORT] [--html]
Running the above command on command line presents before you the following prompt:
hql>
Sample HBASE Queries To create a table:
Syntax: CREATE TABLE table_name(column_family_definition [,
column_family_definition] … )Column_family_definition:
column_family_name[MAX_VERSIONS=n] [MAX_LENGTH=n] [COMPRESSION=NONE|RECORD|BLOCK] [IN_MEMORY]
[BLOOMFILTER=NONE|BLOOMFILTER|COUNTING_BLOOMFILTER|RETOUCHED_BLOOMFILTER VECTOR_SIZE=n NUM_HASH=n]
Sample HBASE Queries (Contd..) SELECT
Syntax:
SELECT { column_name, [, column_name] ... | expr[alias] | *} FROM table_name
[WHERE row='row_key' | STARTING FROM 'row-key' [UNTIL 'stop-key']]
[NUM_VERSIONS = version_count]
[TIMESTAMP 'timestamp']
[LIMIT = row_count]
[INTO FILE 'file_name']
Sample HBASE Queries (contd..) Insert data into table
Syntax:
INSERT INTO table_name
(colmn_name, ...) VALUES ('value', ...)
WHERErow='row_key'[TIMESTAMP'timestamp']; column_name:
column_family_name | column_family_name:column_label_name
HQL FACTS The hql shell prompt has now been depreciated.
It has been moved to a newer shell version.
PS: Never bother to mention hql in IRC.
Sample Php File To Communicate With HBASE // open a new connection to rest server. Hbase Master
default port is 60010$hbase = new hbase_rest($ip, $port);
// get list of tables$tables = $hbase->list_tables();
// get table column family names and compression stuff$table_info=$hbase>table_schema("search_index");
Sample Php File (Contd..) // get start and end row keys of each region
$regions = $hbase->regions($table);
// select data from hbase$results = $hbase->select($table,$row_key);
// insert data into hbase the $column and $data can be arrays with more then one column inserted in one request
$hbase->insert($table,$row,$column(s),$data(s));
Sample Php File (Contd..) // start a scanner on a set range of table
$handle = $hbase->scanner_start($table,$cols,$start_row,$end_row);
// pull the next row of data for a scanner handle$results = $hbase->scanner_get($handle);
// delete a scanner handle$hbase->scanner_delete($handle);
Overview Basically the file uploaded through the web page is
inserted into the HBASE in the form of its byte representation.
On being requested the file, depending upon the key , we select a region of HBASE and output the user the corresponding file.
The Hbase Table The table consists of a row key which Is unique.
Associated with the row key it has column family.
The column family comprises of two columns.
One stores the file name whereas the other stores the actual file data.
HBASE SchemaRow Post
TempAddress+timestamp Name Data (in bytes)
Hdfs://Downloads0408200911:12:07
DiaryofJane.mp3 000000101010101010101…….
------------------------ -------------------- ----------------------
------------------------- --------------------- ---------------------
HBASE Schema (Contd..) Give a unique row key corresponding to a column
family.
Associate a time-stamp with the temporary download location of each file.
The time-stamp associated includes the time of upload+the date of upload to nullify the clashes.
Backend Associated: The file available in the temporary download location
is copied into the HBASE
Php used as a framework.
The Thrift API acts as a bridge for php to communicate with the HBASE.
The thrift API enables socket connection
The php code runs the HBASE code written in Java in the hbase directory.
Backend Associated-Thrift A software library and set of code generation tools.
Developed by Facebook.
Used for implementing efficient and scalable backend services.
Goal: To enable efficient and reliable communication across programming languages.
Backend Associated (contd..) The java code takes as argument the download
location and the actual file along with the file name.
A time-stamp is then associated with the download location.
The download location being fixed for every user , we are able to generate a unique key using time-stamp.
Backend Associated (contd..) Associate a file stream to read the file
Change file into its corresponding byte representation using JAVA methods.
Create a put object associated with the table using the row key.
The byte representation of the file and the file name is then fed into this put object.
The put object then inserts the data in HBASE.
Code SnippetPublic class HbaseClient{
Public static void main (String args[]) throws IO Exception
{
String rowkey=time+’.’+temp;
Put p=new Put(Bytes.toBytes(rowkey));
p.add(Bytes.toBytes(“post”),Bytes.toBytes(“name”),Bytes.toBytes(temp));
}
}
Program Execution The url associated with the file is then returned to the
user
The url when clicked is passed as an argument to another java file interacting with HBASE.
This java file creates a “get” object.
Associated with the url, which is also a unique row key for the table, it returns to the user the data.
Conclusion The Thrift API enables the php code to communicate
with the java code written in Hbase directory.
There still remains an isssue with the scalablity of the project.
That we could handle by storing the files in distributed HDFS cluster as the temporary location.
Conclusion (contd..) We could then scale the Hbase tables across multiple
nodes in case the size of the table grows large.
This scaling of the HBASE table could be done on the basis of the regions associated with the table.