april 2014 hug : apache phoenix

© Hortonworks Inc. 2011

Apache Phoenix – SQL skin over HBase

Jeffrey [email protected]@apache.org


Overview

•What is Phoenix?•Major Phoenix Features•Futures •Phoenix In Action•Summary

Architecting the Future of Big Data


What is Phoenix?SQL skin for HBase originally developed by folks in

Salesforce.com and now is an Apache Incubator ProjectTargets low latency queries over HBase data

Query engine transforms SQL into native HBase APIs: put, delete, parallel scans instead of Map/Reduce

Delivered as an fat JDBC driver(client)•Support features not provided by HBase: Secondary

Indexing, Multi-tenancy, simple Hash Join and more



Phoenix Semantics Support


Feature Supported?

UPSERT / DELETE Yes

SELECT Yes

WHERE / HAVING Yes

GROUP BY Yes

ORDER BY Yes

LIMIT Yes

Views Yes

JOIN Yes (Introduced in 4.0), limited to hash joins

Transactions No


Why Phoenix?Leverage existing tooling

SQL client

•Free the burden to write huge amount code to do simple things

SELECT COUNT(*) FROM WEB_STAT WHERE HOST='EU' and CORE > 35 GROUP BY DOMAIN;

•Performance optimizations transparent to the user Phoenix breaks up queries into multiple scans and runs them in

parallel. For aggregate queries, coprocessors complete partial aggregation on local region server and only returns relevant data to the client



Phoenix Query Optimization

0: jdbc:phoenix:localhost> explain SELECT count(*) FROM WEB_STAT WHERE HOST='EU' and CORE > 35 GROUP BY DOMAIN;+------------+| PLAN |+------------+| CLIENT PARALLEL 32-WAY RANGE SCAN OVER WEB_STAT ['EU'] || SERVER FILTER BY USAGE.CORE > 35 || SERVER AGGREGATE INTO DISTINCT ROWS BY [DOMAIN] || CLIENT MERGE SORT |+------------+


CREATE TABLE IF NOT EXISTS WEB_STAT ( HOST CHAR(2) NOT NULL, DOMAIN VARCHAR NOT NULL, FEATURE VARCHAR NOT NULL, DATE DATE NOT NULL, USAGE.CORE BIGINT, USAGE.DB BIGINT, STATS.ACTIVE_VISITOR INTEGER CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN, FEATURE, DATE));

SELECT count(*) FROM WEB_STAT WHERE HOST='EU' and CORE > 35 GROUP BY DOMAIN;

WEB_STAT Table Schema


Major Features In Phoenix DDL support: CREATE/DROP/ALTER TABLE for adding/removing

columns Extend Schema at query time: Dynamic Column Salting Mapping to an existing HBase table

DML support: UPSERT VALUES for row-by-row insertion, UPSERT SELECT for mass data transfer between the same or different tables and DELETE for deleting rows

Secondary Indexes to improve performance for queries on non-row key columns(still maturing)

Multi-Tenancy (Available in Phoenix 3.0/4.0) Limited Hash Join(Available in Phoenix 3.0/4.0)



Phoenix Futures

• Improved Secondary Indexing.–Tolerant of region split/merge, RegionServer

failures.• Improved JOIN support.•Transaction support.• Improved Phoenix / Hive interoperability.•More at

http://phoenix.incubator.apache.org/roadmap.html



Mapping an existing HBase Table


• create 't1', {NAME=>'f1', VERSIONS => 3}– put 't1', 'r1', 'f1.col1', 'val1’– put 't1', ’r2', 'f1.col2', 'val2’

• Mapping t1 into Phoenix Table– Phoenix stores its own metadata in Table SYSTEM.CATALOG so you need recreate Phoenix

Table or Views to mapping the existing HBase Table– By default, Phoenix uses capital characters, so it’s a better practice to use always “”.

• create table "t1" (myPK VARCHAR PRIMARY KEY, "f1"."col1" VARCHAR);0: jdbc:phoenix:localhost> select * from "t1";+------------+------------+| MYPK | col1 |+------------+------------+| r1 | val1 || r2 | null |+------------+------------+2 rows selected (0.049 seconds)

0: jdbc:phoenix:localhost> select * from t1;Error: ERROR 1012 (42M03): Table undefined. tableName=T1 (state=42M03,code=1012)


Changes Behind Scenes of Mapping


• Metadata are inserted into SYSTEM.CATALOG table0: jdbc:phoenix:localhost> select table_name, column_name, table_type from system.catalog where table_name='t1';+------------+-------------+------------+| TABLE_NAME | COLUMN_NAME | TABLE_TYPE |+------------+-------------+------------+| t1 | null | u || t1 | MYPK | null || t1 | col1 | null |+------------+-------------+------------+

• Empty cell is created for each row. It’s used to enforce PRIMAY KEY constraints because HBase doesn’t store cells with NULL values.

hbase(main):023:0> scan 't1'ROW COLUMN+CELL r1 column=f1:_0, timestamp=1397527184229, value= r1 column=f1:col1, timestamp=1397527184229, value=val1 r2 column=f1:_0, timestamp=1397527197205, value= r2 column=f1:col2, timestamp=1397527197205, value=val2


Mapping an existing HBase Table – Cont.

•The bytes were serialized must match the way the bytes are serialized by Phoenix. You can refer to Phoenix data types. (http://phoenix.incubator.apache.org/language/datatypes.html)



Dynamic Columns - Extend Schema During Query

•HBase can create new columns(qualifier) after table created. In Phoenix, a subset of columns may be specified at table create time while the rest is possibly surfaced at query time through dynamic columns.– In the previous table mapping, we only mapped one column “f1”.”col1” create table "t1" (myPK VARCHAR PRIMARY KEY, "f1"."col1" VARCHAR);

– In order to get data from col2, we can do 0: jdbc:phoenix:localhost> select * from "t1"("f1"."col2" VARCHAR);+------------+------------+------------+| MYPK | col1 | col2 |+------------+------------+------------+| r1 | val1 | null || r2 | null | val2 |+------------+------------+------------+2 rows selected (0.065 seconds)



Secondary Index

• Index data are stored in separate HBase table and located in different region servers other than data table.

•Two types of Secondary IndexImmutable Indexes

– Targets tables where rows are immutable after written– When new rows are inserted, updates are sent to data

table and then index table– Client handles failures

Mutable Indexes



Phoenix Secondary Index – Cont.Mutable Indexes

– Implemented through coprocessors–Aborts region server when index updates fails(could change with

custom IndexFailurePolicy)

Courtesy of Jesse Yates from SF Hbase User Group Slides


Phoenix Secondary Index – Cont.

• Index Creation– Same statement to create both types of indexes. Immutable Indexes are

created for tables created with “IMMUTABLE_ROWS=true” otherwise mutable indexes are created

–DDL Statement: CREATE INDEX <index_name>

ON <table_name>(<columns_to_index>…)INCLUDE (<columns_to_cover>…);

– Examples– create index "t1_index" on "t1" ("f1"."col1")

– Verify index will be used0: jdbc:phoenix:localhost> explain select * from "t1" where "f1"."col1"='val1';+------------+| PLAN |+------------+| CLIENT PARALLEL 1-WAY RANGE SCAN OVER t1_index ['val1'] |+------------+1 row selected (0.037 seconds)



Phoenix Secondary Index – Cont.

•How Index Data are Storedhbase(main):008:0> scan 't1_index'ROW COLUMN+CELL \x00r2 column=0:_0, timestamp=1397611429248, value= val1\x00r1 column=0:_0, timestamp=1397611429248, value=

Row key are concatenated with index column values delimited by a zero byte character end with data table primary key. If you define covered columns, you’ll see cells with their values as well in the index table.



Salted Table

•HBase uses salting to prevent region server hot spotting if row key is monotonically increasing. Phoenix provides a way to salt the row key with salting bytes during table creation time.

For optimal performance, number of salt buckets should match number of region servers


CREATE TABLE table (a_key VARCHAR PRIMARY KEY, a_col VARCHAR) SALT_BUCKETS = 20;


Resources•Apache Phoenix Home Page

–http://phoenix.incubator.apache.org/index.html•Mailing Lists

–http://phoenix.incubator.apache.org/mailing_list.html•Latest Release

–Phoenix 3.0 for HBase0.94.*, Phoenix 4.0 for HBase0.98.1+(http://phoenix.incubator.apache.org/download.html)– HDP(Hortonworks Data Platform)2.1 will ship Phoenix

4.0



Try by yourself

•Load Sample Data./psql.py localhost ../examples/WEB_STAT.sql ../examples/WEB_STAT.csv

•Start Sql Client./sqlline.py localhost

•Run Performance Test./performance.py localhost 10000


Assuming HBase Zookeeper Quorum String = “localhost” and you are under bin folder of the installation.


Summary

•Phoenix vs HBase Native APIsAs a rule of thumb, you should leverage Phoenix as your Hbase client whenever is possible because Phoenix provides easy to use APIs and performance optimizations.



Questions? Comments?

april 2014 hug : apache phoenix

Data & Analytics