scalable php applications with cassandra

Post on 28-Aug-2014

400 Views

Category:

Software

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Developing a fast and scalable application for your fancy new startup is hard. Many factors are responsible for the slowness of a website, like network latency, webserver configuration or large assets, but as any developer involved with high volumes knows, the real bottleneck is the database. During the latest years a bunch of NoSQL solutions came to the rescue, each one with his pros and cons. Apache Cassandra is one of the most used and mature "Big Data" NoSQL, and is currently deployed on several projects by tech giants like Twitter, eBay and Netflix, due to its extremely high throughput, automatic replication and decentralization. During the session I'll talk about how to leverage Apache Cassandra best features and data modeling best practices for your web application projects to respond to huge peaks of traffic, using open source tools such as Zend Framework and phpcassa, and describing a large e-commerce project currently using Cassandra.

TRANSCRIPT

@akira28

Scalable PHP web applications with Apache

Cassandra

Andrea De Pirro

@akira28

About me

• Co-founder at Yameveo

• 9+ years developing in PHP

• 2+ years experience with Apache Cassandra

• Zend Framework Certified Engineer

@akira28

Yameveo

Founded on 2012 in Barcelona, Yameveo is a young, dynamic and international company specialised in e-

commerce and web applications development !

!

www.yameveo.com @Yameveo

@akira28

Yameveo StoreDozens of e-commerce modules

store.yameveo.com

@akira28

What we will talk about

• Apache Cassandra

• Data Modeling

• Cassandra & PHP

• Case study

@akira28

Apache CassandraApache Cassandra is a massively scalable open source

NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and

unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear

scalability, and operational simplicity across many commodity servers with no single point of failure, along

with a powerful dynamic data model designed for maximum flexibility and fast response times.

Apache Cassandra documentation

@akira28

Why Cassandra• Open Source (enterprise distribution also available)

• Linearly scalable

• Fault-tolerant

• Fully distributed

• Highly performant

• Flexible data model

@akira28

Cassandra Uses• Web analytics

• Web Applications

• Transaction logging

• Data collection

• …

@akira28

@akira28

Architecture

@akira28

CAP TheoremOnly two of:!!

1. Consistency all nodes see the same data at the same time

2. Availability the guarantee that every request receives a response about whether it was successful or failed

3. Partition Tolerance the system continues to operate despite message loss or failure of part of the system

@akira28

CAP Theorem

@akira28

Architecture

• Ring

• Each node has a unique token and is identical

• Intra-ring communication via “Gossip” protocol

• Tokens range from 0 to 2^127

@akira28

Partitioning

@akira28

Data Modeling

@akira28

Data Model• Cluster

• Keyspace

• Column Family

• Super Column

• Composite Columns

@akira28

Data Model

@akira28

Data Model

@akira28

Data Modeling Problems

• Neither join nor subquery support

• Limited support for aggregation

• Ordering is done per-partition

• Ordering is specified at table creation time

@akira28

Data Modeling Best Practices

• Don’t think of a relational table

• Model column families around query patterns

• De-normalize and duplicate for read performance

• Storing values in column names is perfectly OK

• Leverage wide rows for ordering, grouping, and filtering

@akira28

Some Numbers

@akira28

Some Numbers

@akira28

@akira28

Cassandra & PHP

@akira28

Apache ThriftThrift is an interface definition language and binary

communication protocol that is used to define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and was developed at

Facebook for "scalable cross-language services development"

Wikipedia

@akira28

Apache Thrift

@akira28

PhpCassa• Open Source

• Uses the Thrift protocol

• Compatible with Cassandra 0.7 through 1.2

• Optional C extension for improved performance

https://github.com/thobbs/phpcassa !

require: “thobbs/phpcassa”: “v1.1.0”

@akira28

ExamplesOpening Connections!!$pool = new ConnectionPool('Keyspace1'); !Create a column family object!!$users = new ColumnFamily($pool, 'Standard1'); $super = new SuperColumnFamily($pool, 'Super1'); !Inserting!!$users->insert('key', array('column1' => 'value1', 'column2' => 'value2')); !Querying!!$users->get(‘key'); // returns an array $users->multiget(array('key1', ‘key2')); // returns an array of arrays !Removing!!$users->remove('key1'); // removes whole row $users->remove('key1', 'column1'); // removes 'column1'

@akira28

Case Study

@akira28

Flash Deals website• 5 Apache servers

• 32 GB of RAM

• 8 CPU

• 6 Cassandra nodes

• 4+ millions visits/month

• 17+ millions pages/month

• 600GB of data

@akira28

@akira28

Requirement• The client wanted a new way to navigate the

website: deal attributes

• Millions of deals (hundreds new and expiring everyday)

• Dozens of stores and categories

• Performance is key!

@akira28

How We Solved It

• Each day we have new deals, so queries based on date and attributes

• Leverage Cassandra wide-rows to create indexes

• Use Cassandra multiGet whenever possible

@akira28

Deals CFRowKey name price attributes …

211 Miyagi Sushi 29 [21,20,114]

432 Mos Eisley Cantina 19 [21,20]

12 iPhone 5 32GB 549 [7]

… … …

@akira28

Attributes CFRowKey name keyword

21 Restaurants restaurants

114 Japanese japanese

20 Barcelona barcelona

7 Technology tech

@akira28

Cities CFRowKey name attributeid …

1 Madrid 12

8 Barcelona 20

32 Amsterdam 81

@akira28

Urls CFRowKey attributes city …

/restaurants/barcelona [21] 8

/restaurants/barcelona/japanese [21,114] 8

/tech [7] -

/restaurants [21] -

… … …

@akira28

AttributesDeals CFRowKey 211 432 12 … …

21|20140621 true true -

114|20140621 true - -

20|20140621 true true -

7|20140621 - - true

… … … …

@akira28

Code/** * List deals action * eg. /restaurants/barcelona/japanese * */ public function dealsAction() { $path = $this->getUrlPath(); // cleaned query string ! $url = $this->manager->getUrl($path); $attributes = Zend_Json::decode($url[‘attributes’]); $cityId = $url[‘city’]; $deals = $this->manager->getDeals($attributes, $cityId); $this->view->assign(‘deals’, $deals); … }

Controller

@akira28

Code/** * Retrieves the url containing attributes and city infos * * @param string $path * @return array $url */ public function getUrl($path) { $pool = new ConnectionPool('Keyspace'); $urls = new ColumnFamily($pool, 'Urls'); try { $url = $urls->get($path); } catch (Exception $e) { … } return $url; }

Manager

@akira28

Code/** * Retrieves the url containing attributes and city infos * * @param array $attributes * @param int $cityId * @return array $deals */ public function getDeals($attributes, $cityId) { $pool = new ConnectionPool('Keyspace'); $dealsCF = new ColumnFamily($pool, ‘Deals’); if(!empty($cityId) { $attributes[] = $this->getAttributeIdByCity($cityId); } try { $dealsIds = $this->getDealsIdsByAttributes($attributes); $deals = $dealsCF->multiget($dealsIds); } catch (Exception $e) { … } return $deals; }

Manager

@akira28

Code/** * Retrieves an array of deals ids given an array of attribute ids * * @param array $attributes * @return array $dealsIds */ protected function getDealsIdsByAttributes($attributes) { $dealsIds = array(); $dealsGroups = array(); $date = date(‘Ymd’); $attributesDeals= new ColumnFamily($pool, 'AttributesDeals'); foreach($attributes as $attributeId) { $attributeKey =“$attributeId|$date"; $dealsGroups[] = array_keys($attributesDeals->get($attributeKey)); // columns! } $countGroups = count($dealsGroups); if($countGroups > 1) { $dealsIds = call_user_func_array('array_intersect', $dealsGroups); } elseif($countGroups == 1) { $dealsIds = reset($dealsGroups); } return $dealsIds; }

Manager

@akira28

Cassandra future (and present)

• New PHP driver wrapping the C++ driver

• Cassandra 2.0

• CQL 3.0

@akira28

Resources

• www.yameveo.com

• http://planetcassandra.org

• https://github.com/thobbs/phpcassa

• http://www.hakkalabs.co/articles/cassandra-data-modeling-guide

@akira28

Resources• http://www.ebaytechblog.com/2012/07/16/

cassandra-data-modeling-best-practices-part-1/

• http://www.slideshare.net/DataStax/cassandra-community-webinar-introduction-to-apache-cassandra-12

• http://www.geroba.com/cassandra/apache-cassandra-byteorderedpartitioner/

@akira28

Questions?

@akira28

yameveo@yameveo.com

WE ARE HIRING!

@akira28

Dank!joind.in/10865 lanyrd.com/scxyhk !

www.yameveo.com !

@akira28 @Yameveo !

http://bit.ly/andreadepirro

top related