cassandra summit 2014: interactive olap queries using apache cassandra and spark

#CassandraSummit

OLAP WITH SPARK ANDCASSANDRA

EVAN CHANSEPT 2014

WHO AM I?Principal Engineer, @evanfchan

Creator of

Socrata, Inc.

http://github.com/velviaSpark Job Server

http://www.socrata.com/

http://github.com/velvia

http://github.com/spark-jobserver/spark-jobserver

WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MOREPEOPLE.

data.edmonton.ca finances.worldbank.org data.cityofchicago.orgdata.seattle.gov data.oregon.gov data.wa.govwww.metrochicagodata.org data.cityofboston.govinfo.samhsa.gov explore.data.gov data.cms.gov data.ok.govdata.nola.gov data.illinois.gov data.colorado.govdata.austintexas.gov data.undp.org www.opendatanyc.comdata.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.itdata.montgomerycountymd.gov data.cityofnewyork.usdata.acgov.org data.baltimorecity.gov data.energystar.govdata.somervillema.gov data.maryland.gov data.taxpayer.netbronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org

http://data.edmonton.ca/

http://finances.worldbank.org/

http://data.cityofchicago.org/

http://data.seattle.gov/

http://data.oregon.gov/

http://data.wa.gov/

http://www.metrochicagodata.org/

http://data.cityofboston.gov/

http://info.samhsa.gov/

http://explore.data.gov/

http://data.cms.gov/

http://data.ok.gov/

http://data.nola.gov/

http://data.illinois.gov/

http://data.colorado.gov/

http://data.austintexas.gov/

http://data.undp.org/

http://www.opendatanyc.com/

http://data.mo.gov/

http://data.nfpa.org/

http://data.raleighnc.gov/

http://dati.lombardia.it/

http://data.montgomerycountymd.gov/

http://data.cityofnewyork.us/

http://data.acgov.org/

http://data.baltimorecity.gov/

http://data.energystar.gov/

http://data.somervillema.gov/

http://data.maryland.gov/

http://data.taxpayer.net/

http://bronx.lehman.cuny.edu/

http://data.hawaii.gov/

http://data.sfgov.org/

WE ARE SWIMMING IN DATA!

BIG DATA AT SOCRATATens of thousands of datasets, each one up to 30 million rowsCustomer demand for billion row datasetsWant to analyze across datasets

BIG DATA AT OOYALA2.5 billion analytics pings a day = almost a trillion events ayear.Roll up tables - 30 million rows per day

HOW CAN WE ALLOW CUSTOMERS TO QUERY AYEAR'S WORTH OF DATA?

Flexible - complex queries includedSometimes you can't denormalize your data enough

Fast - interactive speedsNear Real Time - can't make customers wait hours beforequerying new data

RDBMS? POSTGRES?Start hitting latency limits at ~10 million rowsNo robust and inexpensive solution for querying across shardsNo robust way to scale horizontally

PostGres runs query on single thread unless you partition(painful!)

Complex and expensive to improve performance (eg rolluptables, huge expensive servers)

OLAP CUBES?Materialize summary for every possible combinationToo complicated and brittleTakes forever to compute - not for real timeExplodes storage and memory

When in doubt, use brute force- Ken Thompson

CASSANDRAHorizontally scalableVery flexible data modelling (lists, sets, custom data types)Easy to operateNo fear of number of rows or documentsBest of breed storage technology, huge communityBUT: Simple queries only

APACHE SPARKHorizontally scalable, in-memory queriesFunctional Scala transforms - map, filter, groupBy, sortetc.SQL, machine learning, streaming, graph, R, many more pluginsall on ONE platform - feed your SQL results to a logisticregression, easy!THE Hottest big data platform, huge community, leavingHadoop in the dustDevelopers love it

SPARK PROVIDES THE MISSING FAST, DEEPANALYTICS PIECE OF CASSANDRA!

INTEGRATING SPARK AND CASSANDRAScala solutions:

Datastax integration:

(CQL-based)https://github.com/datastax/spark-cassandra-connectorCalliope

https://github.com/datastax/spark-cassandra-connector

http://tuplejump.github.io/calliope/

A bit more work:

Use traditional Cassandra client with RDDsUse an existing InputFormat, like CqlPagedInputFormat

Only reason to go here is probably you are not on CQL version ofCassandra, or you're using Shark/Hive.

A SPARK AND CASSANDRAOLAP ARCHITECTURE

SEPARATE STORAGE AND QUERY LAYERSCombine best of breed storage and query platformsTake full advantage of evolution of eachStorage handles replication for availabilityQuery can replicate data for scaling read concurrency -independent!

SCALE NODES, NOTDEVELOPER TIME!!

KEEPING IT SIMPLEMaximize row scan speedColumnar representation for efficiencyCompressed bitmap indexes for fast algebraFunctional transforms for easy memoization, testing,concurrency, composition

SPARK AS CASSANDRA'S CACHE

EVEN BETTER: TACHYON OFF-HEAP CACHING

INITIAL ATTEMPTSval rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) )

sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])

No existing generic query engine for Spark when we started(Shark was in infancy, had no indexes, etc.), so we built our ownFor every row, need to extract out needed columnsAbility to select arbitrary columns means using Seq[Any], notype safetyBoxing makes integer aggregation very expensive and memoryinefficient

COLUMNAR STORAGE AND QUERYING

The traditional row-based data storageapproach is dead- Michael Stonebraker

TRADITIONAL ROW-BASED STORAGESame layout in memory and on disk:

Name AgeBarak 46

Hillary 66

Each row is stored contiguously. All columns in row 2 come afterrow 1.

COLUMNAR STORAGE (MEMORY)Name column

0 10 1

Dictionary: {0: "Barak", 1: "Hillary"}

Age column

0 146 66

COLUMNAR STORAGE (CASSANDRA)Review: each physical row in Cassandra (e.g. a "partition key")stores its columns together on disk.

Schema CF

Rowkey TypeName StringDict

Age Int

Data CF

Rowkey 0 1Name 0 1

Age 46 66

ADVANTAGES OF COLUMNAR STORAGECompression

Dictionary compression - HUGE savings for low-cardinalitystring columnsRLE

Reduce I/OOnly columns needed for query are loaded from disk

Can keep strong types in memory, avoid boxingBatch multiple rows in one cell for efficiency

ADVANTAGES OF COLUMNAR QUERYINGCache locality for aggregating column of dataTake advantage of CPU/GPU vector instructions for ints /doublesavoid row-ifying until last possible momenteasy to derive computed columnsUse vector data / linear math libraries

COLUMNAR QUERY ENGINE VS ROW-BASED INSCALA

Custom RDD of column-oriented blocks of dataUses ~10x less heap10-100x faster for group by's on a single nodeScan speed in excess of 150M rows/sec/core for integeraggregations

SO, GREAT, OLAP WITH CASSANDRA ANDSPARK. NOW WHAT?

DATASTAX: CASSANDRA SPARK INTEGRATIONDatastax Enterprise now comes with HA Spark

HA master, that is.spark-cassandra-connector

https://github.com/datastax/spark-cassandra-connector

SPARK SQLAppeared with Spark 1.0In-memory columnar storeCan read from Parquet and JSON now; direct Cassandraintegration comingQuerying is not column-based (yet)No indexesWrite custom functions in Scala .... take that Hive UDFs!!Integrates well with MLBase, Scala/Java/Python

CACHING A SQL TABLE FROM CASSANDRAval sqlContext = new org.apache.spark.sql.SQLContext(sc)

sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") .registerAsTable("gdelt")sqlContext.cacheTable("gdelt")sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER BY AvgTone DESC LIMIT

Remember Spark is lazy, nothing is executed until thecollect()In Spark 1.1+: registerTempTable

SOME PERFORMANCE NUMBERSGDELT dataset, 117 million rows, 57 columns, ~50GBSpark 1.0.2, AWS 8 x c3.xlarge, cached in memory

Query Avgtime(sec)

SELECT count(*) FROM gdeltWHERE Actor2CountryCode ='CHN'

0.49

SELECT 4 columns Top K 1.51

SELECT Top countries by Avg Tone(Group By)

2.69

IMPORTANT - CACHINGBy default, queries will read data from source - Cassandra -every timeSpark RDD Caching - much faster, but big waste of memory(row oriented)Spark SQL table caching - fastest, memory efficient

WORK STILL NEEDEDIndexesColumnar querying for fast aggregationTachyon support for Cassandra/CQLEfficient reading from columnar storage formats

LESSONSExtremely fast distributed querying for these use cases

Data doesn't change much (and only bulk changes)Analytical queries for subset of columnsFocused on numerical aggregationsSmall numbers of group bys

For fast query performance, cache your data using Spark SQLConcurrent queries is a frontier with Spark. Use additionalSpark contexts.

THANK YOU!

EXTRA SLIDES

EXAMPLE CUSTOM INTEGRATION USINGASTYANAX

val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }

SOME COLUMNAR ALTERNATIVESMonetdb and Infobright - true columnar stores (storage +querying)Vertica and C-StoreGoogle BigQuery - columnar cloud database, Dremel basedAmazon RedShift

cassandra summit 2014: interactive olap queries using apache cassandra and spark

Technology