leveraging big data and real-time analytics at cxense

23
Leveraging Big Data and Real-Time Analytics at Cxense Simon Lia-Jonassen 08/04/15

Upload: simon-lia-jonassen

Post on 16-Jul-2015

98 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Leveraging Big Data and Real-Time Analytics at Cxense

Leveraging Big Data and Real-Time Analytics at Cxense Simon Lia-Jonassen 08/04/15

Page 2: Leveraging Big Data and Real-Time Analytics at Cxense

2

Our mission is to help companies understand their audience and build great online user experiences.

– Stay longer on the site. – Sign up for subscriptions. – Find interesting articles. – Buy recommended products.

About Cxense

Page 3: Leveraging Big Data and Real-Time Analytics at Cxense

3

Founded in 2010, ~100 employees in 2015. Offices

–  Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London, Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco.

Some of our customers

About Cxense

Page 4: Leveraging Big Data and Real-Time Analytics at Cxense

4

Our solutions

Page 5: Leveraging Big Data and Real-Time Analytics at Cxense

5

How does it work!?

Page 6: Leveraging Big Data and Real-Time Analytics at Cxense

6

Event (example)

Page 7: Leveraging Big Data and Real-Time Analytics at Cxense

7

Content Profile (example)

Page 8: Leveraging Big Data and Real-Time Analytics at Cxense

8

Page 9: Leveraging Big Data and Real-Time Analytics at Cxense

9

Data Volume and Traffic –  5K+ Web-sites –  50M+ pages (last month) –  500M+ users (last month) –  10B+ events/month (20K events/sec peak)

Heterogeneity and Reliability

–  Hundreds of mobile and desktop platforms, browsers, internet providers, etc. –  Multiple devices per user, cross-domain tracking (3rd party cookie is dying). –  Web-pages (articles, image/video galleries, chats, search/front pages) and human language. –  The Internet is Broken™

Constrains and Requirements

–  Online and real-time processing •  Show and analyze what is happening right now.

–  High and sustainable performance •  Throughput: peak-load 10K+ request/sec.

•  Latency: 100ms latency constrain for ads and recs. –  Fault-tolerance and durability

Challenges

Page 10: Leveraging Big Data and Real-Time Analytics at Cxense

10

Architecture and Data Flow (simplified)

Page 11: Leveraging Big Data and Real-Time Analytics at Cxense

11

Communication –  HTTP with JSON payload. –  Durable and Idempotent.

Local storage

–  Atomically append to file. –  Use a new file each hour. –  Use a separate directory for each partition. –  Tail files and/or directories.

Metadata

–  Keeps the state. –  Can go backwards and re-feed when needed.

System

–  Semi-automatic configuration via Upstart and Crontab. –  Monitoring via Graphite and log files. –  Automatic alerting and centralized log search.

Data Flow and Feeding

Page 12: Leveraging Big Data and Real-Time Analytics at Cxense

12

What is The Cube? –  Partitioned column store database. –  Using efficient string handling and integer compression. –  Provides fast filtering and aggregation over 50B data points. –  Guarantees low update latency (100ms). –  Exists in multiple variants:

•  Disk or memory based.

•  Partitioned by site, by user or by both. –  Low-level API.

Example:

The Cube

© imdb.com

!me   user   rnd   siteid   url    

browser  

1409425329634   “4szi”   “xzst”   “9978”   “cxnews.com”   “Chrome”  

1409425329634   “zthp”   “fd0z”   “9978”   “cxnews.com/seahawks-­‐win-­‐again…”   “Firefox”  

1409425329635   “4szi”   “tzdt”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Chrome”  

1409425329640   “4szi”   “aext”   “9978”   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”   “Chrome”  

1409425329640   “zx5t”   “dxrf”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Safari”  

Page 13: Leveraging Big Data and Real-Time Analytics at Cxense

13

Frame of Reference Compression –  Compress the numbers in groups of 64. –  If the sequence is increasing – use the first number as the reference and compute the

differences between each two consecutive numbers (deltas). –  Find the maximum number of bits (width) needed to represent the larges delta and

compress the deltas using fixed bit width.

–  For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas.

The Cube – Integer Columns

Page 14: Leveraging Big Data and Real-Time Analytics at Cxense

14

–  A global lexicon maps all strings to numbers and back. –  For each column, we map global keys to a smaller set of numbers and back.

The Cube – String Columns

Page 15: Leveraging Big Data and Real-Time Analytics at Cxense

15

Filter –  Keep a bit-filter over a particular range of rows as a state.

Filtering

–  By number or range – pass through a column and update the filter. Use binary search for ordered columns such as time, inverted index for user id.

–  By key – map the key to a number and filter by the number. –  By set of keys – map the keys to a bit-set and filter using the bit-set. –  By pattern – filter by the set of keys matching the pattern.

Logical operations

–  AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters.

Advanced operations –  Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.). –  Join between different cubes on one or multiple dimensions.

The Cube – Filtering

Page 16: Leveraging Big Data and Real-Time Analytics at Cxense

16

Operations –  Count – count the number of bits in the filter. –  Sum – sum the numbers where filter bit is set. –  Cardinality – count the number of distinct keys/numbers. –  CardinalityEstimator – create a HyperLogLog cardinality estimator. –  Frequency – create a map of keys/numbers with the associated count. –  TopList – create a frequency map with only the k most popular keys/numbers. –  SumBy – create a map of keys/numbers with the associated sum. –  CardinalityMap – create a map of keys/numbers with the associated sum. –  FrequencyDistribution – create a histogram over frequencies. –  CardinalityDistribution – create a histogram over cardinalities. –  SumByDistribution – create a histogram over sums. –  NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles).

The Cube – Aggregation

Page 17: Leveraging Big Data and Real-Time Analytics at Cxense

17

Partitioning –  Most of the data structures are partitioned into chunks of data in order to improve memory

allocation, materialization, skipping, compression and locking. Static and dynamic parts

–  Each data column, lexicon or mapping consist of a static and a dynamic part. –  The static part is ordered – can use binary search and Minimal Perfect Hashing. –  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees.

Locking

–  Distinct Read and Read-Write Locks with different granularity/scope. –  The updates are mostly appends, but some of the columns might be updated later (e.g.,

active time, exit query, etc.). Maintenance

–  Periodically flush the dynamic part into the static part. –  Remove the old data, delete unused strings, optimize the mapping.

The Cube – Updates

Page 18: Leveraging Big Data and Real-Time Analytics at Cxense

18

Keyword vectors –  Represent user and document profiles. –  Each contain as a document id, version and a set of group-item pairs with a weight. –  Stored in a separate, highly partitioned set of containers. –  Each container keeps multiple groups. –  Each group contains a document ids, items and weights as columns.

The Cube – Advanced Data Types

Page 19: Leveraging Big Data and Real-Time Analytics at Cxense

19

Structured data –  Can represent any simple JSON object (document). –  Node types: Null, Object, Array, Integer, Float, String, Boolean. –  Stored in a separate container, separate columns for each node type. –  Each document is decomposed into a list of paths and nodes. –  Each node is added to the corresponding column.

The Cube – Advanced Data Types

Page 20: Leveraging Big Data and Real-Time Analytics at Cxense

20

Analytics API –  RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc. –  API resource paths, JSON in - JSON out. –  Most of the APIs require authentication. –  Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly.

Traffic API –  A rich set of high-level API. –  Powerful ad-hoc syntax – types, groups, items, filters, fields, etc. –  See the demo!

Analytics UI

–  HTML and JavaScript. –  Is built on top of the Analytics API. –  Has multiple fixed, functional views which can be combined with arbitrary filters. –  Premium users have a workspace area for dynamic, configurable widgets.

Analytics API and UI

Page 21: Leveraging Big Data and Real-Time Analytics at Cxense

21

Demo Session

Page 22: Leveraging Big Data and Real-Time Analytics at Cxense

Thank you! Questions?

Credits: Erik Gorset & Oslo Dev Team

Page 23: Leveraging Big Data and Real-Time Analytics at Cxense

23

…btw, we are hiring!

www.cxense.com

https://twitter.com/cxense

www.facebook.com/cxense

www.linkedin.com/company/cxense

Connect with Cxense

[email protected]

© http://w

ww.perspectivaconica.com

/