hewlett packard enterprise confidential information...please give me your feedback –use the mobile...

#SeizeTheData

Hewlett Packard Enterprise confidential informationThis is a rolling (up to three year) roadmap and is subject to change without notice.

This Roadmap contains Hewlett Packard Enterprise Confidential Information. If you have a valid Confidential Disclosure Agreement with Hewlett Packard Enterprise, disclosure of the Roadmap is subject to that CDA. If not, it is subject to the following terms: for a period of three years after the date of disclosure, you may use the Roadmap solely for the purpose of evaluating purchase decisions from HPE and use a reasonable standard of care to prevent disclosures. You will not disclose the contents of the Roadmap to any third party unless it becomes publically known, rightfully received by you from a third party without duty of confidentiality, or disclosed with Hewlett Packard Enterprise’s prior written approval.

#SeizeTheData

Please give me your feedback

–Use the mobile app to complete a session survey 1. Access “My schedule”2. Click on the session detail page3. Scroll down to “Rate & review”

– If the session is not on your schedule, just find it via the Discover app’s “Session Schedule” menu, click on this session, and scroll down to “Rate & Review”

– If you have not downloaded our event app, please go to your phone’s app store and search on “Discover 2016 Las Vegas”

– Thank you for providing your feedback, which helps us enhance content for future events.

Session ID:Bxxxxx Speaker: Mark Fay, Natalia Stavisky

Effectively managing & monitoring streaming data loads

Mark FayNatalia Stavisky

using Kafka and the Vertica Management Console

#SeizeTheData

Vertica & Kafka Integration

In a world of just-in-time inventory and on-demand services, the ability to quickly load and analyze tremendous amounts of data is more important than ever before. Last year, HPE Vertica addressed this growing need by integrating with Apache Kafka to offer scalable, real-time loading from Kafka sources. Today Vertica continues to leverage these strengths by adding flexibility, monitoring, and the ability to relay data back to Kafka. With the upcoming Frontloader release, Vertica has created a data ecosystem capable of supporting even the most demanding needs.

5

#SeizeTheData

Agenda

1. Kafka Background

2. Vertica & Kafka Integration

3. Filtering & Parsing Enhancements

4. Closing the Loop: Vertica to Kafka Production

5. Scheduler CLI & Schema Enhancements

6. Monitoring Data Load with MC

6

#SeizeTheData

Kafka Background

7

#SeizeTheData

Apache Kafka Overview

A scalable, distributed message bus─ Apache project originating from LinkedIn─ Rich ecosystem of libraries and tools─ Highly optimized for low latency streaming

Solves the data integration problem─ Producers decoupled from consumers─ O(N) instead of O(N2) data pipelines─ Throughput scalable independently of source &

destination

Producer A Producer B Producer C

Consumer X Consumer Y Consumer ZConsumer Y Consumer ZConsumer X

Producer A Producer B Producer C

Kafka

8

#SeizeTheData

Apache Kafka Architecture

9

Broker A

Partition 0 0 1 32 4 765

Broker B

Partition 1 0 1 32 4 65

Broker N

Partition N 0 1 32 4 97 865

Producer writes to a

topic

Consumer reads offsets

from partitions…

TopicBrokers

PartitionsOffsets

#SeizeTheData

Recap: Vertica & Kafka in 7.2 Excavator

10

#SeizeTheData

Streaming Load Architecture

11

− Vertica schedules loads to continuously consume from any source via Kafka

− JSON, Avro, or custom data formats

− CLI driven− In-database monitoring

#SeizeTheData

Breaking Things Down

Load SchedulerImplements continuous, exactly-once

streaming

Dynamically prioritizes

resources to load from many

topics

Microbatch CommandsLoads a finite chunk of data

Updates stream progress

Kafka UDx PluginPulls data from Kafka Converts Kafka messages to

Vertica tuples

12

#SeizeTheData

Kafka UDx PluginExtending Vertica’s parallel load operators to load from Kafka

Store

…

Parse

Filter

SourceRaw bytes

Transformed bytes

Vertica Tuples

Transformed Tuples

─ Vertica’s execution is modeled as a series of datatransformations pipelined through operators for processing

─ The user defined extension (UDx) framework enables custom logic during this pipeline

─ UDx writer worries about domain logic, Vertica worries about resource management, parallelism, node communication, fault tolerance...

Source: acquire bytes (files, HDFS, Kafka)

Filter: transform bytes (decryption, decompression)

Parse: convert bytes to tuples (JSON, Avro)

Store: write tuples to projections (WOS, ROS)

13

#SeizeTheData

Microbatch CommandsSupport ‘exactly once’ through Vertica transactions

Kafka Data @ Offset X

Data Inserted Into Vertica

New Offsets Stored In Vertica

Commit

Microbatch (µB)

Confidential

#SeizeTheData

Scheduler SQL Statements

SELECT source, target_table, partition, start_offsetFROM stream_microbatch_history;-- run a microbatch for each item returned

COPY target_tableSOURCE KafkaSource(

stream=‘topic|0|0,topic|1|0’, brokers=‘broker:port’, duration=interval ‘10000 milliseconds’, stop_on_eof=true)

PARSER KafkaJSONParser( ) REJECTED DATA AS TABLE rejections_tableDIRECT NO COMMIT;

INSERT INTO stream_microbatch_history(…,*) from (SELECT KafkaOffsets() OVER ()) as microbatch_results;

COMMIT;

stream_microbatch_history table stores state about what to do next. Bootstrap with a SELECT query.

KafkaSource instructs Vertica nodes to load in parallel from Kafka for a period of time, starting at the specified <topic|partition|offset>’s

KafkaJSONParser coverts Kafka JSON messages emitted by the source into Vertica tuples for storage

KafkaOffsets returns the ending offset for each <topic|partition|offset> read by the source. Next frame will start here.

Commit atomically persists the data and ending offsets. It’s all or nothing!

µB

Frame Bootstrap

15

#SeizeTheData

Scheduling

16

#SeizeTheData

Static Scheduling Algorithm

Simple Example:– 5 topics– Concurrency of 1– Frame split into 5 equal parts

– 10 seconds total: 2 seconds each

17

1Example scheduling yields: 2 3 4 5

Hot topics become starved Lots of wasted time!

#SeizeTheData

Dynamic Scheduling Algorithm

1

1 2

1 2 3

1 2 3 4

1 2 3 4 5

To start, every batch gets an even portion of the frame.

If a batch ends early, split the leftover time evenly amongst the remaining batches.Corollary: batches that run later in the frame tend to have more time to run.

…but there’s still some wasted time at the end of the frame.

18

#SeizeTheData

Dynamic Scheduling Algorithm

5

5 4

5 4 3

5 4 3 1

5 4 3 1 2

Next time, sort by the runtime of the previous frame so that batches that ended early go first.

µB2 gets lots of time now!

19

#SeizeTheData

Dynamic Scheduling in Action

– Scheduler configured to load two topics with a frame duration of 5 seconds

– Two producers continuously producing at dynamic rates (dotted lines)

– Vertica’s load rate for the topics keeps up with the produce rate (solid lines), roughly 5 seconds behind

– Net throughput rate remains constant as load resources shift from one topic to the other

20

#SeizeTheData

Since 7.2 Excavator

21

#SeizeTheData

Added Since 7.2 Excavator

– Multiple Kafka cluster support– Added capability within a scheduler

configuration to setup multiple Kafka clusters– Kafka topics can be associated with clusters,

allowing users to stream data into Vertica from anywhere

– Single resource pool; single configuration

– Kafka version support– Added support for Kafka 0.9.x– Working with Confluent to keep up-to-date on

Kafka’s fast release cycles– 0.10 in the works

22

Scheduler Configuration

Kafka cluster

Kafka cluster

Kafka cluster

Vertica

#SeizeTheData

User-Defined Filters and Parsers

Why only JSON & Avro in 7.2 Excavator?– Kafka messages arrive with structure & metadata in

the source– Traditional parsers assume no structure; instead they

discover that structure in the data stream– Kafka JSON & Avro parsers specially designed to

preserve & leverage that information without modifying the data stream

How can I use other formats? Inject a filter!– KafkaInsertDelimiters(delimiter=E’$’)– KafkaInsertLengths()

Once filtered, data can be parsed using built-in parsers or your own custom UDParser

23

Parse

Filter

Source

WTJ(5

WTJ(5 Too much text.Wall, Tom James (Vertica), 8/2/2016

#SeizeTheData

User-Defined Filters and Parsers Example

KafkaInsertDelimiters(delimiter=E'$')– Appends a delimiter character after each message

– Most builtin parsers look for a record boundary

COPY t SOURCE KafkaSource(stream=‘some_topic|0|-2’, stop_on_eof=true, brokers=‘localhost:9092’)

FILTER KafkaInsertDelimiter(delimiter=E’$’)

RECORD TERMINATOR E’$’ DIRECT;

KafkaInsertLengths() – Prepends a unit32 length before each message

– Custom parsers can inspect lengths for efficient parsing

COPY t SOURCE KafkaSource(stream=‘some_topic|0|-2’, stop_on_eof=true, brokers=‘localhost:9092’)

FILTER KafkaInsertLengths()

Parser MyCustomParser() DIRECT;

24

Data in Kafka Data emitted by SOURCE

Data emitted by FILTER

Offset 0: {a:“foo”}Offset 1: {b:“bar”}Offset 2: {c:“baz”}

{a:“foo”}{b:“bar”}{c:“baz”} {a:“foo”}${b:“bar”}${c:“baz”}$

Data in Kafka Data emitted by SOURCE

Data emitted by FILTER

Offset 0: {a:“foo”}Offset 1: {b:“bar”}Offset 2: {c:“baz”}

{a:“foo”}{b:“bar”}{c:“baz”} 9{a:“foo”}9{b:“bar”}10{c:“baz”}

#SeizeTheData

KafkaAVROParser External Schema Support

– Avro Documents have three parts– Schema: JSON blob describing the message(s) in the

document– Object metadata: metadata for parsing the object

(SpecificData vs GenericData)– The data (i.e. a vertica row)

– Kafka Avro serializers typically do one document per Kafka message – lots of bloat!

– Remove bloat with two settings:– external_schema – specify the JSON header up front

and omit from your messages– with_metadata = false (default) to omit metadata

(parse using Avro GenericData)

– Kafka 0.10 schema registry not supported yet

25

Schema (JSON)

Metadata Data

Metadata Data

Metadata Data

Metadata Data

#SeizeTheData

VerticaKafkaKafkaExport UDTSend query results to a kafka topic in parallel!

– Input:– Partition (optional, NULL for round-robin)– Key (optional, NULL for unkeyed)– Message

– Output is messages that failed to send & reasons why (at least once semantics)

– Typical Kafka producer settings available to control performance & reliability

– INSERT … (SELECT …) for error management

26

CREATE TEMP TABLE kafka_rejections(partition integer, key varchar(128), message varchar(2000), reason varchar(1000));

INSERT INTO kafka_rejectionsSELECT KafkaExport(

partition, key, message

USING PARAMETERS

brokers=‘host1:9092,host2:9092’,

topic=‘foo’,

message_timeout_ms=5000

queue_buffering_max_ms=2000,

queue_buffering_max_messages=‘10000’)

OVER(PARTITION BEST) FROM export_src;

#SeizeTheData

VerticaKafkaNotifiers– Notifiers emit messages to external systems,

starting with Kafka

– Data Collector hooks can trigger notifiers when a record is written

– Enables external monitoring of Vertica, with persistence!

27

CREATE NOTIFIER dc_to_kafkaACTION ‘kafka://localhost:9092’MAXMEMORYSIZE ‘1GB’

#SeizeTheData

Schema and CLI Enhancements:A more flexible, more [re-]useable Scheduler– CLI reworked for more flexibility, maintainability &

extensibility

– Separated configuration from state: no longer worry about configuring topics and having the entire offsets history updated.

– Better projection design to optimize scheduler operations

– More consistent CLI config schema mappings to make it easier to do SQL based monitoring

28

MicroBatch

Source

Cluster

Target

Load Spec

#SeizeTheData

From Old to New

Old CLI utilities:

– scheduler

– kafka-cluster

– topic

New CLI utilities:

– scheduler

– cluster

– source

– target

– load-spec

– microbatch

Topic utility managed several different components

Now each component is separated into logical utilities

29

#SeizeTheData

More Flexibility

– Configure clusters that reference Kafka brokers– vkconfig cluster --create --cluster kafka1 --hosts some-kafka-broker:9092

– Separation of Topic (now: Source) and Target:– vkconfig source --create --source topic1 --cluster kafka1 --partitions 3– vkconfig source --create --source topic2 --cluster kafka1 --partitions 5– vkconfig target --create --target-schema public --target-table tgt1

– Configure Microbatches with N:1 source(s)target– Reuse sources and targets as desired– Full N:M multiplexing capabilities with M Microbatches

– vkconfig microbatch --create --microbatch mb1 --target-schema public --target-table tgt1 --add-source-cluster kafka1 --add-source topic1

– vkconfig microbatch --update --microbatch mb1 --add-source-cluster kafka1 --add-source topic2

Note: BOLD refers to Unique Keys for

referencing the specific part of the configuration.

30

#SeizeTheData

More [re-]usability

– COPY statements have many parameters, which are great for differing workloads.

– Sometimes, however, we want to reuse the same “load specification”:

– Introducing new CLI and configuration table: load spec

vkconfig load-spec --create --load-spec SPEC-1 --load-method DIRECT --parser KafkaJSONParser --parser-parameters flatten_tables=true

vkconfig microbatch --update --microbatch mb1 --load-spec SPEC-1

SPEC-1:- Load DIRECT- JSON format- Flatten JSON- No FILTERS

SPEC-2:- Load TRICKLE- Pipe-delimited CSV format- Insert Delimiter FILTER- Specific Kafka configs

31

#SeizeTheData

The New CLI

vkconfig cluster --create --cluster kafka1 --hosts some-kafka-broker:9092

vkconfig source --create --source topic1 --cluster kafka1 --partitions 3

vkconfig source --create --source topic2 --cluster kafka1 --partitions 5

vkconfig target --create --target-schema public --target-table tgt1

vkconfig load-spec --create --load-spec SPEC-1 --load-method DIRECT --parser KafkaJSONParser --parser-parameters flatten_tables=true

vkconfig microbatch --create --microbatch mb1 --target-schema public --target-table tgt1 --add-source-cluster kafka1 --add-source topic1

vkconfig microbatch --update --microbatch mb1 --add-source-cluster kafka1 --add-source topic2

Each component has its own CLI

Each instance of a component is uniquely

identifiable

All components are reusable

Each component independently editable

CRUD keywords for consistency

32

#SeizeTheData

Upgrade Process

– Upgrade will convert current scheduler settings & state to new format

– Old config state left in-tact for historical purposes, but is no longer used

– vkconfig scheduler --upgrade [--upgrade-to-schema <desired-schema>]– Upgrade by default upgrades your 7.2.x configuration within the same schema– upgrade-to-schema allows users to move upgraded schema to a new location– All objects have human-readable identifiers. Upgrade auto generates names, which can be edited afterwards

33

#SeizeTheData

Monitoring Kafka Loading with Vertica Management Console

34

#SeizeTheData

Monitoring data load activities in MC – available in Frontloader 8.0

– Displays history of data loading jobs including COPY command

– Shows outcome of individual COPY commands

35

Kafka loading – many, many COPY commands executed repeatedly over time…

After configuring Kafka streams:

#SeizeTheData

How is Kafka loading different from other types of data loading?Need to track and display many different pieces of data!– Is the data flowing?

– What microbatches are defined in the database?

– Is the data getting processed by my microbatches?

– Is the Scheduler running?

– How many messages have been processed in the last hour? In the last frame?

– Are there any errors?

– Are there any rejections?

36

The MC now presents a separate view of Instance and Continuous types of loading

The MC user can easily focus on the type of loading tasks that they want to track

#SeizeTheData

Continuous (Kafka) loading – your data flow at a glanceMonitoring Kafka loading: MC data collector streams

37

#SeizeTheData

Explore the details…

Scheduler

Microbatch 38

#SeizeTheData

Explore the details…

Microbatch errors Microbatchrejections

39

#SeizeTheData

Suspend or Resume Topic Processing

40

#SeizeTheData

Filtering Continuous Loads

41

Filtering out MC data collector monitoring streams

Filtering on the source

#SeizeTheData

Benefits of Using MC to Monitor Kafka

Monitor the Scheduler: – Is it running?

Monitor microbatches: – Are they enabled?

Monitor microbatch processing messages: – Is the data flowing the way it is expected?

Easier to triage errors and rejected data

42

#SeizeTheData

Wrap Up

– Enhancements to Integration– Closed the loop:

– Export Vertica records to Kafka– Write DC table data to Kafka

– Enhanced Filtering & Parsing capabilities– Any UDFilter & UDParser can be used, not just Kafka-specific– Native Vertica parsing

– Scheduler: extensible, relational schema design– Flexible– Easily SQL-monitored

– Scheduler CLI enhancements

– Monitoring– Browser-based access to the status of the scheduler and microbatches– Easy assessment of issues such as: data not loading, errors and rejections

43

Thank youContact information

44