time series processing with solr and spark
TRANSCRIPT
![Page 1: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/1.jpg)
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
![Page 2: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/2.jpg)
Time Series Processing with Solr and Spark Josef Adersberger (@adersberger)
CTO, QAware
![Page 3: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/3.jpg)
3
TIME SERIES 101
![Page 4: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/4.jpg)
4
01WE’RE SURROUNDED BY TIME SERIES
▸ Operational data: Monitoring data, performance metrics, log events, …
▸ Data Warehouse: Dimension time
▸ Measured Me: Activity tracking, ECG, …
▸ Sensor telemetry: Sensor data, …
▸ Financial data: Stock charts, …
▸ Climate data: Temperature, …
▸ Web tracking: Clickstreams, …
▸ …
@adersberger
![Page 5: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/5.jpg)
5
WE’RE SURROUNDED BY TIME SERIES (Pt. 2)
▸ Oktoberfest: Visitor and beer consumption trend
the singularity
![Page 6: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/6.jpg)
6
01TIME SERIES: BASIC TERMS
univariate time series multivariate time series multi-dimensional time series (time series tensor)
time series setobservation
@adersberger
![Page 7: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/7.jpg)
7
01ILLUSTRATIVE OPERATIONS ON TIME SERIES
align
Time series => Time series
diff downsampling outlier
min/max avg/med slope std-dev
Time series => Scalar
@adersberger
![Page 8: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/8.jpg)
OUR USE CASE
![Page 9: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/9.jpg)
Monitoring Data Analysis of a business-critical,worldwide distributed software system. Enableroot cause analysis and anomaly detection.
> 1,000 nodes worldwide
> 10 processes per node
> 20 metrics per process (OS, JVM, App-spec.)
Measured every second.
= about 6.3 trillions observations p.a.Data retention: 5 yrs.
![Page 10: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/10.jpg)
![Page 11: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/11.jpg)
11
01USE CASE: EXPLORING
Drill-down host process measurements counters (metrics)
Query time series metadata
Superimpose time series
@adersberger
![Page 12: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/12.jpg)
12
01USE CASE: STATISTICS
@adersberger
![Page 13: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/13.jpg)
13
01USE CASE: ANOMALY DETECTION
Featuring Twitter Anomaly Detection (https://github.com/twitter/AnomalyDetectionand Yahoo EGDAS https://github.com/yahoo/egads
@adersberger
![Page 14: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/14.jpg)
14
01USE CASE: SQL AND ZEPPELIN
@adersberger
![Page 16: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/16.jpg)
![Page 18: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/18.jpg)
![Page 19: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/19.jpg)
19
01AVAILABLE TIME SERIES DATABASES
https://github.com/qaware/big-data-landscape
![Page 20: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/20.jpg)
EASY-TO-USE BIG TIME SERIES DATA STORAGE & PROCESSING ON SPARK
![Page 21: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/21.jpg)
21
01THE CHRONIX STACK chronix.io
Big time series database Scale-out Storage-efficient Interactive queries
No separate servers: Drop-in to existing Solr and Spark installations
Integrated into the relevant open source ecosystem
@adersberger
Core
Chronix Storage
Chronix Server
Chronix Spark
Chr
onix
For
mat
GrafanaChronix Analytics
Collection
Analytics Frontends
Logstash fluentd collectd
Zeppelin
Prometheus Ingestion Bridge
KairosDB OpenTSDBInfluxDB Graphite
![Page 22: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/22.jpg)
22
node
Distributed Data &Data Retrieval ‣ Data sharding ‣ Fast index-based queries ‣ Efficient storage format
Distributed Processing ‣ Heavy lifting distributed
processing ‣ Efficient integration of Spark
and Solr
Result Processing Post-processing on a smaller set of time series
data flow
icon credits to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist)
@adersberger
![Page 23: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/23.jpg)
23
TIME SERIES MODEL
Set of univariate multi-dimensional numeric time series
▸ set … because it’s more flexible and better to parallelise if operations can input and output multiple time series.
▸ univariate … because multivariate will introduce too much complexity (and we have our set to bundle multiple time series).
▸ multi-dimensional … because the ability to slice & dice in the set of time series is very convenient for a lot of use cases.
▸ numeric … because it’s the most common use case.
A single time series is identified by a combination of its non-temporal dimensional values (e.g. unit “mem usage” + host “aws42” + process “tomcat”)
@adersberger
![Page 24: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/24.jpg)
24
01CHRONIX SPARK API: ENTRY POINTS
CHRONIX SPARK
ChronixRDD
ChronixSparkContext
‣ Represents a set of time series ‣Distributed operations on sets of time series
‣Creates ChronixRDDs ‣ Speaks with the Chronix Server (Solr)
@adersberger
![Page 25: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/25.jpg)
25
01CHRONIX SPARK API: DATA MODEL
MetricTimeSeries
MetricObservationDataFrame
+ toDataFrame()
@adersberger
Dataset<MetricTimeSeries>
Dataset<MetricObservation>
+ toDataset() + toObservationsDataset()
ChronixRDD
![Page 26: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/26.jpg)
26
01SPARK APIs FOR DATA PROCESSING
RDD DataFrame Dataset
typed yes no yes
optimized medium highly highly
mature yes yes medium
SQL no yes no
@adersberger
![Page 27: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/27.jpg)
27
01CHRONIX RDD
Statistical operations
the set characteristic: a JavaRDD of MetricTimeSeries
Filter the set (esp. bydimensions)
@adersberger
![Page 28: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/28.jpg)
28
01METRICTIMESERIES DATA TYPEaccess all timestamps
the multi-dimensionality:get/set dimensions(attributes)
access all observations as stream
access all numeric values
@adersberger
![Page 29: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/29.jpg)
![Page 30: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/30.jpg)
30
01//Create Chronix Spark context from a SparkContext / JavaSparkContextChronixSparkContext csc = new ChronixSparkContext(sc);//Read data into ChronixRDDSolrQuery query = new SolrQuery( "metric:\"java.lang:type=Memory/HeapMemoryUsage/used\""); ChronixRDD rdd = csc.query(query, "localhost:9983", //ZooKeeper host "chronix", //Solr collection for Chronix new ChronixSolrCloudStorage());//Calculate the overall min/max/mean of all time series in the RDDdouble min = rdd.min();double max = rdd.max();double mean = rdd.mean();
DataFrame df = rdd.toDataFrame(sqlContext);DataFrame res = df .select("time", "value", "process", "metric") .where("process='jenkins-jolokia'") .orderBy("time"); res.show();
@adersberger
![Page 31: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/31.jpg)
CHRONIX SPARK INTERNALS
![Page 32: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/32.jpg)
32
Distributed Data &Data Retrieval ‣ Data sharding (OK) ‣ Fast index-based queries (OK) ‣ Efficient storage format
@adersberger
![Page 33: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/33.jpg)
33
01CHRONIX FORMAT: CHUNKING TIME SERIES
TIME SERIES ‣ start: TimeStamp ‣ end: TimeStamp ‣ dimensions: Map<String, String> ‣ observations: byte[]
TIME SERIES ‣ start: TimeStamp ‣ end: TimeStamp ‣ dimensions: Map<String, String> ‣ observations: byte[]
Logical
TIME SERIES ‣ start: TimeStamp ‣ end: TimeStamp ‣ dimensions: Map<String, String> ‣ observations: byte[]
Physical
Chunking: 1 logical time series = n physical time series (chunks) 1 chunk = fixed amount of observations 1 chunk = 1 Solr document
@adersberger
![Page 34: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/34.jpg)
34
01CHRONIX FORMAT: ENCODING OF OBSERVATIONS
Binary encoding of all timestamp/value pairs (observations) with ProtoBuf incl. binary compression. Delta encoding leading to more effective binary compression
… of time stamps (DCC, Date-Delta-Compaction)
… of values: diff
chunck • timespan • nbr. of observations
periodic distributed time stamps (pts): timespan / nbr. of observations
real time stamps (rts) if |pts(x) - rts(x)| < threshold : rts(x) = pts(x) value_to_store = pts(x) - rts(x)
value_to_store = value(x) - value(x-1)
@adersberger
![Page 35: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/35.jpg)
35
01CHRONIX FORMAT: TUNING CHUNK SIZE AND CODEC
GZIP + 128
kBytes
Florian Lautenschlager, Michael Philippsen, Andreas Kumlehn, Josef AdersbergerChronix: Efficient Storage and Query of Operational Time Series International Conference on Software Maintenance and Evolution 2016 (submitted)
@adersberger
storage demand access
time
![Page 36: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/36.jpg)
36
01CHRONIX FORMAT: STORAGE EFFICIENCY BENCHMARK
@adersberger
![Page 37: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/37.jpg)
37
01CHRONIX FORMAT: PERFORMANCE BENCHMARK
unit: secondsnbr of queries query
@adersberger
![Page 38: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/38.jpg)
38
Distributed Processing ‣ Heavy lifting distributed
processing ‣ Efficient integration of Spark
and Solr
@adersberger
![Page 39: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/39.jpg)
39
01SPARK AND SOLR BEST PRACTICES: ALIGN PARALLELISM
SolrDocument(Chunk)
Solr Shard Solr Shard
TimeSeries TimeSeries TimeSeries TimeSeries TimeSeries
Partition Partition
ChronixRDD
• Unit of parallelism in Spark: Partition • Unit of parallelism in Solr: Shard • 1 Spark Partition = 1 Solr Shard
SolrDocument(Chunk)
SolrDocument(Chunk)
SolrDocument(Chunk)
SolrDocument(Chunk)
SolrDocument(Chunk)
SolrDocument(Chunk)
SolrDocument(Chunk)
SolrDocument(Chunk)
SolrDocument(Chunk)
@adersberger
![Page 40: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/40.jpg)
40
01ALIGN THE PARALLELISM WITHIN CHRONIXRDD
public ChronixRDD queryChronixChunks( final SolrQuery query, final String zkHost, final String collection, final ChronixSolrCloudStorage<MetricTimeSeries> chronixStorage) throws SolrServerException, IOException { // first get a list of replicas to query for this collection List<String> shards = chronixStorage.getShardList(zkHost, collection); // parallelize the requests to the shards JavaRDD<MetricTimeSeries> docs = jsc.parallelize(shards, shards.size()).flatMap( (FlatMapFunction<String, MetricTimeSeries>) shardUrl -> chronixStorage.streamFromSingleNode( new KassiopeiaSimpleConverter(), shardUrl, query)::iterator); return new ChronixRDD(docs);}
Figure out all Solr shards (using CloudSolrClient in the background)
Query each shard in parallel and convert SolrDocuments to MetricTimeSeries
@adersberger
![Page 41: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/41.jpg)
41
01SPARK AND SOLR BEST PRACTICES: PUSHDOWNSolrQuery query = new SolrQuery( “<Solr query containing filters and aggregations>"); ChronixRDD rdd = csc.query(query, …
@adersberger
Predicate pushdown • Pre-filter time series based on their metadata (dimensions, start, end) with Solr.
Aggregation pushdown • Perform pre-aggregations (min/max/avg/…) at ingestion time and store it as metadata.
• (to come) Perform aggregations on Solr-level at query time by enabling Solr to decode observations
![Page 42: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/42.jpg)
42
01SPARK AND SOLR BEST PRACTICES: EFFICIENT DATA TRANSFER
Reduce volume: Pushdown & compression
Use efficient protocols: Low-overhead, bulk, stream
Avoid remote transfer: Place Spark tasks (processes 1 partition) on the Solr node with the appropriate shard. (to come by using SolrRDD)
@adersberger
Export Handler
ChronixRDD
CloudSolrStream
Format Decoder
bulk of JSON tuples
Chronix SparkSolr / SolrJ
![Page 43: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/43.jpg)
43
private Stream<MetricTimeSeries> streamWithCloudSolrStream(String zkHost, String collection, String shardUrl, SolrQuery query, TimeSeriesConverter<MetricTimeSeries> converter) throws IOException { Map params = new HashMap(); params.put("q", query.getQuery()); params.put("sort", "id asc"); params.put("shards", extractShardIdFromShardUrl(shardUrl)); params.put("fl", Schema.DATA + ", " + Schema.ID + ", " + Schema.START + ", " + Schema.END + ", metric, host, measurement, process, ag, group"); params.put("qt", "/export"); params.put("distrib", false); CloudSolrStream solrStream = new CloudSolrStream(zkHost, collection, params); solrStream.open(); SolrTupleStreamingService tupStream = new SolrTupleStreamingService(solrStream, converter); return StreamSupport.stream( Spliterators.spliteratorUnknownSize(tupStream, Spliterator.SIZED), false);}
Pin query to one shard
Use export request handler
Boilerplate code to stream response
@adersberger
![Page 44: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/44.jpg)
Time Series Databases should be first-class citizens.
Chronix leverages Solr and Spark to be storage efficient and to allow interactive
queries for big time series data.
![Page 45: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/45.jpg)
THANK YOU! QUESTIONS?
Mail: [email protected] Twitter: @adersberger
TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE
![Page 46: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/46.jpg)
BONUS SLIDES
![Page 47: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/47.jpg)
PERFORMANCE
![Page 49: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/49.jpg)
PREMATURE OPTIMIZATION IS NOT EVIL IF YOU HANDLE BIG DATA
Josef Adersberger
![Page 50: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/50.jpg)
PERFORMANCE
USING A JAVA PROFILER WITH A LOCAL CLUSTER
![Page 51: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/51.jpg)
PERFORMANCE
HIGH-PERFORMANCE, LOW-OVERHEAD COLLECTIONS
![Page 52: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/52.jpg)
PERFORMANCE
830 MB -> 360 MB(- 57%)
unveiled wrong Jackson handling inside of SolrClient
![Page 53: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/53.jpg)
53
01THE SECRETS OF DISTRIBUTED PROCESSING PERFORMANCE
Rule 1: Be as close to the data as possible! (CPU cache > memory > local disk > network)
Rule 2: Reduce data volume as early as possible! (as long as you don’t sacrifice parallelization)
Rule 3: Parallelize as much as possible! (max = #cores * x)
![Page 54: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/54.jpg)
PERFORMANCE
THE RULES APPLIED
‣ Rule 1: Be as close to the data as possible! 1. Solr caching2. Spark in-memory processing with activated RDD compression3. Binary protocol between Solr and Spark
‣ Rule 2: Reduce data volume as early as possible! ‣ Efficient storage format (Chronix Format)‣ Predicate pushdown to Solr (query)‣ Group-by & aggregation pushdown to Solr (faceting within a query)
‣ Rule 3: Parallelize as much as possible! ‣ Scale-out on data-level with SolrCloud‣ Scale-out on processing-level with Spark
![Page 55: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/55.jpg)
APACHE SPARK 101
![Page 56: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/56.jpg)
CHRONIX SPARK WONDERLAND
ARCHITECTURE
![Page 57: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/57.jpg)
APACHE SPARK
SPARK TERMINOLOGY (1/2)
▸ RDD: Has transformations and actions. Hides data partitioning & distributed computation. References a set of partitions (“output partitions”) - materialized or not - and has dependencies to another RDD (“input partitions”). RDD operations are evaluated as late as possible (when an action is called). As long as not being the root RDD the partitions of an RDD are in memory but they can be persisted by request.
▸ Partitions: (Logical) chunks of data. Default unit and level of parallelism - inside of a partition everything is a sequential operation on records. Has to fit into memory. Can have different representations (in-memory, on disk, off heap, …)
![Page 58: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/58.jpg)
APACHE SPARK
SPARK TERMINOLOGY (2/2)
▸ Job: A computation job which is launched when an action is called on a RDD.
▸ Task: The atomic unit of work (function). Bound to exactly one partition.
▸ Stage: Set of Task pipelines which can be executed in parallel on one executor.
▸ Shuffling: If partitions need to be transferred between executors. Shuffle write = outbound partition transfer. Shuffle read = inbound partition transfer.
▸ DAG Scheduler: Computes DAG of stages from RDD DAG. Determines the preferred location for each task.
![Page 59: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/59.jpg)
THE COMPETITORS / ALTERNATIVES
CHRONIX RDD VS. SPARK-TS
▸ Spark-TS provides no specific time series storage it uses the Spark persistence mechanisms instead. This leads to a less efficient storage usage and less possibilities to perform performance optimizations via predicate pushdown.
▸ In contrast to Spark-TS Chronix does not align all time series values on one vector of timestamps. This leads to greater flexibility in time series aggregation
▸ Chronix provides multi-dimensional time series as this is very useful for data warehousing and APM.
▸ Chronix has support for Datasets as this will be an important Spark API in the near future. But Chronix currently doesn’t support an IndexedRowMatrix for SparkML.
▸ Chronix is purely written in Java. There is no explicit support for Python and Scala yet.
▸ Chronix doesn not support a ZonedTime as this makes it way more complicated.
![Page 60: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/60.jpg)
CHRONIX SPARK INTERNALS
![Page 61: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/61.jpg)
61
01CHRONIXRDD: GET THE CHUNKS FROM SOLR
public ChronixRDD queryChronixChunks( final SolrQuery query, final String zkHost, final String collection, final ChronixSolrCloudStorage<MetricTimeSeries> chronixStorage) throws SolrServerException, IOException { // first get a list of replicas to query for this collection List<String> shards = chronixStorage.getShardList(zkHost, collection); // parallelize the requests to the shards JavaRDD<MetricTimeSeries> docs = jsc.parallelize(shards, shards.size()).flatMap( (FlatMapFunction<String, MetricTimeSeries>) shardUrl -> chronixStorage.streamFromSingleNode( new KassiopeiaSimpleConverter(), shardUrl, query)::iterator); return new ChronixRDD(docs);}
Figure out all Solr shards (using CloudSolrClient in the background)
Query each shard in parallel and convert SolrDocuments to MetricTimeSeries
![Page 62: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/62.jpg)
62
01BINARY PROTOCOL WITH STANDARD SOLR CLIENT
private Stream<MetricTimeSeries> streamWithHttpSolrClient(String shardUrl, SolrQuery query, TimeSeriesConverter<MetricTimeSeries> converter) { HttpSolrClient solrClient = getSingleNodeSolrClient(shardUrl); solrClient.setRequestWriter(new BinaryRequestWriter()); query.set("distrib", false); SolrStreamingService<MetricTimeSeries> solrStreamingService = new SolrStreamingService<>(converter, query, solrClient, nrOfDocumentPerBatch); return StreamSupport.stream( Spliterators.spliteratorUnknownSize(solrStreamingService, Spliterator.SIZED), false);}
Use HttpSolrClient pinned to one shard
Use binary (request)protocol
Boilerplate code to stream response
![Page 63: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/63.jpg)
63
private Stream<MetricTimeSeries> streamWithCloudSolrStream(String zkHost, String collection, String shardUrl, SolrQuery query, TimeSeriesConverter<MetricTimeSeries> converter) throws IOException { Map params = new HashMap(); params.put("q", query.getQuery()); params.put("sort", "id asc"); params.put("shards", extractShardIdFromShardUrl(shardUrl)); params.put("fl", Schema.DATA + ", " + Schema.ID + ", " + Schema.START + ", " + Schema.END + ", metric, host, measurement, process, ag, group"); params.put("qt", "/export"); params.put("distrib", false); CloudSolrStream solrStream = new CloudSolrStream(zkHost, collection, params); solrStream.open(); SolrTupleStreamingService tupStream = new SolrTupleStreamingService(solrStream, converter); return StreamSupport.stream( Spliterators.spliteratorUnknownSize(tupStream, Spliterator.SIZED), false);}
EXPORT HANDLER PROTOCOL
Pin query to one shard
Use export request handler
Boilerplate code to stream response
![Page 64: Time Series Processing with Solr and Spark](https://reader034.vdocuments.mx/reader034/viewer/2022052405/5871c6941a28ab55058b8275/html5/thumbnails/64.jpg)
64
01CHRONIXRDD: FROM CHUNKS TO TIME SERIESpublic ChronixRDD joinChunks() { JavaPairRDD<MetricTimeSeriesKey, Iterable<MetricTimeSeries>> groupRdd = this.groupBy(MetricTimeSeriesKey::new); JavaPairRDD<MetricTimeSeriesKey, MetricTimeSeries> joinedRdd = groupRdd.mapValues((Function<Iterable<MetricTimeSeries>, MetricTimeSeries>) mtsIt -> { MetricTimeSeriesOrdering ordering = new MetricTimeSeriesOrdering(); List<MetricTimeSeries> orderedChunks = ordering.immutableSortedCopy(mtsIt); MetricTimeSeries result = null; for (MetricTimeSeries mts : orderedChunks) { if (result == null) { result = new MetricTimeSeries .Builder(mts.getMetric()) .attributes(mts.attributes()).build(); } result.addAll(mts.getTimestampsAsArray(), mts.getValuesAsArray()); } return result; }); JavaRDD<MetricTimeSeries> resultJavaRdd = joinedRdd.map((Tuple2<MetricTimeSeriesKey, MetricTimeSeries> mtTuple) -> mtTuple._2); return new ChronixRDD(resultJavaRdd); }
group chunks according identity
join chunks tological time series