ingestion, indexing and retrieval of high-velocity...

Ingestion,IndexingandRetrievalofHigh-VelocityMultidimensionalSensor

DataonaSingleNodeJuanA.Colmenares,RezaDorrigiv andDanielWaddington

<juan.col@samsung.com>SeminarSeries

DepartmentofComputerScienceUniversityofCalifornia,Irvine

January12,2018

SamsungResearchAmerica

Disclaimer

• Nopartofthispresentationnecessarilyrepresentstheviewsandopinionsofmycurrentemployerandresearchcollaborators.

• Thismaterialwaspresentedatthe2017IEEEInt’lConferenceonBigData(IEEEBigData).

MultidimensionalDataSourcesMobileDevices

Vehicles

DataCenters PowerGrid

SmartAppliances

MultidimensionalData

• Spatial-temporaldata– [time,longitude,latitude,speed,…]

• Sensordata– [time,voltage,current,temp,…]

• Logs– [time,responselatency,resultcount,…]

id:28379,time:2015/12/04-11:52:21.134,latitude:40.77,longitude:-73.89,occupants:3,speed:43.2mph

NYCTaxiData

Record:[f1,f2,f3,f4,…,f(N-1),f(N) ](withnumericalindexingfields)

DemandsforHighIngestionRatesinIndustrialIoT

• SomeIoT appsrequire– Ingestingmillionsofrecs/sec– Processingqueriesonrecentlyingestedandhistoricaldata

• Example– Telemetryofpowerdistributionsystemswithmicrophasormeasurementunits(μPMU)[1,2,3]

[1] UCBerkeley,LBNLetal.http://pqubepmu.com/[2] Pinte etal.Lowvoltagemicro-phasormeasurementunit(μPMU).PECI2015.[3] Andersenetal.DISTIL:Designandimplementationofascalablesynchrophasor dataprocessingsystem.IEEESmartGridComm 2015.

512+samples/secofACvoltagesandcurrents,andothersvariables

NewDBMSstoMeetHighIngestionRequirements

• TraditionalDBMS– Optimizedforread-heavyworkloads– Offerverylowingestrates(<300Krecs/s)

• Newtimeseriesdatabases– Gorilla[VLDB2015],BTrDB [FAST2016]

• NewOLAPsystems– Druid[SIGMOD2014],VOLAP[CLUSTER2016],Cubrik [VLDB2016]

• Scalehorizontally• Sub-secondqueryresponses• Someoperatein-memory

• Lowper-nodeingestionrates

KeyQuestion

• Toanswerit:– Adoptasimpledesigntostreamlineingestion– Conductaexperimentalstudyconfinedto• Recordswithnumericalindexingfields• Rangequeries

Canwebuildasingle-nodemultidimensionaldatastore ableto:(1) sustainingestionratesmuchhigherthanthoseof

individualnodesofexistingDBMS(2) whilestillofferingsimilarqueryperformance?

SeparateNodesforIngestionandStorage

DINode(1) DINode(N)

DSNode(1) DSNode(M)DataStorageNodes

DataIngestionNodes

DataStream

PermanentStorage

SimilartoDruid’sdesign[SIGMOD2014]

InterimStorage

Queries

QueryBrokerNode

R-Tree

FamilyVariants:R*-tree,R+-tree,HilbertR-tree,X-tree

Source:Wikipedia

K-dimensional(kd)Tree

K-dtreedecompositionforthepointset(2,3),(5,4),(9,6),(4,7),(8,1),(7,2)

Source:wikipedia

Two-LevelIndexingScheme

1. AnR*-Treeindexesdatasegments(boundingboxes)

2. AKD-Treeineachsegmentindexesindividualrecords

SerializedDataSegments(withtherecords)

R*-Tree,inmemory(Level1)

KD-Tree(Level2)

SimilartoEMINC[CloudDB 2009]

BoundingBox ={d1,min,d1,max,…,d3,min,d3,max }

Two-LevelIndexingScheme

DataSegments(withtherecords)

KD-Tree(Level2)

RangeQuery

R*-Tree,inmemory(Level1)

BoundingBox ={d1,min,d1,max,…,d3,min,d3,max }

DataSegment

PackedKD-Tree(Serialized)

DataRecords

RecordDescriptor

DataIngestionProcedure

• Steps1– 5areperformedonlyinmemory

Multi-DimensionalDatastore (MDDS)

μ-batches

ThreadParallelism(Chunksprocessedindependently)

Dataaccessibletoqueriesfrommemory

beforebecomingpersistent

ConcurrentQueries(whileingestingdata)

Exploitdatalocality

EvaluationSystems Datasets Queries• Percona Server(enhancedMySQL)withstorageenginesXtraDB,MyISAM,andTokuDB• SQLite3• Druid [SIGMOD2014]

NYCTaxiTrip• ~169Mrecords• 10 numericalfields(outof14)

16 randomlygeneratedquerieson1kmX1kmareas

USNOAA’sGlobalHistoricalClimatologyNetwork- Daily (GHCN-Daily)• First100Mrecords• 6 numericalfields(outof7)

10meaningfulhandcraftedqueries(e.g.,theaveragesnowdepthforMountMcKinleyinAlaska)

Test Platform: Dell PowerEdge R720 Server • Two 2.50-GHz Intel Xeon processors (20 hardware threads), 64GB of RAM, and an

Intel 750 400GB SSD with ext4 file system. • Ubuntu 14.04 LTS (Linux kernel 3.13.0-71).

Details of experimental setup at: https://arxiv.org/abs/1707.00825

TestQueriesonNYCTaxiDataRandomlyGenerated

TestQueriesonGHCN-DailyDataMeaningfulHandcrafted

CharacterizationofDataSegmentationSchemes

• UniformlyRandomScheme(verysimple)– Recordsassignedtodatasegmentschosenuniformlyatrandom

• Kd-treepartitioningbasedscheme– Triestocreatewell-populatedsegmentswithsmalloverlapamongtheirboundinghyperrectangles

– Ourhypothesis• Itlimitsreadamplificationandimprovesqueryperformance(butnotquiteL!)

K-dimensional(kd)Tree

K-dtreedecompositionforthepointset(2,3),(5,4),(9,6),(4,7),(8,1),(7,2)

Source:wikipedia

KD-treePartitioningBasedScheme

ChunkofRecords(μ-batch)

(1)BulkLoading

KD-Tree

(2)KD-TreePartitioning

(3)Assembly(Serialization)

PartitionedKD-Tree

(2)KD-TreePartitioning• Traversesthetreeindepth-firstpre-order,groupingtherecordsbasedonthenumberofnodesinthesubtrees(withenoughrecords)

DataSegment

ComparisonbetweenSegmentationSchemesIngestionThroughput

ComparisonbetweenSegmentationSchemesNumberofOverlapsamongSegments

Lessoverlapsforkd-treepart.,

exceptinthiscase

ComparisonbetweenSegmentationSchemesQueryPerformance

Couldn’tvalidateourhypothesisThekd-treepartitioningschemedoesnotyieldbetterqueryperformance

ComparisonbetweenSegmentationSchemesQueryPerformance

Couldn’tvalidateourhypothesisThekd-treepartitioningschemedoesnotyieldbetterqueryperformance

Single-Threaded/Bulkloading Ingestion

11xw/binarydata

2x w/CSVdata

IngestionThreadScalingandInfluenceofQueries

Percona Server,SQLite&Druidreported35K,30K,and55Krecs/s,respectively.

230x inthemultithreadedscenario160x overall27x w/CSVdata

QueryResponseTimesforNYCTaxiData• Querieson1km2 areaswithrangesintime,tripdurationandpassengercount.• Percona ServerandSQLitewithasinglemulticolumnindex.

• MDDSperformscomparablytoorbetterthanPercona Serverin12queries.• ItoutperformsSQLiteinQ7andQ14.• ItoutperformsDruidinQ6-Q16(on3- to5-dimensionalranges).

QueryResponseTimesforGHCN-DailyData• 10meaningfulqueries(e.g.,averagesnowdepthforMt.McKinleyinAlaska).• Percona ServerandSQLitewithmultipleindicestailoredtothequeries.

• Asexpected,RDBMSsoutperformsMDDSacrossallqueries.• MDDSoutperformsDruidinhalfofthequeries(with3+dimensionalranges)

StorageFootprint(inGB)

MDDSoccupies• 20-42%lessstoragespacethantheRDBMSs• Upto2xthespaceusedbyDruid(w/heavydata

compression)

Conclusions

• Developedamultidimensionaldatastore ableto– Ingesthigh-velocitysensordata– Offerdecentqueryperformance

• Showedpotentialforsignificantreductionsinthenumberofclusternodesrequiredtoingesthigh-velocitysensordata

• Comparedarandomschemeandakd-treepartitioningbasedschemefordatasegmentation– Kd-treepartitioningschemeproducedlessoverlapbetweendatasegments,butdidnotyieldbetterqueryperformance

– Therandomschemeisverysimpleandfaster• Ourfirstchoice

Thanks

Questions?

ingestion, indexing and retrieval of high-velocity...

Documents

indexing implementation and indexing models csc 575...

image indexing and retrieval

computational intelligence in media indexing and retrieval

introduction to information retrieval introduction to...

fishstore: fast ingestion and indexing of raw data ·...

art extension for description, indexing and retrieval of

data-intensive information processing applications session...

cs336: intelligent information retrieval lecture 8: indexing...

indexing and retrieval for genomic databases

multimedia indexing and retrieval - imag

indexing mixed types for approximate retrieval

multimodal semantic indexing for image retrieval

audio based indexing and retrieval in muvis

indexing and retrieval

semantic annotation, indexing, and retrieval

inverted indexing for text retrieval

document preprocessing and indexing si650: information...

text indexing and retrieval

video indexing and retrieval

introduction to information retrieval introduction to...