vdda: automatic visualization-driven data aggregation in ...€¦ · cations. data analysts...

25
Preprint version for self-archiving. The final publication is available at Springer via http://dx.doi.org/10.1007/s00778-015-0396-z. Noname manuscript No. (will be inserted by the editor) VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl the date of receipt and acceptance should be inserted later Abstract Visual analysis of high-volume numerical data is ubiquitous in many industries, including finance, banking, and discrete manufacturing. Contemporary, RDBMS-based systems for visualization of high-volume numerical data have difficulty to cope with the hard la- tency requirements and high ingestion rates of interac- tive visualizations. Existing solutions for lowering the volume of large data sets disregard the spatial proper- ties of visualizations and result in visualization errors. In this work, we introduce VDDA, a Visualization- Driven Data Aggregation approach that provides high- quality to error-free visualizations of high-volume data sets, at high data reduction rates. Based on the M4 aggregation for producing error-free line charts, we de- velop a complete set of visualization-driven data aggre- gation operators for the most common chart types. We describe how to model aggregation-based data re- duction at the query level in a visualization-driven query rewriting system. Our approach is generic and appli- cable to any visualization system that consumes data stored in relational databases. Using real world data sets from high-tech manufacturing, stock markets, and sports analytics domains, we demonstrate that our vis- ualization-driven data aggregation can reduce data vol- umes by up to two orders of magnitude,while preserving pixel-perfect visualizations of the raw data. Keywords Relational databases, Data visualization, Dimensionality reduction, Line rasterization Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich SAP SE, Walldorf/Dresden, Germany {firstname}.{lastname}@sap.com Volker Markl Technische Universität Berlin, Berlin, Germany [email protected] 1 Introduction Enterprises are gathering petabytes of data in pub- lic and private clouds, with numerical data originating from various sources, including sensor networks [21], smart grids [33], financial markets, and many more. Large volumes of the recorded data are subsequently stored in relational databases. Relational databases, in turn, are used as backend by visual data analysis appli- cations. Data analysts interact with the visualizations and their actions are transformed by the applications into a series of queries, issued against the relational database holding the original data. In state-of-the-art visual analytics software (e.g., Tableau, QlikView, SAP Lumira, etc.), such queries are issued to the database without considering the cardinality of the query result. However, when reading data from high-volume data sources, the result sets often contain millions of records. This leads to a very high bandwidth consumption be- tween the visualization system and the database. Let us consider the following example. SAP customers in high tech manufacturing report that it is not uncom- mon for 100 engineers to simultaneously access a global database, containing equipment monitoring data. The data originates from sensors embedded within the high- tech manufacturing machines. The common reporting frequency for such embedded sensors is 100Hz [21]. An engineer usually accesses data that spans the last 12 hours for any given sensor. The engineer’s visualiza- tion system may then use the following non-aggregating query to retrieve the necessary data from the database. SELECT time,value FROM sensor WHERE time > NOW() - 12 * 3600 The amount of data to transfer from the database to all visualization clients is 100 users ·(12·3600) seconds ·100 Hz =

Upload: others

Post on 28-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

Preprint version for self-archiving.The final publication is available at Springer via http://dx.doi.org/10.1007/s00778-015-0396-z.

Noname manuscript No.(will be inserted by the editor)

VDDA: Automatic Visualization-Driven Data Aggregationin Relational Databases

Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

the date of receipt and acceptance should be inserted later

Abstract Visual analysis of high-volume numericaldata is ubiquitous in many industries, including finance,banking, and discrete manufacturing. Contemporary,RDBMS-based systems for visualization of high-volumenumerical data have difficulty to cope with the hard la-tency requirements and high ingestion rates of interac-tive visualizations. Existing solutions for lowering thevolume of large data sets disregard the spatial proper-ties of visualizations and result in visualization errors.In this work, we introduce VDDA, a Visualization-Driven Data Aggregation approach that provides high-quality to error-free visualizations of high-volume datasets, at high data reduction rates. Based on the M4aggregation for producing error-free line charts, we de-velop a complete set of visualization-driven data aggre-gation operators for the most common chart types.We describe how to model aggregation-based data re-duction at the query level in a visualization-driven queryrewriting system. Our approach is generic and appli-cable to any visualization system that consumes datastored in relational databases. Using real world datasets from high-tech manufacturing, stock markets, andsports analytics domains, we demonstrate that our vis-ualization-driven data aggregation can reduce data vol-umes by up to two orders of magnitude,while preservingpixel-perfect visualizations of the raw data.

Keywords Relational databases, Data visualization,Dimensionality reduction, Line rasterization

Uwe Jugel, Zbigniew Jerzak, Gregor HackenbroichSAP SE, Walldorf/Dresden, Germany{firstname}.{lastname}@sap.com

Volker MarklTechnische Universität Berlin, Berlin, [email protected]

1 Introduction

Enterprises are gathering petabytes of data in pub-lic and private clouds, with numerical data originatingfrom various sources, including sensor networks [21],smart grids [33], financial markets, and many more.Large volumes of the recorded data are subsequentlystored in relational databases. Relational databases, inturn, are used as backend by visual data analysis appli-cations. Data analysts interact with the visualizationsand their actions are transformed by the applicationsinto a series of queries, issued against the relationaldatabase holding the original data. In state-of-the-artvisual analytics software (e.g., Tableau, QlikView, SAPLumira, etc.), such queries are issued to the databasewithout considering the cardinality of the query result.However, when reading data from high-volume datasources, the result sets often contain millions of records.This leads to a very high bandwidth consumption be-tween the visualization system and the database.Let us consider the following example. SAP customersin high tech manufacturing report that it is not uncom-mon for 100 engineers to simultaneously access a globaldatabase, containing equipment monitoring data. Thedata originates from sensors embedded within the high-tech manufacturing machines. The common reportingfrequency for such embedded sensors is 100Hz [21]. Anengineer usually accesses data that spans the last 12hours for any given sensor. The engineer’s visualiza-tion system may then use the following non-aggregatingquery to retrieve the necessary data from the database.

SELECT time,value FROM sensorWHERE time > NOW() - 12 * 3600

The amount of data to transfer from the database to allvisualization clients is 100users·(12·3600)seconds·100Hz =

Page 2: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

2 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

432 million records, i.e., over 4 million records per vi-sualization client. Assuming a wire size of 60 Bytes perrecord, the total traffic is 25.92 GB. Each user will haveto wait for 259.2 MB to be loaded to the visualizationclient, before he or she can examine the chart that dis-plays the sensor signal.With the proliferation of high frequency data sourcesand real-time visualization systems, the above concur-rent usage pattern and its implications are observed bySAP not only in high tech manufacturing, but acrossa constantly increasing number of industries, includingsports analytics [32], finance, and utilities [33].

1.1 Basic Idea

The final visualization, presented to the engineer, is in-herently restricted to displaying the retrieved data us-ing the width × height pixels of the canvas of the re-sulting visualization. Thereby, large data sets are trans-formed and projected onto the limited number of pix-els by the rendering component of the visualizationclient. Consequently, the rendering process is an im-plicit, client-side data reduction, applied to any incom-ing result set, regardless of the number of records it con-tains. The goal of this paper is to leverage this funda-mental observation and apply an corresponding data re-duction already at the query level within the database.

RD

BM

S

QR= MR(Q)

Q = SELECT t,v FROM T

100k tuples (in 20s)

10k tuples (in 2s)

a)

b)vis1==vis2

Fig. 1 Numerical data visualization: a) based on a unbounded,non-aggregating query; b) using visualization-driven data aggre-gation at the query level.

As illustrated in Figure 1, the goal is to rewrite an orig-inal, visualization-related query Q using a data reduc-tion operator MR, such that the resulting query QRproduces a much smaller result set, without impairingthe resulting visualization. We aim to significantly re-duce the data volumes and thus lower the high networkbandwidth requirements. This will lead to shorter wait-ing times for the users of the visualization system. Notethat the goal of our approach is not to compute imagesinside the database, since this prevents client-side in-teraction with the data. Instead, our system selects asubset of the original query result that can be consumedtransparently by any visualization client.

1.2 Contributions

We first propose a visualization-driven query rewritingtechnique, relying on relational operators and parame-trized with width, height, and type of the desired vi-sualization. Secondly, focusing on the pixel-level, spa-tial properties of line charts, which are the predom-inant form of time series visualization, we develop avisualization-driven data aggregation that produces asmall data subset, necessary to draw a pixel-perfect linechart of the underlying high-volume data set. Finally,in this extended version of the previously published pa-per [23], we generalize our approach for line charts, anddescribe how to model appropriate data aggregation op-erators for other commonly used chart types.M4. For line rendering in particular, we model the vi-sualization process by selecting for every time intervalthat corresponds to a pixel column in the final visu-alization, the tuples with the minimum and maximumvalue, and additionally the first and last tuples, havingthe minimum and maximum timestamp in that pixelcolumn. To best of our knowledge, there is no applica-tion or previous discussion of this data reduction modelin the literature, even though it provides superior re-sults for the purpose of two-color (binary) line visual-izations. We denote this model as M4 aggregation anddiscuss and evaluate it for line visualizations in general,including anti-aliased (non-binary) line visualizations.The proposed aggregation model significantly differsfrom the state-of-the-art time series dimensionality re-duction techniques [16], which are often based on linesimplification algorithms [36], such as the Ramer-Douglas-Peucker [10,20] and the Visvalingam-Whyattalgorithms [38]. These algorithms are computationallyexpensive O(n log(n)) [27] and disregard the projectionof the line to the pixels of the final visualization, re-sulting in pixel errors. In contrast, our M4 approachhas the complexity of O(n) and provides pixel-perfectvisualizations, as producible from the raw data.VDDA. Motivated by the good performance and abil-ity to implement M4 using only the relational alge-bra, we aimed to generalize the underlying aggregationmodel. As a result, in this extended work, we definea complete set of Visualization-Driven Data Aggrega-tion (VDDA) operators for the most common charttypes, such as scatter plots and bar charts. To deriveeach VDDA operator, we present a detailed analysisof each considered visualization, describing how it in-tersects the available 2D pixel space into unique dis-play units that correspond to the aggregation groupsfor VDDA.Our approach differs from existing, systematic approa-ches to data visualization in that it is not yet another

Page 3: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 3

model for building a visualization pipeline [7,37]. In-stead, VDDA complements these existing approaches.Our approach also differs from automatic presentationtechniques [31], by focusing on the spatial propertiesof the visualization, and thus not requiring additionalsemantic information about the data.M4 in particular and VDDA in general rely only on re-lational operators for the data reduction. Consequently,our approach can be applied transparently in anyRDBMS-based visualization system, using the proposedquery rewriting technique. We demonstrate the improve-ments of our techniques in a real world setting, usingprototype implementations of our algorithms on top ofSAP HANA [15] and Postgres (postgres.org).The remainder of the paper is structured as follows.In Section 2, we describe our query rewriting system.In Section 3, we discuss the considered types of visu-alizations. In Section 4, we discuss the proposed M4aggregation for line charts and describe the drawbacksof existing dimensionality reduction techniques. In Sec-tion 5, we analyze additional chart types, and basedon their spatial properties, derive their VDDA opera-tors in Section 6. In Section 7, we compare M4 withexisting techniques and evaluate the improvements re-garding query execution time, data efficiency, and visu-alization quality. Thereafter in Section 8, we evaluatethe capabilities of the additional VDDA operators. Af-ter discussing additional related work in Section 9, weeventually conclude with Section 10.

2 Query Rewriting System

In this section, we describe our query rewriting ap-proach. The approach is applicable for visual data anal-ysis systems and facilitates a data-centric reduction ofhigh-volume numerical data in relational databases.

2.1 Solution Overview and Basic Concepts

To incorporate operators for data reduction, an originalquery to a high-volume data source needs to be rewrit-ten. The rewriting can either be done directly by the vi-sualization client or by an additional query interface tothe relational database management system (RDBMS).Independently of where the query is rewritten, the ac-tual data reduction is always computed by the databaseitself. Figure 2 illustrates this query rewriting and data-centric data reduction approach.Query Definition. The definition of a query starts atthe visualization client, where the user first selects adata source, a visualization type, and the main query

VisualizationClient

selected time range

Query RewriterRDBMS

data-reduced query result

visualizationparameters

query

reductionquerydata

data reduction

data flow

+

Fig. 2 Visualization system with query rewriter.

parameters, e.g., a time range of an underlying time se-ries data source. The data source and its related param-eters define the original query. Many queries, issued byour visualization clients, are non-aggregating queries,such as the following.

SELECT time, value FROM seriesWHERE series_id = ’1’AND time > $t1 AND time < $t2

Practically, the visualization client can define an ar-bitrary relational query, as long as the schema of thequery result is compatible with the data model of thedesired visualization.

Time Series Data Model. Most of our considereddata sets are time series or sets of time series, e.g.,one series for every sensor in a semiconductor manu-facturing machine. A time series is a binary relationT (a, b) with two numeric columns a and b. To respectthe formulas of the original paper, especially when dis-cussing the M4 aggregation for line charts, we also de-note time series as a relation T (t, v) with the timestampt ∈ R and value v ∈ R. Any other relation that has atleast two numerical attributes, e.g., one for each sen-sor, can be easily projected to this model. For instance,given a relation X(a, b, c) and knowing that a is thenumerical timestamp and b and c are the numericalsensor values, we can derive two separate time seriesrelations by means of projection and renaming usingthe relational algebra, i.e., Tb(t, v) = πt←a,v←b(X) andTc(t, v) = πt←a,v←c(X).

Visualization Parameters. In addition to the origi-nal query, the visualization client must also provide thetype of the visualization and the corresponding width wand height h of the visualization canvas, on which thedata is rendered using geometric shapes, i.e., the plaindrawing area without axes, legend, labels, margins, etc.

Geometric Transformation Functions. When de-riving the data reduction operators from the visualiza-tion parameters, our query rewriter assumes avisualization-specific geometric transformation to becomputed from the query result by the visualizationclient. Given the width w and height h of the canvas, thevisualization client uses the following geometric trans-formation functions x = fx(t) and y = fy(v), x, y ∈ R

Page 4: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

4 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

to project each timestamp t and value v to the visual-ization’s coordinate system.

fx(t) = w · (t− tstart)/(tend − tstart)fy(v) = h · (v − vmin)/(vmax − vmin)

(1)

The projected real-valued time series data is then tra-versed by the drawing routines of the visualization clientto derive the discrete pixels. We assume that the pro-jected minimum and maximum timestamps and val-ues of the selected time series (tstart, tend, vmin, vmax)match exactly with the real-valued left, right, bottom,and top boundaries (xleft, xright, ybottom, ytop) of thecanvas area, i.e., we assume that the drawing routinesdo not apply an additional vertical or horizontal trans-lation or rescaling operation to the data that the queryrewriter is not aware of.

2.2 Extended Visualization Data Model

Different visualizations rely on different data models,and thus a visualization-related query has to produceresults that follow a visualization-specific schema. Manyvisualizations are based on a simple two-column (bi-nary) relation T (a, b), as already defined above, by ourtime series data model. For many visualizations, thenumeric columns a and b correspond to the x and ycoordinates on the visualization canvas.Some visualizations can display data with multiple di-mensions, including several categorical and numericaldimensions. The most complex relation that we con-sider for basic (non-composite) visualizations is the re-lation T (a, b, c, d, e), with five numeric columns a to e.This relation is similar to the data model for a 5D scat-ter plot, as illustrated in Figure 3.

b

a

c d eposition (a,b)size (c)color (d)shape (e)

Fig. 3 A 5D scatter plot with 4 continuous numerical dimensionsa to d and one discrete dimension e with low cardinality.

In this example, the data column e is represented byshape type, e.g., rectangle, triangle, or circle. The num-ber of different shape types is very limited, so that datacolumn e should contain only a few distinct values, suchas the id of a data subset. Such data columns are usuallynon-numeric, i.e., categorical data columns. However, inour work we consider all data columns to be numericand that n distinct categorical values are representedby a sequence of numbers (1, . . . , n). The handling and

a) Conditional query to apply PAA data reduction

WITH Q AS (SELECT t,v FROM sensors WHERE id = 1 AND t >= $t1 AND t <= $t2), QC AS (SELECT count(*) c FROM Q)SELECT * FROM Q WHERE (SELECT c FROM QC) <= 800UNIONSELECT * FROM ( SELECT min(t),avg(v) FROM Q GROUP BY round(200*(t-$t1)/($t2-$t1))) AS QD WHERE (SELECT c FROM QC) > 800

1) original query Q

2) cardinality query QC3a) use Q if low card.

3b) use QD if high card.

reduction query QD:compute aggregatesfor each pixel-column

b) resulting image c) expected image

Fig. 4 Query rewriting result and visualization.

conversion of categorical data, for the purpose of dataaggregation, is out of scope of this paper.For our approach, we moreover do not explicitly distin-guish between discrete and continuous numericalcolumns, as required by automatic presentation tech-niques [31], which have to acquire semantic informationabout the data columns. Our aggregation model consid-ers only the spatial properties of a visualization, inde-pendently from the semantic of the data. Internally, theproposed visualization-driven data aggregation opera-tors may still convert real-valued data to discrete val-ues, e.g., for computing aggregation group keys. How-ever, the final query result always preserves the orig-inal data types of all visualized data columns. As aresult, for all VDDA operators, the input relation hasthe same schema as the output relation and thus canbe consumed transparently by any visualization client.

2.3 Constructing the Rewritten Query

In our system (cf. Figure 2), the query rewriter handlesall visualization-related queries. Therefore, it receives aquery Q and the visualization parameters type, widthw and height h. The goal of the rewriting is to apply anadditional data reduction to those queries, whose resultset exceeds a certain size limit. The result of the rewrit-ing process is exemplified in Figure 4a. The depictedrewritten query Q′ contains the following subqueries.

1) The original query Q,2) a cardinality query QC on Q,3) a cardinality check (conditional execution),3a) to either use the result of Q directly, or3b) to execute a data reduction query QR on Q.Based on the type of the visualization, the query rewriterwill use different data reduction queries QR, as definedin Sections 4 and 6.

Page 5: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 5

The SQL example in Figure 4a uses a width of w = 200pixels, a cardinality limit of 4 · w = 800 tuples, and itapplies a data reduction using a simple piece-wise ag-gregate approximation (PAA) [26] – a naive measure fortime series dimensionality reduction. The correspond-ing visualization is a line chart, for which we only con-sider the width w to define the PAA parameters.Our system composes all relevant subqueries into onesingle SQL query to ensure a fast query execution.Thereby, we first leverage that databases are able toreuse results of subqueries, and secondly assume thatthe execution time of counting the number of records,selected by the original query, is negligibly small, com-pared to the execution time of the original query itself.The two properties are true for most modern databasesand the first facilitates the second. In practice, subqueryreuse can be achieved via common table expression [34],as defined by the SQL 1999 standard.Conditional Execution. Note that we model the de-scribed conditional execution using a union of the dif-ferent subqueries (3a) and (3b) that have contradic-tory predicates based on the cardinality limit. This al-lows us to execute any user-defined query logic of thequery Q only once. The execution of the original queryQ and the data reduction query QR is all covered bythis single query Q′. No intermediate data needs to beevaluated by the downstream components and no high-volume data needs to be copied to an downstream datareduction or compression component.Grouping Function. To compute group keys for theaggregation groups, we use the same geometric trans-formations y = fy(v) and x = fx(t), as used by thevisualization client and then round the resulting valuesto discrete group keys. For instance, the line chart queryin Figure 4a groups the data in one dimension (time t)using the following grouping function fg(t), resulting ingroup keys between 0 and w = 200.

fg(t) = round(w · (t− tstart)/(tend − tstart)) (2)

The group keys of most 1D Charts, including bar chartsand line charts, can be computed using such a one-dimensional grouping function.2D visualizations, in turn, require a vertical and hori-zontal grouping, incorporating both transformation func-tions fx and fy. For instance, a scatter plot uses thefollowing grouping function fg(t, v) resulting in groupkeys between 0 and w · h.

fg(t, v) = h · round(w · (t− tstart)/(tend − tstart))+ round(h · (v − vmin)/(vmax − vmin))

(3)

In Section 6, based on these two basic grouping func-tions, we will define specific grouping functions for eachconsidered type of visualization.

Aggregation Function. Figure 4b shows the result-ing image that the visualization client derived from thereduced data, and Figure 4c shows the resulting im-age derived from the raw time series data. For this linechart example, the chosen averaging aggregation func-tion significantly changes the perceivable shape of thetime series. The quality of the visualization heavily de-pends on the type of aggregation function that is usedto compute the aggregated values for each interval ofthe grouped data. In Section 4, we will discuss the util-ity of different types of aggregation functions for thepurpose of visualization-driven data reduction.

3 Data Visualization

There exist dozens of ways to visualize numerical data,but only a few of them work well with large data vol-umes. Bar charts, pie charts, and many other simplechart types consume too much space per record, i.e.,per rendered shape [12]. The most common charts, usedfor high-volume numerical data, are shown in Figure 5.These are line charts and scatter plots, where a singledata point can be presented using only a few pixels.Regarding space efficiency, these two basic chart typesare only surpassed by space-filling approaches [25].

c) space filling visualization

timevaluetime

valu

e

valu

e

time

b) scatter plota) line chart

Fig. 5 Visualizations for high-volume data sets.

3.1 Visual Scalability

Line charts and scatter plots, but also bar charts arebasic components of many visualization systems. Forexample, they are an integral part of tools like Tableau,for which Mackinlay et al. systematically defined a com-prehensive collection of composite chart types [31]. In-dependently of their applicability for high-volume data,we consider these chart types as a reference list that ourVDDA approach has to cover. For completeness, we addspace-filling visualizations to this list, even if they arenot very common in RDBMS-based visualizations sys-tems. However, they are the most space-efficient wayto display large data volumes and thus cannot be ne-glected. All supported chart types are listed and illus-trated in Figure 6. For a detailed description of eachchart type, please consult Mackinlay’s paper [31]. Forthe considered chart types, Figure 6 also lists a vi-sual scalability estimate, considering an UltraHD screenwith 3840×2160 pixels.

Page 6: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

6 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

Text Table

Aligned Bars

Stacked Bars

Discrete Lines

Scatter Plot

Gantt Chart

Heat Map

384*216 cells, 83k

3840 cols * 216 rows, 829k

3840 cols * 20 stacks, 77k

3840 cols * 20 series, 77k

3840*2160 px 8.29M

3840 cols * 216 lanes, 829k

960*432 cells, 500k

Chart Type Max. Number of Displayed Records

Highlight Table

Side-by-Side Bars

Measure Bars

Continuous Lines

Circle Chart

Scatter Matrix

Histogram

Space-Filling Vis.

384*216 cells, 83k

3840 cols, 3840

3840 cols, 3840

3840 cols * 20 series, 77k

384 cols * 2160 px, 829k

3840*2160 px, 8.29M

3840 cols, 3840

3840*2160 px, 8.29M

123 123123 123

12

31

17

32

21

37

49

10

19

Fig. 6 Visualizations used by visual data analysis tools [31] andtheir estimated visual scalability.

Visual Scalability is a measure for how many uniquedata points a visualization can display unambigu-ously using the limited number of w ·h pixels of thevisualization canvas.

The visual scalability estimation reveals that some charttypes can theoretically display large amounts of dataunambiguously. For instance, a scatter plot, with w =

3840 and h = 2160, can display a maximum number of8.29 Million records, when using single pixels to markeach record and with the underlying data set containingat least one record in the data range corresponding toeach pixel. However, since most data sets are not dis-tributed homogeneously, all but the space-filling visual-izations will be subject to a lot of white space and candisplay significantly less data than the theoretical max-imum of 8.29 Million records; particularly if the chartuses large mark types like the bars of a bar chart. Forthese charts, an upstream data reduction can be mostbeneficial, even if the underlying data set only has afew hundred-thousand records.Note that devices with high display resolutions, e.g.,3840× 2160 pixels, and mobile devices in general oftenuse physical, device independent pixels (dp) to positionelements on the screen. On such devices, when render-ing geometric shapes, e.g., to the 320 × 240dp of anHTML5 Canvas, the browser internally uses the finalnon-physical pixel resolution, e.g., 640×480px, to main-tain the color values of each final pixel on the screen.

On such devices, the visualization client must use thenon-physical width or height of the actual screen pixels,to appropriately parametrize the VDDA queries.

3.2 The Importance of Line Charts

Line charts are the most commonly used type of visual-ization for displaying continuous signals. Drawing a linebetween each two consecutive points visually indicatesthe continuity of the signal to the viewer (see the linechart in Figure 5a). This not the case when renderingthe points separately using disconnected marks (see thescatter plot in Figure 5b).Even though line charts are ubiquitous, a common mis-conception is that selecting one tuple per horizontalpixel is sufficient to visualize a larger data set [5]. Aswe will show in the following Section 4, this assumptionis not true and thus requires an in-depth analysis ofthe line rendering process from a data reduction point-of-view. The complexity of line rendering was one ofour main motivations for the focus on line charts in theoriginal paper [23]. The extensions in this paper, in Sec-tions 5 and 6, show that most other chart types followa simpler rendering process, usually relying on homo-geneously shaped marks like upright rectangles, restric-tively set up in a few grid cells or positioned freely inthe 2D pixel space of the visualization canvas.

4 Data Reduction Operators for Line Charts

The goal of our query rewriting system is to apply vi-sualization-driven data reduction in the database bymeans of relational operators, i.e., using data aggre-gation. In the literature, we did not find any compre-hensive discussion that describes the effects of data ag-gregation on rasterized line visualizations. Most timeseries dimensionality reduction techniques [16] are toogeneric and are not designed specifically for line charts.We will now remedy this shortcoming and describe sev-eral classes of operators that we considered and evalu-ated for data reduction and discuss their utility for linevisualizations.

4.1 Visualization-Driven Time Series Aggregation

For line charts, we model data reduction using a time-based grouping, aligning the time intervals with thepixel columns of the visualization. For each interval andthus for each pixel column we can compute aggregatedvalues using one of the following options.

Page 7: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 7

Normal Aggregation. A simple form of data reduc-tion is to compute an aggregated value and an aggre-gated timestamp using the aggregation functions min,max, avg, median, medoid, or mode. The resulting datareduction queries on a time series relation T (t, v), usinga grouping function fg and two aggregation functionsft and fv can be defined in relational algebra.1

fg(t)Gft(t),fv(v)(T ) (4)

We already used this simple form of aggregation in ourquery rewriting example (Figure 4), selecting a mini-mum (first) timestamp and an average value to modela piece-wise aggregate approximation (PAA) [26]. Butany averaging function, i.e., using avg, median, medoid,ormode, will significantly distort the actual shape of thetime series (see Figure 4b vs. 4c). To preserve the shapeof the time series, we need to focus on the extrema ofeach group. For example, we want to select those tuplesthat have the minimum value vmin or maximum valuevmax per group. Unfortunately, the grouping semanticsof the relational algebra does not allow selection of non-aggregated values.Value Preserving Aggregation. To select the cor-responding tuples, based on the computed aggregatedvalues, we need to join the aggregated data again withthe underlying time series. Therefore, we replace oneof the aggregation functions with the time-based groupkey (result of fg) and join the aggregation results withT on that group key and on the aggregated value ortimestamp. In particular, the following query

πt,v(T ./ fg(t)=k∧v=vg (fg(t)Gk←fg(t),vg←fv(v)(T ))) (5)

selects the corresponding timestamps t for each aggre-gated value vg = fv(v), and the following query

πt,v(T ./ fg(t)=k∧t=tg (fg(t)Gk←fg(t),tg←ft(t)(T ))) (6)

selects the corresponding values v for each aggregatedtimestamp tg = ft(t). Note that these queries may se-lect more than one tuple per group, if there are dupli-cate values or timestamps per group. However, in mostof our high-volume time series data sources, timestampsare unique and the values are real-valued numbers withmultiple decimals places, such that the average numberof duplicates is less than one percent of the overall data.In scenarios with more significant duplicate ratios, thedescribed queries (5) and (6) need to be encased withadditional compensating aggregation operators to en-sure appropriate data reduction rates.Sampling. Using the value preserving aggregation wecan express simple forms of systematic sampling, e.g.,

1 We use the common relational algebra notations π forprojection, σ for selection, and denote data aggregation as<GroupingFunctions>G<AggregationFunctions>.

selecting every first tuple per group using the followingquery.

πt,v(T ./ fg(t)=k∧t=tmin(fg(t)Gk←fg(t),tmin

←min(t)(T )))

For query level random sampling, we can also combinethe value preserving aggregation with the SQL 2003concept of the TABLESAMPLE or a selection opera-tor involving a random() function. While this allowsus to conduct data sampling inside the database, wewill show in our evaluation that these simple forms ofsampling are not appropriate for line visualizations.Composite Aggregations. In addition to thedescribed queries (4), (5), and (6) that yield a singleaggregated value per group, we also consider compositequeries with several aggregated values per group. In re-lational algebra, we can model such queries as a unionof two or more aggregating sub queries that use thesame grouping function. Alternatively, we can modifythe queries (5) and (6) to select multiple aggregated val-ues or timestamps per group before joining again withthe base relation. We then need to combine the joinpredicates, such that all different aggregated values ortimestamps are correlated to either their missing times-tamp or their missing value. The following MinMax ag-gregation is an example of a composite aggregation.MinMax Aggregation. To preserve the shape of atime series, the first intuition is to group-wise select thevertical extrema, which we denote as min and max tu-ples. This is a composite, value-preserving aggregation,exemplified by the SQL query in Figure 7a. The queryprojects existing timestamps and values from a timeseries relation Q, after joining it with the aggregationresults Qa. Q is defined by an original query, as alreadyshown in Figure 4a. The group keys k are based on therounded results of the (horizontal) geometric transfor-mation of Q to the visualization’s coordinate system.Finally, the join of Qa with Q is based on the aggrega-tion group keys k and on matching the values v in Q

either with vmin or vmax from Qa.Figure 7b shows the resulting visualization and we nowobserve that it closely matches the expected visualiza-tion of the raw data 7c. In Section 4.3, we later discussthe remaining pixel errors; indicated by arrows in Fig-ure 7b.

4.2 The M4 Aggregation

The composite MinMax aggregation focuses on the ver-tical extrema of each pixel column, i.e., of each cor-responding time span. There are already existing ap-proaches for selecting extrema for the purpose of datareduction and data analysis [17]. But most of them only

Page 8: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

8 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

a) value-preserving MinMax aggregation query

SELECT t,v FROM Q JOIN(SELECT round($w*(t-$t1)/($t2-$t1)) as k, -- define key

min(v) as v_min, max(v) as v_max -- get min,maxFROM Q GROUP BY k) as Q_a -- group by k

ON k = round($w*(t-$t1)/($t2-$t1)) -- join on kAND (v = v_min OR v = v_max) -- &(min|max)

b) resulting image c) expected image

Fig. 7 MinMax query and resulting visualization.

a) value-preserving M4 aggregation query

SELECT t,v FROM Q JOIN(SELECT round($w*(t-$t1)/($t2-$t1)) as k, --define key

min(v) as v_min, max(v) as v_max, --get min,maxmin(t) as t_min, max(t) as t_max --get 1st,lastFROM Q GROUP BY k) as Q_a --group by k

ON k = round($w*(t-$t1)/($t2-$t1)) --join on kAND (v = v_min OR v = v_max OR --&(min|max|

t = t_min OR t = t_max) -- 1st|last)

b) resulting image == expected image

Fig. 8 M4 query and resulting visualization.

partially consider the implications for data visualizationand neglect the final projection of the data to discretescreen pixels.A line chart that is based on a reduced data set, willalways omit lines that would have connected the notselected tuples, and it will always add new approxi-mating lines to bridge the not selected tuples betweentwo consecutive selected tuples. The resulting errors (inthe real-valued visualization space) are significantly re-duced by the final discretization process of the drawingroutines. This effect is the underlying principle of theproposed visualization-driven data reduction.Intuitively, one may expect that selecting the minimumand maximum values, i.e., the tuples (tbottom, min(v))and (ttop, max(v)), from exactly w groups, is sufficientto derive a correct line visualization. This intuition iselusive, and this form of data reduction – provided bythe MinMax aggregation – does not guarantee an error-free line visualization of the time series. It ignores theimportant first and last tuples of each group. We nowintroduce the M4 aggregation that additionally selectsthese first and last tuples (min(t), vfirst) and (max(t),vlast). In Section 4.3, we then discuss how M4 surpassesthe MinMax intuition.M4 is a composite value-preserving aggregation (seeSection 4.1) that groups a time series relation into wequidistant time spans, such that each group exactly

corresponds to a pixel column in the visualization. Foreach group, M4 then computes the aggregates min(v),max(v), min(t), and max(t) – hence the name M4 –and then joins the aggregated data with the originaltime series, to add the missing timestamps tbottom andttop and the missing values vfirst and vlast.In Figure 8a, we present an example query using theM4 aggregation. This SQL query is very similar to theMinMax query in Figure 7a, adding only themin(t) andmax(t) aggregations and the additional join predicatesbased on the first and last timestamps tmin and tmax.Figure 8b depicts the resulting visualization, which isnow equal to the visualization of the unreduced under-lying time series.Complexity of M4. The required grouping and thecomputation of the aggregated values can be computedin O(n) for the n tuples of the base relation Q. Thesubsequent equi-join of the aggregated values with Q

requires to match the n tuples in Q with 4 · w aggre-gated tuples, using a hash-join in O(n + 4 · w), butw does not depend on n and is inherently limited byphysical display resolutions, e.g., w = 5120 pixels forlatest WHXGA displays. Therefore, the described M4aggregation of has complexity of O(n).

4.3 Aggregation-Related Pixel Errors

In Figure 9 we compare three line visualizations: a) theschematic line visualization a time series Q, b) the vi-sualization of MinMax(Q), and c) the visualization ofM4(Q). MinMax(Q) does not select the first and lasttuples per pixel column, causing several types of linedrawing errors.In Figure 9b, the pixel (3,3) is not set correctly, sinceneither the start nor the end tuple of the correspondingline are included in the reduced data set. This kind ofmissing line error E1 is distinctly visible with time se-ries that have a very heterogeneous time distribution,i.e., notable gaps, resulting in pixel columns not holdingany data (see pixel column 3 in Figure 9b). A missingline error is often exacerbated by an additional false lineerror E2, as the line drawing still requires to connecttwo tuples to bridge the empty pixel column. Further-more, both error types may also occur in neighboringpixel columns, because the inner-column lines do notalways represent the complete set of pixels of a col-umn. Additional inter-column pixels – below or abovethe inner-column pixels – can be derived from the inter-column lines. This will again cause missing or additionalpixels if the reduced data set does not contain the cor-rect tuples for the correct inter-column lines. In Figure9b, the MinMax aggregation causes such an error E3

by setting the undesired pixel (1,2), derived from the

Page 9: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 9

a) baseline vis. of Q

1 2 3 4

b) vis. of MinMax(Q) c) vis. of M4(Q)

1

2

3

E3

E1

E2

inter-groupline

inner-grouplinesbackground pixels

fore-groundpixels

Fig. 9 MinMax-related visualization error.

false line between the maximum tuple in the first pixelcolumn and the (consecutive) maximum tuple in thesecond pixel column. Note that these errors are inde-pendent of the resolution of the desired raster image,i.e., of the chosen number of groups. If we do not explic-itly select the first and last tuples, we cannot guaranteethat all tuples for drawing the correct inter-column linesare included.

4.4 The M4 Upper Bound

Based on the observation of the described errors, thequestion arises if selecting only the four extremum tu-ples for each pixel column guarantees an error-free vi-sualization. In the following, we prove the existence ofan upper bound of tuples necessary for an error-free,two-color line visualization.Definition 1. A width-based grouping of an ar-bitrary time series relation T (t, v) into w equidistantgroups, denoted as G(T ) = (B1, B2, ..., Bw), is derivedfrom T using the surjective grouping function i = fg(t) =

round(w · (t− tstart)/(tend− tstart)) to assign any tupleof T to the groups Bi. A tuple (t, v) is assigned to Biif fg(t) = i.Definition 2. An M4 aggregation GM4(T ) selectsthe extremum tuples (tbottomi , vmini), (ttopi , vmaxi),(tmini

, vfirsti), and (tmax, vlasti) from each Bi of thewidth-based grouping G(T ).Definition 3. A visualization relation V (x, y) withthe attributes x ∈ N[1,w] and y ∈ N[1,h] contains allforeground pixels (x, y), representing all (black) pixelsof all rasterized lines. V (x, y) contains none of the re-maining (white) background pixels in N[1,w] × N[1,h].Definition 4.A visualization operator visline(T )→V defines, for all tuples (t, v) in T , the correspondingforeground pixels (x, y) in V .Thereby visline first uses the linear transformation func-tions fx and fy to rescale all tuples in T to the coordi-nate system R[1,w]×R[1,h] (see Section 2) and then testsfor all discrete pixels and all non-discrete lines – definedby the consecutive tuples of the transformed time series– if a pixel is on the line or not on the line. For brevity,we omit a detailed description of line rasterization [4,

inner-column pixelsdepend only on top &bottom pixels, derivedfrom min/max tuples

all non-inner-column pixelscan be derivedfrom first andlast tuples

inner-columnpixel

inter-columnpixels

first and last tuplesmin and max tuples

Fig. 10 Illustration of the Theorem 1.

6], and assume the following two properties to be true.

Lemma 1. A rasterized line has no gaps.Lemma 2. When rasterizing an inner-column line, noforeground pixels are set outside of the pixel column.

Theorem 1. Any two-color line visualization of an ar-bitrary time series T is equal to the two-color line vi-sualization of a time series T ′ that contains at leastthe four extrema of all groups of the width-basedgrouping of T , i.e., visline(GM4(T )) = visline(T ).

Figure 10 illustrates the reasoning of Theorem 1.Proof. Suppose a visualization relation V = visline(T )

represents the pixels of a two-color line visualization ofan arbitrary time series T . Suppose a tuple can onlybe in one of the groups B1 to Bw that are defined byG(T ) and that correspond to the pixel columns. Thenthere is only one pair of consecutive tuples pj ∈ Bi andpj+1 ∈ Bi+1, i.e., only one inter-column line betweeneach pair of consecutive groups Bi and Bi+1. But then,all other lines defined by the remaining pairs of consec-utive tuples in T must define inner-column lines.As of Lemmas 1 and 2, all inner-column pixels can bedefined from knowing the top and bottom inner-columnpixels of each column. Furthermore, since fx and fyare linear transformations, we can derive these top andbottom pixels of each column from the min and maxtuples (tbottomi

, vmini), (ttopi , vmaxi

) for each group Bi.The remaining inter-column pixels can be derived fromall inter-column lines. Since there is only one inter-column line pjpj+1 between each pair of consecutivegroups Bi and Bi+1, the tuple pj ∈ Bi is the last tuple(tmaxi

vlasti) of Bi and pj+1 ∈ Bi+1 is the first tuple(tmini+1

, vfirsti+1) of Bi+1.

Consequently, all pixels of the visualization relation Vcan be derived from the tuples (tbottomi

, vmini),

(ttopi , vmaxi), (tmaxi

vlasti), (tmini, vfirsti) of each group

Bi, i.e., V = visline(GM4(T )) = visline(T ). �

Using Theorem 1, we can moreover derive

Theorem 2. There exists an error-free two-color linevisualization of an arbitrary time series T , based ona subset T ′ of T , with |T ′| ≤ 4 · w.

No matter how big T is, selecting the correct 4·w tuplesfrom T allows us to create a pixel-perfect line visualiza-

Page 10: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

10 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

tion of T . Clearly, for the purpose of line visualization,the M4 aggregation is a data-efficient and predictableform of data reduction.

4.5 Comparison to Existing Approaches

In Section 4.1, we discussed averaging, systematic sam-pling and random sampling as measures for data re-duction. They provide some utility for time series di-mensionality reduction in general, and may also provideuseful data reduction for certain types of visualizations.However, for rasterized line visualizations, we have nowproven the necessity of selecting the correct min, max,first, and last tuples. As a result, any averaging, system-atic, or random sampling approach will fail to obtaingood results for line visualizations, if it does not ensurethat these important tuples are included.Compared to our approach, the most competitive ap-proaches, as found in the literature [16], are times se-ries dimensionality reduction techniques based on linesimplification. For instance, Fu et al. reduce a time se-ries based on perceptually important points (PIP) [17].Such approaches usually apply a distance measure de-fined between each three consecutive points of a line,and try to minimize this measure ε for a selected datareduction rate. This min-ε problem is complementedwith a min-# problem, where the minimum numberof points needs to be found for a defined distance ε.However, due to the O(n2) worst case complexity [27]of this optimization problem, there are many heuristicalgorithms for line simplification. The fastest class of al-gorithms works sequentially in O(n), processing everypoint of the line only once. The two other main classeshave complexity of O(n log(n)) and either merge pointsuntil some error criterion is met (bottom up) or splitthe line (top down), starting with an initial approxi-mating line, defined by the first and last points of theoriginal line. The latter approaches usually provide amuch better approximation of the original line [36]. Ouraggregation-based data reduction, including M4, has acomplexity of O(n).Approximation Quality. To determine the approxi-mation quality of a derived time series, most time se-ries dimensionality reduction techniques use a geomet-ric distance measure, defined between the segments ofthe approximating line and the (removed) points of theoriginal line.The most common measure is the Euclidean (perpen-dicular) distance, as shown in Figure 11a, which is theshortest distance of a point pj to a line segment pipk.Other commonly applied distance measures are the areaof the triangle (pi, pj , pk) (Figure 11b) or the verticaldistance of pj to pipk (Figure 11c).

original line approximating line

b) triangle area

a) perpendicular distance

c) vertical distance

Fig. 11 Distance measures for line simplification.

However, when visualizing a line using discrete pix-els, the main shortcoming of the described measuresis their generality. They are geometric measures, de-fined in R× R, and do not consider the discontinuitiesin discrete 2D space, as defined by the cutting linesof two neighboring pixel rows or two neighboring pixelcolumns. For our approach, we consult the actual visu-alization to determine the approximation quality of thedata reduction operators.Visualization Quality. Two images of the same sizecan be easily compared pixel by pixel. A simple, com-monly applied error measure is the mean square errorMSE = 1

wh

∑wx=1

∑hy=1(Ix,y(V1)− Ix,y(V2))2, with Ix,y

defining the luminance value of a pixel (x, y) of a visu-alization V . Nonetheless, Wang et al. have shown [39]that MSE-based measures, including the commonly ap-plied peak-signal-to-noise-ratio (PSNR) [9], do not ap-proximate the model of human perception very well anddeveloped the Structural Similarity Index (SSIM). Forbrevity and due to the complexity of this measure, wehave to pass on without a detailed description of thismeasure. The SSIM yields a similarity value between 1

and −1. The related normalized distance measure be-tween two visualizations V1 and V2 is defined as:

DSSIM(V1, V2) =1− SSIM(V1, V2)

2(7)

We use DSSIM to evaluate the quality of the visualiza-tion of the reduced data set, compared to the originalvisualization of the underlying unreduced data set.

5 Display Units

The visual scalability, i.e., the maximum number of dis-playable records for each chart type, depends on themaximum number of display units of that chart type,and thus on how each chart type separates the pixelspace of the visualization canvas into such display units.Display Unit. A display unit of a visualization dis-plays one or several computed values to represent a po-tentially larger subset of records of the entire data set.The number of display units is limited by the size andstructure of the visualization. A unique display unit isdefined by one or several discrete coordinate values ofone or several visualization-specific layout dimensions.

Examples of display units are the following.

Page 11: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 11

Pixel. The display units of a scatter plot are the w · hsingle pixels (x, y) within the layout dimensions x ∈N[1,w] and y ∈ N[1,h] of the scatter plot. The marks,i.e., the geometric shapes, in a scatter plot are freelypositioned in the 2D pixel space, starting at anypixel (x, y).

Table cell. The cell of a table presents a text string torepresent the data records underlying that cell. Theunique coordinate (kx, ky) of a table cell dependson the size of the table and the size of the cells, i.e.,kx ∈ N[1,w/wcell], and ky ∈ N[1,h/hcell].

Full-height bar. The set of wbar · hbar pixels for a full-height bar, including margins, is the display unit ofa non-stacked bar chart. It can present two valuesto represent the underlying group of records, usinga variable height of the rendered bar and the colorof the bar. The unique coordinate k of a display unitdepends on the width of the chart and the width,including margins, of the bars, i.e., k ∈ N[1,w/wbar].

A display unit is not necessarily the same as a mark,used to represent a data point. For instance, the marksin a scatter plot, such as rectangles or circles, are usu-ally larger than one pixel and often require many neigh-boring pixels to be filled. Thereby, they overplot alreadydrawn marks or parts of marks.Nevertheless, for most visualizations, there is a strongcorrelation between marks and display units, and weconsequently denote themaximum width and height, in-cluding margins, of bars, cells, and marks in general, aswbar, hbar, wcell, hcell, wmark, and hmark. If not statedotherwise, these measures correspond to the sizes of theunderlying display units.While rendering the marks, the data records are mappedto discrete display units, i.e., the visualization clienttransforms and projects the underlying data dimen-sions to the layout dimensions. In the described query-rewriting system, the data dimensions are representedby the data columns of the original query Q. The datacolumns are transformed and projected, already in thedatabase, using the visualization-specific grouping func-tions, introduced in Section 2.3 and discussed more ex-tensively in Section 6.1.In Figure 12, we classify the considered, specific charttypes into several chart type categories, based on theway they define display units in the 2D pixel space of avisualization. The resulting categories have the follow-ing properties.1D Charts are basic visualizations, where the visualmarks are laid out horizontally or vertically along onelayout dimension k, with k ∈ N[1,w/wmark] or k ∈ N[1,h/hmark]

. A small, visualization-specific set of marks is drawninside every single (or between every two) horizontal orvertical display units.

1D Charts one or several pixel columns or rows(size depends on grid size or mark size)

Chart Type Category

123 123123 123

12

31

17

32

21

37

49

10

19

2D Grids (no overlap) grid cells (one cell may display more thanone value to summarize the represented data)

Display Units

Stacked Marks 1. non-stacking units (lanes, cells, rows, etc.)

2. single pixels (in the stacking direction)

1. stack cells (horizontal and/or vertical)

2. original display units (of stacked charts)

Stacked Charts

single pixels(x,y defined by single data column)

Space-FillingVisualization

2D Plots(overlapping shapes)

single pixels (x,y correlate to two data columns)

Fig. 12 Display units of different chart categories.

2D Plots are basic visualizations, where the visualmarks are laid out horizontally and vertically alongtwo layout dimensions x and y, with x ∈ N[1,w] andy ∈ N[1,h]. At most one mark or fraction of a mark isvisible inside every display unit (pixel). The marks of2D plots (e.g., a cross in a scatter plot or a filled circlein a bubble chart) are often larger than one pixel andmay hide several neighboring pixels.

Space-Filling Visualizations are basic visualizations,where the marks and display units are single pixels, or-dered sequentially in the 2D pixel space, along one lay-out dimension, i.e., the unique coordinate of a displayunit is a single discrete value k, with k ∈ N[1,w·h]. Atmost one mark is drawn for every display unit.

There are different space-filling algorithms [25]. Themost common one is to render the pixels, for a list ofup to w · h tuples, from top to bottom and from left toright, as illustrated in Figure 5c.

2D Grids are basic visualizations, where the marksmay not leave the grid cells. Overplotting may only oc-cur inside each grid cell. The display units (kx, ky), i.e.,the grid cells, are defined by a vertical and horizontalintersection of the visualization canvas into groups ofequal size, i.e., kx ∈ N[1,w/wcell], and ky ∈ N[1,h/hcell].

Stacked Charts are a combination of basic charts,e.g., 1D Charts or 2D Plots, in a 2D Grid. The chartscan be combined by either stacking the complete charts(e.g., Aligned Bars) or by separating the display units ofthe basic charts and regrouping these units, accordingto their correlation in one of the shared data dimensions(e.g., Side-by-Side Bars). As a result, the nchart ·nx ·nydisplay units of a stacked chart are the union of thenchart separate display units of each of the nx · nystacked charts. A display unit is defined by the co-ordinate (kchart, kx, ky), with kx ∈ N[1,w/wchart], ky ∈

Page 12: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

12 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

N[1,h/hchart], kchart ∈ N[1,nchart], and nchart dependingon the type of the stacked charts.Stacked Marks are a composite visualizations thatspatially compose single marks of several correlated datadimensions. Thereby, the position of the marks of onedata dimension depends on the position of the marksof the other data dimensions. The stacked marks aredisplayed inside the coarse display units of the non-stacked chart, e.g., the cell kchart of a bar chart, withkchart ∈ N[1,w/wbar]. The coarse units are subsequentlyseparated per pixel row or pixel column. As a result,the display units of stacked marks are identified by(kchart, kstack), with kstack ∈ N[1,h] or kstack ∈ N[1,w].

6 Visualization-Driven Data Aggregation

Similar to the MinMax aggregation and the M4 aggre-gation for line charts, we can define an aggregation foreach of the considered chart types listed in Figure 6.In the following we define a general aggregation model,using an SQL query template that can be used for alltypes of visualizations. Thereafter, we refine this ag-gregation model for each considered chart type cate-gory, defined in Section 5, and eventually present avisualization-specific data aggregation operator for eachsupported chart type.Figure 13 shows a complete VDDA query to compute adata-reduced subset for a bar chart. In the following, weuse this query as a template, replacing the visualization-specific parts of this query, i.e., Abar and Qbar, withappropriate subqueries for each visualization. Note thatthis query is equal to the data reduction subquery QRas defined by the query rewriting process (cf. Section2.3), i.e., it only models the visualization-specific datareduction and does not include the cardinality checkand the conditional execution.The query template contains the following subqueriesin the form of common table expressions.

Q: A subquery containing the original query, as definedby the visualization client, incorporating all param-eters that are not explicitly exposed to the queryrewriter.

Qb: A subquery to compute boundaries of the requesteddata range. This subquery computes the boundariesfor all data dimensions that have to be projected tothe corresponding layout dimensions of the visual-ization.

Qg: A subquery that adds an additional column k toQ, containing the group key that represents the cor-responding display unit for each record.

A<chart>: A chart-specific subquery that simulates orapproximates the visual data reduction of the ren-dering process. This subquery has as output the ag-gregated values and the corresponding group key.The output relation of A<chart> is not yet compat-ible with the visualization data model.

Q<chart>: A chart-specific subquery that correlates theaggregated values with the original data. This queryyields a relation that is compatible with the corre-sponding visualization model.

The final, cleaned up result set, excluding all auxiliarycolumns, can be derived from Q<chart> by projectingall data columns that correspond to data dimensions tobe displayed by the visualization. The complete queryQR implements a value-preserving aggregation, as al-ready defined in Section 4.

6.1 Multidimensional Grouping

To compute the group keys for most visualizations, oneor two numerical columns are projected to the layoutdimensions of the visualization, using one of the group-ing functions 2 and 3 introduced in Section 2.3. Therequired parameters are the numbers of display unitsfor each layout direction, derived from the size of thevisualization canvas and the size of the display units.The group key subquery does not require the originallyselected data range to be exposed to the query rewriter,since these ranges can be computed by the boundarysubquery Qb and thus be used by the grouping sub-query Qg. Each record in Qg then contains the originalrecord and one of the group keys k1 to kn, with n beingthe total number of display units. Independently of howmany layout dimensions are considered for the visual-ization, the grouping function ensures that only one keyis computed for every record, i.e., fg : R× ...×R→ N.We can generalize the grouping functions to support anarbitrary number of dimensions. Let fi(ai) be a group-ing function for distributing the values of a single datadimension ai, to the ni display units of a layout dimen-sion.

fi(ai) = round(ni ∗ (ai − aimin)/(aimax − aimin)))

Then, using an fi for any of the D considered layoutdimensions, we can define multidimensional groupingfunctions as follows.

2D : fg(a1, a2) = n1 · f2(a2) + f1(a1)

3D : fg(a1, a2, a3) = n1 · n2 · f3(a3)+ n1 · f2(a2)+ f1(a1)

Page 13: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 13

WITH Q AS (SELECT a,b FROM T WHERE a > $a1 AND a < $a2), -- original queryQ_b AS (SELECT min(a) as a1, max(a) as a2 FROM Q), -- computed boundariesQ_g AS (SELECT -- compute group keys using

round( $n * (a - (SELECT a1 FROM Q_b)) -- the number of display units $n/ (SELECT a2 - a1 FROM Q_b) -- and the boundaries in Q_b

) as k,a,b), -- add key column to data columns-- <bar-chart-subqueries> --A_bar AS (SELECT k,max(b) b_max FROM Q_g GROUP BY k), -- compute agg. valuesQ_bar AS (SELECT * FROM Q_g JOIN A_bar -- correlate agg. with orig. data

ON A_bar.k = Q_g.k -- via equi-join on the group keyAND A_bar.b_max = Q_g.b -- and the aggregated values)

-- </bar-chart-subqueries> --SELECT a,b FROM Q_bar -- remove auxiliary columns

Fig. 13 VDDA query to compute the minimal data subset for displaying the underlying data as a bar chart using n bars.

DD : fg(a1, ..., aD) =

D∑i=1

(fi(ai) ·i−1∏j=0

nj) (8)

The factors n1 to nD are the numbers of display units ofeach layout dimension. The constant n0 = 1 is requiredfor formal correctness and does not influence the result.

Example. The most complex grouping function that weconsider, is defined for scatter matrices, which splitthe visualization into several smaller canvases. Eachcanvas uses the entire, assigned pixel sub range tolayout the geometric shapes of the scatter plots. As-suming that all sub canvases have the same widthws and height hs, we can compute a combined groupkey, using the following multidimensional groupingfunction.

fg (a, b, c, d) =

ws · hs · wws· round( hhs

· d−dmin

dmax−dmin)

+ ws · hs · round( wws· c−cmin

cmax−cmin)

+ ws · round(hs · b−bmin

bmax−bmin)

+ round(ws · a−amin

amax−amin)

The data columns a and b define the x and y coor-dinates of the marks in each sub canvas. The datacolumns c and d determine which sub canvas eachtuple corresponds to.

Multidimensional grouping can be expressed as rela-tional operator, and implemented using SQL, as exem-plified in Figure 14, showing Qb and Qg for a basicscatter plot visualization. In the following, we definethe subqueries Qb and Qg for each chart type category.1D Charts require to divide one data dimension a ofthe selected data range into na equidistant intervals cor-responding to either the horizontal or vertical displayunits of the chart, using the following subqueries.

Qb = Gmax(a),min(a)(Q)

Qg = πk←fg(a),a,b(Q)

-- boundary query Q_b --SELECT min(a) as a1, max(a) as a2,

min(b) as b1, max(b) as b2 FROM Q

-- group key query Q_g --SELECT $n_v * round(

$n_h * (a - (SELECT a1 FROM Q_b))/ (SELECT a2 - a1 FROM Q_b)

) + round($n_v * (b - (SELECT b1 FROM Q_b))

/ (SELECT b2 - b1 FROM Q_b)) as k,a,b FROM Q

Fig. 14 2D Boundary and group key subqueries of a scatter plot.

2D Plots require to divide two data dimensions a, andb of the selected data range into na horizontal intervalsand nb vertical intervals, resulting in na · nb data subranges, corresponding to the display units of the plot.The boundary subquery must compute the boundaryvalues for the horizontal and the vertical value ranges.The values na and nb usually correspond to the pixelwidth w and the pixel height h of the canvas of the plot.2D plots use the following subqueries.

Qb = Gmax(a),min(a),max(b),min(b)(Q)

Qg = πk←fg(a,b),a,b,c(Q)

Space-Filling Visualizations use the same group keysubquery Qg as 1D Charts, with the number of displayunits defined as na = w · h.2D Grids use the same grouping and boundary sub-queries as 2D Plots, with na defined by the number ofcolumns and nb defined by the number of rows of thegrid, i.e., na = w/wcell and nb = h/hcell.Stacked Charts require an outer grouping similar tothe grouping of 2D Grids and an inner grouping definedby the type of the stacked charts. The boundary andgroup key subqueries are defined as follows.

Qb = Gmax(a),min(a),max(b),min(b),max(c),min(c)(Q)

Page 14: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

14 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

Qg = πk←fg(a,b,c),a,b,c,d(Q)

Stacked Marks require to align or combine the valuesranges of the correlated data dimensions, for instance,by addition of different stackable data columns

Qs = πa,b,c,d,s←(b+c+d)(Q)

or by computing a correlated aggregation of a stackeddata column b over a non-stacked data column a.

Qs = πa,b,c,s(Q ./ Ga,s←sum(b)(Q))

The latter is required if several data sub sets are storedin a single data column b. In this case, each data subsetmust be identified by a key, provided by another datacolumn c. For each value of a at most one value mustexist in b per key in c, to ensure that the stack sums in scorrectly simulate the stacking. The resulting boundaryand group key subqueries for stacked marks are similarto those of 1D Charts.

Qb = Gmax(a),min(a)(Qs)

Qg = πk←fg(a),a,b,c,...,s(Qs)

However, they use Qs as substitute for the originalquery. Qg moreover projects all data columns from Qs,including the stacking column s.

6.2 Visualization-Specific Aggregation Functions

The aggregation functions for the considered chart typesare similar to the MinMax or M4 aggregation, select-ing one or more extremum tuples for each display unit.These extremum tuples often represent the biggest geo-metric shape rendered inside or starting at this displayunit, on the visualization canvas. In most visualizationsthis biggest shape completely hides or visually domi-nates all smaller shapes in the same display unit. Inthese cases, the aggregation subquery A<chart> com-putes the maximum values for each group and for eachdata column that is relevant for the visual aggregation,i.e., for data dimensions that influence the overplot-ting. The correlation subquery Q<chart> subsequentlyselects the corresponding values of the other visualizeddimensions.In the following, we denote the combination of a max-imum aggregation and the corresponding correlationsubquery as Qmax. Using the relational algebra, we canmodel Qmax for any number of data columns a1 to aDof the original query Q, with any combination of visu-ally aggregating columns g1 to gV as

Qmax(g1, ..., gV ; a1, ..., aD) = πa1,...,aQ(Qg ./θ Amax)

with an aggregation subquery Amax as

Amax(g1, ..., gV ) = GkG,g1max←max(g1),...,gVmax←max(gV )

and an equi-join condition θ as

θ ← (k = kG ∧ (g1 = g1max ∨ ... ∨ gV = gVmax).

The subquery Qmax joins the result of Amax with Qgon the group key k and on any of the aggregated datacolumns, and subsequently projects all original datacolumns from the join result. For instance, the com-bined aggregation and correlation Qmax(a, b; a, b, c) isequal to the following SQL query.WITH A_max AS ( -- aggregation --

SELECT k, max(a) AS a_max, max(b) AS b_maxFROM Q_g GROUP BY k

) -- correlation --SELECT a,b,c FROM Q_g JOIN A_max AS AON Q_g.k = A.k AND (b_max = b OR a_max = a)

Qmax and thus Amax are defined independently ofwhether or not any of the columns a1 to aD of Q areused by the group key subquery Qg to compute k.In the following, we define and explain a specific Qmaxfor all considered visualizations, excluding line charts.This implies that, for all but line charts, we can find anapproximating maximum aggregation to simulate theoverplotting behavior of the rendering process.Scatter Plots layout marks in two dimensions, us-ing geometric shapes as marks that are defined rela-tively to the center of the shape. The center and sizeof a shape determine how much the shape can over-plot smaller, underlying shapes. When rendering thegeometric shapes, the center coordinates are projectedto discrete canvas pixels. As a result, the shapes ofall records, corresponding to a single canvas pixel, willshare this pixel as their center. Several shapes with thesame center pixel can be visible, e.g., smaller shapescan be drawn on top of bigger shapes. The final vis-ibility of the shape depends on the rendering order,which our query rewriter is not aware of. Based on thisobservation, and since the relational model uses setsrather than (ordered) lists, we use a simplified aggrega-tion model for scatter plots. For each display unit, i.e.,pixel, we select only the tuple that produces the biggestshape, using the subquery Qmax(c; a, b, c, d1, ..., dm),with a and b projecting to the coordinates x and y ofa shape, c defining the size of the shape, and d1 to dmrepresenting any other dimension to be visualized, e.g.,using various types of shapes and colors.For scatter plots that do not map any data dimensionto the size of the marks, i.e., using equally sized marks,we cannot compute a biggest shape. In this case, we useQmax(a; a, b, d1, ..., dm), aggregating horizontally overa, computing the tuple with the maximum a value foreach display unit, i.e., the last horizontal tuple perpixel. The last shape usually overplots all other pre-viously rendered shapes starting at the same pixel.

Page 15: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 15

Aligned Bar Charts are a visual combination of sev-eral bar charts in a 2D grid. Each coarse display unit,i.e., each basic bar chart, displays the values of a singledata column. A bar can only overplot bars in the samesub chart. As a result, we can use separately aggregatedvalues – one for each data column c1 to cm – to sim-ulate the overplotting, using the subquery Qmultibar =Qmax(c1, ..., cm; a, b, c1, ..., cm), with a and b projectingto kx and ky of the 2D grid of the composite chart.Side-by-Side Bar Charts are a visual combinationof several bar charts. Each display unit, i.e., a group ofbars, displays at most one value per correlated data set,i.e., per data column. Each correlated bar of a group ofbars is overplotted separately. As a result, we can useseparately aggregated values – one for each data col-umn – to simulate the overplotting, using the subqueryQmultibar, as used for aligned bar charts.Stacked Bar Charts are composite charts, showingone composite geometric shape per display unit. Forstacked bar charts, Qmax requires Qg to contain a ge-ometric composition column s that simulates the geo-metric stacking (cf. Section 6.1). This column is usedby Qmax to compute the visual aggregation but is notincluded in the final query result, i.e., we use the sub-query Qmax(s; a, b1, ..., bn) for stacked bar charts, withthe stacking column s computed from the stacked datacolumns b1 to bm, and with a projecting to the layoutdimension k of the basic bar chart.Measure Bar Charts are basic bar charts that canvisualize one additional numerical data dimension, us-ing the color of the bar. For measure bars, we use thesubquery Qmax(b; a, b, c), with a projecting to the lay-out dimension k of the basic bar chart, b defining theheight and c defining the color of a rendered bar.Text Tables define a 2D grid. Every grid cell can dis-play one value. When rendering several tuples inside thesame cell, the text strings overlay each other, makingthe visual result unreadable to the user. For overlaidtexts there is no biggest geometric mark that domi-nates the other marks. Consequently, a value must bedisplayed that represents the data range of the cell se-mantically, rather than visually. Commonly used sum-marizing representatives are the minimum, maximum,first, last, and mean values. From these options, we pre-fer the maximum value as an appropriate visual aggre-gation, as it would correspond to the largest graphicalshape that could be rendered in a table cell, insteadof the textual value. To visually aggregate the data fortext tables, we use the subquery Qmax(c; a, b, c), withthe data columns a and b projecting to the vertical andhorizontal layout dimensions kx and ky of the 2D griddefined by the table. An aggregated maximum value ofthe data column c is is displayed inside the cells.

Heat Maps define a 2D grid. Each cell of the gridis used to display one value. For heat maps that usevariable-sized rectangles inside each cell, the visibility ofthe resulting shapes depends on the rendering order. Ifthe heat map uses colors instead of shapes to indicate avalue, only the colored table cell itself is visible. In bothcases, we again use the tuple with the maximum valueto represent the data, i.e., the largest or “hottest” tupleof each grid cell, using the subquery Qmax(c; a, b), witha and b projecting to kx and ky of the 2D grid of theheat map. The aggregated values of the data column cdetermine the color of a cell, or the size of the shape tobe rendered inside a cell.Highlight Tables define a 2D grid, rendering a geo-metric shape and a text value in each grid cell. Theyare a similar to text tables and heat maps. For highlighttables, we use the same subquery Qmax(c; a, b), as usedfor heat maps to compute a maximum tuple per cell.Circle Charts are simplified scatter plots. Instead ofusing pixels as display units, one of the layout dimen-sions kx or ky is defined by a coarse intersection ofthe visualization canvas, i.e., kx ∈ N[1,w/wcell] or ky ∈N[1,h/hcell]. The size and shape of the marks is fixed, i.e.,circle charts can display two dimensions less than scat-ter plots. Shapes with the same center pixel are fullyoverplotting each other. As a result, only one tuple perdisplay column and per pixel row is required to draw acorrect visualization. Which tuple is to produce the lastrendered shape, depends on the rendering order. Thisorder is unknown to the query rewriter and we againconsider the last tuple as the dominant representative,using the subquery Qmax(a; a, b, c), with a and b defin-ing the coordinates (x, ky) or (kx, y) of the circle and cdefining the color.Scatter Matrices are combinations of scatter plots,splitting the screen into several smaller plots. The ag-gregation model is the same as for basic scatter plots.We select the tuple producing the biggest shape, ac-cording to the sizing dimension e, using the subqueryQmax(e; a, b, c, d, e), with a and b projecting to the xand y coordinates of the shapes of the individual scatterplots and c and d projecting to the layout dimensionskx and ky of the 2D grid of the matrix. For scatter ma-trices with equally sized marks, we again compute thelast tuples, using Qmax(a; a, b, c, d).Histograms are basic bar charts, displaying the valuedistribution of one data column. The computation ofthe histogram values is subject to the original queryQ. The visualization-driven aggregation is the same asthat of a basic bar chart and uses the subquery Qbar =Qmax(b; a, b), also exemplified in Figure 13, with a pro-jecting to the layout dimension k of the basic bar chartand b determining the height of a bar.

Page 16: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

16 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

Space-Filling Visualization are sequentially order-ing the w · h pixels of the canvas, according to a space-filling algorithm. Unlike scatter plots, the visualizeddata set can have at most two data columns, one defin-ing the order of the pixels and the other determiningthe color of a pixel. Therefore, we can treat space-fillingvisualizations as bar charts with a height of h = 1 pixeland a width of w′ = w · h pixels. We simulate the ren-dering of space-filling visualizations using the basic barchart subquery Qbar = Qmax(b; a, b).

Gantt Charts have several lanes, displaying variablesized horizontally stacked bars. New bars can start atany horizontal pixel, having a size between one pixeland the number of pixels between the starting pixel ofthe bar and the rightmost pixel of the canvas. Assumingtuples to be rendered from left to right, a bar startingin a specific pixel column will overplot previous barsthat start or stretch into this pixel column. We approx-imate Gantt charts by selecting, for each lane and foreach pixel column, the maximum, i.e., farthest stretch-ing tuple, using the subquery Qmax(s; a, b, c), with a de-termining the coarse layout dimension k, i.e., the lanes,and b and c representing the start and end times ofthe time intervals to be displayed in the Gantt chart.The auxiliary column s of Qg determines the size of thebars and is computed similarly to that of stacked barcharts (cf. Section 6.1), using an additional subqueryQs = πa,b,c,s←(c−b)(Q) as input, instead of the originalquery Q.

Continuous Line Charts draw one line between ev-ery two pixel columns and one ore more lines inside eachpixel column. Overplotting occurs mainly inside eachpixel column (cf. Section 4.2). As stated previously, themaximum aggregation is not a sufficient approximationfor line charts. Consequently, we use an M4 aggregationAM4, in combination with a correlation subquery Qline,exemplified as follows.

WITH A_M4 AS (SELECT k, min(a) as a_min, max(a) as a_max,

min(b) as b_min, max(b) as b_maxFROM Q_g GROUP BY k

)-- Q_line --SELECT a,b FROM Q_g JOIN A_M4 as AON Q_g.k = A.k AND (a_max = a OR a_min = a OR

b_max = b OR b_min = b)

Discrete Line Charts can be treated as continuousline charts, having a smaller virtual width or height,derived from the size of the display units, i.e., w′ =w/wcell, or h′ = h/hcell.

7 Evaluation of Line Chart Aggregation

In the following evaluation, we will compare the datareduction efficiency of the M4 aggregation with state-of-the-art line simplification approaches and with com-monly used naive approaches, such as averaging, sam-pling, and rounding. Therefore, we apply all consideredtechniques to several real world data sets and measurethe resulting visualization quality. For all aggregation-based data reduction operators, which can be expressedusing the relational algebra, we also evaluate the queryexecution performance.

7.1 Real World Time Series Data

We consider three different data sets: the price of a sin-gle share on the Frankfurt stock exchange over 6 weeks(700k tuples), 71 minutes from a speed sensor of a soc-cer ball ([32], ball number 8, 7M records), and one weekof sensor data from an electrical power sensor of a semi-conductor manufacturing machine ([21], sensor MF03,55M records). In Figure 15, we show excerpts of thesedata sets to display the differences in time and valuedistribution.

t

electrical power

P (W)

c)

t

stock pricea)

v (µm/s)

t

ball speedb)

Fig. 15 a) financial, b) soccer, c) machine data.

For example, the financial data set (a) contains over20000 tuples per working day, with the share price chang-ing only slowly over time. In contrast to that the soccerdata set (b) contains over 1500 readings per second andis best described as a sequence of bursts. Finally, themachine sensor data (c) at 100Hz is a mixed signal thathas time spans of low, high, and also bursty variation.

7.2 Query Execution Performance

In Section 4.1, we described how to express simple sam-pling or aggregation-based data reduction operators us-ing relational algebra, including our proposed M4 ag-gregation. We now evaluate the query execution perfor-mance of these different operators. All evaluated querieswere issued as SQL queries via ODBC over a (100Mbit)wide area network to a virtualized, shared SAP HANAv1.00.70 instance, running in an SAP data center at aremote location.

Page 17: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 17

Fig. 16 Query performance: (a,b,c,d) financial data, (e,f) soccer data, (g,h) machine data.

The considered queries are: 1) a baseline query that se-lects all tuples to be visualized, 2) a PAA-query thatcomputes up to 4·w average tuples, 3) a two-dimensionalrounding query that selects up to w · h rounded tuples,4) a stratified random sampling query that selects 4 ·wrandom tuples, 5) a systematic sampling query that se-lects 4 · w first tuples, 6) a MinMax query that selectsthe two min and max tuples from 2 · w groups, andfinally 7) our M4 query selecting all four extrema fromw groups. Note that we adjusted the group numbersto ensure a fair comparison at similar data reductionrates, such that any reduction query can at most pro-duce 4 · w tuples. All queries are parametrized using awidth w = 1000, and (if required) a height h = 200.

In Figure 16, we plot the corresponding query execu-tion times and total times for the three data sets. Thetotal time measures the time from issuing the query toreceiving all results at the SQL client. For the finan-cial data set, we first selected three days from the data(70k records). We ran each query 20 times and obtainthe query execution times, shown in Figure 16a. Thefastest query is the baseline query, as it is a simple se-lection without additional operators. The other queriesare slower, as they have to compute the additional datareduction. The slowest query is the rounding query, be-cause it groups the data in two dimensions by w andh. The other data reduction queries only require onehorizontal grouping. Comparing these execution timeswith the total times in Figure 16b, we see the baselinequery losing its edge in query execution, ending up oneorder of magnitude slower than the other queries. Evenfor the low number of 70k records, the baseline query

number of underlying rows

tota

l tim

e (s

)

Fig. 17 Performance with varying record count.

is dominated by the additional data transport time.Regarding the resulting total times, all data reductionqueries are on the same level and manage to stay be-low one second. Note that the M4 aggregation does nothave significantly higher query execution times and to-tal times than the other data reduction queries. The ob-servations are similar when selecting 700k records (30days) from the financial data set (Figure 16c and 16d).The aggregation-based queries, including M4, are over-all one order of magnitude faster than the baseline at anegligible increase of query execution time.Our measurements show very similar results when run-ning the different types of data reduction queries onthe soccer and machine data sets. We requested 1400krecords from the soccer data set, and 3600k recordsfrom the machine data set. The results in Figure 16(e–h) again show an improvement of total time by up toone order of magnitude.We repeated all tests, using a Postgres 8.4.11 RDBMSrunning on an Xeon E5-2620 with 2.00GHz, 64GB RAM,1TB HDD (no SSD) on Red Hat 6.3, hosted in the samedata center as the HANA instances. All data was servedfrom a RAM disk. The Postgres working memory was

Page 18: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

18 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

a) M4 and MinMax vs. Aggregation b) M4 and MinMax vs. Line Simplification

Ibinary

round2d

nh=w nh=2*w

number of tuples number of tuples

IIanti-aliased

number of tuples number of tuples

Fig. 18 Data efficiency of evaluated techniques, showing DSSIM over data volume.

set to 8GB. In Figure 17, we show the exemplary re-sults for the soccer data set, plotting the total time foran increasing number of underlying records. We againobserve the baseline query heavily depending on thelimited network bandwidth. The aggregation-based ap-proaches again perform much better. We made com-parable observations with the finance data set and themachine data set.The results of both the Postgres and the HANA sys-tem show that the baseline query, fetching all records,mainly depends on the database-outgoing network band-width. In contrast, the size of the result sets of allaggregation-based queries for any amount of underlyingdata is more or less constant and below 4·w. Their totaltime mainly depends on the database-internal query ex-ecution time. The evaluation also shows that the M4 ag-gregation is equally fast as common aggregation-baseddata reduction techniques. M4 can reduce the time theuser has to wait for the data by one order of magnitudein all our scenarios, and still provide the correct tuplesfor high quality line visualizations.

7.3 Visualization Quality and Data Efficiency

We now evaluate the robustness and the data efficiencyregarding the achievable visualization quality. There-fore, we test M4 and the other aggregation techniquesusing different numbers of horizontal groups nh. Westart with nh = 1 and end at nh = 2.5 · w. Therebywe want to select at most 10 · w records, i.e., twice asmuch data as is actually required for an error-free two-color line visualization. Based on the reduced data setswe compute an (approximating) visualization and com-pare it with the (baseline) visualization of the originaldata set. All considered visualizations are drawn using

the open source Cairo graphics library (cairographics.org).The distance measure is the DSSIM, as motivated inSection 4.5. The underlying original time series of theevaluation scenario are 70k tuples (3 days) from the fi-nancial data set. The related visualization has w = 200

and h = 50. In the evaluated scenario, we allow thenumber of groups nh to be different than the width wof the visualization. This will show the robustness ofour approach. However, in a real implementation, theengineers have to make sure that nh = w to achievethe best results. In addition to the aggregation-basedoperators, we also compare our approach with threedifferent line simplification approaches, as described inSection 4.5. We use the Reumann-Witkam algorithm(reuwi) [35] as representative for sequential line simpli-fication, the top-down Ramer-Douglas-Peucker (RDP)algorithm [10,20], and the bottom-up Visvalingam-Whyatts (visval) algorithm [38]. Since RDP does notallow setting a desired data reduction ratio, we pre-computed the minimal ε that would produce a numberof tuples proportional to the considered nh.

In Figure 18, we plot the measured, resulting visualiza-tion quality (DSSIM) over the resulting number of tu-ples of each different grouping nh = 1 to nh = 2.5 ·w ofan applied data reduction technique. For readability, wecut off all low quality results with DSSIM < 0.8. Thelower the number of tuples and the higher the DSSIM,the more data efficient is the corresponding techniquefor the purpose of line visualization. The Figures 18aIand 18bI depict these measures for binary line visualiza-tions and the Figures 18aII and 18bII for anti-aliasedline visualizations. We now observe the following re-sults.

Sampling and Averaging operators (avg, first, andsrandom) select a single aggregated value per (horizon-

Page 19: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 19

tal) group. They all show similar results and providethe lowest DSSIM. As discussed in Section 4, they willoften fail to select the tuples that are important forline rasterization, i.e., the min, max, first, and last tu-ples that are required to set the correct inner-columnand inter-column pixels.2D-Rounding requires an additional vertical groupinginto nv groups. We set nv = w/h · nh to a have pro-portional vertical grouping. The average visualizationquality of 2D-rounding is higher than that of averagingand sampling.MinMax queries select min(v) and max(v) and thecorresponding timestamps per group. They provide veryhigh DSSIM values already at low data volumes. Onaverage, they have a higher data efficiency than all ag-gregation based techniques, (including M4, see Figure18a), but are partially surpassed by line simplificationapproaches (see Figure 18b).Line Simplification techniques (RDP and visval) onaverage provide better results than the aggregation-based techniques (compare Figures 18a and 18b). Asseen previously [36], top-down (RDP) and bottom-up(visval) algorithms perform much better than the se-quential ones (reuwi). However, in context of rasterizedline visualizations they are surpassed by M4 and alsoMinMax at nh = w and nh = 2 · w. Line simplifica-tion techniques often miss one of the min, max, first,or last tuples, because these tuples must not necessar-ily comply to the geometric distance measures used forline simplification, as described in Section 4.5.M4 queries select min(v), max(v), min(t), max(t) andthe corresponding timestamps and values per group. Onaverage M4 provides a visualization quality ofDSSIM > 0.9 but is usually below MinMax and theline simplification techniques. However, at nh = w, i.e.,at any factor k of w, M4 provides error-free visualiza-tions, without pixel errors, because any grouping withnh = k ·w and k ∈ N+ also includes the min, max, first,and last tuples for nh = w.2

Anti-aliasing. The observed results for binary visu-alization (Figures 18I) and anti-aliased visualizations(Figures 18II) are very similar. The absolute DSSIMvalues for anti-aliased visualizations are even better thanfor binary ones. This is caused by a single pixel er-ror in a binary visualization implying a full color swapfrom one extreme to the other, e.g., from back (0) towhite (255). Pixel errors in anti-aliased visualization areless distinct, especially in overplotted areas, which are

2 Given w equally sized groups of a continuous range, we obtain2 · w equally sized groups by adding intersections in the centerof each group. The original intersections of the value range andthus their corresponding first and last tuples are still part of thequery result. Similarly, the original min and max tuples becomemin or max tuples in the new subgroups.

common for high-volume time series data. For exam-ple, a missing line will often result in a small increasein brightness, rather than a complete swap of full colorvalues, and an additional false line will result in a smalldecrease in brightness of a pixel.

7.4 Evaluation of Pixel Errors

A visual result of M4, MinMax, RDP, and averaging(PAA), applied to 400 seconds (40k tuples) of the ma-chine data set, is shown in Figure 19. We use only100×20 pixels for each visualization to reveal the pixelerrors of each operator. M4 thereby also presents theerror-free baseline image. We marked the pixel errorsfor MinMax, RDP, and PAA; black represents addi-tional pixels and white the missing pixels compared tothe base image.

M4/Baseline

PAA

MinMax

RDP

Fig. 19 Projecting 40k tuples to 100x20 pixels.

We see how MinMax draws very long, false connectionlines (right of each of the three main positive spikesof the chart). MinMax also has several smaller errors,caused by the same effect. In this regard, RDP is better,as the distance of the not selected points to a long, falseconnection line is also very high, and RDP will have tosplit this line again. RDP also applies a slight averag-ing in areas where the time series has a low variance,since the small distance between low varying values alsodecreases the corresponding measured distances. Themost pixel errors are produced by the PAA-based datareduction, mainly caused by the averaging of the verti-cal extrema. Overall, MinMax results in 30 false pixels,RDP in 39 false pixels, and PAA in over 100 false pixels.M4 stays error-free.

7.5 Data Reduction Potential

Let us now go back to the motivating example in Sec-tion 1. For the described scenario, we expected 100

Page 20: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

20 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

users trying to visually analyze 12 hours of sensor data,recorded at 100Hz. Each user has to wait for over 4Million records of data until he or she can examine thesensor signal visually. Assuming that the sensor data isvisualized using a line chart that relies on an M4-basedaggregation, and that the maximum width of a chart isw = 2000 pixels. Then we know that M4 will at mostselect 4 ·w = 8000 tuples from the time series, indepen-dent of the chosen time span. The resulting maximumamount of tuples, required to serve all 100 users witherror-free line charts, is 100users · 8000 = 800000 tuples;instead of previously 463 million tuples. In this scenariowe achieve a data reduction ratio of over 1 : 500.

8 Evaluation of VDDA

In the following, we evaluate the performance and dis-cuss the visualization quality of the VDDA operatorsand visualization categories.For this evaluation, we use specific original queries Qon the Frankfurt stock exchange data (cf. Figure 15a),such that these baseline queries yield a reasonable vi-sualization. The considered data set contains the 58largest stock market shares, having the largest num-ber of records within the evaluated day. The data hasthree data dimensions, the id of a share, the time t of arecord, and the price v at that time. The complete dataset, including all 58 subsets, comprises 450k records andis stored in a Postgres 9.3 database. The database runson an Ubuntu/Linux 14.4 LTS installed on an 2.5GhzCore2Duo, with 4GB RAM and SSD. Postgres uses upto 2GB working memory. The queries are issued viaODBC from the VDDA Demonstrator web application[22], which we extended to support the new chart types.All charts are rendered to an HTML canvas using thedefault drawing API of the browser (Firefox 34).

8.1 Evaluation Scenarios

We consider four different visualization scenarios to beevaluated. For each scenario, we use multiple runs withincreasing number of underlying records, starting at 25krecords (30 minutes), ending at 450k records (9 hours).Depending on the type of visualization, we configuredthe original queries, as detailed in Table 1. The tablecolumn Q defines the original query on the underlyingdata T (id, t, v). The column Q → V denotes how eachvisualization maps the result of the original query Q tothe attributes of the visualization V . The last two ta-ble columns define the relevant sub queries, required toderive a complete VDDA query, as described in Section6. The considered scenarios are the following.

Scn. w h wg hg mark type ids (cats)

1) 320 20 10 1 bars 10 (1)2) 320 100 1 1 pixels 4 (1)3) 120 100 1 1 stacked bars 5 (1)4) 320 120 10 3 pixels 58 (3)

Table 2 Visualization parameters and data properties.

Scenario 1. As representative for 1D Charts andStacked Charts, we consider the visualization of mul-tiple data subsets using Aligned Bar Charts.Scenario 2. As representative for 2D Plots, we visu-alize multiple data subsets in one Scatter Plot, usingsingle pixels as marks. This scenario also represents 2DGrids, whose VDDA query is the same as for a low-resolution scatter plot.Scenario 3. As representative for Stacked Marks, wefirst aggregate the data (from milliseconds to minutes)and visualize the correlated maximum values per minuteas Stacked Bar Chart.Scenario 4. As representative for high-dimensionalStacked Charts, we visualize all 58 data subsets as Scat-ter Matrix with 3×10 grid cells, i.e., with multiple datasubsets sharing one cell.For each scenario, Table 2 lists the visualization param-eters, i.e., the size w×h of the visualization canvas, thesize wg × hg of the visualization grid, stack, or matrix,the mark type, the number of data subsets (ids), andthe number of data categories (cats).

8.2 Evaluation Results

We measure the query execution time, until the queryresult is available to be sent from the remote backendsystem to the local web client. Furthermore, we measurethe data transfer time and the rendering time. The sumof all three times is the total time the user has to waituntil he or she can see a chart on the screen. For eachscenario, and each run, we compare the performanceof an VDDA query QR with the performance of theunreduced baseline query Q.The performance of the measured VDDA queries is sim-ilar to the results observed for the M4, MinMax, andthe other competing data aggregation queries for linecharts. The measurement results are depicted in Figure20, showing the query execution times, data transfertimes and rendering times for each test run.Overall, the VDDA queries are in most cases not signif-icantly slower than the baselines queries, even thoughthe database has to compute the additional data ag-gregation. In Scenarios 1, 2, and 4, the VDDA queriesare even up to 3 times faster than the baseline, becausethe aggregation in the database is significantly reduc-

Page 21: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 21

Scenario Q (original query) Q→V (visual projection) Qmax(vis. agg.) Qg (group key subqueries)

1) aligned bars σid≤12(T ) (id, t, v)→ (kx, kcell, h) Qmax(v; id, t, v) Qg = πk←fg(id,t),id,t,v(Q)

2) scatter plot σid≤6(T ) (id, t, v)→ (kcolor, x, y) Qmax(t; id, t, v) Qg = πk←fg(t,v),id,t,v(Q)

3) stacked bars Gid,t← tC

,v←max(v)(Q0)

Q0 = σ11≤id≤15(T )

(id, t, v)→ (kcolor, kcell, h) Qmax(s; id, t, v) Qg = πk←fg(t),id,t,v,s(Qs)

Qs = Q ./ Gt,s←sum(v)(Q)

4) scatter matrix πid,m←id%3,t,v(T ) (id,m, t, v)→ (kx, ky , x, y) Qmax(v; id,m, t, v) Qg = πk←fg(t,v,id,m),id,m,t,v(Q)

Table 1 Original queries Q on the base relation T (id, t, v) for each chart type; Projection of Q to the attributes of the visualizationV ; Visualization-specific aggregation query Qmax; Corresponding sub queries, on which Qmax depends.

no. base records (thousands) t (ms)t (ms) t (ms) t (ms)

a) aligned bars (25k-450k records) b) scatter matrix (25k-450k records) c) stacked bars (25k-450k records) d) scatter plot (25k-450k records)

texec

tnet

trender

VDDAorig. query

Fig. 20 VDDA vs. original query: Query execution times, data transfer times, and rendering times of four selected chart types.

scenario aligned bars scatter plot stacked bars scatter matrixrun 25k 450k 25k 450k 25k 450k 25k 450k

speedup 1.8 2.9 1.2 1.9 1.0 1.0 1.1 1.6reduction 15.6 1:254 1:3.1 33.3 1.4 19.9 1.7 13.3

Table 3 Speedup and data reduction of VDDA over baseline.

ing number records to be prepared for the downstreamcomponents. Depending on the original query, less datahas to be acquired and less data has to be serialized bythe database. Even for small workloads on 25k under-lying records, the VDDA queries are faster and alreadyprovide a small benefit. The higher the number of un-derlying tuples is, the higher is the speedup of VDDAover the baseline. The speedup and data reduction ra-tios are summarized in Table 3, listing the results forthe first run with 25k underlying records and the lastrun with 450k underlying records for each scenario.The overall average performance increase of all mea-sured VDDA queries over all runs for the different sce-narios is 60%, i.e., the queries are answered 1.6 timesfaster than the original queries. The average data re-duction ratio is 1:43, with 1:1.4 at the lower end and1:254 at the high end.

8.3 Visualization Results

All VDDA operators, defined for the considered charttypes, use a maximum aggregation to select the dom-inating tuples per display unit. For most chart typesthis maximum aggregation effectively selects all tuplesrequired to produce all correct pixels, so that the visu-alization of the reduced data is exactly the same as the

11k

53k

|Q||T|50k

250k

90k450k

353

344

|QR|

352

Fig. 21 Scenario 1: Aligned Bars, ten shares in own sub chart.

|T|,|Q|,|QR|: 25k,2k,1196; 100k, 8.5k, 2509; 450k, 34.7k, 2540;

Fig. 22 Scenario 2: Scatter Plot, four shares in one plot.

visualization of the original data. This is also the casefor the four evaluation scenarios.Scenario 1. For bar charts, only one maximum tupleis required to draw the correct bar per pixel column.The visualizations of three runs are depicted in Figure21. For each of the three runs, we also listed the sizeof the base relation |T |, the size of the original queryresult |Q|, and the size or the VDDA query result |QR|.Note that the VDDA query usually selects more thanw = 320 tuples, since the correlated aggregation doesnot remove duplicates (cf. Section 4.1).Scenario 2. For scatter plots with single pixel marks,only one last (maximum time) tuple is required to drawthe correct pixel, given than the rendering process con-sumes the tuples ordered by time. The visualizationsand data properties of three runs are depicted inFigure 22.Scenario 3. For stacked bar charts, the tuples defin-ing the highest stacked bars are required. These tu-ples are selected by correlating the highest sum of val-ues per timestamp with the base relation and selectingthe tuples that contribute to that sum. The visualiza-

Page 22: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

22 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

1 21 225k450k

|Q||T|

25k450k

806013538

|QR|

1)2)

w: 320; wmatrix: 10h: 120; hmatrix: 3

dimmh: id, |dimmh|: 58dimmv: id%3, |dimmv|: 3

Fig. 24 Scenario 4: Scatter Matrix, 58 shares in 3x10 matrix cells. 1) 25k records, 30 minutes; 2) 450k records, 9 hours.

632

6.9k

|Q||T|

25k

250k

12k450k

468

623

|QR|

624

1 2 3

1:

2:

3:

Fig. 23 Scenario 3: Stacked Bars, four shares in one chart.

tions and data properties of three runs are depicted inFigure 23.Scenario 4. A scatter matrix is a composition of scat-ter plots. For each single scatter plot, using single pixelmarks, the VDDA query selects the last (maximumtime) tuple per pixel. This results in an error-free vi-sualization, given than the rendering process consumesthe tuples ordered by time. The visualizations and dataproperties of two runs are depicted in Figure 24.For all scenarios, we observe that the number of recordsproduced by the VDDA queries QR are virtually thesame as the number of pixels or the number of barsof the visualization. The higher number of records pro-duced by the original queries Q does not provide anybenefit and also requires the client-side renderer to re-draw many pixels or bars over and over again.In summary, we can state that VDDA is able to pro-duce correct visualizations for the most common charttypes, at lower query execution times, lower data trans-fer times, and lower rendering times, compared to thetimes measured for the original queries.

9 Related Work

In this section, we discuss existing visualization sys-tems and provide an overview of related data reductiontechniques, discussing the differences to our approach.

9.1 The Visualization Pipeline

Visualizations systems implement a very common dataflow model, denoted as the visualization pipeline [19,2]. The pipeline comprises three steps for transformingraw data to a visual presentation on the screen.

data > Filtering > Mapping > Rendering > view

In this model, our proposed VDDA operators are part ofthe low-level Filtering process. However, they require asinput the visualization parameters, e.g., the pixel widthand pixel height of the final visualization, and the typeof operator depends on the type of the visualization.Each VDDA operator moreover simulates some aspectsof the Mapping and of the Rendering steps.

> Filtering( Mapping > Simulated Rendering ) >

Nevertheless, a VDDA operator is a real data-filteringoperator, having raw data as input and producing plaindata records as output. Consequently, it is fully com-patible also with extended visualization pipeline mod-els. For instance, Chi et al. [7] use a seven-step modelto emphasize intermediate states of the original three-step model. In terms of their model, a VDDA operatoris a low-level Data Stage Operator. Chi et al. also showthat many kinds of apparently custom visualizationscan be formalized using their model. Consequently, weregard VDDA to be applicable to virtually any type ofvisualization.

9.2 Visualization Systems

Regarding visualization-related data reduction, currentstate-of-the-art visualization systems and tools fall intothree categories. They (A) do not use any data reduc-tion, or (B) compute and send images instead of data tovisualization clients, or (C) rely on additional data re-duction outside of the database. In Figure 25, we com-pare these systems to our solution (D), showing howeach type of system applies and reduces a relationalquery Q on a time series relation T . Note that thin ar-rows indicate low-volume data flow, and thick arrowsindicate that raw data needs to be transferred betweenthe system’s components or to the client.Visual Analytics Tools. Many visual analytics toolsare systems of type A that do not apply any visualization-related data reduction, even though they often con-tain state-of-the-art (relational) data engines [40] thatcould be used for this purpose. For our visualizationneeds, we already evaluated four common candidatesfor such tools: Tableau Desktop 8.1 (tableausoftware.com),

Page 23: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 23

RelationalSystem

Data ReductionSystem

data

VisualizationClient

C) additional reduction

pixels pixelsB) image- based

data pixelsD) in-DB reduction

Q(T)

Q(T)

QR(Q(T))

A) without reduction

Q(T)

data pixels

Inter-acivity

Band-width

++ – –

+

+ ++

+

SystemType

+

DATA pixels

Fig. 25 Visualization system architectures.

SAP Lumira 1.13 (saplumira.com), QlikView 11.20 (click-view.com), and Datawatch Desktop 12.2(datawatch.com). But none of these tools was able toquickly and easily visualize high-volume time series data,having 1 million records or more. Since all tools allowworking on data from a database or provide a tool-internal data engine, we see a great opportunity forour approach to be implemented in such systems. Forbrevity, we cannot provide a more detailed evaluationof these tools.Interactive Visualization Systems. Our approachscales “only” linearly, potentially causing an expensivefull scan of the raw data in many traditional DBMS.Full scans may be avoided by materializing aggregateddata, as done in many specialized data visualizationssystems, such as imMens [29], which provides interac-tive response times for analytical queries on hierarchi-cally aggregated data sets. However, as type A systems,they require a fully copy of the data to pre-compute theaggregates.Client-Server Systems. The second system type Bis commonly used in web-based solutions, e.g., finan-cial websites like Yahoo Finance (finance.yahoo.com) orGoogle Finance (google.com/finance). Those systems re-duce the data volumes by generating and caching rasterimages, and sending those instead of the actual datafor most of their smaller visualizations. Purely image-based systems usually provide poor interactivity andare backed with a complementary system of type C,implemented as a rich-client application that allows ex-ploring the data interactively. Systems B and C usu-ally rely on additional data reduction or image gen-eration components between the data engine and theclient. Assuming a system C that allows arbitrary non-aggregating user queries Q, they will regularly need totransfer large query results from the database to the ex-ternal data reduction components. This may consumesignificant system-internal bandwidth and heavily im-pact the overall performance, as data transfer is one ofthe most costly operations.Data-Centric System.Our visualization system (typeD) can run expensive data reduction operations directlyinside the data engine and still achieve the same level of

interactivity as provided by rich-client visualization sys-tems (type C). Our system rewrites the original queryQ, using the additional data reduction operators, pro-ducing a new query QR. When executing the new query,the data engine can then jointly optimize all operatorsin one single query graph, and the final (physical) oper-ators can all directly access the shared in-memory datawithout requiring additional, expensive data transfer.Cross-Layer Systems. Similar to our data-centric sys-tem, Wu et al. [41] recently proposed Ermac, a DataVisualization Management System (DVMS) that treatsdata visualizations as first class citizens to be optimizedacross layers. A data visualization and thus a queryin Ermac is specified using a declarative visualizationspecification language that supports aggregation overvisual bins. We see VDDA as a complementary dataaggregation for Ermac queries, e.g., to quickly visualizelarge data sets that are subject to overplotting.

9.3 Data Reduction

In the following we give an overview on common datareduction methods and how they are related to visual-izations.Quantization. Many visualization systems explicitlyor implicitly reduce continuous times-series data to dis-crete values, e.g., by generating images, or simply byrounding the data, e.g., to have only two decimal places.A rounding function is a surjective function and doesnot allow correct reproduction of the original data. Inour system we also consider lossy, rounding-based re-duction, and can even model it as relational query, fa-cilitating a data-centric computation.Time Series Representation. There are many workson time series representations [14], especially for thetask of data mining [16]. The goal of most approachesis, similar to our goal, to obtain a much smaller repre-sentation of a complete time series. In many cases, thisis accomplished by splitting the time series (horizon-tally) into equidistant or distribution-based time inter-vals and computing an aggregated value (average) foreach interval [26]. Further reduction is then achieved bymapping the aggregates to a limited alphabet, for ex-ample, based on the (vertical) distribution of the values.The results are, e.g., character sequences or lists of linesegments (see Section 4.5) that approximate the origi-nal time series. The validity of a representation is thentested by using it in a data mining tasks, such as time-series similarity matching [42]. The main difference ofour approach is our focus on relational operators andour incorporation of the spatial properties of the visual-izations. None of the existing approaches discussed the

Page 24: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

24 Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, Volker Markl

related aspects of line rasterization that facilitate thehigh quality and data efficiency of our approach.Offline Aggregation and Synopsis. Traditionally,aggregates of temporal business data in OLAP cubesare very coarse grained. The number of aggregationlevels is limited, e.g., to years, months, and days, andthe aggregation functions are restricted, e.g., to count,avg, sum, min, and max. For the purpose of visualiza-tion, such pre-aggregated data might not represent theraw data very well, especially when considering high-volume time series data with a time resolution of afew milliseconds. The problem is partially mitigated bythe provisioning of (hierarchical or amnesic) data syn-opsis [11,18]. However, synopsis techniques again relyon common time series dimensionality reduction tech-niques [16], and thus are subject to approximation er-rors. In this regard, we consider the development of avisualization-oriented data synopsis system that usesthe proposed M4 aggregation to provide error-free vi-sualizations as a challenging subject to future work.Online Aggregation (Streaming). Even though thispaper focuses on aggregation of static data, our workwas initially driven by the need for interactive, real-time visualizations of high-velocity streaming data [24].Indeed, we can apply the M4 aggregation for online ag-gregation, i.e., derive the four extremum tuples in O(n)and in a single pass over the input stream. A customM4 implementation could scan the input data for theextremum tuples rather than the extremum values, andthus avoid the subsequent join, as required by the rela-tional M4 (see Section 4.2).Data Compression. We currently only consider datareduction at the application level. Any additionaltransport-level data reduction technique, e.g., packetcompression or specialized compression of numericaldata [28,16], is complementary to our data reduction.Content Adaptation. Our approach is similar to con-tent adaptation in general [30], which is widely used forimages, videos, and text in web-based systems. Contentadaptation is one of our underlying ideas that we ex-tended towards a relational approach, with a specialattention of the spatial properties of visualizations.Statistical Approaches. Statistical databases [1] canserve approximate results. They serve highly reducedapproximate answers to user queries. Nevertheless, theseanswers cannot very well represent the raw data for thepurpose of line visualization, since they apply simplerandom or systematic sampling, as discussed in Section4. In theory, statistical databases could be extendedwith our approach, to serve for example M4 or Min-Max query results as approximating answers.Visualization-Driven Data Reduction. The usageof visualization parameters for data reduction has been

partially described by Burtini et al. [5], where they usethe width and height of a visualization to define pa-rameters for some time series compression techniques.However, they describe a client-server system of typeC (see Figure 25), applying the data reduction outsideof the database. In our system, we push all data pro-cessing down to database by means of query rewriting.There are other specialized systems that incorporatevisualization-related query parameters for the purposeof data aggregation. For instance, the ScalaR system,described by Battle et al. [3], uses the regrid function ofSciDB [8] to prevent large query results. However, ex-isting works neither appropriately discuss data aggrega-tion on the pixel-level through overplotting, nor providesolutions for all common types of visualizations.Visual Aggregation. Even though we consider visualaggregation through overplotting as the most ubiqui-tous form of visual aggregation, occurring in many datavisualization systems, we are certainly aware that over-plotting is perceptively not the most appropriate formof data aggregation. An overview of the many alterna-tive visual aggregations in the literature is provided byElmqvist and Fekete [13].

10 Conclusion

In this paper, we extended the M4 data aggregation[23] towards a more general Visualization-Driven DataAggregation (VDDA) for many kinds of visualizations.We explained in detail, how VDDA mimics the over-plotting behavior of data visualizations using data ag-gregation in SQL queries, facilitating data reductionalready at the query-level. Driven by the spatial prop-erties of each specific visualization, VDDA can reducelarge data volumes to a limited, predictable number ofrecords. We moreover described how VDDA integratestransparently with existing visualization systems, andeventually demonstrated the practicality of our approachin several real-world scenarios.Currently, VDDA is designed for systems that visual-ize large data sets naively, i.e., by means of overplot-ting. However, we believe that big data visualizationsystems should not only consider the spatial limits ofthe screen, but also incorporate the limits of humanperception, and moreover provide alternative represen-tations of the data. In the future, we plan to investigatehow our approach integrates in such systems.We hope that this in-depth, interdisciplinary databaseand computer graphics research paper will inspire otherresearchers to investigate the boundaries between thetwo areas.

Page 25: VDDA: Automatic Visualization-Driven Data Aggregation in ...€¦ · cations. Data analysts interact with the visualizations and their actions are transformed by the applications

VDDA: Automatic Visualization-Driven Data Aggregation in Relational Databases 25

References

1. S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden,and I. Stoica. Blink and it’s done: Interactive queries on verylarge data. PVLDB, 5(12):1902–1905, 2012.

2. W. Aigner, S. Miksch, H. Schumann, and C. Tominski. Vi-sualization of Time-Oriented Data. Human–Computer In-teraction Series. Springer, 2011.

3. L. Battle, M. Stonebraker, and R. Chang. Dynamic reductionof query result sets for interactive visualizaton. In IEEE BigData, pages 1–8. IEEE, 2013.

4. Jack E. Bresenham. Algorithm for computer control of adigital plotter. IBM Systems journal, 4(1):25–30, 1965.

5. G. Burtini, S. Fazackerley, and R. Lawrence. Time seriescompression for adaptive chart generation. In CCECE, pages1–6. IEEE, 2013.

6. Jim X. Chen and Xusheng Wang. Approximate line scan-conversion and antialiasing. In Computer Graphics Forum,pages 69–78. Wiley, 1999.

7. Ed H. Chi and John T. Riedl. An operator interaction frame-work for visualization systems. In Symposium on Informa-tion Visualization, pages 63–70. IEEE, 1998.

8. P. Cudré-Mauroux, H. Kimura, K.-T. Lim, J. Rogers,R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Bal-azinska, J. Becla, and et al. A demonstration of SciDB: Ascience-oriented DBMS. PVLDB, 2(2):1534–1537, 2009.

9. David Salomon. Data Compression. Springer, 2007.10. David H. Douglas and Thomas K. Peucker. Algorithms for

the reduction of the number of points required to representa digitized line or its caricature. The Cartographica Journal,10(2):112–122, 1973.

11. Q. Duan, P. Wang, M. Wu, W. Wang, and S. Huang. Ap-proximate query on historical stream data. In DEXA, pages128–135. Springer, 2011.

12. S. G. Eick and A. F. Karr. Visual scalability. Journal ofComputational and Graphical Statistics, 11(1):22–43, 2002.

13. Niklas Elmqvist and Jean-Daniel Fekete. Hierarchical ag-gregation for information visualization: Overview, techniquesand design guidelines. TVCG, 16(3):439–454, 2010.

14. Philippe Esling and Carlos Agon. Time-series data mining.ACM Computing Surveys, 45(1):12–34, 2012.

15. Franz Färber, Sang Kyun Cha, Jürgen Primsch, ChristofBornhövd, Stefan Sigg, and Wolfgang Lehner. SAP HANAdatabase-data management for modern business applica-tions. SIGMOD Record, 40(4):45–51, 2012.

16. Tak-Chung Fu. A review on time series data mining. EAAIJournal, 24(1):164–181, 2011.

17. T.C. Fu, F.L. Chung, R. Luk, and C.M. Ng. Representingfinancial time series based on data point importance. EAAIJournal, 21(2):277–300, 2008.

18. S. Gandhi, L. Foschini, and S. Suri. Space-efficient onlineapproximation of time series data: Streams, amnesia, andout-of-order. In ICDE, pages 924–935. IEEE, 2010.

19. Robert B. Haber and David A. McNabb. Visualization id-ioms: A conceptual model for scientific visualization systems.Visualization in scientific computing, 74:93, 1990.

20. J. Hershberger and J. Snoeyink. Speeding up the Douglas-Peucker line-simplification algorithm. University of BritishColumbia, Department of Computer Science, 1992.

21. Z. Jerzak, T. Heinze, M. Fehr, D. Gröber, R. Hartung, andN. Stojanovic. The DEBS 2012 Grand Challenge. In DEBS,pages 393–398. ACM, 2012.

22. U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. Fastervisual analytics through pixel-perfect aggregation. PVLDB,7(13):1705–1708, 2014.

23. U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: Avisualization-oriented time series data aggregation. PVLDB,7(10):797–808, 2014.

24. Uwe Jugel and Volker Markl. Interactive visualization ofhigh-velocity event streams. PVLDB (PhD Workshop),5(13), 2012.

25. D. A. Keim, C. Panse, J. Schneidewind, M. Sips, M. C. Hao,and U. Dayal. Pushing the limit in visual data exploration:Techniques and applications. Lecture notes in artificial in-telligence, (2821):37–51, 2003.

26. Eamonn J. Keogh and Pazzani. A simple dimensionalityreduction technique for fast similarity search in large timeseries databases. In PAKDD, pages 122–133. Springer, 2000.

27. Alexander Kolesnikov. Efficient algorithms for vectorizationand polygonal approximation. University of Joensuu, 2003.

28. Peter Lindstrom and Martin Isenburg. Fast and efficient com-pression of floating-point data. In TVCG, volume 12, pages1245–1250. IEEE, 2006.

29. Zhicheng Liu, Biye Jiang, and Jeffrey Heer. imMens: Real-time visual querying of big data. In Computer Graphics Fo-rum, volume 32, pages 421–430. Wiley Online Library, 2013.

30. W.Y. Ma, I. Bedner, G. Chang, A. Kuchinsky, and H. Zhang.A framework for adaptive content delivery in heterogeneousnetwork environments. In MMCN, volume 3969, pages 86–100. SPIE, 2000.

31. Jock Mackinlay, Pat Hanrahan, and Chris Stolte. Showme: Automatic presentation for visual analysis. TVCG,13(6):1137–1144, 2007.

32. C. Mutschler, H. Ziekow, and Z. Jerzak. The DEBS 2013Grand Challenge. In DEBS, pages 289–294. ACM, 2013.

33. Office of Electricity Delivery & Energy Reliability.Smart Grid. January 2014. (energy.gov/oe/technology-development/smart-grid).

34. P. Przymus, A. Boniewicz, M. Burzańska, and K. Stencel.Recursive query facilities in relational databases: a survey.In DTA and BSBT, pages 89–99. Springer, 2010.

35. K. Reumann and A. P. M. Witkam. Optimizing curve seg-mentation in computer graphics. In Proceedings of the In-ternational Computing Symposium, pages 467–472. North-Holland Publishing Company, 1974.

36. W. Shi and C. Cheung. Performance evaluation of line sim-plification algorithms for vector generalization. The Carto-graphic Journal, 43(1):27–44, 2006.

37. C. Upson, Thomas A. Faulhaber Jr., D. Kamins, D. Laid-law, D. Schlegel, J. Vroom, R. Gurwitz, and A. Van Dam.The application visualization system: A computational envi-ronment for scientific visualization. Computer Graphics andApplications, IEEE, 9(4):30–42, 1989.

38. Maheswari Visvalingam and J. D. Whyatt. Line generali-sation by repeated elimination of points. The CartographicJournal, 30(1):46–51, 1993.

39. Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P.Simoncelli. Image quality assessment: from error visibility tostructural similarity. IEEE Transactions on Image Process-ing, 13(4):600–612, 2004.

40. Richard Wesley, Matthew Eldridge, and Pawel T. Terlecki.An analytic data engine for visualization in tableau. In SIG-MOD, pages 1185–1194. ACM, 2011.

41. Eugene Wu, Leilani Battle, and Samuel R. Madden. Thecase for data visualization management systems. PVLDB,7(10):903–906, 2014.

42. Y. Wu, D. Agrawal, and A. El Abbadi. A comparison of DFTand DWT based similarity search in timeseries databases. InCIKM, pages 488–495. ACM, 2000.