a holistic approach to olap sessions composition: the ...julien.aligon/papers/dolap2014.pdf · 22],...

A Holistic Approach to OLAP Sessions Composition:The Falseto Experience

Julien Aligon, Kamal Boulil, Patrick Marcel, Verónika PeraltaUniversity of Tours

Blois, [email protected]

ABSTRACTOLAP is the main paradigm for flexible and effective explo-ration of multidimensional cubes in data warehouses. Dur-ing an OLAP session the user analyzes the results of a queryand determines a new query that will give her a better un-derstanding of information. Given the huge size of the dataspace, this exploration process is often tedious and may leavethe user disoriented and frustrated.

This paper presents an OLAP tool1 named Falseto (For-mer AnalyticaL Sessions for lEss Tedious Olap), that ismeant to assist query and session composition, by lettingthe user summarize, browse, query, and reuse former ana-lytical sessions. Falseto’s implementation on top of a formalframework is detailed. We also report the experiments werun to obtain and analyze real OLAP sessions and assessFalseto with them. Finally, we discuss how Falseto can beseen as a starting point for bridging OLAP with exploratorysearch, a search paradigm centered on the user and the evo-lution of her knowledge.

1. INTRODUCTIONWhile it is universally recognized that OLAP tools have

a key role in supporting flexible and effective exploration ofmultidimensional cubes in data warehouses, it is also com-monly agreed that the huge number of possible aggregationsand selections that can be operated on data may make theuser experience disorientating. OLAP queries are usuallyformulated in the form of sequences called OLAP sessions.During an OLAP session the user analyzes the results of acurrent query and, depending on the specific data she sees,determines a new query that will give her a better under-standing of information. This new query is often constructedby modifying the previous query, applying simple primitiveslike roll-up, drill-down or slice/dice. Noticeably, given thehuge number of such modification possibilities, this explo-

1This work was conducted as part of the frenchregional project DOPAn (http://www.info.univ-tours.fr/dp/fr/recherche/dopan)

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, November 7, 2014, Shanghai, China.Copyright 2014 ACM 978-1-4503-0999-8/14/11 ...$15.00.http://dx.doi.org/10.1145/2666158.2666179.

ration process is often seen as tedious and may leave theuser disoriented and frustrated.

To make the user experience less disorientating, it hasrecently been suggested that database systems could takeadvantage of query logs [15]. Logs can indeed be used forlearning user preferences [13], for query recommendation [7,22], for query auto-completion [16], or for query composi-tion [17]. Similar efforts were also made in the context ofOLAP systems [4, 10, 2], but so far no usable OLAP toolimplements them, and this leds us to the development ofFalseto.

Falseto (Former AnalyticaL Sessions for lEss Tedious Olap)aims at assisting the user analyzing data cubes, by lettingher summarize, browse, query, and reuse former analyticalsessions. In particular, Falseto’s recommender system auto-matically suggests an analytical session considered as rele-vant regarding her current session. Falseto also lets the userquery and browse former sessions, which is particularly use-ful when a user initiates her session and the recommendersystem cannot suggest queries as no current session exists.Since the set of former sessions can be very large, Falsetopresents first summarized versions of the sessions in a waythat makes it easy to navigate and query them.

The paper is structured as follows. Section 2 introducesFalseto’s functionalities by means of examples. Section 3 de-scribes the logical framework underlying Falseto and Section4 presents its implementation. Section 5 reports the lessonslearned through various tests aiming at obtaining and ana-lyzing real OLAP sessions and assessing Falseto with them.After briefly reviewing related work, we discuss how Fasetocan be a starting point of future researches to bridge the gapbetween OLAP and exploratory search in Section 6. Section7 concludes the paper.

2. GETTING FAMILIAR WITH FALSETOAssume you are an OLAP user who has to analyze cen-

sus data obtained from the IPUMS (Integrated Public UseMicrodata Series) website [18], and gathered under the formof a multidimensional cube whose schema is depicted Figure1. It has 5 hierarchies describing people sex, race, occupa-tion and residence place, as well as census year. They areorganized as follows:

• Sex is a one-level hierarchy, containing two values:Male and Female.

• Race is a three-level hierarchy. MRN (stands for MajorRaces Number) indicates the number of major racesmix of a person (e.g. 2). Racegroup groups races in

Figure 1: IPUMS multidimensional schema

major categories (e.g. Black and Asian) and Race listsdetailed races (e.g. Black and Korean).

• Occupation is a four-level hierarchy. Category indi-cates 9 major occupation categories, that are special-ized in Sub-categories and Branches. Finally, Occupa-tion lists detailed occupations.

• City is a three-level hierarchy describing people resi-dence place. Region indicates 10 United States regions.State indicates state codes (e.g., CA for California) andCity lists United States cities.

• Year is a one-level hierarchy, containing six values,from 2000 to 2005.

• The schema describes six census indicators: total per-sonal income (INCTOT), annual property insurancecost (PROPINSR), annual gas cost (COSTGAS), an-nual water cost (COSTWATR), annual electricity cost(COSTELEC) and person weight (PERWT). Four mea-sures are defined for each indicator according to 4 ag-gregation functions (sum, max, min and average), forexample SUMINCTOT, MAXINCTOT, MININCTOTand AVGINCTOT. A 25th measure (EVENTCOUNT)counts the number of facts.

More specifically, assume that you have to answer the fol-lowing, rather vague, analytical question: What is the pro-file of female workers that better resists to a drop of averageincome? You would probably try to navigate the IPUMScube, playing with aggregation levels and filters, to discoverthis profile.

Falseto, whose main user interface is depicted in Figure 2,can be used to graphically create the queries of your session,execute them, view the results and decide on what the nextquery should be. Designing queries is done in the query de-sign tab (bottom left corner of Figure 2). The interface refersto the Dimension Fact Model [11] and allows to edit eachcomponent of the query, i.e., the measure set, the group-byset and the selection set. The reason for choosing this wayof representing OLAP queries is, aside the popularity of theDFM, to have a simple, uniform and non-textual model todepict the queries using the schema of the cube, which al-lows to focus on the three components of the query. Falsetoalso enables to devise a sequence of OLAP queries, yourOLAP session, in order to answer your analytical question.

Figure 2: Falseto’s main user interface

Figure 3: Log browsing with Falseto

Each time a query is created and evaluated, it is recordedand placed in the Session Viewer panel (top left corner ofFigure 2). This is the viewer for the past queries of yoursession. Each query can be subsequently retrieved, zoomed,reevaluated, edited, and used as a parameter for the holisticfunctionalities of Falseto.

The advanced features of Falseto are called holistic func-tionalities since they consider sessions in their entirety anddo not treat queries independently. Of course, all queries re-trieved by means of these functionalities can be reused andedited.

The first holistic functionality allows to browse the pastsessions recorded in the query log. The log browsing tab ofFalseto provides an overview of these past sessions (see Fig-ure 3). Each of the different panels represents a cluster (i.e.,a subset) of past sessions that are similar. The text in eachpanel indicates the fragments that appear in all the sessionsof this subset (group-by levels are in yellow, selections ingreen and measures in black). A left-click on a cluster dis-plays a summary of the cluster under the form of a sequenceof queries (bottom of Figure 3). Each query of this sessionsummarizes a portion of all the sessions of the cluster by dis-playing the common measures and selections, and the coars-est group-by set of the portion. It is also possible to scrolldown the detailed view in order to display each past sessionthat composes the cluster. Another possibility is to split a

cluster into several clusters of shorter sizes, or to merge clus-ters to have a coarser view of the log. Because the browsingof each cluster or individual session can be tedious, Falsetoalso offers filtering operators to reduce the number of clus-ters and sessions to browse. The specialization-based searchretrieves from the set of past sessions only those that includeat least one query containing the selections, measures andfiner (or identical) group-by sets of the query given as theinput of the operator. The similarity-based search retrievesfrom the set of past sessions only those that include at leastone query that is similar to the query given as input of theoperator. These two operations can also be used by givingnot a query but a session as input.

Finally, Falseto implements a recommender system thatsuggests a sequence of queries based on what you have donein your current session. This suggestion comes from a pastsession (recorded in the log) whose beginning is found simi-lar to your session, and can be seen as a potential and cus-tomized future for your current session. This recommendersystem can be seen as an instantiation of the generic frame-work proposed in [9, 10] with some of the similarity measuresdescribed in [5]. More details are given in Section 4 and thefull description can be found in [3, 2].

Note that the examples and figures used in this sectioncome from the Falseto tutorial, available online on the Falsetoweb site [8], where Falseto can also be downloaded.

3. THE LOGICAL FRAMEWORKThe logical framework underlying Falseto relies on the sep-

aration between queries and logs. The query design func-tionalities of Falseto rely on a simple query model and lan-guage described in Section 3.1, while holistic functionalitiesrely on a log manipulation language described in Section 3.2.

3.1 The query edition languageQuery edition is done with a simple language for manipu-

lating dimensional queries [11]. We now introduce the datamodel, the query model and finally the operators of this lan-guage. To keep the formalism simple, we consider cubes un-der a ROLAP perspective, described by a star schema. Moreprecisely, we consider that a dimension consists of one hier-archy, and we consider simple hierarchies without branches,i.e., consisting of chains of levels.

Let L be a set of attributes called levels, and for L ∈ L, amember is an element of the domain of L, noted Dom(L).A hierarchy hi is a set Lev(hi) = {L0, . . . , Ld} of levelstogether with a roll-up total order �hi of Lev(hi), which issuch that, for any Lj and Lk in Lev(hi), Lk �hi Lj if Lkis the roll-up of Lj . For each hierarchy hi, the top level isdenoted byALLi, has a single possible value, and determinesthe coarsest aggregation. The bottom level is denoted byDIMi and determines the finest aggregation level.

A multidimensional schema (or, briefly, a schema) is atriple M = 〈A,H,M〉 where:

• A is a finite set of levels, whose domains are assumedpairwise disjoint,

• H = {h1, . . . , hn} is a finite set of hierarchies, (suchthat the Lev(hi) for i ∈ {1, . . . , n} defines a partitionof A);

• M is a finite set of measure attributes, each defined ona numerical domain Dom(m).

Given a schemaM = 〈A,H, M〉, letDom(H) = Lev(h1)×. . . × Lev(hn); each G ∈ Dom(H) is called a group-by setof M. In what follows we will use the classical partial or-der on group-by sets, defined as follows. A group-by set Gis more general than a group-by set G′, noted G �H G′,if G = 〈a1, . . . , an〉, G′ = 〈a′1, . . . , a′n〉 and ∀i ∈ [1, n] it isai �hi a

′i.

Definition 3.1 (Fragment-based OLAP query). Aquery2 over schema M = 〈A,H,M〉 is a tripleq = 〈G,P,Meas〉 where:

1. G ∈ Dom(H) is the query group-by set;

2. P = {p1, . . . , pn} is a set of Boolean predicates, onefor each hierarchy, whose conjunction defines the se-lection predicate for q; they are of the form l = v, orl ∈ V , with l a level, v a value, V a set of values. Con-ventionally, we note pi = TRUEi if no selection on hiis made in q (all values being selected);

3. Meas ⊆ M is the measure set whose values are re-turned by q.

Definition 3.2 (OLAP Session and log). LetM =〈A,H,M〉 be a multidimensional schema and SC be a set ofqueries over M. A session s of k queries over M, noteds = 〈q1, . . . , qk〉, is a function from an ordered set pos(s) ofintegers (called positions) of size k to SC . A log L is a finiteset of sessions, noted L = {s1, . . . , sp}.

The operators of the query edition language allow to de-rive a query expression from another one by affecting a par-ticular component of the query (group-by set, selection setor measure set). The operators are similar to those tradi-tionally used in OLAP (such as roll-up, drill-down, or slice-and-dice) but instead of being applied on cubes (i.e. theinstances), they are applied on queries (i.e. the expressions).

The first two operators allow to modify the group-by setG, by changing a level, for a given hierarchy hi, to a coarseror finer granularity level. Two operators modify the selec-tion set P . The first operator adds a member c′i of a level l′ito a set of members Xi of the same level as l′i of a predicatepi ∈ P , for a given hierarchy hi. The second operator re-moves a member c′i of a level l′i from a set of members Xi ofthe same level as l′i of a predicate pi ∈ P , for a given hierar-chy hi. Finally, two operators allow modifying the measureset Meas, by either adding or removing a measure fragment.

Definition 3.3 (Query edition operators). Let q =〈G,P,Meas〉 be a query, hi be a hierarchy, li ∈ {c′i} be a se-lection predicate and m a measure.

• Rollupint(q, hi) = 〈G′, P,Meas〉 where li ∈ Lev(hi)and G′ = 〈l1, . . . , Rollup(li), . . . , ln〉, if li 6= ALLi,undefined otherwise.

• Drilldownint(q, hi) = 〈G′, P,Meas〉 where li ∈ Lev(hi)and G′ = 〈l1, . . . , Drilldown(li), . . . , ln〉, if li 6= DIMi,undefined otherwise.

• AddSelection(q, li ∈ {c′i}) = 〈G,P ′,Meas〉 where pi ∈P ′ with pi = li ∈ X ′i such as X ′i = Xi ∪ {c′i}

2This definition can be extended to add more query oprators(e.g. selections on measure values)

• RemoveSelection(q, li ∈ {c′i}) = 〈G,P ′,Meas〉 wherepi ∈ P ′ with pi = li ∈ X ′i such as X ′i = Xi − {ci}

• AddMeasure(q,m) = 〈G,P,Meas′〉 where Meas′ =Meas ∪ {m}.

• RemoveMeasure(q,m) = 〈G,P,Meas′〉 where Meas′ =Meas \ {m}.

3.2 The log manipulation languageThe holistic functionalities of Falseto can be easily de-

scribed by expressions in a language that manipulates logs,i.e., sets of sessions [6]. This language features five opera-tions and is inspired by the relational algebra. For the sakeof simplicity, the names and symbols of the operations arethose used in the relational algebra. Two of them are unaryoperations: selection σ and grouping/aggregation π. Threeof them are binary: join 1, union ∪ and difference \.

The selection operation enables to select from a log thosesessions satisfying a given condition. This condition is givenunder the form of a relation to another session.

Definition 3.4 (Selection). Let L be a log, s be asession and θ be a binary relation over sessions.

σθ,s(L) = {s′ ∈ L|θ(s, s′)}

As logs are defined as sets of sessions, the set union and setdifference operations can be used to manipulate logs. Thedefinitions are straightforward:

Definition 3.5 (Set operations). Let L and L′ betwo logs.L ∪ L′ = {s ∈ L or s ∈ L′} and L \ L′ = {s ∈ L|s /∈ L′}.

The join operation enables to combine the sessions in twologs, that are related through a relation θ.

Definition 3.6 (Join). Let L and L′ be logs, f be a bi-nary function outputting a session and θ be a binary relationover sessions.

L 1θ,f L′ = {f(s, s′)|s ∈ L, s′ ∈ L′, θ(s, s′)}.

Note that intersection can be simulated with the join op-eration. Indeed, L∩L′ = L 1same,first L

′ with, for any twosessions s, s′, first(s, s′) = s and same(s, s′) ≡ (s = s′).

Note also that this join operation is not symmetric unlessθ is symmetric and f(s, s′) = f(s′, s) for all s, s′. Moreover,the f function is needed for closeness reason. In that sense,it is inspired by the join operation defined in [1] for joiningcubes.

The grouping/aggregation operator allows to group ses-sions that are related to one another through a relation θ,and to aggregate them using an aggregation function agg.

Definition 3.7 (Grouping/Aggregation). Let L bea log, agg be a function aggregating a set of sessions into asession, and θ be a binary relation over sessions.

πθ,agg(L) = {agg({s′ ∈ L|θ(s, s′)})|s ∈ L}.

We briefly review the properties of the language in termsof closeness, completeness and minimality. It can easily beseen that the language is closed under composition, the re-sult of any operation being a set of sessions. The language iscomplete. Indeed, for any pair of logs L and L′, there is anexpression that enables the transformation L into L′. Thetransformation would proceed as follows:

1. Pick a session s in L with σ,

2. Transform it into a session s′ of L′ using π,

3. Repeat the previous steps until all sessions of L′ areformed,

4. Union all transformed sessions to form L′.

The language is not minimal since σ can be simulatedwith 1. Indeed, σθ,s(L) = L′ 1θ,f L where L′ = {s} andf(s, s′) = s′. It can easily be seen that removing σ makesthe language minimal.

4. UNDER THE HOOD OF FALSETOFalseto is implemented as a standalone Java application,

on top of the Mondrian OLAP engine, and uses the Wekamachine learning library. Falseto can be used with anydatabase system compatible with Mondrian and can be usedto query any cube with a star schema.

In terms of query edition, Falseto can be used to constructa query on a given schema from scratch, edit the query usingthe query edition operators, and evaluate the query over aninstance of the schema. All evaluated queries are kept in anin-memory structure representing the current session of theuser, who can go back to them for edition or reevaluationpurpose. The sessions of the log, that can be retrieved us-ing the holistic functionalities, are loaded in memory whenFalseto is started, and their queries can also be edited andevaluated.

The holistic functionalities of Falseto are described by ex-pressions in the log manipulation language introduced above.Noticeably, important inputs of these expressions are the re-lations over sessions that are used to parametrize the oper-ators (the θ in the definitions above). This is why, beforegiving the expressions corresponding to the holistic function-alities, we start by introducing relations between queries andsessions which will be used as parameters of the operators.

4.1 Relations over queries and sessionsTo describe the holistic functionalities of Falseto, we de-

scribe two types of relations. The first type are specializa-tion relations, enabling the descriptions of groups of queriesor sessions at various levels of details. The second type aresimilarity relations, allowing the description of how close ordistant two queries, or two sessions, are. We start by thespecialization relations.

Considering two group-by sets G and G′, recall that wenote G �H G′ if G is more general than G′ in the sense ofthe group-by lattice, whose top element G> is the coarsestgroup-by set. For two sets of selection predicates P, P ′, wenote P �p P ′ if P selects less values than P ′, i.e., val(P ) ⊆val(P ′), where val(X) is the set of values used in selectionpredicate X. Finally, for two measure sets M,M ′, we noteM �m M ′ if M ⊆M ′.

Definition 4.1 (Query specialization). Let q = 〈G,P,M〉 and q′ = 〈G′, P ′,M ′〉 be two queries over the sameschema. q is more general than q′, noted q � q′, if G �H G′

and P �p P ′ and M �m M ′.

We introduce a specialization relation over sessions thatis based on a specialization relation over queries. In whatfollows, s � s′ denotes that a session s is more general thana session s′.

Figure 4: Specialization relation over sessions

Definition 4.2 (Session specialization). A sessions = 〈q1, . . . , qns〉 is more general than another session s′ =〈q′1, . . . , q′ns′

〉, written s � s′, if there exists a sequence of

ns′ integers {i1, . . . , ins′ } with i1 = 1, ins′ = ns and, fork ∈ {1, ns′−1}, ik ≤ ik+1, such that, for all j ∈ {1, . . . , ns′},it is qij � q′j.

The intuition of this specialization relation is given inFigure 4. Note that if s � s′ then the sequence of inte-gers {i1, . . . , ins′ } defines a partition P = {p1, . . . , pns} ofqueries(s′) where the pi’s can be ordered according to thequeries of s, and the queries in each pi, each less generalthan qi, constitute a sub-session of s′. More details on thisspecialization relation can be found in [6].

Finally, we suppose a function dist applied to pairs of ses-sions, giving the distance between two sessions. Similarityrelations over sessions can be defined from such a function.For instance, sim(s, s′) ≡ dist(s, s′) ≤ α for some real αindicates that two sessions s and s′ are similar if their dis-tance is below a threshold. Falseto implements such a rela-tion, that is based on an extension of the Smith-Watermanalignement algorithm [21], whose goal is to efficiently findthe best alignment between subsequences of two given se-quences by ignoring their non-matching parts. This exten-sion was proven a good measurement of OLAP session sim-ilarity in [5].

4.2 Holistic functionalitiesRecall from Section 2 that Falseto includes three holis-

tic functionalities: log searching, log browsing and sessionrecommendation.

Log searchTo search the log for useful sessions, Falseto implementsdifferent versions of the log selection σθ,s. A first ver-sion enables the filtering of a log using for θ the sim re-lation based on the extension of the Smith-Waterman al-gorithm. This operation returns those sessions of the logthat are close to the session s given as a second param-eter. Likewise, a log can be filtered using for θ the spe-cialization relation over sessions defined above. This ver-sion of the selection is very useful as it allows to search a

log for queries by giving only a partial description of thesequeries. For instance, to search the log L for sessions con-taining queries with selection year = 2013, it is sufficient touse σ�,〈〈GAll,P={year=2013},∅〉〉(L), where GAll is the coarsestgroup-by set.

Log summarization and browsingTo allow for a user-friendly browsing of the log, Falseto im-plements operations for summarizing it, using the π group-ing/aggregation operator. The basic summarization oper-ation allows to form pairs of most similar sessions, thatare subsequently aggregated using the specialization rela-tion. The θ relation used for grouping corresponds toθsim(s, s′) ≡ @s′′ ∈ L such that sim(s, s′′) > sim(s, s′) andsim(s′, s′′) > sim(s, s′). The aggregation function mergespairs of sessions into a session more general than the two ele-ments of the pair. It uses a tilted-time window [12] to favor aprecise summary in proportion as queries are recent betweenthe sessions. The interest being to provide the least amountof information on old queries. It can be declaratively writtenas follows: aggmerge(s, s

′) = {s′′ such that both s′′ � s ands′′ � s′ and for all i ∈ [1, |s′′|], s′′ is the least session w.r.t.� such that both 〈q′′i 〉 � s|i and 〈q′′i 〉 � s′|i} where the q′′i ’sare the queries of s′′ and s|i represents the portion of a ses-sion s that corresponds to q′′i in the sense of the tilted-timewindow.

The π∗ operation is a generalization of the basic summa-rization operation, and summarizes a log so as to obtainonly one session generalizing all the sessions of the log. Itis implemented as an ascendant hierarchical clustering thatat each steps uses πθsim,aggmerge to group pairs of sessions.When Falseto is started, it computes the complete cluster-ing of the log, stores the resulting dendogram and displaysa summary of the log in its log viewer. This summary is notthe session that summarizes the complete log since such asession would most likely be too much general to be useful.Instead, Falseto proposes nodes of the dendogram that cor-respond to summarized sessions that contains at least onequery that is not the most general one. To save space on theviewer, these nodes are represented as tag clouds, where thetags are the query fragments. On demand, the user can seethe details of the corresponding cluster. The dendogram canalso be browsed: each cluster displayed on the viewer can besplit to display its two direct descendants. This “drill-down”browsing operation exploits the results of a π∗ operation, inthe sense that, from any session s of the dendogram result-ing of π∗, the pair of sessions summarized with aggmerge toobtain s is given.

Recommender systemFinally, the Falseto recommender system [3, 2] can be de-scribed by the three following phases:

1. The Alignment phase selects from the log L a set ofsessions that are similar to the current session scurand determines from them a set F of potential futuresfor scur. This is implemented using a join operationF = {scur} ./sim,future L where sim is the similarityrelation and future is the function that determines ifa session in L is a potential future for scur.

2. The Ranking phase chooses a base recommendationr as a subsession of a candidate recommendation inF by considering both its similarity with scur and

its frequency in L. It is implemented by {r} =πall,aggrank (F ) where all(s, s′) holds for any s, s′ of Fand aggrank is the function that compares pairwise thesubsessions of F and returns the most relevant one.

3. The Fitting phase adapts r to scur by looking for rele-vant patterns in the query fragments of scur and r andusing them to make changes to the queries in r, so asto deliver a recommendation r′. It is implemented bya join {r′} = {scur} ./all,fit {l} where l is the log ses-sion from which r is extracted and fit is the functionused for adapting r to scur.

We close this section by noting that the holistic function-alities of Falseto, and therefore the specialization relationsand similarity relations so far used in Falseto, are exclu-sively based on the query expressions. This is done for thesake of achieving an interactive response time. Indeed, ifquery answer is used instead of query syntax, this wouldhave a substantial impact on the time taken to compute arecommendation or summary of the log. Studying perfor-mant ways of incorporating query answers in the holisticfunctionalities of Falseto is part of our future work.

5. EXPERIMENTSWe tested our approach by running experiments with

OLAP users analyzing the cube whose schema is describedin Section 2. We first describe the test protocol and thenreport the lessons learned.

Our objectives for running these tests were to obtain realOLAP sessions and automatically assess Falseto’s holisticfunctionalities using these sessions. Due to the limited timewe had with the users, the goal was not to ask the users todirectly assess Falseto’s holistic functionalities. Such assess-ment is part of our future work as explained in the followingsection.

5.1 Test protocolThe test consisted of developing sessions for four analyti-

cal questions on the IPUMS cube:

1. (Q1) Is there a trend in the evolution of the averagecost of gas for some profiles?

2. (Q2) Are energy costs following the evolution of theaverage income for some profiles?

3. (Q3) What are the individual profiles to target for acampaign of energy cost reduction (i.e., those who paytoo much)?

4. (Q4) Where is it better to live in terms of energy costs,for an individual profile?

Note that these questionnaires can be found online via theFalseto web site [8].

The 17 OLAP users engaged in the test were students ofMaster’s programs specialized in BI: 7 of them were stu-dents of the European Erasmus Mundus IT4BI program(http://it4bi.univ-tours.fr/) and 10 were students of theFrench program in Information systems and BI of the Uni-versity of Tours. The test was not part of the program, wasnot graded and all the participants were volunteers.

Students were first shown the Falseto tool (mostly thequery design functionalities) and then they had 2 hours to

develop sessions. Then they were asked to identify in thesessions relevant queries, in the sense that the results of theserelevant queries should be included in a report summarizingthe session. They were also asked to justify the relevance ofthese queries.

With this corpus of annotated sessions, we tested Falseto’sholistic functionalities as follows. We developed differentnaive algorithms using Falseto’s holistic functionalities toautomatically generate sessions, and we compared these syn-thetic sessions with the real ones. More precisely:

1. To understand Falseto’s log search capacities, let Abe an analytical question and Al be the subset of logqueries developed for A. We first randomly choose aquery among the initial queries of Al. Then, with anincreasing probability for ending the session, at eachstep the current query is used to filter the log and re-trieve among the resulting session the query that isthe closest to the current one. This query becomes thenew current query (and is removed from the log). Forthese generated sessions, we report the recall and preci-sion compared to the log queries, computed as follows.Recall (R) is the number of queries of the generatedsessions that also belong to Al (noted Ag) over the to-

tal number of queries in Al, R =|Ag||Al|

. Precision (P)

is the number of queries of the generated sessions thatalso belong to Al (Ag) over the total number of queries

in the generated session (Sg), P =|Ag||Sg| . We also mea-

sure recall and precision w.r.t. relevant queries. Inthis case, Recall is the number of relevant queries inthe generated session (Rg) over the number of rele-

vant queries in Al (RAl ), R′ =|Rg||RA

l| , while precision

is the number of relevant queries in the generated ses-sion (Rg) over the number of queries in the generated

session (Sg), P′ =

|Rg||Sg| .

2. To understand Falseto’s capacities of suggesting newinformation (i.e., not found in the log) when answeringan analytical need, we generated sessions by pickinga session in the log, removing the final part of thesession and replacing it with the session suggested byFalseto’s recommender system. For this we report thesession novelty (the fraction of the generated sessionthat corresponds to queries that cannot be found in thelog), the coverage (fraction of times the recommenderhas somtehing to suggest) and the average similarityof the generated session to the sessions developed foranswering the analytical question (to make sure thatthe generated session does not diverge from its goal).

3. Finally, to understand the overall benefit of Falseto as-suming that the recommender system is used in com-bination with the log search facility, we use Falseto’srecommender system on the sessions generated withthe first test, and we report the metrics listed above.

5.2 Lessons learnedWe begin by describing the log obtained. 27 sessions (308

queries, 108 of them are relevant) were collected. A globalanalysis of the overall log is reported in Table 1 (effort is theinverse ratio of relevant queries per session, i.e., the numberof queries it takes to obtain one relevant query).

Analysis of the log min max avg std dev.Size of session 4 31 11.4 6.35Relevant queries / session 2 11 4 2.61Ratio of relevant queries 0.12 0.71 0.37 0.19Effort 1.4 8 3.08 1.76

Table 1: Analysis of the log

Figure 5: OLAP operations for a typical session

Figure 5 shows the use of OLAP operations along an av-erage sesssion. It plots the average number of OLAP op-erations used between two consecutive queries, the averageaggregation level and the average number of filters of queries,and the average number of relevant queries for the sessionsof the log (all sizes normalized). The aggregation level in-dicates the position in the group-by set lattice and variesbetween 0 (coarsest level) and 12 (finest level). The numberof filters indicates the number of filters per hierarchy andvaries between 0 (no selection) and 5 (selections on all di-mensions). The overall pattern is that the number of filtersand level of detail increase along the session, while the num-ber of OLAP operations between consecutive queries variesbefore a final drop at the end of the session. This indicatesthat users favor mainly a top-down exploration whose aimis to collect filters. Interstingly, the curve for the averagenumber of OLAP operations and that of average number ofrelevant queries are mostly inversely correlated. When theavereage number of operations increase, the average numberof relevant queries decrease, and vice versa. This may in-dicate that relevant queries are found when search is morefocused. In particular, the final drop of the number of op-erations seems to indicate that the sessions become morefocused in the end.

Figure 6 reports the distribution of the positions of rel-evant queries (in red) in the sessions. Note that given therelative small size of the log, manual inspection was possi-ble and done, and the queries in yellow color (around 9%)indicate the presence of a query that does not seem to con-tribute to the session (being a repetition of another queryof the session, or comporting edition mistakes like forgetingto incorporate the good measure attribute to analyze).

An analysis by question is reported in Table 2. We havecharacterized the average similarity of the sessions in the

Figure 6: Distribution of relevant queries

Question analysis Q1 Q2 Q3 Q4Number of sessions 8 7 6 5Number of queries 82 57 86 56Number of relevant queries 26 21 34 27Queries / session 10.25 8.14 14.33 11.2Relevant queries / session 3.25 3 5.67 5.4Intra-question similarity 0.55 0.16 0.13 0.21Inter-question similarityQ1 0.55Q2 0.02 0.16Q3 0.11 0.15 0.13Q4 0.14 0.08 0.16 0.21

Table 2: Question analysis

log, using the similarity defined in [5], which is quite low(0.14) as expected, as well as the intra-question similarityand the inter-question similarity. It can be noted that theintra-question similarity shows important variations, whichindicates that sessions developed for answering a given ana-lytical question can be quite different. However, this intra-question similarity is almost always greater than the averagesession similarity of the complete log. A notable exception isQ3, where the intra-question similarity (0.13) is lower thanthe average similarity to the sessions of both questions 2(0.15) and 4 (0.16). Interestingly, this is the question thatgenerated the highest number of queries per sessions, whichcan be understood as the difficulty of the question.

We then report the results of the 3 naive automatic ses-sion generation algorithms described above. For each of thequestions and each of the tests, 100 sessions were generated.The result of the first test, aiming at understanding Falseto’slog search capacity, is reported in table 3, that shows recall

Log search R(7) P(7) R(13) P(13) R(18) P(18)Q1 0.16 0.91 0.29 0.93 0.39 0.95Q2 0.11 0.80 0.14 0.67 0.14 0.57Q3 0.05 0.79 0.08 0.69 0.11 0.64Q4 0.04 0.38 0.08 0.35 0.09 0.31

Table 3: Log search capacity

Recommender system Q1 Q2 Q3 Q4Average novelty 0.25 1 0.42 0.39Average coverage 0.88 1 0.68 1Average recommendation similarity to questionRecommendations for Q1 0.65 0.03 0.14 0.27Recommendations for Q2 0.03 0.33 0.20 0.15Recommendations for Q3 0.12 0.15 0.28 0.21Recommendations for Q4 0.19 0.12 0.23 0.39

Table 4: Recommender system capacity

Search and recommendations Q1 Q2 Q3 Q4Average recall 0.32 0.16 0.08 0.08Average precision 0.92 0.61 0.61 0.34

Average novelty 0.08 0.54 0.33 0.5Average coverage 1 1 1 1Average recommendation similarity to questionRecommendations for Q1 0.66 0.03 0.14 0.15Recommendations for Q2 0.02 0.33 0.17 0.19Recommendations for Q3 0.15 0.17 0.23 0.24Recommendations for Q4 0.05 0.25 0.26 0.29

Table 5: Combining search facilities with recommen-dations

(R) and precision (P) for various average sizes of sessions(in brackets). We observe that precision is almost alwaysgood while recall increases with the average session size. Asimilar phenomenon occurs when computing recall and pre-cision with respect to relevant queries (not reported heredue to lack of space). This test shows how effective Falsetocan be for retrieving in a log queries and therefore sessionsdeveloped for an analytical need that is the same as the onecurrently investigated by the current session.

The result of the second test, aiming at understandingthe capacity of Falseto’s recommender system, is reportedin Table 4. It can be seen that coverage is very good (arecommendation is very frequently proposed). Novelty isgenerally good except for question 1, which can be explainedby the fact that the sessions developed for this question arevery close to each other. Importantly, the recommendationsare always closer to the sessions of the question they weregenerated for (in bold), compared to the sessions of otherquestions, which means that the focus on the question ispreserved. This is further confirmed by the fact that theaverage similarity of the recommandations to the sessions ofthe question is even above the intra-question similarity.

The results of the last test (generating sessions by rec-ommending for the sessions generated for test number 1) isreported in table 5. These results are slightly degraded com-pared to the previous tests, which is due to the combinationof the two holistic functionalities (the recommendations forQ3 are more similar to those for Q4 than Q3, in italic).

Overall, these tests confirm that, even with a very naiveway of using Falseto’s functionalities, the sessions obtainedremained focused on the analytical question addressed andinclude both previously developed queries and novel queries.

Finally, we tested the responsiveness of Falseto, to check ifthe holistic functionalities are compatible with an interactiveuse. The tests were run on MAC-OSX Mavericks with 16Gb of RAM and a 2.5 GHz Intel Core i5.

Figure 7: Falseto’s responsiveness

Figure 7 shows the average time it takes to compute i)the initial summary using clustering (in blue), ii) a searchin the log (in green) and iii) a recommendation (in yellow),for synthetic logs of increasing sizes (a maximum of 200 ses-sions corresponding to 1888 queries). It can be seen that logsearch and recommendation times are perfectly compatiblewith an interactive use (the maximum average was 152 msin our tests, for computing recommendations with a log of200 sessions). The time to compute the initial clusteringis much more substantial and increases with the size of thelog, but it remains acceptable since it is executed only oncewhen Falseto is started. This test confirms Falseto’s respon-siveness, and, importantly, justifies the use of only the querysyntax (and not the query answer, as explained in Section4) in the holistic functionalities of Falseto.

6. BRIDGING THE GAP BETWEEN OLAPAND EXPLORATORY SEARCH

6.1 Related workIt is now admitted in the database community that query

logs can be leveraged not only for physical tuning, but alsofor user empowerment [15]. Logs can indeed be used forlearning user preferences [13], for query recommendation [7,22], for query auto-completion [16], or for query composition[17]. In particular, the study of [17] showed that SQL querycomposition time can be heavily reduced when users areprovided with a tool to browse, search and reuse formerSQL queries organized in sessions.

In the OLAP context, the navigational nature of analyt-ical sessions over data cubes is well known [19], and ap-proaches have been developped for improving query perfor-mance holistically at the session level [19, 14], or suggest-ing queries or facts to assist the user in her navigation [20,10]. Noticeably, while many efforts have been devoted tospeedup OLAP query processing, the question of how effec-tive an OLAP session has attracted little attention, while itis predominent in other fields like information retrieval oreven more specifically, exploratory search.

Exploratory search [23] can be defined as a searchparadigm centered on the user and the evolution of herknowledge, aiming at a better support for information un-derstanding by moving beyond the traditional query-browse-refine model. As illustrated by Figure 8, exploratory searchconsists of two main activities: exploratory browsing and fo-cused searching. Exploratory browsing essentially helps re-late the problem context to similar documented experiences,while focused searching is generally intended to help the

Figure 8: Model of exploratory search behavior [23]

user follow a known or expected trail. Exploratory search isparticularly driven by the quality of the user’s experience,and suitable metrics for measuring it have been proposed[23]: engagement and enjoyment (the number of actionableevents), information novelty (the rate at which new infor-mation is encountered), task success (do users reach a par-ticular target and got enough detail en route to their goal),task time (the time spent to complete the task), and learningand cognition (the evolution of individual knowledge).

Exploratory search attracted the interest of the databasecommunity, and was even the topic of a recent post onthe SIGMOD blog (http://wp.sigmod.org/), with a specificsection that compares it to OLAP. The conclusion of thissection is that OLAP has serious limitations in supportingexploratory search, especially in the domain of discovery,adaptability to new information needs, and the support ofall types of relevant data and users. We next discuss howFalseto could help to bridge the gap between OLAP andexploratory search, and overcome these limitations.

6.2 Bridging the gapFigure 8 can be used as a first driver for bridging the gap

between OLAP and exploratory search. Interstingly, the testwe did with users showed that OLAP sessions could followthis model (see Figure 5). The two languages implementedby Falseto and presented in Section 3 can be seen as tar-geting the two activities of exploratory search. On the onehand, the query edition language can be used for focusedsearching, and on the other hand, the log manipulation lan-guage can be used to describe exploratory browsing tasksthat take advantage of former sessions. With this idea inmind, we designed a simple protocol to understand, via usertests, how Falseto can be used to better support exploratorysearch in the context of OLAP. This protocol relies on pro-duction of sessions in response to a given analytical question,and production of reports by the inclusion and comment, ina document, of the relevant queries from the session.

We first start by adapting to the OLAP context the met-rics introduced above for assessing exploratory search appli-cations. These metrics could be computed as follows:

• User engagement is measured as the number of queriesfound relevant and included in the report created while

answering a given analytical question. It is also mea-sured as the number of clicks on the holistic function-alities.

• Information novelty is measured as the number ofqueries in a report that are relevant and that appearin no other reports of the same analytical need.

• Task time is measured as the time taken to complete asession (to answer the analytical question). The timespent on holistic functionalities (exploratory browsing)is also reported. We also measure the user effort as theratio between the number of queries in the session andthe number of queries in the report.

• Task success is measured as the precision and recall ofthe report that answers to a given analytical question.Precision is the number of relevant queries in a reportover the overall number of queries in the report. Recallis the number of queries of a report that appear in (orthat are included in a query of) another report.

• Finally, learning and cognition can be measured byrecording a user profile and assess i) how this profileevolves through consecutive uses of Falseto and ii) ifthis user profile leads to improvements of the othermetrics through consecutive uses of Falseto.

We now sketch the protocol more precisely. The completeprotocol can be downloaded from the Falseto website [8].It corresponds to a test that is organized in three phases.The first phase aims at giving the analysts the sufficientknowledge to use both the basic and holistic functionalitiesof Falseto. The analysts are asked to complete the Falsetotutorial. After this phase, the test supervisors should consti-tute three groups of analysts (named Basic, Advanced andComplete from now on), ensuring in each group an evendistribution of skills with Falseto.

In the second phase, the analysts in the the Basic group,will use a basic version of Falseto that does not include theholistic functionalities. In the Advanced group, the analystswill use a version of Falseto that only includes the holis-tic functionalities, but does not allow the analyst to editqueries. In the Complete group, the analysts will use thecomplete version of Falseto that includes all the functional-ities. All analysts are asked to solve an exercise consistingof two analytical questions that can be solved by navigatingthe data cube. The second need is deliberately more fuzzythan the first one, and probably requires a longer navigationto get answered. The analysts should answer one require-ment after the other. The form of the answer to the questionis twofold: the log that records the session (automaticallygenerated by Falseto) and a report. This report is a docu-ment that contains a list of annotated query results, takenfrom the session.

During the last phase, each analyst should evaluate all thereports that were produced to answer a particular analyticalquestion. They should indicate, for each query, if this queryis relevant (does it contribute to answer the need, accordingto the analyst) and mark all the reports that include thisquery. If a report does not include the query stricto sensubut a superset of it (for instance the query filters out oneyear but the report includes a query that includes all theyears but which apart from that is exactly the same) thisreport should also be marked as including the query.

Implementing this protocol is part of our future work, andwill allow to compute the metrics introduced above. As men-tioned already, OLAP is limited in its exploratory searchcapabilities in terms of discovery, adaptability to new infor-mation needs, and the support of all types of relevant dataand users. Implementing this protocol will at least enable toanswer to the limitations in terms of discovery (can Falsetoleads the user towards interesting findings they would havemissed without Falseto) and in term of supporting differenttypes of users (naive and expert).

7. CONCLUSIONThis paper introduced Falseto, an OLAP tool that is

meant to facilitate OLAP query and session composition.During a session with Falseto, the user can summarize,browse, query and reuse former sessions recorded in aquery log, and can take advantage of sessions suggested byFalseto’s recommender system. This tool relies on a logi-cal framework that includes a language for editing queriesand a language for manipulating logs. We report the lessonslearned from analyzing real OLAP sessions and using themto assess Falseto’s functionalities. Finally, we discussed howFalesto can link OLAP to the field of exploratory search,a search paradigm centered on the evolution of the user’sunderstanding.

Falseto can be seen as the starting point of a platform fordeveloping user-centric OLAP facilities. The first short-termperspective of this work is to better assess Falseto’s holisticfunctionalities with user tests, so as to improve the existingfunctionalities and devise new ones. A second short-termperspective is to make Falseto compatible with logs providedby standard OLAP tools. The challenge being to translatethe queries of these logs according to the format supportedby Falseto, by limiting information loss. Another short-termperspective is to find better ways of graphically displayingsessions and summaries. On the long run, our aim is to beable to benchmark OLAP activities in terms of effectiveness,which would be possible by bridging the gap between OLAPand exploratory search.

8. REFERENCES[1] Rakesh Agrawal, Ashish Gupta, and Sunita Sarawagi.

Modeling Multidimensional Databases. In ICDE,pages 232–243, 1997.

[2] Julien Aligon. Similarity based recommendation ofOLAP sessions. PhD thesis, Universite FrancoisRabelais Tours, 2013.

[3] Julien Aligon, Enrico Gallinucci, Matteo Golfarelli,Patrick Marcel, and Stefano Rizzi. A CollaborativeFiltering Approach for Recommending OLAPSessions. Under submission, 2014.

[4] Julien Aligon, Matteo Golfarelli, Patrick Marcel,Stefano Rizzi, and Elisa Turricchia. MiningPreferences from OLAP Query Logs for ProactivePersonalization. In ADBIS, pages 84–97, 2011.

[5] Julien Aligon, Matteo Golfarelli, Patrick Marcel,Stefano Rizzi, and Elisa Turricchia. Similaritymeasures for OLAP sessions. KAIS, 39(2):463–489,2014.

[6] Julien Aligon, Haoyuan Li, Patrick Marcel, andArnaud Soulet. Towards a logical framework forOLAP query log manipulation. In PersDB, 2012.

[7] Gloria Chatzopoulou, Magdalini Eirinaki, Suju Koshy,Sarika Mittal, Neoklis Polyzotis, and JothiSwarubini Vindhiya Varman. The QueRIE system forPersonalized Query Recommendations. IEEE DataEng. Bull., 34(2):55–60, 2011.

[8] Falseto web site.http://vega.info.univ-tours.fr:29082/TEA/.

[9] Arnaud Giacometti, Patrick Marcel, and Elsa Negre.A framework for recommending OLAP queries. InDOLAP, pages 73–80, 2008.

[10] Arnaud Giacometti, Patrick Marcel, and Elsa Negre.Recommending Multidimensional Queries. In DaWaK,pages 453–466, 2009.

[11] Matteo Golfarelli and Stefano Rizzi. Data WarehouseDesign: Modern Principles and Methodologies.McGraw-Hill, 2009.

[12] Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei,Benjamin W. Wah, Jianyong Wang, and Y. Dora Cai.Stream Cube: An Architecture for Multi-DimensionalAnalysis of Data Streams. Distributed and ParallelDatabases, 18(2):173–197, 2005.

[13] Stefan Holland, Martin Ester, and Werner Kießling.Preference Mining: A Novel Approach on Mining UserPreferences for Personalized Applications. In PKDD,pages 204–216, 2003.

[14] Niranjan Kamat, Prasanth Jayachandran, KarthikTunga, and Arnab Nandi. Distributed and interactivecube exploration. In ICDE, pages 472–483, 2014.

[15] Nodira Khoussainova, Magdalena Balazinska,Wolfgang Gatterbauer, YongChul Kwon, and DanSuciu. A Case for A Collaborative Query ManagementSystem. In CIDR, 2009.

[16] Nodira Khoussainova, YongChul Kwon, MagdalenaBalazinska, and Dan Suciu. SnipSuggest:Context-Aware Autocompletion for SQL. PVLDB,4(1):22–33, 2010.

[17] Nodira Khoussainova, YongChul Kwon, Wei-TingLiao, Magdalena Balazinska, Wolfgang Gatterbauer,and Dan Suciu. Session-Based Browsing for MoreEffective Query Reuse. In SSDBM, pages 583–585,2011.

[18] Minnesota Population Center. Integrated Public UseMicrodata Series. http://www.ipums.org, 2008.

[19] Carsten Sapia. PROMISE: Predicting Query Behaviorto Enable Predictive Caching Strategies for OLAPSystems. In DAWAK, pages 224–233, 2000.

[20] Sunita Sarawagi. User-Adaptive Exploration ofMultidimensional Data. In VLDB, pages 307–316,2000.

[21] Temple Smith and Michael Waterman. Identificationof Common Molecular Subsequences. Journal ofMolecular Biology, 147:195–197, 1981.

[22] K. Stefanidis, M. Drosou, and E. Pitoura. ”You MayAlso Like” Results in Relational Databases. InPersDB, 2009.

[23] Ryen W. White and Resa A. Roth. ExploratorySearch: Beyond the Query-Response Paradigm.Synthesis Lectures on Information Concepts,Retrieval, and Services. Morgan & ClaypoolPublishers, 2009.

a holistic approach to olap sessions composition: the ...julien.aligon/papers/dolap2014.pdf · 22],...

Documents