enhancing web-based analytics applications through...

11
Enhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya Chandurkar, Maoyuan Sun, and David Koop Abstract— Visual analytics systems continue to integrate new technologies and leverage modern environments for exploration and collaboration, making tools and techniques available to a wide audience through web browsers. Many of these systems have been developed with rich interactions, offering users the opportunity to examine details and explore hypotheses that have not been directly encoded by a designer. Understanding is enhanced when users can replay and revisit the steps in the sensemaking process, and in collaborative settings, it is especially important to be able to review not only the current state but also what decisions were made along the way. Unfortunately, many web-based systems lack the ability to capture such reasoning, and the path to a result is transient, forgotten when a user moves to a new view. This paper explores the requirements to augment existing client-side web applications with support for capturing, reviewing, sharing, and reusing steps in the reasoning process. Furthermore, it considers situations where decisions are made with streaming data, and the insights gained from revisiting those choices when more data is available. It presents a proof of concept, the Shareable Interactive Manipulation Provenance framework (SIMProv.js), that addresses these requirements in a modern, client-side JavaScript library, and describes how it can be integrated with existing frameworks. Index Terms—Collaboration, provenance, streaming data, history, web. 1 I NTRODUCTION With the emergence of web libraries for interactive visual exploration and analysis of data [5, 12, 14, 16, 48, 55], users are able to analyze and explore a wide variety of data from their web browsers. Support for animated transitions and event handling gives users the ability to interactively filter data, change views, or dynamically steer computa- tions. As these frameworks and tools continue to evolve, there has been a tradeoff between the fluidity offered by client-side frameworks and the complexity of the applications. Specifically, where server-side applications capture information as pages are navigated and offer mech- anisms to return to previous steps or settings, client-side applications offer little infrastructure to recall or persist past actions. This either limits the complexity of the visualization or leads to user frustration. However, server-side solutions require servers to provide storage and interfaces and users to trust servers with their information. In addition, it is usually more difficult to customize the content and granularity of information captured, and the data tends to be less connected to the actions a user makes. We have explored the space of solutions for adding provenance to support reflection and collaboration, and de- veloped a prototype JavaScript framework (SIMProv.js) to show how client-controlled provenance can augment interactive web applications. Provenance—data describing how a result was achieved—has been shown to be useful for a variety of tasks [20, 30, 49, 52]. It is important in documenting work accomplished, engendering trust in results, and supporting reproducibility of past results. However, most work on capturing computational provenance, which is focused on the history of the design and execution of tasks, has concerned static computations. In practice, data exploration and analysis involves dynamic interaction with the computed results, and those actions may trigger computational updates usually presented to the user immediately. In addition, the design of the computation is an important step to document, not only for accuracy, but also to see what changes were developed as the computation evolved [21]. Finally, the goals and decisions of a user, as they digest and interact with visual displays are an important piece in understanding past and current analyses [46], often enhanced by user-provided annotations. With the dominance of the web, it is becoming more common for users to use and collaborate on web analytics applications. Poly- Chrome [3] is a solution that seeks to facilitate this using general web-standard techniques and focuses on fine-grained events; we seek Akhilesh Camisetty, Chaitanya Chandurkar, Maoyuan Sun, and David Koop are with UMass Dartmouth. E-mail: {acamisetty, cchandurkar, smaoyuan, dkoop}@umassd.edu. coarser actions to enable users to navigate and reflect on their past work. Provenance, generated as a user performs analysis steps, can be useful in helping a user understand another’s work when users are working asynchronously and a handoff [60] occurs. While the promise of provenance is well-recognized, user privacy has also become a part of any conversation about web-based tools and applications. Sites that store such information on servers must make users aware of such de- tails, and any project that involves sensitive information often cannot function in a hosted environment unless those servers are controlled by the investigator’s organization or compliant with specific regulations and security protocols. Thus, we wish to ensure that all provenance data is maintained on a user’s machine leveraging browser storage like IndexedDB. Ideally, applications may be delivered and executed over the web via the browser, but all history is stored locally–no provenance travels back to the server without user permission. Another emerging area where provenance can help is streaming data. As visual analytics deals with updating data, web applications are well-positioned to link with data streams, but we need to consider how the provenance and the streaming data match (or don’t). It is useful to both recall or revisit the exact state of the application and data when reviewing provenance, but it can also be enlightening to see the application in the same state but the data updated with more or more recent data. Thus, it makes sense to support both of these choices when possible. In this paper, we describe a framework that enables developers to add provenance features to their interactive web applications and users to improve their analyses through these features. Key contributions include: Desiderata and strategies for capturing, using, and sharing prove- nance in web applications. A framework that supports both private, user-controlled prove- nance and asynchronous and synchronous collaboration. Strategies for capturing and revisiting provenance when web analysis applications use streaming data. A reflection on decisions made in evolving the framework to better embrace modern web technologies and support real-time analytics. A developer-focused perspective that considers the barriers to the adoption of provenance in web applications. Please see the supplemental video for a better understanding of how these aspects have been integrated into a prototype system.

Upload: others

Post on 08-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

Enhancing Web-based Analytics Applications through Provenance

Akhilesh Camisetty, Chaitanya Chandurkar, Maoyuan Sun, and David Koop

Abstract— Visual analytics systems continue to integrate new technologies and leverage modern environments for exploration andcollaboration, making tools and techniques available to a wide audience through web browsers. Many of these systems have beendeveloped with rich interactions, offering users the opportunity to examine details and explore hypotheses that have not been directlyencoded by a designer. Understanding is enhanced when users can replay and revisit the steps in the sensemaking process, andin collaborative settings, it is especially important to be able to review not only the current state but also what decisions were madealong the way. Unfortunately, many web-based systems lack the ability to capture such reasoning, and the path to a result is transient,forgotten when a user moves to a new view. This paper explores the requirements to augment existing client-side web applicationswith support for capturing, reviewing, sharing, and reusing steps in the reasoning process. Furthermore, it considers situations wheredecisions are made with streaming data, and the insights gained from revisiting those choices when more data is available. It presentsa proof of concept, the Shareable Interactive Manipulation Provenance framework (SIMProv.js), that addresses these requirements in amodern, client-side JavaScript library, and describes how it can be integrated with existing frameworks.

Index Terms—Collaboration, provenance, streaming data, history, web.

1 INTRODUCTION

With the emergence of web libraries for interactive visual explorationand analysis of data [5, 12, 14, 16, 48, 55], users are able to analyzeand explore a wide variety of data from their web browsers. Supportfor animated transitions and event handling gives users the ability tointeractively filter data, change views, or dynamically steer computa-tions. As these frameworks and tools continue to evolve, there hasbeen a tradeoff between the fluidity offered by client-side frameworksand the complexity of the applications. Specifically, where server-sideapplications capture information as pages are navigated and offer mech-anisms to return to previous steps or settings, client-side applicationsoffer little infrastructure to recall or persist past actions. This eitherlimits the complexity of the visualization or leads to user frustration.However, server-side solutions require servers to provide storage andinterfaces and users to trust servers with their information. In addition,it is usually more difficult to customize the content and granularityof information captured, and the data tends to be less connected tothe actions a user makes. We have explored the space of solutionsfor adding provenance to support reflection and collaboration, and de-veloped a prototype JavaScript framework (SIMProv.js) to show howclient-controlled provenance can augment interactive web applications.

Provenance—data describing how a result was achieved—has beenshown to be useful for a variety of tasks [20, 30, 49, 52]. It is importantin documenting work accomplished, engendering trust in results, andsupporting reproducibility of past results. However, most work oncapturing computational provenance, which is focused on the historyof the design and execution of tasks, has concerned static computations.In practice, data exploration and analysis involves dynamic interactionwith the computed results, and those actions may trigger computationalupdates usually presented to the user immediately. In addition, thedesign of the computation is an important step to document, not onlyfor accuracy, but also to see what changes were developed as thecomputation evolved [21]. Finally, the goals and decisions of a user,as they digest and interact with visual displays are an important piecein understanding past and current analyses [46], often enhanced byuser-provided annotations.

With the dominance of the web, it is becoming more common forusers to use and collaborate on web analytics applications. Poly-Chrome [3] is a solution that seeks to facilitate this using generalweb-standard techniques and focuses on fine-grained events; we seek

• Akhilesh Camisetty, Chaitanya Chandurkar, Maoyuan Sun, and David Koopare with UMass Dartmouth. E-mail: {acamisetty, cchandurkar, smaoyuan,dkoop}@umassd.edu.

coarser actions to enable users to navigate and reflect on their pastwork. Provenance, generated as a user performs analysis steps, canbe useful in helping a user understand another’s work when users areworking asynchronously and a handoff [60] occurs. While the promiseof provenance is well-recognized, user privacy has also become a partof any conversation about web-based tools and applications. Sites thatstore such information on servers must make users aware of such de-tails, and any project that involves sensitive information often cannotfunction in a hosted environment unless those servers are controlled bythe investigator’s organization or compliant with specific regulationsand security protocols. Thus, we wish to ensure that all provenancedata is maintained on a user’s machine leveraging browser storage likeIndexedDB. Ideally, applications may be delivered and executed overthe web via the browser, but all history is stored locally–no provenancetravels back to the server without user permission.

Another emerging area where provenance can help is streamingdata. As visual analytics deals with updating data, web applicationsare well-positioned to link with data streams, but we need to considerhow the provenance and the streaming data match (or don’t). It isuseful to both recall or revisit the exact state of the application and datawhen reviewing provenance, but it can also be enlightening to see theapplication in the same state but the data updated with more or morerecent data. Thus, it makes sense to support both of these choices whenpossible.

In this paper, we describe a framework that enables developers toadd provenance features to their interactive web applications and usersto improve their analyses through these features. Key contributionsinclude:

• Desiderata and strategies for capturing, using, and sharing prove-nance in web applications.

• A framework that supports both private, user-controlled prove-nance and asynchronous and synchronous collaboration.

• Strategies for capturing and revisiting provenance when webanalysis applications use streaming data.

• A reflection on decisions made in evolving the framework tobetter embrace modern web technologies and support real-timeanalytics.

• A developer-focused perspective that considers the barriers tothe adoption of provenance in web applications.

Please see the supplemental video for a better understanding of howthese aspects have been integrated into a prototype system.

Page 2: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

2 RELATED WORK

Ragan et al.’s survey contains an organizational framework for prove-nance in visualization and data analysis [49]. In that framework, ourwork can be characterized as focusing on the interaction and visual-ization types with the direct purposes of replication, action recovery,and collaboration. Note that the types of provenance may also be cat-egorized more generally. Computational provenance focuses on thelineage of a specification or result regardless of who or what impactedeach step [20]. Analytic provenance focuses on a user’s actions andpurpose [46]. One may try to infer from computational provenancewhat a user’s intent was, and from a high-level understanding, whatdata was used or computations run. This intersection is most salientwhere a user action leads directly to a computational action (or viceversa). Gotz and Zhou organize provenance based on semantic richnesswith high-level tasks (e.g. identifying insights) at one end and low-levelevents (e.g. mouse clicks) at the other [24]. Our work focuses on theaction tier, capturing exploration actions and allowing users to performinsight actions like annotation and meta actions like undo and redo.

Provenance Structure & Display A challenge with provenanceis effectively supporting visual and interactive exploration [30]. GRAS-PARC introduced the history tree as a way to organize snapshots ofthe steps scientists take in an investigation [7], and Derthick and Rossshowed how such a tree could be used to efficiently visit past states [17].VisTrails introduced change-based provenance as a way to encode statesof the history (or version) tree [21]. Unlike version control softwarelike Git [22], change-based provenance relies on prescriptive changes,designated when an action occurs rather than when modifications arecommitted. Network visualizations like Image Graphs [40] and Ex-Plates [56] help organize related states and can also display informationabout the mutation of data of visual representation. Thumbnails can bea key component in displaying history [4, 29], and studies have shownthey can be effective even at low resolutions [35, 54]. Furthermore,thumbnails may be cropped or zoomed to focus on particular changesthat help users identify types of actions [37].

Collaboration in Visual Analytics The visualization communityhas acknowledged the importance of collaboration, and there has beensignificant work to classify the different contexts and outline associatedchallenges [28, 33, 51]. Modes of collaboration are divided accord-ing to both space and time. Co-located users will collaborate withdifferent strategies than those who are distributed in different loca-tions. Synchronous collaboration has users working at the same timewhile asynchronous collaboration allows users to work more indepen-dently [33]. An important challenge in asynchronous collaborativesensemaking is the handoff problem where one user’s analysis is pickedup by another user who must understand the work that has alreadybeen done and where to go next, a type of knowledge transfer [60]. Insynchronous collaboration, the problem of concurrent updates is oftenaddressed via operational transformation which can utilize timestampsand buffers to keep clients in the same state [18, 45].

Collaboration Infrastructure and Interfaces Collaborative anal-ysis requires users be aware of others’ work as well as their own;interfaces need to facilitate user access and awareness. Using serverinfrastructure, VisPortal persisted visualizations and notes across ses-sions and allowed users to build off each other’s work [34]. Hajizadehet al. investigated views where a collaborator’s selections could beoverlaid over a user’s visualization [27]. The CLIP tool showed notonly that analysts benefit from externalizations of notes and discoveries,but also introduced partial and full merging strategies for integratingcollaborators’ work into a user’s workspace [41]. A network or clus-ter view that organizes insights juxtaposed with a timeline has beenshown to aid users in synchronous [11] and asynchronous [10] col-laboration scenarios. McGrath et al. investigated the branching andmerging in their protocol for co-located, synchronous collaboration,providing methods for users to switch between coupled and decoupledexploration as an analysis proceeds [42]. PolyChrome is a native webframework for synchronous collaboration, and pushes DOM eventsbetween browsers [3]; our work is similar but targets the action tier toprovide a coarser provenance.

Web Provenance Provenance on the web has been examined inthe context of linked data or the lineage of information displayed on aweb page [25, 26]. It also includes analytic provenance tasks, trackingactions across different sites for sensemaking [44]. Because of the in-terconnected nature of the web, various data sources may be combinedin an interactive system. Thus, an insight gained during interactiondepends on the data sources, the integration and transformation of thatdata, and the actions of a given user. Our focus is on web applicationswhere a user interacts with a single page that updates interactively ratherthan examining the browsing history of a user. However, we expectthat provenance schemes that work across pages could be coupled withour focused provenance.

Using Provenance The ability to use the captured provenance isimportant, and visualizations of history help users navigate and explorepast work while algorithms might also suggest directions for furtheranalysis. Work in this space includes explicitly showing dimensioncoverage during collaborative analysis [50] and displaying personalinteraction histories via direct encoding [19]. Similarly, scented wid-gets help guide users to explore certain parts of the visualization [57].Dabek and Callan show how building a grammar-based model usinginteraction history might help suggest or correct directions during anal-ysis [15]. REACT uses captured history to generate personalized actionrecommendations for query changes [43], and EVLIN extends theidea to a richer provenance model and also generates visualizationrecommendations [38].

Web App Debugging There are a variety of tools to help replayor debug web applications. Tools like Unravel seek to help developersbetter understand or reverse engineer websites [31]. Timelapse providesan interpreter that can be embedded with modern web browsers tocapture detailed debugging information [8]. Visualization is also usefulfor understanding asynchronous interactions [1]. One of our goals isto keep users in the same environment they are used to, augmenting itinstead of requiring users to use special software or tools when browsingthe web. Other tools, like Vega’s visual debugger [32], provide targeteddebugging for specific libraries.

Logging Mechanisms Other solutions exist for logging actionsas users interact with a web application and run in a standard browser.Google Analytics is a popular framework that allows developers to trackuser page views and developer-identified events through a JavaScriptlibrary [23]. In such frameworks, data is automatically sent to andstored on servers where it can be analyzed by users in a dashboardinterface. In addition, it is not data a user can use to help recall theirexploration. Other solutions, like mimic.js, provide fine-grained dataabout interaction events [6]. Again, while these solutions allow forpost-hoc analyses based on different provenance data, we seek solutionsto help users as they work by providing external memory for past statesand analyses.

Libraries Methods to store and recall past state range from frag-ment encoding to libraries that seek to provide undo and redo capabili-ties equivalent to desktop applications on the web. Web developers mayencode state in URLs with fragment identifiers; the fragment identifierdefines parameters that impact the current state of the page. Client-side web storage (IndexedDB, WebSQL) provides larger storage forsuch data. Libraries that provide undo and redo capabilities (e.g., [2])align with one of the core goals that users are able to use the recordedprovenance immediately. However, unlike logging or provenance solu-tions, undo stacks are often not serializable, and they do not supportbranching or jumping between states.

3 DESIGN DESIDERATA

Interactive web applications have become an important medium forvisual analytics work. Many tools are being developed for web browsersbecause they can support users in many different settings. At the sametime, rich toolkits and more libraries have made it possible to developmore complex applications. As this trend continues, it is important thatwe support analysts who are using these tools. One important aspect ofthis support is to augment the reasoning process by freeing users from

Page 3: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

Average speed by circuit segment

2 3 4 5 6 7

# Lap

050

100150200250300350400450500550600

Racer Journalist

Driver

0500

1,0001,5002,0002,5003,0003,500

-60 -40 -20 0 20 40 60

Banking angle (degrees)

050

100150200250300350

0 50 100 150 200 250 300

Speed (km/h)

020406080

100120140160180200220240

-50 -40 -30 -20 -10 0 10 20 30

Accelleration (m/s²)

0100200300400500600

(a) Racer

Average speed by circuit segment

2 3 4 5 6 7

# Lap

050

100150200250300350400450500550600

Racer Journalist

Driver

0500

1,0001,5002,0002,5003,0003,500

-60 -40 -20 0 20 40 60

Banking angle (degrees)

050

100150200250300350400450500

0 50 100 150 200 250 300

Speed (km/h)

020406080100

120140160180200220240260

-50 -40 -30 -20 -10 0 10 20 30

Accelleration (m/s²)

0100200300400500600700800

(b) Journalist

Average speed by circuit segment

2 3 4 5 6 7

# Lap

020406080

100120140160180200220

Racer Journalist

Driver

0100200300400500600700800900

1,0001,1001,200

-60 -40 -20 0 20 40 60

Banking angle (degrees)

050

100150200250300350

0 50 100 150 200 250 300

Speed (km/h)

0102030405060708090

100110

-50 -40 -30 -20 -10 0 10 20 30

Accelleration (m/s²)

050

100150200250300350

(c) Banking

Fig. 1: Shown are three states from an analyst using a dc.js web application examining data captured from two motorcyclists racing around atrack. One of the riders is a professional racer (a) and one is a journalist (b). In the course of investigation, the analyst decides to examine howmuch speed the racer was carrying as he held a steep banking angle (c). While such web apps enable useful analyses, the visited states and pathsof reasoning are generally lost when the analyst ends the session.

having to remember every step of their analysis. For example, Fig. 1shows three thumbnails of the many states an analyst might encounterwhen comparing the performance of two motorcycle racers. Thesethumbnails provide markers in a user’s sensemaking process.

While provenance has been shown to have value in a variety ofsettings and for a variety of purposes [49], there has been less workon designing web frameworks to support the analysis process. Clearly,many of the principles from desktop applications or workflows can beleveraged on the web, but the web also presents its own challengesand opportunities. Among these, collaboration is an important benefitthat the web offers. Instead of users working alone, networked envi-ronments open many opportunities for analysts to discuss insights anditerate with others on new ideas. However, the web also exacerbatesissues that exist in desktop environments. Specifically, security andprivacy are of paramount importance especially in a sensitive analysis,and as applications receive and send data, it can be difficult to knowwhat information is being exposed. With the frantic pace of web devel-opment, the longevity of a web application can suffer due to librariesgoing unsupported or rapid changes in languages. Our goal has been todevelop a framework that supports provenance for visual analysis in theweb environment, and we have identified desiderata both by examiningprovenance in other contexts as well as current web applications andframeworks.

Reflection We wish to support self-reflection at different timescales. An important feature for reflection is the ability to revisit paststates, including states that were just created, as part of an analysis. Theweb offers many features, but undo and redo are not standard featuresoutside of text entry fields. When these features are offered, it is oftenas checkpoints that are stored by the server. A second scale is revisitingwork from hours or perhaps days ago. Here, the goal may be to pickup where one left off or to return to a known base for new exploration.Finally, returning to archived provenance can aid a user who months oryears later needs to recall how a particular result was achieved. Here, itmay be more instructive to step through multiple states to examine theprocess used rather than simply locating the final result.

Collaboration We wish to support collaboration at different levels.While self-reflection is important, collaboration has become a focusin today’s interconnected world. The web allows work to be postedfor anyone to examine, but in the context of analysis, it may be morelikely for users to share with a smaller group or even just one otherperson. For smaller, more controlled environments, a user should beable to control the data, storing it locally rather than posting it on aserver. However, collaboration at scale brings about many interestingopportunities. If many users are analyzing a specific problem, all withinthe same application, we can start to build collective intelligence ina more automated fashion, identifying common paths of inquiry aswell as unique results. Instead of users all beginning at the same placeand performing similar first steps, it is possible for them to jump in atexisting waypoints and extend those. Another aspect of collaborationis whether it is shared or independent. In other words, should oneuser’s work impact all others, or should users simply be notified of

each others’ progress? Shared collaboration produces a single resultand works well when different aspects can be analyzed at the sametime, but if users are constantly getting in each other’s way, this can beproblematic. Collaboration with independent views is usually enabledby notifications of changes that a user may choose to examine andperhaps integrate, but that user may also continue to work withoutthe same type of interruption that active shared collaboration mayintroduce.

Streaming Data Much analysis is still carried out on staticdatasets, but with tasks that demand immediate examination even ifthe data has not all loaded, or on datasets that are too large to open atonce, or when current data is more important that past data, we needto support analysis on changing data. Streaming data includes databeing generated in real-time, but data that is loaded incrementally inbatches is currently a more likely use case. Here, revisiting a past stateis fraught with issues. The data may not be available any longer, oran analysis on past data may not make sense when examining morecurrent data. We seek a provenance framework to allow past states tobe recalled in a manner that is either consistent with past data or can beupdated for current data.

Interface Features Users need interfaces where navigation isfeatured to explore and make use of provenance. These interfacesshould support common tasks like undo and redo but also allow jumpsback to known states. To permit such jumps, it is important that one canidentify states, and this often requires some description or depictionof that state. At the same time, understanding the relationship of onestate to another is important. Were two results derived from a similarwaypoint; was another developed at the same time?

Recording Insights Users need to be able to provide their owninsights. Clearly, if a machine was able to perform all of the tasksinvolved in sensemaking, there would be no need for humans in theloop, but this is not the current situation. Instead, we need to makesure that users can relate their own insights to the structures and dataautomatically captured. If a particular state is important, allow the userto tag and comment on it.

Application Integration We want any provenance framework tobe general enough to integrate with different styles of applications. Ofcourse, the time involved in adding collaborative provenance capabili-ties can be prohibitive, but we believe that the core requirements areidentifying the actions or states that should be captured and being ableto serialize and materialize them. In many cases, the visual analyticsapplication does not generate significant state with respect to the un-derlying data or analysis. In other words, because a user drives thechanges, the provenance is purely user-generated which means that thisdata need not grow as significantly as the provenance of a distributedcomputation that is capturing each function call and parameter setting.

Storage & Running Time Another important consideration isthe minimization of time and space constraints needed for provenance.While fully client-side web applications have not always been usedfor involved analyses, this may be partially due to the need for tools

Page 4: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

Yearly Performance (radius: fluctuation/index ratio, color: gain/loss)

19861987

198819891990

19911992

19931994

1995

1996

19971998

1999

-400 -200 0 200 400 600Index Gain

-100%-80%-60%-40%-20%0%20%40%60%80%100%120%

Ind

ex

Ga

in %

GainOrLoss

Gain(58%)

Loss(41%)

Quarters reset

Q1

Q2

Q3

Q4

Day of Week reset

0 200 400

Mon

Tue

Wed

Thu

Fri

Days by Fluctuation(%)

-25% -20% -15% -10% -5% 0% 5% 10% 15% 20% 25%0

50

100

150

200

250

300

Monthly Index Abs Move & Volume/500,000 Chart

1986 1988 1990 1992 1994 1996 1998 20000

200400600800

1,0001,2001,4001,6001,8002,0002,2002,4002,600

Monthly Index Average

Monthly Index Move

1986 1988 1990 1992 1994 1996 1998 2000

1,038 selected out of 3,580 records

Nasdaq 100 Index

Date Open Close Change Volume

1986/01

01/06/1986 130 130 0.00 992400

01/07/1986 130 132 2.00 1275000

01/13/1986 129 129 0.00 894900

01/14/1986 129 129 0.00 1029000

01/20/1986 132 131 -1.00 894300

01/21/1986 131 131 0.00 1107000

01/27/1986 130 130 0.00 1105000

01/28/1986 130 131 1.00 1154000

1986/02

02/03/1986 132 134 2.00 1076000

02/04/1986 134 134 0.00 1253000

Reset All

H

Õ

*

,

ǩ

¢

µ

(a) Persistent View

Yearly Performance (radius: fluctuation/index ratio, color: gain/loss)

19861987

198819891990

19911992

19931994

1995

1996

19971998

19992000

2001 2002

2003

200420052006 2007

2008

200920102011

2012

-1,200 -1,000 -800 -600 -400 -200 0 200 400 600 800Index Gain

-140%-120%-100%-80%-60%-40%-20%0%20%40%60%80%100%120%

Ind

ex

Ga

in %

GainOrLoss

Gain(55%)

Loss(44%)

Quarters reset

Q1

Q2Q3

Q4

Day of Week reset

0 200 400 600 800 1,000

Mon

Tue

Wed

Thu

Fri

Days by Fluctuation(%)

-25% -20% -15% -10% -5% 0% 5% 10% 15% 20% 25%0

100

200

300

400

500

600

Monthly Index Abs Move & Volume/500,000 Chart

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 20120

5001,0001,5002,0002,5003,0003,5004,0004,5005,000

Monthly Index Average

Monthly Index Move

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

1,962 selected out of 6,724 records

Nasdaq 100 Index

Date Open Close Change Volume

1986/01

01/06/1986 130 130 0.00 992400

01/07/1986 130 132 2.00 1275000

01/13/1986 129 129 0.00 894900

01/14/1986 129 129 0.00 1029000

01/20/1986 132 131 -1.00 894300

01/21/1986 131 131 0.00 1107000

01/27/1986 130 130 0.00 1105000

01/28/1986 130 131 1.00 1154000

1986/02

02/03/1986 132 134 2.00 1076000

02/04/1986 134 134 0.00 1253000

Reset All

H

Õ

*

,

ǩ

¢

µ

(b) Updating View

Fig. 2: With streaming data, a user often begins analysis without the full dataset. Here, in an analysis of the NASDAQ stock index, when the userexamined data until 2000, there was a strong upward trend, but the full dataset shows this trend does not continue. We advocate for provenancethat both allows a user to return to the exact view they examined when originally analyzing the data–the persistent view (a), and allows a user toperform the exact same analysis using the full dataset–the updating view (b).

that help users keep track of their progress and process. As theseapplications become more complex, they may generate more complexprovenance data that consumes more space and takes more time tomaterialize. Here, we can examine techniques that use storage tomaintain checkpoints that are faster to load than replaying changes.

4 SOLUTIONS

Now, we seek to address these desiderata and focus on some ideas thatextend current practices. For streaming data, we propose two modesfor revisiting and replaying provenance information. We also exploreusing the same provenance information to support both asynchronousand synchronous collaboration. We examine strategies to improve theefficiency of application integration as well as to improve the efficiencyfor users via a storage-running time tradeoff.

4.1 InterfaceAn important aspect of the interface is that it supports not only capturebut allows users to explore the data as it is captured. This becomes moreimportant when streaming data and collaboration are involved as morevariables complicate a user’s mental map of the analysis. Furthermore,the interface should serve as an additional overlay or integration withexisting web applications, not a replacement or container for them.In this sense, the interface should be minimal but expand to providehelpful overviews of the provenance including tables of past actionsand trees or relationships between states.

In order to support reflection, we see thumbnails as an importantingredient to aid users looking to discover past work [35,54]. However,a thumbnail that shows the entire view may be of limited utility whenchanges are relegated to a particular region of the page. Thus, wemight allow developer configuration of how thumbnails are capturedfrom a particular application. Specifically, there may be three types ofthumbnails: overview, action-focused, and result-focused. An overviewthumbnail shows the entire visualization in its current state. Action-focused thumbnails, then, focus on what the user changed (e.g. achange in selection) while result-focused thumbnails show the regionof the visualization that was updated as the result of the action.

4.2 Streaming DataA streaming data application is as an application where the data is notloaded in a single pass [13]. Importantly, this means that not all of thedata has been loaded when the interface is first shown, and the interfacecontinues to update as more data is ingested. Note that streaming dataapplications may be able to access all of the data that eventually comesin, or they may choose to purge the raw data as filtering or aggregationis completed. In some cases, the data is too big to all be stored, andin other cases, the data becomes stale and thus less useful for onlineanalytics.

Crouser et al. identify some challenges that visual analytics faceswith streaming data, including the importance of recognizing and un-derstanding changes in insights and decisions as the data changes [13].Because the data keeps changing as the user analyzes it, that user canquickly face cognitive overload as they try to keep track of and com-pare new features with views in the past. Because the user is engagedin analysis, they are updating views while the streaming data is alsoenacting change on the same views. To this end, support for recallingpast views as they were originally seen is important. To support this,we can associate an order and identifier to the chunks of data as theycome in and bind the most recent chunk identifier in each provenanceaction as it is captured.

If an application has the entire dataset available and can recall paststates of the analyses, analysts also have an interesting opportunity tocompare that view from the past with an updated version that reflectsthe current reservoir of data. The past view is the persistent, set-in-stonesnapshot of the analysis when decisions were made, but an updatingmode changes as new data is made available, as if the user had stoppedanalysis and watched as the application’s views updated as new datacame in (see Fig. 2 for an example). However, this non-persistent modeshould allow the user to revisit any previous state and apply the latestdata to it. This enables reflection on whether the additional, perhapsmore recent, data changes the analyst’s perspective on any past insights.

4.3 Collaboration

Our goal is to facilitate both asynchronous and synchronous collab-oration through the same mechanism–provenance. Synchronous col-laboration can take advantage of a wide swath of conferencing toolsand screen-sharing software, while asynchronous collaboration can besupported through various serializations, but we believe that standardiz-ing the mechanism will allow users to transition between modes moreeasily. Shared views can be helpful in initial analyses, but independentwork that can later be examined could allow a greater diversity of ideas.Because provenance is stored in the same way for both modes, it iseasy to switch between the modes. Fig. 3 shows how two users might

Name Collaboration Sharing/Privacy Technology

AsyncUser Asynchronous User-Controlled FileAsyncPub Asynchronous Public Web FileSyncPriv Synchronous Private Network Ad-hocSyncOrg Synchronous Organization Network Hosted

Table 1: Different collaboration strategies that can be supported in aweb application

Page 5: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

Interacts

User-1 InteractiveVisualization

Provenance is recorded as user interacts with

the visualization

Provenance Storage

Records User Action

Provenance Data

Exports Provenance in

JSONRecorded provenance can be exported to gist

or saved locally

Shares provenance with User-2

User-2

Loads provenance back into

visualization using SIMProv

InteractiveVisualization

Initial S tate

F requencyOrdering

S wapped76...41

C lusterOrdering

S wapped 4...6

S wapped18...21

S wapped42...47

S wapped14...21

S wapped17...20

S wapped36...28

x894 March 30 2016,6:16:50 pm

c7f5 March 30 2016,6:17:16 pm

f190 March 30 2016,6:17:28 pm

j497 March 30 2016,6:17:44 pm

jaf7 March 30 2016,6:17:50 pm

g15c March 30 2016,6:18:08 pm

ta76 March 30 2016,6:21:25 pm

T humbnail Id R ec orded A t A c tions

Navigates to particular version from Povenance

Gallery

Further interacts with visualization and records new

versions

Exports New version of

Provenance

Sends it back to User-1

Loads new provenance back into visualization

Here user can see versions recorded by

both User-1 and User-2

Gallery View allow users to view past explorations and

navigate between them using inverse/forward actions and

checkpoints

Provenance Storage persisting both old and new versions

User-1

Provenance Data

InteractiveVisualization

Fig. 3: Two users leverage a provenance framework to asynchronously collaborate on work in a web application.

asynchronously collaborative with a provenance-based solution.Especially with the proliferation of personal information stored and

transferred about the web, we must be wary about how provenance isshared and allow users to maintain privacy when desired [59]. To sup-port this, we seek to capture all information locally; provenance shouldnot travel outside a user’s own machine without their consent. This ispossible through the use client-side browser storage like IndexedDB;the application and dependent libraries can be loaded across the web,but the captured provenance stays in a local database. To support col-laboration, we allow this data to be exported in JSON format to a fileor directly as a web-hosted file that can be immediately shared. Table 1shows different strategies and technologies to enable collaboration andwhat the privacy characteristics are.

A final consideration is the join between streaming data and collabo-ration. As streaming data presents challenges beyond standard visualanalytics, collaboration may be useful in this context. Specifically, ifone user has done analysis on a subset of the data, and another userrevisits that analysis in the non-persistent mode, the second user may,like the first user, be able to gain different insights. More interestingly,if the data streaming is not uniform, different users may see views ofthe data, allowing a collaboration to more quickly identify insights.In the synchronous case, this becomes problematic because the prove-nance needs to store the state of the stream at each action, but it couldbe possible to collaborate in a non-persistent mode where users seedifferent views of the data but adhere to the same action history.

4.4 Efficient Storage and AccessThe storage of provenance in a version tree builds on the concepts intro-duced in VisTrails including change-based provenance [21]. We extendchange-based actions to normalize the data involved in a change as itmay be used in multiple changes (e.g. a forward and inverse change).In addition, we distinguish changes that update the visualization fromthose that update state. Finally, the materialization algorithm is ex-tended to consider the costs of each change when evaluating potentialpaths.

4.4.1 DefinitionsA version tree is tree where each node is a version–a state–of the entitybeing tracked by the tree, and each edge signifies that a child node wasderived from its parent node. In contrast to undo stacks, version treespersist all versions, adding branches where multiple versions have beenderived from the same starting state. An edge contains informationabout how a version is transformed. Each change is a function that useschange data to transform one version into another. Ideally, the changedata specifies only the information that changed from one version to theother, but it could, for the simplicity of implementation but at the costof efficiency, encode the entire state. Inverse changes support efficientundo capabilities, encoding a transformation up the tree. The root nodeof the version tree must have an associated reset method that resets an

application to its original state; using this and the sequence of changesalong a path from the root, we can derive any other version.

A change is a specific type of transformation that mutates one versioninto another. The focus is on the transformation—how the state isupdated—rather than exactly how a user enacted it. Formally, givena version v and a change c, c(v) should produce a new valid versionv′. Where a change specifies the type of transformation, the changedata specifies the data and parameters that a change uses to construct aderivative version. Ideally, the change data and changes allow efficientencoding of the transformation between states, but we aim to supportthe range of possibilities because inefficient provenance can often stillbe a good starting point. With (forward) changes, going backwardto previous states requires reseting the visualization to its startingstate and replaying actions until the desired state is reached. Inversechanges are changes that go from a child version to a parent version–the“undo” direction. Note that in some cases, we can share change dataacross changes, even more so with inverse actions since their data oftenoverlaps with a corresponding forward action.

While encoding the entire state of an application as the change datafor every change may be inefficient, having such information wouldallow us to skip replaying entire change sequences in order to get toa particular version. Instead, we simply update the application’s stateaccording to the data stored with that version. To take advantage of thisshortcut more generally, we use a checkpoint for a version that encodesthe full state for that version. More concretely, given a checkpointfunction C and checkpoint data dv for a version v, C(dv), no matterwhat the current version is, will update the visualization to the versionv. We also allow developer-defined checkpoint rules that specify theconditions for creating a checkpoint. Each rule is passed the versiontree and the current node, and returns a boolean indicating whether acheckpoint should be persisted.

Because checkpoints essentially shadow a web application’s state,we can also consider changes that act on that shadow state instead

Change A transformation from one version to anotherStateChange A Change that specifies the transformation of the

serialized application state of one version to the stateof another

ChangeData Data referenced by one or more ChangesCheckpoint Serialization of application state for a versionAction An encapsulation of Changes (change, inverse, stat-

eChange, stateInverse), corresponding ChangeDataobject(s), an optional Checkpoint, and metadata(timestamp, thumbnail, etc.)

Trail An encapsulation of all Actions and the correspond-ing version tree

Table 2: Terms related to the provenance representation

Page 6: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

A B C D

E

I

FH

K

J

G

1 23

4

Fig. 4: Given that we are currently at the N node in the above versiontree and wish to jump to the F node, there are a number of poten-tial paths depending on whether only forward changes (1), forwardchanges and inverses (2), checkpoints and forward state changes (3),or checkpoints and inverse state changes (4), are available. Generally,implementing more features leads to more efficient switches betweenversions.

of updating the application directly. Specifically, we define a statechange, cS, as a change that takes a checkpoint corresponding to aparticular version v and returns a new checkpoint corresponding toanother version v′. Note that SIMProv.js allows users to define botha state change and an inverse state change in addition to a non-statechange and inverse. Often, updating shadow state can be more efficientbecause there are no interface updates that need to be processed untilthe final state is loaded. In addition, state changes may be chunked [29],and these chunks compressed.

Finally, an action encapsulates all data associated with a node andthe edge to its parent in version tree. It includes at least one change thatspecifies how the state represented by the parent node is transformed tothe state represented by the given node. However, it may also includeany of an inverse change, a state change, and an inverse state change. Itmay store an optional checkpoint that represents the state correspondingto the node. Importantly, changes may share change data. If the inverseof adding a particular filter is removing it, the specification of thatvalue may serve as the change data for both the change and the inversechange. It may also serve as the change data for state and inverse statechanges.

4.4.2 Materializing Versions

A key operation on version trees is the materialization of a specific ver-sion, transforming the application to that version’s state from whereverwe happen to be in the version tree. Because we may have both forwardand inverse changes as well as the state-changes of both types, there area number of potential paths to reach the desired state (see Fig. 4). At aminimum, we need a reset callback and forward changes to return toany version. Specifically, we can reset and then run all changes alongthe path from the root to the desired node (Path 1 in Fig. 4). Withinverses, we add the ability to follow the inverses back to a commonparent and then down to the desired node (Path 2 in Fig. 4). Withonly checkpoints and forward state changes (no inverses), we have theability to transform the shadow state from the checkpoint to the desirednode (Path 3 in Fig. 4). Finally, if we add inverse state changes, we canutilize nearby checkpoints even if they are not ancestors of the targetnode (Path 4 in Fig. 4). Note that even with state changes, inverses, andcheckpoints, we still may find that returning to the root node may be anefficient solution.

The cost of each strategy depends on a number of factors and mayalso be impacted by the tradeoff between efficient materialization andstorage costs. In order to determine the most efficient method to obtaina target version v′ from the current version v via provenance, we mustfirst consider various starting nodes and then the paths from those nodesto v′. Besides the current node, the root node and checkpoints are alsovalid starting points. We use a breadth-first search from the target nodeto identify the nearest checkpoints. Due to inverses, a checkpoint maybe descendant of v′ in addition to an ancestor. Each checkpoint has an

associated cost of materialization, and reseting the application to itsinitial state has a cost as well. Next, we calculate the path from eachpotential starting node to v′ and check it for both “normal” change andstate change sequences. First, we check that the necessary inversesexist. Recall that if there are no inverse changes defined, we may stillmaterialize from the root node. Given a valid sequence of changes, wecalculate the associated cost. Each change has an associated cost, andthus, each path has a total cost based on the sums of the change costs.The minimal cost path is selected from all possible starting nodes.

5 PROTOTYPE FRAMEWORKS

We have completed two prototypes of the proposed framework. Thefirst prototype focused on showing that provenance could be capturedand navigated in web applications via a plug-in style system. It had athumbnail strip, gallery view, and annotation panel. It also supportedundo and redo, and integrated a change-based model to track and replayprovenance. The second prototype focused on support for modernweb applications, featuring asynchronous calls and integration withIndexedDB storage so that provenance could be robustly saved andmaintained even when a user navigated away from an application.Behind the scenes, we rewrote code to better delineate between thecore features and the user interface. In addition, this prototype focusedon the challenges of streaming data and real-time collaboration.

One goal of the SIMProv.js framework is to allow developers flex-ibility in capturing provenance information and users flexibility inpersisting, sharing, and using it. Because a general approach for inte-grating provenance in interactive web applications cannot cover everypotential situation, we provide developers with tools to facilitate prove-nance capture and let them designate both where and how provenanceis captured. SIMProv.js provides features for persisting provenance,capturing thumbnails of the current state, and storing annotations. Inaddition, it provides capabilities for importing, reviewing, and replay-ing past provenance information. Since JavaScript is the language ofthe web, SIMProv.js is written in JavaScript, building on the growingstack of existing technologies and libraries.

5.1 InterfaceWhile the ability to capture and persist provenance is clearly important,our goal for the framework is to make the provenance useable. Thisrequires interfaces that permit navigation, browsing, and annotation.We provide a set of panels and widgets inline in the library so userscan inject them (along with proper style rules) into existing pages. Thetoolbar interface is a widget that can be relocated about the page. Fig. 5shows the toolbar and the associated panels. The toolbar provides undoand redo buttons, an annotation panel, a scrollable thumbnail view,and a menu to access methods to store and import provenance data. Italso provides access to a full-screen gallery that shows a table withdetailed information about the versions and thumbnails (see Fig. 7) anda tree showing the relationships between different versions. In addition,the interface provides two submenus to facilitate streaming data andsynchronous collaboration.

Thumbnail Strip Because thumbnails allow quick decisions aboutwhich versions to return to, we provide a thumbnail panel that has allof the versions along the current path (just like with undo and redo). Auser can select a thumbnail to materialize the version associated with it.When hovering over a particular thumbnail, a tooltip displays amountof time that has passed since that version was created in a human-readable form (e.g., “a few seconds ago”) along with a string describingthe change. For visualizations built entirely in canvas or a singlesvg, capturing these thumbnails is fairly straightforward. However,many applications use multiple graphical views and also encode data inHTML elements. Thus, we also need to be able to capture entire regionsof the page. In addition, with transitions, we must often wait to capturethe thumbnail until the transition has ended; sometimes this results inempty thumbnails if a user moves quickly. We use rasterizeHTML.js [9]as a helper library in these cases because it works to respect style rulesat all levels. Indeed, even when trying to capture a simple svg, stylerules that reference parent elements do not work correctly, leading tomisrendered thumbnails.

Page 7: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

Created a few seconds ago by dakoopChan

ged tab to ( #btnRevenues )

( ktn3rtgn )

Add ( m8ql9jk6 )

Add ( d240othv )

Add ( 1xiaeni3 )

Add ( o5bw2vf6 )

Add ( au8yx5sg )

Tab ( dvjk4ntq )

Add ( ie71zq81 )

Tab ( ktn3rtgn )

Add ( yyt1nc4r )

Tab ( uj7oubcu )

Tab ( t8l4tluz )

Tab ( w29thaxn )

Tab ( 9h0aywi9 )

Initial ( lxtwdzwp )au8yx5sg

At March-31-2018

On 04:07:13

By dakoop

o5bw2vf6

At March-31-2018

On 04:07:17

By dakoop

1xiaeni3

At March-31-2018

On 04:07:23

By dakoop

d240othv

At March-31-2018

On 04:07:47

By dakoop

m8ql9jk6

At March-31-2018

On 04:07:51

By dakoop

Thumbnail ID Captured Operations

Thumbnail Strip Gallery & Version Tree View

Annotation Panel

yyt1nc4r

ie71zq81

o5bw2vf6

ID Annotate

In 2012, the system saw less money fromthe state

AsynchronousCollaboration

SynchronousCollaboration

StreamingOptions

UndoRedo

Persistent

Updating

Fig. 5: The SIMProv.js toolbar and associated panels. The toolbar provides undo and redo functions, panels for comments and navigation bythumbnail, and a submenu for importing and exporting provenance. It also provides a gallery overlay that shows the table and tree view withlinked highlighting.

Gallery & Tree Panel Accessible from the toolbar, the gallerypanel shows a table with versions as rows as well as a node-link dia-gram showing the version tree. These views complement each otherbecause the table allows quick visual browsing while the tree makesthe relationships and derivation history for specific versions clearer.In addition, the views are linked so that a user can quickly locate aversion from the table in the tree and vice versa. For each version, thetable view has a thumbnail, identifier, timestamp and icons that allowa user to locate the version in the tree, and view and edit annotationsassociated with the version. The tree view shows the textual descriptionof the change and upon hovering over a particular version, highlightsthat version in the table view. Clicking a node loads the version. Theview can be zoomed and panned, and when a version is located fromthe table view, the view automatically pans to show the linked node.

Streaming Data The second-to-last menu item concerns stream-ing data. Most importantly, the corresponding submenu contains atoggle between the persistent and updating (or non-persistent) modes.This allows a user to dynamically configure whether they should revisitpast states as they were originally viewed or with the potential lightnew data brings to them. In addition, users can check the amount ofstream data downloaded and configure associated streams.

Synchronous Collaboration The final menu item contains a sub-menu for synchronous collaboration. In order to allow many users tocollaborate without the need to install a server, one user can serve asthe host for a collaboration session and provide keys for users that wishto join the shared view. A key is required to access the collaboration,providing security for the analysis. The collaboration is facilitated bypushing provenance between users so even if a user is disconnected,they can review past states and potentially resync with the collaborationwhen they are connected again. During synchronous collaboration,only the main view is shared. Users can independently add annotationsor review past states without affecting others’ work.

5.2 Designing CaptureA key concern in creating useful provenance is having a design that cap-tures meaningful changes at a reasonable granularity. Thus, a developermust identify the transformations that must be captured and the listenersor callbacks that can be utilized to do so. If using checkpoints and statechanges, the developer needs to identify a schema for that state. Often,consulting the application’s own representations, if they exist, is usefulhere. Then, representations for the change data and changes must bedefined. Identifying overlaps between changes and their inverses orstate-based counterparts lead to more efficient representations. Notethat the change data must be serializable regardless of whether it isbeing used for a state change.

SIMProv.js’s framework focuses on helping users design actions tofacilitate capture and replay. An action can include both forward and in-verse data to facilitate inverse changes, and the developer is responsibleboth for determining a proper serialization of the information neededto revisit a state from the previous one and methods to take this dataand enact the change to move between states. To facilitate faster speedat the cost of storage, a developer can also specify a checkpoint methodthat dumps all of the state along with a corresponding updateStatemethod to rematerialize that state. Fig. 6 shows how a new action tosupport tracking changes in the tabs of a multi-tab analytics applicationwas developed.

5.3 Persistence and Collaboration

As a client-side framework, SIMProv.js stores its provenance in thebrowser’s local storage, and does not send information back to a server.This was a key change in our revised framework, as the persistingprovenance used to require that the user export it from memory, butnow provenance is written to browser storage after each action. Notonly does this reduce the burden on the server, eliminating the need tostore everyone’s provenance, but it also puts the user in control of dataderived from their interactions. If a user wishes to remove provenanceinformation, SIMProv.js provides a delete database method to clear outall provenance information. Note that we will need to alert the userif space in browser storage is running low. If a user wishes to keepthe information private, SIMProv.js can archive it to a local disk as aJSON file for future review. However, users may also choose to sharethat file or directly upload their provenance to a server where theircollaborators or perhaps the public can view it. Collaborators can alsowork synchronously using a serverless architecture where provenance isshared between clients as it is generated. For streaming data used in anapplication, it may be infeasible to store the data with the provenance.However, we can tag each provenance action with the timestamp ofstream [17] eliminating the need to store all of it if the data can berestreamed or loaded at a later time.

5.4 Navigating and Organizing Provenance

As described, the framework provides a thumbnail strip as well as agallery that includes both a list view and tree view that with sharedhighlighting. A concern here is having too much information, andstrategies to address this include undo-as-delete [29], chunking [37],and pruning [36]. In addition, the organization of the provenance canaid in identifying states [40, 56]. In the first prototype, we highlightednodes in the thumbnail strip where branches existed, but the navigationproved difficult so it was removed. We do think there is a way toallow navigation from the strip in the future. The list view provides the

Page 8: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

configuration = { ...

actions: {Tab: actionTab , ...} }

...

function checkpoint() {...

state[’tab’] = $(’#uc-tab .active’)[0].id.slice(4); }

function updateFromState() {...

if (state[’tab’]) {$(’#btn’ + state[’tab’]).click(); }

function createTabAction(tab, prevTab) {actionData = {

forwardData: tab,

inverseData: prevTab,

type: ’Tab’,

inverseAction: ’Tab’,

information: ’Changed tab to ’ + tab};

simprov.acquire(actionData);

}

function actionTab(state, action, isState) {data = action.actionData;

if (isState) {state[’tab’] = data.forwardData;

return state;} else {$(’#btn’ + data.inverseData).click(); }

}

// use click events to create action

$(’#btnExpenditures’).click(function () {createTabAction(’Expenditures’, lastTab);

lastTab = ’Expenditures’;

});

...

Fig. 6: To add a new action to be captured, we need to edit methodsthat capture and update state, add new methods to create and enact theaction, and register the action with simprov.

opportunity to sort states by various criteria including visual thumbnailsimilarity or state differences. The version tree can also be reorganizedto facilitate exploration by similarity [36], but this impacts the abilityto see derivations. Because of our goal in maintaining the relationshipbetween views, we do not currently employ these strategies but plan toincorporate them in the future as we investigate the scalability of theframework.

6 CASE STUDIES

In order to evaluate the prototypes, we have focused on applicationscreated using the dc.js framework which builds visualizations on topof the crossfilter library [16]. However, with the first version, wealso explored how to use Vega to support provenance in the Polestarvisualization design tool [58]. An important factor in these exampleswas that we tried to not only instrument a single application but alsoexamine how multiple applications could be enabled by building abridge from those libraries to SIMProv.js.

6.1 Crossfilter ExplorationsWhen users can interact with a visualization, it allows them to exploretheir own ideas and scenarios in addition to the default settings orprescribed configurations. We examined three different applicationscreated in dc.js that facilitate these types of explorations. The NASDAQstocks example (shown in Fig. 2) allows a user to explore the perfor-mance of the stocks on that exchange over time. If that data is streamedin, examining only the data until 2000 could lead to very differentconclusions than those obtained by examining the data through 2012.Another application examined characteristics like speed, acceleration,and banking angle for two different motorcycle racers, one a profes-sional and the other a journalist (shown in Fig. 1). These attributescould be filtered by range which filtered the data shown on a map

Fig. 7: Using Polestar to investigate the cars dataset, one user createdsome initial visualizations tracking fuel economy over the years (bluerows), posted provenance as a gist, and a second user added moredesigns (white rows) to show relationships with other attributes.

of the track. Specifically, a user can investigate differences betweenthe professional and the journalist in very fine-grained situations (e.g.locations where the rider is banking left at low speed). A final applica-tion examines trends in revenue and expenses among the University ofCalifornia campuses from 2008-2012 (shown in Fig. 5).

To make the process of adding provenance to these applicationseasier, we created a bridge from dc.js to SIMProv.js, providing userswith not only the ability to undo and redo past filtering steps but alsotrack and compare different combinations of filters. This bridge can bereused with minimal configuration (a small template of code per chart)for many different dc.js visualizations. Crossfilter is a JavaScript librarythat allows users to quickly analyze large multivariate datasets [12], anddc.js provides methods to integrate its functionality with pre-definedcharts [16]. These libraries support filters for individual dimensions(attributes) of the dataset. dc.js maintains user-queryable filters andprovides notification of new or changed filters. Thus, we define aforward change as the addition or removal of a particular filter from aspecific chart. The inverse change is defined as the removal or addition,respectively, or the same. For ranged filters, this operation is notnecessarily boolean and may instead track a change from one range toanother. To create a checkpoint, we query the filters set on every chartand persist them. To materialize that checkpoint, we clear and set thefilters to exactly those that were stored earlier.

6.2 Collaborative Visualization DesignVisualization design is an iterative process, often requiring explorationand refinement. When working with a client or colleague, sharing thestate of the visualization is useful. It is also useful to share multipledesigns and show their relationships. Finally, a static view is not nearlyas useful as the ability to share working, interactive visualizations,where users may change and tweak shared views. Keeping track of thisdesign history may be important not only for reflection and learning,but also to spur future ideas. Consider a design that did not work wellfor a particular dataset but would be great for another. Being able toreview the aspects of the old design, though it was never publishedwould be very useful.

The Polestar application provides an interface for creating a vi-sualization from multi-attribute datasets, similar to applications like

Page 9: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

Tableau [58]. By mapping attributes to different visualization channels,a user can explore different visualizations of a dataset. Polestar pro-vides its own undo and redo capabilities by tracking specific variableswhen an event occurs as well as bookmarks where visualization speci-fications can be preserved in the browser’s local storage. However, itcurrently lacks ability to show the steps involved in creating a book-marked visualization–the iterative design process. While SIMProv.jsprovides its own interfaces, we were able to integrate the functionalityinto Polestar without adding a new layer to the interface. Undo and redoare then provenance-backed, and the gallery view and import/exportcapabilities can be included in Polestar’s toolbar.

As an example, when exploring the standard cars dataset, a usermight be interested in a relationship between fuel economy and modelyear. That user might try adding encodings of other attributes andthen decide to post the provenance on the web; SIMProv.js allowsusers to automatically upload the provenance as a gist on Github. Ifanother user downloads the provenance and examines some of the pastvisualizations, she may choose different techniques like small multiplesor different attributes to emphasize like country of origin. As otherusers post their updates, the original user can see how they have builton his ideas and iterate on them. Again, the derivation of the designor those designs that were discarded may be relevant to others, andSIMProv.js seeks to preserve this provenance. See Fig. 7 for some ofthe provenance produced by this exploration.

7 DISCUSSION

SIMProv.js provides an initial solution for capturing and using prove-nance from web-based visual analyses. There are a number of potentialissues that should examined both when including provenance and im-proving the support for and utility of the provenance. For provenance tobe of use years or even days in the future, the data and the specificationof the interactive application must be available in the form they existedwhen the original analysis was performed. That use of provenancemay allow not only the replay of past work but also analysis of userdecisions.

Updates A clear concern when storing provenance is the evolutionof the interactive application or the data used. If a web developer fixes abug in a visualization or changes a variable name, past provenance maynot load or may produce a different result than when it was generated.Schemes that were used for versioning web services may be usefulhere [39]. Another option is to archive the page and its resources (thedata and dependent libraries) as part of a provenance bundle. If thedata comes from a curated store, it alleviates the need for maintaininganother copy, and for this reason, the choice on what to include can beleft to the user. Similarly, some library versions may not need to beexplicitly included if they exist in a centralized archive. While contentdelivery networks (CDNs) help centralize standard libraries, they canalso serve to archive past versions. A final option is for the applicationdeveloper to provide upgrade schemes that translate past provenancefrom a previous version of the application to the latest format. Forrenaming or reorganization, helper routines can mitigate the requiredeffort here.

Understanding Users Existing work [41, 50] has shown that pro-viding users with more information about what they have explored andthe insights already gained improves their ability to analyze data. Ourconcern in evaluating our framework is trying to understand provenance-specific concerns. Specifically, we are curious how the various presen-tations of provenance (thumbnails, list view, tree view) aid in recalland navigation. Additionally, we plan to evaluate strategies that usersemploy to revisit past states, known and unknown, after a few minutesversus a few days. For streaming data, we are interested in understand-ing when users toggle between persistent and updating modes, and howthis affects analyses.

Developer Cost A brief survey of prototypes and demos fromVIS 2017 projects that were made publicly available [47] showed fewallowed users to persist states, and we believe one reason for this isthat the cost to enable provenance from scratch is high. The reasonwe examined examples from dc.js and vega is that they have fairly

standard interfaces for responding to specific changes. For dc.js, wewere able to define a reusable base of changes in a provenance bridge.The bridge was used for the case study to enable three different dc.jsvisualizations. Configuring the differences meant identifying the chartsand their types, and providing label information. While vega provideda uniform method for listening for signals, understanding exactly whichdata to capture, how to inject state, and the dependencies betweensignals required a deeper understanding of the underlying visualizationdesign. We believe that if developers build visualizations with prove-nance in mind, greater support in the libraries will help simplify thelinks between the visualizations and provenance mechanisms.

Referenced Data Another potential issue developers may face inextending an existing application to support provenance is a relianceon in-memory semantics. For example, suppose there is a check onwhether a specific object exists in a particular container; if that objectwas passed from the container by reference and eventually ended upin the data, we expect that such a query would return true. However,if that object was deserialized from provenance change data and thecontainer was deserialized from a provenance checkpoint, the normalcomparison of those objects will show them as dissimilar. A relatedissue is that referenced data in an action or checkpoint may be modifiedby later actions. In order to guard against this, we clone the object sothat it will not be mutated before the provenance is persisted. Again,this may introduce the issue where two clones of the same object areviewed as unequal instead of the same, but we have not yet encounteredthis. We recommend not relying on any operators that require equalitychecks based on memory location for provenance-enabled applications.

8 CONCLUSION AND FUTURE WORK

We have examined a number of aspects that important for users to cap-ture, recall, store, and share provenance information from web-basedapplications. As more work is done in client-side web environments,the ability to capture and preserve the explorations is increasingly im-portant. We believe that large-scale client-side web visualizations andanalytic environments are rare because web provenance frameworks donot exist. We see bridges to existing frameworks as a path to increaseadoption. A benefit of client-side web environments is that they protectprivacy in a medium where that is usually not expected. If the initialspecifications come from a server, a user can take the framework, plugin their own data, and capture provenance without it being reportedback to a server. At the same time, users may choose to share prove-nance with a colleague by sending the provenance file or share it withthe world by posting it to a common site or potentially the server itself.They can also use provenance to facilitate synchronous collaborationand deal with the cognitive burden that streaming data brings.

While SIMProv.js provides an initial prototype, we believe that theinitial work with streaming data may be interesting to apply to pro-gressive computations [53], where a user’s steering might interruptcomputations before they finish. In those cases, what should the prove-nance record? Clearly, we can record the point at which a computationwas interrupted or a new computation begun, but how do we capturethe state of the data or visualization at that point?

More opportunities come from the wealth of data that can be pro-duced by capturing provenance. Especially for focused applicationswhere many users may interact with a visualization (e.g. interactivevisualizations published by the media), techniques to aggregate oranalyze the data will certainly be useful in understanding trends orpatterns. Another opportunity from the data is to try to extract some ofthe higher-level insight or task provenance which might be aided by alarge amount of data [15].

REFERENCES

[1] S. Alimadadi, A. Mesbah, and K. Pattabiraman. Understanding asyn-chronous interactions in full-stack javascript. In Proceedings of the 38thInternational Conference on Software Engineering, ICSE ’16, pages 1169–1180, New York, NY, USA, 2016. ACM.

[2] Backbone.undo.js. http://backbone.undojs.com.[3] S. K. Badam and N. Elmqvist. Polychrome: A cross-device framework

for collaborative web visualization. In Proceedings of the Ninth ACM

Page 10: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

International Conference on Interactive Tabletops and Surfaces, ITS ’14,pages 109–118, New York, NY, USA, 2014. ACM.

[4] A. Bezerianos, F. Chevalier, P. Dragicevic, N. Elmqvist, and J. D. Fekete.Graphdice: A system for exploring multivariate social networks. InProceedings of the 12th Eurographics / IEEE - VGTC Conference onVisualization, EuroVis’10, pages 863–872, Chichester, UK, 2010. TheEurographs Association & John Wiley &; Sons, Ltd.

[5] Bokeh Development Team. Bokeh: Python library for interactive visual-ization, 2015.

[6] S. Breslav, A. Khan, and K. Hornbæk. Mimic: Visual analytics of onlinemicro-interactions. In Proceedings of the 2014 International WorkingConference on Advanced Visual Interfaces, AVI ’14, pages 245–252, NewYork, NY, USA, 2014. ACM.

[7] K. Brodlie, A. Poon, H. Wright, L. Brankin, G. Banecki, and A. Gay.Grasparc-a problem solving environment integrating computation andvisualization. In Visualization, 1993. Visualization ’93, Proceedings.,IEEE Conference on, pages 102–109, Oct 1993.

[8] B. Burg, R. Bailey, A. J. Ko, and M. D. Ernst. Interactive record/replayfor web application debugging. In Proceedings of the 26th Annual ACMSymposium on User Interface Software and Technology, UIST ’13, pages473–484, New York, NY, USA, 2013. ACM.

[9] C. Burgmer. rasterizeHTML.js. http://cburgmer.github.io/

rasterizeHTML.js/.[10] Y. Chen, J. Alsakran, S. Barlowe, J. Yang, and Y. Zhao. Supporting effec-

tive common ground construction in asynchronous collaborative visualanalytics. In 2011 IEEE Conference on Visual Analytics Science andTechnology (VAST), pages 101–110, Oct 2011.

[11] H. Chung, S. Yang, N. Massjouni, C. Andrews, R. Kanna, and C. North.Vizcept: Supporting synchronous collaboration for constructing visual-izations in intelligence analysis. In 2010 IEEE Symposium on VisualAnalytics Science and Technology, pages 107–114, Oct 2010.

[12] Crossfitler: Fast multidimensional filtering for coordinated views. http://square.github.io/crossfilter, 2015.

[13] R. J. Crouser, L. Franklin, and K. Cook. Rethinking visual analytics forstreaming data applications. IEEE Internet Computing, 21(4):72–76, 2017.

[14] D3.js. http://d3js.org.[15] F. Dabek and J. J. Caban. A grammar-based approach for modeling user

interactions and generating suggestions during the data exploration process.IEEE Transactions on Visualization and Computer Graphics, 23(1):41–50,Jan 2017.

[16] dc.js: Dimensional charting javascript library. http://dc-js.github.io/dc.js/, 2015.

[17] M. Derthick and S. Roth. Enhancing data exploration with a branchinghistory of user operations. Knowledge-Based Systems, 14(1):65 – 74,2001.

[18] C. A. Ellis and S. J. Gibbs. Concurrency control in groupware systems.In Proceedings of the 1989 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’89, pages 399–407, New York, NY, USA,1989. ACM.

[19] M. Feng, C. Deng, E. Peck, and L. Harrison. Hindsight: Encouragingexploration through direct encoding of personal interaction history. IEEETransactions on Visualization and Computer Graphics, PP(99):1–1, 2016.

[20] J. Freire, D. Koop, E. Santos, and C. T. Silva. Provenance for computa-tional tasks: A survey. Computing in Science and Engineering, 10(3):11–21, 2008.

[21] J. Freire, C. Silva, S. Callahan, E. Santos, C. Scheidegger, and H. Vo. Man-aging rapidly-evolving scientific workflows. In International Provenanceand Annotation Workshop (IPAW), LNCS 4145, pages 10–18. SpringerVerlag, 2006.

[22] Git. http://git-scm.com.[23] Google analytics. https://www.google.com/analytics.[24] D. Gotz and M. X. Zhou. Characterizing users’ visual analytic activity for

insight provenance. Information Visualization, 8(1):42–55, 2009.[25] P. Groth. Provenancejs: Revealing the provenance of web pages. In

D. L. McGuinness, J. R. Michaelis, and L. Moreau, editors, Provenanceand Annotation of Data and Processes, volume 6378 of Lecture Notes inComputer Science, pages 283–285. Springer Berlin Heidelberg, 2010.

[26] P. T. Groth, Y. Gil, J. Cheney, and S. Miles. Requirements for provenanceon the web. IJDC, 7(1):39–56, 2012.

[27] A. H. Hajizadeh, M. Tory, and R. Leung. Supporting awareness throughcollaborative brushing and linking of tabular data. IEEE Transactions onVisualization and Computer Graphics, 19(12):2189–2197, Dec 2013.

[28] J. Heer and M. Agrawala. Design considerations for collaborative visual

analytics. Information Visualization, 7(1):49–62, 2008.[29] J. Heer, J. Mackinlay, C. Stolte, and M. Agrawala. Graphical histories

for visualization: Supporting analysis, communication, and evaluation.Visualization and Computer Graphics, IEEE Transactions on, 14(6):1189–1196, 2008.

[30] M. Herschel, R. Diestelkamper, and H. Ben Lahmar. A survey on prove-nance: What for? what form? what from? The VLDB Journal, 26(6):881–906, Dec 2017.

[31] J. Hibschman and H. Zhang. Unravel: Rapid web application reverseengineering via interaction recording, source tracing, and library detection.In Proceedings of the 28th Annual ACM Symposium on User InterfaceSoftware & Technology, UIST ’15, pages 270–279, New York, NY, USA,2015. ACM.

[32] J. Hoffswell, A. Satyanarayan, and J. Heer. Visual debugging techniquesfor reactive data visualization. Computer Graphics Forum, 35(3):271–280,2016.

[33] P. Isenberg, N. Elmqvist, J. Scholtz, D. Cernea, K.-L. Ma, and H. Hagen.Collaborative visualization: Definition, challenges, and research agenda.Information Visualization, 10(4):310–326, 2011.

[34] T. J. Jankun-Kelly, O. Kreylos, K.-L. Ma, B. Hamann, K. I. Joy, J. Shalf,and E. W. Bethel. Deploying web-based visual exploration tools on thegrid. IEEE Computer Graphics and Applications, 23(2):40–50, Mar 2003.

[35] S. Kaasten, S. Greenberg, and C. Edwards. How people recognize pre-viously seen web pages from titles, urls and thumbnails. In People andComputers XVI (Proceedings of Human Computer Interaction, pages247–265, 2001.

[36] D. Koop. Versioning version trees: The provenance of actions that affectmultiple versions. In Proceedings of the 6th International Provenanceand Annotation Workshop on Provenance and Annotation of Data andProcesses - Volume 9672, IPAW 2016, pages 109–121, Berlin, Heidelberg,2016. Springer-Verlag.

[37] D. Kurlander and S. Feiner. Editable graphical histories. In Visual Lan-guages, 1988., IEEE Workshop on, pages 127–134, Oct 1988.

[38] H. B. Lahmar and M. Herschel. Provenance-based recommendationsfor visual data exploration. In 9th USENIX Workshop on the Theoryand Practice of Provenance (TaPP 2017), Seattle, WA, 2017. USENIXAssociation.

[39] P. Leitner, A. Michlmayr, F. Rosenberg, and S. Dustdar. End-to-endversioning support for web services. In Services Computing, 2008. SCC’08. IEEE International Conference on, volume 1, pages 59–66, July 2008.

[40] K.-L. Ma. Image graphs–a novel approach to visual data exploration. InProceedings of the Conference on Visualization ’99: Celebrating Ten Years,VIS ’99, pages 81–88, Los Alamitos, CA, USA, 1999. IEEE ComputerSociety Press.

[41] N. Mahyar and M. Tory. Supporting communication and coordinationin collaborative sensemaking. IEEE Transactions on Visualization andComputer Graphics, 20(12):1633–1642, Dec 2014.

[42] W. McGrath, B. Bowman, D. McCallum, J. D. Hincapie-Ramos,N. Elmqvist, and P. Irani. Branch-explore-merge: Facilitating real-timerevision control in collaborative visual exploration. In Proceedings ofthe 2012 ACM International Conference on Interactive Tabletops andSurfaces, ITS ’12, pages 235–244, New York, NY, USA, 2012. ACM.

[43] T. Milo and A. Somech. React: Context-sensitive recommendations fordata analysis. In Proceedings of the 2016 International Conference onManagement of Data, SIGMOD ’16, pages 2137–2140, New York, NY,USA, 2016. ACM.

[44] P. Nguyen, K. Xu, A. Wheat, B. Wong, S. Attfield, and B. Fields.Sensepath: Understanding the sensemaking process through analyticprovenance. Visualization and Computer Graphics, IEEE Transactionson, 22(1):41–50, Jan 2016.

[45] D. A. Nichols, P. Curtis, M. Dixon, and J. Lamping. High-latency, low-bandwidth windowing in the jupiter collaboration system. In Proceedingsof the 8th Annual ACM Symposium on User Interface and Software Tech-nology, UIST ’95, pages 111–120, New York, NY, USA, 1995. ACM.

[46] C. North, R. Chang, A. Endert, W. Dou, R. May, B. Pike, and G. Fink.Analytic provenance: Process+interaction+insight. In CHI ’11 ExtendedAbstracts on Human Factors in Computing Systems, CHI EA ’11, pages33–36, New York, NY, USA, 2011. ACM.

[47] Open access VIS. http://oavis.steveharoz.com.[48] Plotly Technologies Inc. Collaborative data science. https://plot.ly,

2015.[49] E. Ragan, A. Endert, J. Sanyal, and J. Chen. Characterizing provenance

in visualization and data analysis: An organizational framework of prove-

Page 11: Enhancing Web-based Analytics Applications through …faculty.cs.niu.edu/~dakoop/pubs/simprov.pdfEnhancing Web-based Analytics Applications through Provenance Akhilesh Camisetty, Chaitanya

nance types and purposes. Visualization and Computer Graphics, IEEETransactions on, 22(1):31–40, Jan 2016.

[50] A. Sarvghad and M. Tory. Exploiting analysis history to support col-laborative data analysis. In Proceedings of the 41st Graphics InterfaceConference, GI ’15, pages 123–130, Toronto, Ont., Canada, Canada, 2015.Canadian Information Processing Society.

[51] Y. B. Shrinivasan and J. J. van Wijk. Supporting the analytical reason-ing process in information visualization. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, CHI ’08, pages1237–1246, New York, NY, USA, 2008. ACM.

[52] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance ine-science. SIGMOD Rec., 34(3):31–36, 2005.

[53] C. Stolper, A. Perer, and D. Gotz. Progressive visual analytics: User-drivenvisual exploration of in-progress analytics. Visualization and ComputerGraphics, IEEE Transactions on, 20(12):1653–1662, Dec 2014.

[54] J. Teevan, E. Cutrell, D. Fisher, S. M. Drucker, G. Ramos, P. Andre, andC. Hu. Visual snippets: Summarizing web pages for search and revisitation.In Proceedings of CHI 2009. Association for Computing Machinery, Inc.,April 2009.

[55] Vega: A visualization grammar. http://vega.github.io/vega.[56] J. W. and E. N. Explates: Spatializing interactive analysis to scaffold

visual exploration. Computer Graphics Forum, 32(3pt4):441–450, 2013.[57] W. Willett, J. Heer, and M. Agrawala. Scented widgets: Improving

navigation cues with embedded visualizations. Visualization and ComputerGraphics, IEEE Transactions on, 13(6):1129–1136, 2007.

[58] K. Wongsuphasawat, D. Moritz, A. Anand, J. Mackinlay, B. Howe, andJ. Heer. Voyager: Exploratory analysis via faceted browsing of visualiza-tion recommendations. IEEE Transactions on Visualization and ComputerGraphics, 22(1):649–658, Jan 2016.

[59] K. Xu, S. Attfield, T. J. Jankun-Kelly, A. Wheat, P. H. Nguyen, andN. Selvaraj. Analytic provenance for sensemaking: A research agenda.IEEE Computer Graphics and Applications, 35(3):56–64, May 2015.

[60] J. Zhao, M. Glueck, P. Isenberg, F. Chevalier, and A. Khan. Supportinghandoff in asynchronous collaborative sensemaking using knowledge-transfer graphs. IEEE Transactions on Visualization and Computer Graph-ics, 24(1):340–350, Jan 2018.