data services in your spreadsheet

Upload: vthung

Post on 30-May-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Data Services in Your Spreadsheet

    1/10

    Data Services in your Spreadsheet!

    Regis Saint-Paul, Boualem BenatallahSchool of Computer Science & Engineering

    University of New South WalesSydney NSW 2052, Australia

    {regiss, boualem}@cse.unsw.edu.au

    Julien VayssiereSAP Research, RC Brisbane

    Level 12, 133 Mary StreetBrisbane QLD 4000, Australia

    [email protected]

    ABSTRACT

    Service-oriented architecture offers a high-level and inte-grated access to data and applications across the company.Using Data Services, together with Service Data Objects,developers can now refer to business entities rather thanstorage structures. They are relieved from the burden ofrepetitive and low-level tasks such as joining tables.

    Sadly, a vast community of developers dont as yet enjoythis relief. They are end-user programmersand their pro-gramming environment of choice is the spreadsheet. In thispaper, we identify what are the criteria that definefroman end-user point of viewa good integration of spread-sheet with a service oriented architecture. Based on thesecriteria, we propose SpreadATOR, a generic bridge betweendata services and spreadsheets.

    Categories and Subject Descriptors

    H.4.1 [Office Automation]: Spreadsheets; E.2 [Data Stor-age Representations]: Object representation

    General TermsLanguages, Design

    Keywords

    Spreadsheet, Service-oriented, architecture, integration

    1. INTRODUCTIONEnd-user programmersthe 45 million of them, as esti-

    mated for 2001 in US alone [18]routinely use spreadsheetto visualize, manipulate, and analyze data. Thanks to thisenvironment, they can build more or less complex applica-tions that solve their daily problems. Even building a report

    can be seen as programming an application that takes corpo-rate data as input and outputs a presentation. To build thisapplication, spreadsheet users have to import data and placethem in spreadsheet cells, highlight the important pieces,compute maybe some aggregates, add a chart or two. Ifwell done, this application can be used again later on a newset of data to effortlessly produce a new report.

    Service oriented architecture (SOA) emerged as a responseto the general problems of enterprise application integration(EAI) and enterprise information integration (EII) [11, 20,

    Copyright is held by the author/owner(s).WWW2007, May 812, 2007, Banff, Canada..

    5]. This architecture offers a unified and high-level view ofthe company resources.

    In particular, the advent of data services [8] and standardssuch as Service Data Objects (SDO)[19], professional devel-opers now benefit from high-level and integrated data access.Integrated because developers can access transparently froma single end-point to information that may be managed byvarious systems and stored in various location. High-level

    because data services rely on a conceptual modeling of theinformationnamely the Entity-Relationship (ER) model.For example, developers can retrieve from a data service anentity customer. This entity presents information retrievedfor some parts from a relational database and for the restfrom a supply chain management software. From this en-tity, developers can also access to related entities such as thepurchase orders or the invoices of this particular customer.

    Undoubtedly, spreadsheets need to be integrated with SOA.Indeed, there already exist some efforts to this end. For in-stance, Microsoft Excel Services [1] allows to incorporateExcel computations as part of a larger process. Anotherexample is given by Visual Studio Tool for Office (VSTO)[14] which allows to isolate the presentation elements of a

    spreadsheet from the data it contains. Data are stored asseparate XML documents and can be consumed by otherapplications.

    Those initiatives, however, are meant for professional de-velopers. A solid background in ob ject-oriented program-ming is needed to use VSTO. In this article, we are con-cerned with the large majority of spreadsheet usersthenon-professional programmer ones. Manipulating SDO en-tities can already be done by using the macro language thataccompanies spreadsheet environments. But few end-usersare ready to invest time in learning a macro language, whichresort to learning object-oriented programming.

    For the majority of end-users, spreadsheet programmingmeans formulas and cells manipulations only, sometimes as-

    sisted by wizards or visual assistants. An integration solu-tion has to preserve this programming model and it has toaccommodate existing spreadsheet environment. With thisin mind, we want to investigate how integration of spread-sheet with SOA can impact the daily tasks of end-users pos-itively; what would characterizeto thema good integra-tion.

    The integration of spreadsheet with data services bearsseveral facets. In this article, we focus on the problem ofimportation and manipulation of data delivered by data ser-vices. Other aspects such as the exportation of data fromspreadsheet is left for future research.

  • 8/14/2019 Data Services in Your Spreadsheet

    2/10

    This paper is organized as follows. In section 2 we definemore precisely the model of spreadsheet programming andpresent the framework of service-oriented architecture anddata services.

    In Section 3.1, our first contribution is to identify thedimensions of the importation and manipulation of data ob-

    jects in spreadsheetthat is, we define what are the char-acteristics of a good integration solution with data services.

    In particular, we identify how the richness of the ER modelcan benefit to end-users when they program with the flattabular representation of data proposed by a spreadsheet.

    On this basis, we review in Section 3.2 the existing ap-proaches to data importation. As we will see, those ap-proaches are not fully satisfactory because (i) they do notembrace the spreadsheet programming model and, thus, havelimited programmability and, (ii) while they offer some pre-defined manipulations that take advantage of the underlyingstructure of imported data, users can not benefit from thisstructure in their programming.

    In Section 4 we present our approach, called SpreadA-TOR, for data importation. Our other contributions are:

    In order to give more programming possibilities to end-

    users and to leverage their expertise in spreadsheetprogramming, we propose to import data by using thetraditional spreadsheet programming model. To thisend, SpreadATOR proposes to blend the specificationof importation by using formulas. But contrary toother approaches, those formulas are not limited toprimitive data types and can be used to access collec-tions of data or composite data structures.

    We propose a method to allow users to exploit rela-tionships that exist between entities to access relatedinformation. This method also lets users program com-putations that apply to a business entity (e.g. a cus-tomer) and can be reused when the entity appears in

    other context (e.g. in a list of customers). We propose to enhance the traceability of imported

    data by offering a generic facility for the display ofmeta-data and allow users to use meta-data in formula.

    In Section 5, we show how our prototype implementationof SpreadATORas an add-in to MS Excelcan be usedwith publicly available data sources and how it simplifies thespreadsheet developer task when working with compositedata. We present in section 6 additional related work andconclude in Section 7.

    2. BACKGROUND

    2.1 Spreadsheet Programming ModelSpreadsheet applications come in a variety of implemen-

    tation and bear an even more varied set of features. In ad-dition, several research proposals have been made to extendthe spreadsheet model in various directions.

    In order to facilitate our exposition, we will introduce inthis section what we call the traditional model of spread-sheet programming. We intend here to capture the essenceof spreadsheet common to the vast majority of commercialimplementations. We will reserve the discussion of the vari-ous extensions to this modelmainly research prototypesto section 6.

    OLAPDBMS

    Supply chain

    application

    Data

    Service

    Data

    Service

    Data

    Service

    Data

    Service

    CRM

    application

    Resource layer

    Service layer

    Spreadsheet layer

    Logical model(e.g. relational)

    Conceptual model

    Entity-Relationship (SDO)

    Grid model

    Figure 1: Spreadsheets in Service Oriented Archi-

    tecture

    The traditional spreadsheet model is based on a grid, alsocalled a worksheet, where cells are identified by their coor-dinates, denoted using letters for the columns and numbersfor the rows. For example C4 is the cell located in the thirdcolumn of the fourth row. Each cell contains either a singleatomic value or no value at all. This value is obtained ei-ther from a direct user input (e.g. user may input C4=6) orfrom a formula expression (e.g. user may set C4=SIN(C3)+5).Formulas are expressions that combine i) functions, such asSIN(x) or + in this last example, ii) constant expressionsand iii) variable references denoted as cell coordinatesC3in this example. A spreadsheet application may contain sev-eral worksheets, together referred to as the workbook.

    Spreadsheets are often tagged as the most successful end-user development environment [15, 12]. An explanation tothis well deserved reputation can be sought in the cognitivedimension (CD) framework [6]. CDs are criteria that helpto understand the difficulties faced by developers in using a

    system. An example of cognitive dimension is Progressiveevaluation. It measures how far developers need to go intotheir programming before being able to check if what theydid is correct. Spreadsheets evaluate continuously and thus,they do very well along this dimension.

    Some tasks, such as adding a chart, can not be done by for-mula programming. A wizard dialog can in this case be usedto guide users through the sequence of operations needed toaccomplish this task. Indeed, formula themselves can bebuilt with the help of such a visual assistant.

    The spreadsheet programming model is thus a combinationof functional programming, grid manipulations and visualprogramming. All these elements concur to deliver an in-tuitive experiencewith little or no learning barriersand

    great pays-off. Extensions to spreadsheet have to preserveas much as possible the qualities of this programming model.We refer the reader to [12] for a more complete expositionof the spreadsheet programming model. In the following,we will introduce when needed the characteristics of thismodel that we believe are important when importing andmanipulating data.

    2.2 Service Oriented ArchitectureOur framework is the service oriented architecture illus-

    trated Figure 1. It consists of three layers.The resource layer is the domain of data management sys-

  • 8/14/2019 Data Services in Your Spreadsheet

    3/10

    tems, e.g. relational databases or data warehouses, as wellas applications, e.g. customer relationship or supply chainmanagement softwares. At this layer, data are structuredaccording to a logical model, concerned with issues such asretrieval performances or storage costs. For example, datacan be organized in relational tables with various degree ofnormalization in order to reduce the computation time ofqueries. For their part, applications are accessed in this

    layer using specificoften heterogeneousprogramming in-terfaces (e.g. C++ for the CRM and java for the supplychain software).

    Programming at this level often implies tedious and repet-itive manipulations. For example, in order to retrieve, froma relational database, all the desired information regardinga customer, programmers may have to join several tables inrather complex SQL expression.

    The service layer helps in providing a higher level expe-rience to developers. Application interfaces are made ho-mogeneous thanks to web services and information can beaccessed through data services. For instance, ALDSP fromBEA [8] proposes data access through the Service Data Ob-

    ject (SDO) standard [19, 21]. Service data objectsor their.Net cousin [2] rely on the Entity-Relationship model andallow developers to use this model for accessing and updat-ing data.

    This approach relieves developers from low-level manipu-lations akin to the resource layer. When an entity such ascustomer is accessed, all the related information can also beaccessed easily. The customer entity has explicit relation-ships with other entities such as purchase orders. Developerscan use these relationships to gain access to related infor-mation.

    The third layer in Figure 1 represents the spreadsheet ap-plication. We have seen in section 2.1 that in the spreadsheetenvironment, variables are cells and the only available datastructures are cell matrices.

    When going from the bottom layer of resources to the

    superior layer of services, developers gained in terms of ab-straction. By contrast, programming in spreadsheet lookslike a huge step backward. Here lies the main difficulty. Agood integration will allow spreadsheet developers to ben-efit from the high-level abstractions of data services whileletting them work with the simple structures offered by aworksheet. In the next section, we discuss what this im-plies.

    3. DATA IMPORTATION AND MANIPULA-

    TIONThis section presents the dimensions of the problem of

    data importation from data services to spreadsheet and their

    subsequent manipulations. We then review the existing ap-proaches to data importation and identify their strengthsand weaknesses.

    3.1 Dimensions of the problem

    3.1.1 Incremental construction of grid representa-tion

    We have seen in section 2.1 that one of the characteristicsof the spreadsheet programming model is the possibility toincrementally build applications, starting from simple andgradually adding more and more formulas.

    Importation consists in locating a data resource and build-ing its representation on the worksheet grid. This should beaccomplished in a way that preserves the incrementality ofspreadsheet programming. Users should be able to modifyany part of a grid representation, adding or removing im-ported information, and this with a minimum of side-effectson the rest of the application. And since the modificationunit in a spreadsheet is the cell, building the grid represen-

    tation ought to be possible cell by cell. In terms of cognitivedimensions, this notion is referred to as viscosity.

    3.1.2 Traceability

    It is important to know where the information displayedin a spreadsheet comes from and what it means; not onlyat the time the application is built, but throughout all itslife-cycle. For example, if a cell displays the value Prefect,does it stand for the last name of a customer, a mis-spellingfor Perfect or the model name of a car.

    Clearly, users should be able at any given time to knowwhich external information is represented in any given cell.This information is twofold. First, the cell value has to beprecisely bound with an external resource. It is needed to

    perform the retrieval of this value and its subsequent ac-tualization. This identification is thus at a system level.Second, the information need to be identified at a user levelwith what is called meta-data. Meta-data include for exam-ple the semantic of the value, its precision or, if its a dis-tance, in which unit it is expressed (e.g. miles, kilometers orlight-years). Finally, users may need to refer to meta-datain formula expressions.

    3.1.3 Parametric data access

    A parametric data access is needed in order to composeimportation operations. For example, a user may first im-port a list of customers from the Customer RelationshipManagement software. Then, additional data regarding thecurrent processing of customer orders may be retrieved fromthe supply chain management software. In this scenario, thesecond importation depends on data that come from the firstimportation.

    We have seen in section 2.1 that in spreadsheet program-ming, cells references are used as variables. In this example,it means that the stock name would be stored in a cell, andthat the importation of its quote would use that cell refer-ence as parameter.

    3.1.4 Data access efficiency

    Importation may impact both the data service used to de-livers the information and the spreadsheet application thatretrieves it. To illustrate this problem, consider the portfo-lio application mentioned above. A table consisting of thenames and quotations of a list of stocks have to be importedfrom some relational database. It is more efficient to retrievethis list with a single selection and projection query ratherthan querying each individual cell values separately.

    3.1.5 Relationship-based manipulations

    As mentioned in section 2.2, data services offer a highlevel access to corporate information based on the Entity-Relationship model. We argued that a goo d integration ofspreadsheet has to give to users access to this model. Butwhat does that mean concretely? To illustrate this, considera cell that displays, as in Figure 2, the last name of a cus-

  • 8/14/2019 Data Services in Your Spreadsheet

    4/10

    PO 001-04080105/10/2005Betellgeuse 707/10/2005

    70.00

    Customer 003BeeblebroxZaphod

    CustomerlastNamefirstName

    Spreadsheet application

    A B C D

    1

    2

    3 Dent

    4

    Last name First name

    Ford

    Arthur

    POpoDatepoShipAddresspoTotal

    order

    Prefect

    Beeblebrox ZaphodImportation of alist of customers

    Navigation throughOrderrelationshipfor customer 001

    Customer 002DentArthurCustomer 001

    PrefectFord

    PO 001-05100105/10/2005Betellgeuse 770.00

    order

    Template: PODetailsAndAverage

    A B C D

    1

    2

    3 08/08/2004

    4

    Date

    05/10/2005

    Total PO

    70.00

    95.00

    Average Total PO 82.50

    Importation of alist of orders fora given customer

    1

    *

    Figure 2: Relationship navigation

    tomer, say Prefect. Accessing the underlying ER model,represented on the left of the figure, means here that wecan display additional details about this customer, say hisfirst name and address, or list of his recent purchases. Thatis, the value is not seen as isolated, but as an element of alarger composite entity, here the customer, in relationshipwith other entities, e.g. purchase orders.

    The Pivot Table, a feature found in MS Excel, implementssuch a mechanism and, incidentally, offers an illustrationof the shortcomings that integration solution should avoid.The pivot table allows to compute an aggregate, e.g. a sumor an average, of a collection of values grouped along somedimensions. It is the spreadsheet representation of an OLAPcube. It takes the form of a table, i.e. a collection of cells,where horizontal and vertical headers represent the chosendimensions and where each non-header cell is an aggregate.A right-click on an aggregated value pops up a context menuthat offers to display the details of this aggregate; that isthe list of individual values that were summed or averagedto produce this cell content. This is a form of relationshipnavigation. The aggregated value is in relation with the

    individual values used to compute the aggregate. However,the navigation experience offered by this method poses twoproblems: it is fixed and closed.

    The navigation is fixed since it is not possible to specifyhow the details are displayed. In the pivot table case, theyare displayed in a new worksheet as a table where each col-umn is a dimension and each row an individual value. Butsuppose that you modify this new worksheet to computesome custom aggregations, say a sum of all values greaterthan 100. If you happen to need a similar computation forsome other aggregate values, you unfortunately will have todo the work again, as the details of each aggregate is going tobe displayed in its own newly created worksheet.There areworkarounds. For example, the computation can be pro-grammed in a separate worksheet that refers the the work-sheet automatically produced by this tool. But this methodis more complex. What would be needed here is the possi-bility to customize the navigation.

    The navigation is also closed since the details of an aggre-gate are accessible only from within the pivot table, throughthis particular context menu. It is not possible, for exam-ple, to access the details of an aggregate from another toolor from a formula expression, nor is it possible to refer to theorigin of a cell value from a formula or another tool. Onlythe resulting aggregate value is accessible, the structure towhich it belongs, a lattice in this case, is known to the pivottable since it allows users to drill-down, roll-up or display the

    details from a given aggregate. But the structure is only ac-cessible through the set of pre-defined manipulations offeredby this tool. It is possible to perform a drill-down operationby using the pivot table API from the macro language ofExcel but not to build spreadsheet-based computation thatrefer to the drilled-down data.

    This hinders the interoperability between importation so-lutions. The spreadsheet becomes integrated with a col-

    lection of separate systems, but no interaction is possiblebetween them in spreadsheet programming. However, weobserve that this situation is not due to a defect in the pivottable itself. It has to do with the approach that consists inintegrating the spreadsheet directly with the resource layer(see Section 2.2 and Figure 1), where resources are hetero-geneous. Integrating with the service layer allows to build aunique solution to access a variety of systems, and closenessin this case would hopefully become irrelevant.

    We also want to emphasize that a relationship navigationis different from a parametric importation. The differenceis that relationships are pre-built in the conceptual modeland users dont need to express themthey are ready touse. Relationships are precious when the parameters arenot trivial (for example when obtaining the purchase ordersof a customer involves to join several tables with compositeforeign keys). Put it roughly, relationship navigations areto parametric data access what SDOs are to SQL.

    Now that we have a clearer picture of what to expect froman integration solution with SOA, we propose to examineexisting approaches to data importation and see how wellthey do along these five dimensions.

    3.2 Review of existing approachesWe observed that approaches for data importation and

    manipulation could helpfully be classified into two categories:formula-based importation and external mapping definition.For each, we picture below its main traits, give some ex-ample of commercial products, and we discuss their relative

    merits according to the criteria identified in the previoussection.

    3.2.1 Formula-based importation

    In this model, the grid representation of external data isobtained from formula evaluation as illustrated in Figure 3.Examples of this method for MS Excel include the built-infunctions for database (e.g. DGET, DSUM, DAVG, etc.) and theReal Time Data (RTD) provider. The function DGET(x,y,z),when used in a cell formula, retrieves from a database x thevalue corresponding to attribute y of a tuple identified byz. RTD is an extension mechanism that can be used by aprofessional developer to provide access to dynamic values.They correspond respectively to a push and a pull model of

    data importation. In addition to those features, a profes-sional developer can easily extend the library of functionsavailable in the formula language by User Defined Func-tions (UDF). Figure 3 illustrates this approach with a UDFCustomer that takes the customer number and an attributeas parameter.

    The main advantage that derives from using formula fordata importation is that it is perfectly in line with thespreadsheet programming model and, thus, share its goodproperties:

    The grid representation can be built incrementally andeach cell can individually be modified;

  • 8/14/2019 Data Services in Your Spreadsheet

    5/10

    Spreadsheet application

    A B C D

    1

    2

    3

    4

    =Customer(001, lastName)

    Customer

    lastNamefirstName

    Customer 002Dent

    Arthur

    Customer 001Prefect

    Ford

    Customer 003

    BeeblebroxZaphod

    Figure 3: Formula-based importation

    The traceability is immediate since the formula reflectsthe origin of the informationprovided its syntax isnot too abstruse. However, only the binding of a valuewith an external data is expressed in a formula. Ad-ditional meta-data support is needed;

    Data access is naturally parametric, and only limitedby the way functions are defined. For instance, import-ing data with the function Customer(id, attributeName)gives more flexibility than with CustomerLastName(id)

    but less than GetData(entityType, id, attributeName).Note that the three versions could be provided.

    The efficiency of the data access is however more problem-atic. First, a straightforward implementation of the func-tions used for data retrieval implies an individual query tothe data resource. This can have a significant impact on thedata provider, the spreadsheet application and even on thenetwork. It can be mitigated by implementing some cachemechanism local to the spreadsheet, but i) this implementa-tion is not trivial and 2) even so, the evaluation of separatefunctions for each of the cells will have an impact.

    The major concern is thus that formula are not suitedto import collections of data. They can return only values

    compatible with the cell content, i.e. a single primitive valuesuch as a text, a date, or a number. In addition to theperformance impact, this limitation also makes it impossibleto import collections of data with varyingor unknownsize. This, of course, is not acceptable as most of the dataserved by data services are in one or both of these categories.

    Finally, although relationship navigation can be imple-mented, very few approaches actually propose this feature.For example, Excel add-in for MS Analysis server [3] im-plements this feature by parsing the formula to identify thecomponent of the structure to which it refers and offers ina context menu some navigation choices. As mentioned insection 3.1 this navigation is closed and fixed.

    3.2.2 External mapping definition

    An alternative to import data in spreadsheets is to specifywhich data to import and where on the spreadsheet to placethem, as shown in Figure 4. We refer to this specificationas the mapping definition. It is programmed by spreadsheetusers through wizards or visual assistant. This definitiondoesnt involve formula, and in that sense, it is external tothe spreadsheet application.

    The XML mapping tool [17] that ships with MS Excel isan example of this approach. It relies on drag-and-drop op-erations for the mapping definition. Users can select XMLelements from the tree representation of an XML schemaand drop them over the cells where they want the informa-

    CustomerlastNamefirstName

    Customer 002Dent

    Arthur

    Customer 001PrefectFord

    Customer 003BeeblebroxZaphod

    Spreadsheet application

    A B C D

    1

    2

    3 Dent

    4

    Last name First name

    Ford

    Arthur

    Prefect

    Beeblebrox Zaphod

    Customer(001, firstName)

    C2Customer(001, lastName)

    B2

    Customer(>001, {lastName, firstName})

    B3:C4

    External Mapping Definition

    Figure 4: External mapping definition

    tion to appear. This results in a mapping between cells andXML schema elements not much different from that repre-sented in Figure 4. It is not a complete mapping definitionyetas the data to import also need to be specified. Thisis done in a separate operation in which user actually selectan XML document of that schema.

    Another example is given by the pivot table1 already in-troduced in section 3.1. When used with external data, a

    series of wizard dialogs helps users to formulate the whichpart of their mapping, i.e. to select a data source and builda query. Then, users can specify the where part which, inthis case, means choosing a single cell: the upper-left cornerof the pivot table. The number of cells used by the pivottable will depend on the size of the data that need to bedisplayed.

    This approacha mix of wizard dialogs to select the dataand of visual assistant to build the grid representationhasbeen adopted by the vast majority of software vendors topropose the integration of spreadsheet with their applica-tions; far too many indeed to start citing them. The ad-vantages of the approach are clear: (i) the data access canbe made very efficient. Since users first build a query to re-trieve in one shot all the related data, only one query needto be sent to the data resource. (ii) It is possible to specifya grid representation of collections of data of unknown orvarying size.

    There are drawbacks however; mainly because the map-ping definition is not programmed in a spreadsheet style,with formulas.

    Traceability is problematic. The cell content is usuallyidentified by headers; only the cell location determinesits content, meaning that it can not be moved. Someapproaches resort to using the comment zone of cellsto store information regarding the origin of the data.Using the comment zone in fact highlights that spread-sheets dont provide any facilities to store, display and

    use meta-data. This problem is recognized as very sig-nificant when spreadsheets are used in context such asBusiness Intelligence or for reporting [10]. This hasled us to propose in Section 4.3 a systematic supportof meta-data informations.

    A parametric data access supposes that some spread-sheet cells can be referenced in the external mapping

    1Arguably, the pivot table is not a pure importation solu-tion in the sense it can also transform the imported data tocompute aggregates; but (i) it is an importation tool whenused with an OLAP data source and (ii) being standard inExcel makes it a good example.

  • 8/14/2019 Data Services in Your Spreadsheet

    6/10

    Formula-based External

    Efficiency Problematic Good

    Traceability Good Fragile

    Incremental Good Pre-commitment

    Parametric Go od Inconsistency, Hidden dep.

    Relationship Closed and fixed Closed and fixed

    Table 1: A comparison of importation models

    definition. If we take the example of Figure 4, it meansthat rather than defining the mapping for customer001, we would define it for customer A1. This is nota technical difficulty, but it introduces an importantproblem and, probably for that reason, we havent re-viewed any approach allowing it. The problem is thatof hidden dependency; one of the cognitive dimensionswe introduced in section 2.1. It means that if we definethe mapping this way, the cells C2 and B2 of our ex-ample now become dependent on the value of cell A1.When this happen in spreadsheet programming, thedependency between C2 and A1 would be clear fromthe fact the formula in C2 refers to A1. Here however,cell C2 itself doesnt show this dependency, it is hid-den unless the appropriate dialog used to define theexternal mapping is displayed.

    The grid representation is not anymore built incre-mentally: all the data are imported at the same time.More often than not, in order to add an attribute toan imported table, it is necessary to go through all thesteps of the importation process again. The previousimportationand possibly all the modifications madeby the useris simply erased and replaced by a newone.

    Note that to mitigate this last problem, or the fact, forinstance, that the way mappings are defined imposes to im-port collections of values as contiguous collections of cells,

    some products allow to transform an external mapping intoa formula-based importation (e.g. SAP BEx Analyzer[4],Oracle BI [13] or Microsoft Analysis server [3]). A set ofcells which obtains its values by an external mapping canbe refactored into formula expressions to retrieve cell val-ues. It becomes then possible to relocate any of these cellsand, thus, to insert an empty row or column in a table with-out preventing future data refresh. However, this methodcant be called a formula-based. Data are still queried asdefined during the mapping definition and the formulas referto the result of this query. Thus, both traceabilitysincethe formula doesnt refer to the external dataand para-metric accesssince the query isnt in the formularemaina problem.

    Regarding relationship-based manipulation, the situationis the same as in formula-based approach. They are possibleand some form of navigation is usually supported by prod-ucts in this category. But as discussed about the pivot tablein section 3.1, they are closed and fixed.

    4. A NOVEL APPROACH: SPREADATORA formula-based importationsince it conforms to the

    spreadsheet programming modelis more satisfying thanan external mapping definition on the dimensions concernedwith the programming aspects (see Table 1). However, formula-based importation is not suited for importing collections

    CustomerlastName

    firstName

    Customer 002DentArthur

    Customer 001PrefectFord

    Customer 003

    BeeblebroxZaphod

    Spreadsheet application

    A B C D

    1

    2

    3 Dent

    4

    Prefect

    SpreadATOR External mapping definition

    A B C D

    1

    2

    3

    =Customers[001].LastName

    4

    =\\Customers\002@lastName

    External Mapping Definition

    Figure 5: Formula-based External mapping defini-

    tion

    of data or composite data: formulas return only primitivetypes. This weakness has a huge practical impact as most ofthe data users need to import are indeed collection or com-posite. As a result, existing systems for data importationalmost solely rely on an external definition of the mapping.

    It appears then that in order to satisfy all the criteria

    listed in Section 3.1, we need to blend the qualities of bothapproaches. This is what we attempt to do with SpreadA-TOR2. In a nutshell, SpreadATOR is essentially an exter-nal mapping approach but it also offers a spreadsheet-likeprogramming experience based on formulas. Section 4.1presents the formula language that we use to construct theexternal mapping definition.

    In Section 3.1.5, we discussed why we believe that bothformula and external mapping approaches are not entirelysatisfying when it comes to relationship-based manipula-tions. We present in Section 4.2 how those manipulationsare eased in SpreadATOR and what are the resulting bene-fits for the end-user programmer.

    Finally, we saw that conveying meta-data information is

    necessary to achieve good traceability. We propose in Sec-tion 4.3 an innovative method to (i) convey these informa-tion and, (ii) allow end-users to incorporate them in theircomputation.

    4.1 Formula-based external mappingWe define SpreadATOR as a middleware for spreadsheet

    integration. It adopts an external mapping approach to im-port data retrieved from data services. The innovation ofSpreadATOR is to make this mapping definition explicit andto blend it with the rest of the formula-based programmingof the spreadsheet application.

    SpreadATOR mapping definition is thus based on formulaexpressions that are very similar to spreadsheet formula.

    They are stored in cells and can use other cell references.Figure 5 shows such a mapping definition in cells B2 and B3.The exact syntax chosen for the language is not importanthere. It is merely a matter of preference or implementation.To emphasize this idea, we used a formula with an object-oriented syntax in cell B2 and one in the style of XPath inB3.

    Our implementation of SpreadATOR relies on JScript.Netfor formula evaluation. JScript is an implementation of

    javascript for the .Net framework. Therefore, the formula

    2SpreadATOR stands for Spreadsheets and dATA ObjectsReconciled.

  • 8/14/2019 Data Services in Your Spreadsheet

    7/10

    syntax we adopted corresponds to that of cell B2.Although not strictly speaking an object-oriented language,

    javascript has the advantage of being easily interfaced withpure object libraries. In particular, it is possible to useany assembly compliant with .Net from JScript expressions.Thus, our implementation can benefit from ADO.Net [2],the .Net equivalent of SDO, and can interact with otherdata access libraries such as Application Programming In-

    terfaces (API). Section 5 presents such a scenario. To avoidunnecessary confusion, we will adopt for the rest of this pa-per an object-oriented terminology and, for example, speakof object instances rather than entities.

    Figure 5 shows the mapping definition in one spreadsheetgrid and its evaluation in another. In reality, the spread-sheet user only sees one grid and can choose to display ei-ther the formulas or their evaluations as it is already thecase with traditional formulas. Traditional formulas andSpreadATOR formulas are merged in a single interface, mak-ing the programming very intuitive to spreadsheet develop-ers. They are oblivious of the fact the two types of formulasare maintained by different systems.

    Formula expressions in (current implementation of) Spread-ATOR are essentially javascript statements. However, weneeded to extend slightly the language with few key-wordsand syntactic sugars. First, spreadsheet formulas need toreference cells. Because cell coordinates could collide withthe name of object members or variables, we enclose themwith angle brackets (e.g. =customers[].lastName im-ports the last name of the customer whose number corre-sponds to cell A1). Second, we needed some mechanism toallow the mapping of collections of values to collections ofcells. This is achieved by using the character * instead ofthe identifier of an element of the list. For example, a col-umn containing the last names of a list of customers is ob-tained by =Customers[*].lastName3. Finally, three otherkey-wordsobj, template, and metadatatake a specialmeaning in SpreadATOR formula and are presented in the

    following sections.Statements that return an object reference are also valid.

    For example, =Customers[001] returns a reference to aninstance of customer and =Customers returns a reference tothe complete list of customers. The reference returned ismanaged by SpreadATOR; for Excel, the cell simply con-tains a string representation of these objects (obtained bythe default transtyping given by toString()). The advan-tages of storing a reference to a composite entity in a cellare:

    It is now possible to refer directly to these compositeobjects to build their representation on the spread-sheet. For example, if B2=Customers[001], we can

    have a formula B3=.lastName. Thus, it sufficesthat the content of B2 changes (e.g. if it is replaced bya reference to customer 002), for all related formulato change accordingly. This makes formula shorter,easier to read and more efficient to compute;

    The content of cell B2 now has a type. It is possible todisplay additional information corresponding to thatparticular entity type and permit navigation to relatedentities through the template mechanism, described inthe next Section.

    3Note that a only one * is allowed per formula

    We want to emphasize that, despite the object-like syn-tax of the formula language used in SpreadATOR, we dontassume any familiarity of end-users with object-oriented pro-gramming. First, users are only accessing pre-built objectsavailable from data services, they are not actually creatingthese objects. Second, the use of formula doesnt excludethe complementary usage of wizards and visual assistantsto generate the formulas. Traditional spreadsheet formulas

    are themselves often built by using a wizard dialog. The vi-sual assistant we propose in SpreadATORcalled the objectexplorer and presented in Section 5makes the grid repre-sentation construction an experience very close to that ofusing the XML mapping tool in Excel.

    4.2 Relation-ship based manipulationsWe saw in section Section 3.1.5 that existing approaches

    offer limited support for relationship-based manipulation. Inparticular, these approaches do not allow spreadsheet pro-grammers to actually benefit from relationships in their pro-gramming.

    To address this problem, we propose to introduce a tem-plate mechanism. The idea of template is not new to spread-

    sheet and end-users are already familiar with it. The innova-tion of SpreadATOR is to associate templates with the typeof composite objectseach type may have several templatesand allow to define a generic grid representation for in-stances of that type. Templates are given names and areproposed in a drop-down menu (see Figure 6(a)); its con-tent depends on the current cell selection.

    It is the fact that SpreadATOR formula can return ref-erences to composite objects that permits to associate tem-plates with types. To illustrate this mechanism, suppose aworksheet with a formula A1=Customers[001]; that is, cellA1 contains, from SpreadATOR point of view, a reference tothe instance of type Customer that represents the customer001. When A1 is selected, users can open a template asso-

    ciated to the type Customer (or create a new template forthat type). An internal object named obj is associated tothe instance referenced in A1. Users can use this referenceto build a representation of this object.

    So the difference between a template and a worksheet isthat instead of referencing an external object, such as in=Customers[001].lastName, a template references an inter-nal object denoted obj. For example, in a template suitedfor objects of type Customer, we can have formula such as=obj.lastName. The approach used to build the templateis exactly the same as for a worksheet and rely on the samevisual assistant, only the formula generated by the assistantare different.

    The template defines how to represent an object obj. Ac-cessing a template is equivalent to a relationship navigationsince the template can display any information related to theinstance selectedfor example, the list of purchase orders ofthe selected customer. It is indeed more powerful since usersare not limited to displaying only the destination entity(ies)of a relationship.

    Furthermore, SpreadATOR allows to access the customizedgrid representation of an object type from a worksheet thatcontains instances of that type. For example, suppose thata customer template called PO details is used to computesome custom aggregatesay an average of the PO which to-tal exceeds 100$which result is in cell G4 of the template.From our worksheet example above, where cell A1 contains a

  • 8/14/2019 Data Services in Your Spreadsheet

    8/10

    reference to customer 001, we can access the custom aggre-gate of the template by using the formula =template(A1,POdetails,G4).

    This formula can easily be duplicated for all the customerspresent on a worksheet, simply changing the reference A1.In object-oriented terminology, it is as if the type Customerwas extended with a new method that computes the customaggregate. When this formula is evaluated, obj is associ-

    ated with the reference contained in A1. It can be seen asthe SpreadATOR equivalent to keyword this in object ori-ented programming. However, obj stands for current cellcomposite content, rather than current instance of thatclass.

    By comparison, computing this custom aggregate on alist of customers in traditional spreadsheet programmingapproachthat is, without resorting to another program-ming paradigm such as a macro languageis more com-plex. For example, you could import the list of customersin a worksheet and either (i) import as a large table all thepurchase orders of all customers in a second worksheet or(ii) import the list of POs in a separate worksheet for eachcustomer. In either case, youll have to rebuild the join be-tween the list of customers and that of POs; that is, youllhave to search the starting and ending row correspondingto a customer (case (i)) or youll have to search the work-sheet corresponding to a customer (case (ii)). The formulalanguage of spreadsheet includes search functions, preciselyfor such situations. But why should users have to build this

    join when it is readily available in the ER model of data?

    4.3 Meta-data managementMeta-data are not very different in nature from other in-

    formation related to a given entity. They mainly differ bytheir semantic and usage. A meta-data is typically not avalue that users want to lay out on the spreadsheet becausethey are not an essential attribute of the information ac-

    cessed. They represent a complement of information, kindof a documentation; they speak about data.For example, a Business Intelligence (BI) software often

    provides access to Key Performance Indicator (KPI) such asOrder processing delay which expresses in days the aver-age time needed to process a customer purchase order. Thisvalue is a high level aggregate and users need to know whatthey exactly mean, how they are evaluated, when the valuewas computed for the last time, what its precision is, etc.We need meta-data (i) to always be accessible whenever weexamine a KPI and (ii) not to occupy cells of their own onthe worksheetunless, of course, the specific application webuild calls for it.

    Thus, we propose in SpreadATOR to display meta-dataseparately from the worksheet in some reserved space of theuser interface (see the bottom-right area in Figure 6(a)).We define meta-data as a collection of name,value pairsobtained from a collection of name,formula tuples thatdependin the same way as the template mechanismonthe type of the composite object contained in the cell. Theformer is used for display in a list when a cell containing acomposite object is selected, while the later corresponds toa collection defined by the user where each formula refersto the selected object through the keyword obj introducedin the previous section. For example, the end user can de-fine a meta-data for type Customer with Last contacted,obj.lastContactDate. When a cell containing a customer

    is selected, the evaluation of this formula is displayed inthe bottom-right section of the screen, e.g. Last contacted,21/06/2006.

    Though this value is not needed for display, it could beneeded for computation. For example, users may want tohighlight customers that have not been contacted for a while.Since meta-data are visible at the same time as the work-sheet, they can be used in drag-and-drop operations. The

    keyword metadata is used to refer to their value. For exam-ple =metadata(A1, Last contacted) returns the meta-data named Last contacted for the composite objecthere oftype Customercontained in cell A1. Meta-data are henceat the same time very similar but complementary to tem-plates. While template can be used to display a large quan-tity of related information, the purchase orders of a cus-tomer for instance, meta-data offer a simple and intuitivemechanism to display complementary information about animported entity.

    5. CONSUMING REAL-WORLD DATATo validate our approach, we demonstrate in this section

    how RSS feeds can be easily accessed and manipulated in

    Excel using SpreadATOR. We show how (i) access to a RSSfeed can be performed using a generic librarythat is, a li-brary which has not been developed specifically for use fromwithin a spreadsheetand (ii) that the composite structureof RSS feeds can be laid out on the worksheet and ma-nipulated in formulas using pure visual programming, i.e.through a combination of point-and-click and drag-and-dropoperations.

    RSS is a popular XML format for news syndication. RSSdocuments are XML documents with several levels of nestedparent/child relationships. A typical RSS document in-cludes one or more channels where each channel containsa collection of news items. They are published by news-oriented web sites or blogs and are widely available on the

    Internet.We use here the library called RSS.Net4. This library

    exposes a class called RssFeed which is an object-orientedrepresentation of an RSS document. An instance ofRssFeedobject can be built using the static method Read(url) of thisclass.

    To be available in SpreadATOR, a library or a service hasfirst to be referenced. This is done using a dialog whereusers have to provide the service URL (from which a libraryis produced using .Net utilities), or, as in our example, toselect the compiled file of a library (called an assembly in.Net terminology).

    Once this is done, all the public resources of the library,including the RssFeed class, can be used in spreadsheet for-

    mulas. Figure 6(a) presents the SpreadATOR add-in as itappears in Excel. In the top right corner, SpreadATORoffers a zone where users can input formulas (though formu-las can also be input using the Object Explorer as explainedhereafter). In this Figure, we can see that cell A8 containsthe formula RssFeed.Read(URL). The URL used in thisexample is that of the IEEE Transactions on Computers5.This formula evaluation returns an object of type RssFeedwhose default transtyping as a string is displayed in the cell(in this case, it corresponds to the URL of the feed).

    4available at http://www.rssdotnet.com5located at: http://csdl.computer.org/rss/tc.xml

  • 8/14/2019 Data Services in Your Spreadsheet

    9/10

    (a) A worksheet accessing several RSS feeds (b) Template view of one RSS feed

    Figure 6: The SpreadATOR add-in user interface in MS Excel

    In this example, three different feeds have been accessed inthree different rows of the spreadsheet. Details of those feeds

    are provided in other columns. For row 8, the last refreshdate and the most recent news item are accessed using, for-mulas .LastModified and .Channels[0].Items[0]respectively. Again, these formulas can be entered by us-ing the Object Explorer, as described below. Other rowsuse similar formulas, with only the cell reference changed.This illustrates that any level in the nested structure of theRSS document can be freely laid out on the worksheet. Italso shows that the individual components of the compos-ite value contained in a cell can be accessed directly withinformula expressions using only cell references, hence avoid-ing the impedance mismatch problem faced in traditionalspreadsheet programming.

    The composite value contained in cell A8 can also be dis-

    played in details using a template corresponding to its type(see section 4.2). Figure 6(b) shows a detailed display of theRssFeed instance of cell A8. In this template, all formulasuse the keyword obj to refer to the current instance to bedisplayed. For example, the formula used in cell C6 Figure6(b) is obj.Channels[0]. This formula returns an object oftype RssChannel, whose default string representation is thetitle of the channel.

    The Object Explorer can be seen on the right of Figure6(b). In this mode, it displays the details of the type con-tained in the selected cell which, in our case, corresponds tothe type RssChannel since cell C6 is selected. The ObjectExplorer can be used to build a layout by simply selectingone of the properties of the object contained in the selected

    cell, and dragging the corresponding node over to an emptycell. This results in the formula corresponding to that nodebeing copied into the cell. For instance, one can see that thenode description is selected, and the corresponding formulais displayed in the bottom part of the panel.

    The detailed layout of Figure 6(b) can be reused for anyinstance of RssFeed class. Hence, similar details can be ob-tained for any of the three RssFeed objects on Figure 6(a)through a simple click. Moreover, it is possible to createseveral templates, corresponding to as many views of com-plex data of a given type. Templates can also be nested. Forinstance, it is possible to create a template for the type Rss-

    Channel and access it from any cell containing a compositevalue of that type (e.g. cell C6 Figure 6(b)).

    6. RELATED WORKWe already reviewed in Section 3.2 the existing approaches

    to data importation in spreadsheet and our discussion wasfocused on mainstream spreadsheets. But spreadsheets havereceived from the research community a sustained attention.Several proposals have been made to extend spreadsheets inorder to introduce features found in conventional program-ming languages and make these features easy to exploit forend-users.

    An early work in that area is the Analytic SpreadsheetPackage (ASP) [16] where the language Smalltalk 80 is usedto build a spreadsheet where cells can contain instances of

    objects. Object visualization within cells is provided eitherby the default transtyping mechanism offered by smalltalkwith the printString protocol or by instanciating objectsthat derive from DisplayObject to build custom visualiza-tion.

    A more advanced integration of object-oriented features,as well as functional programming, into the spreadsheet en-vironment was proposed in [9]. This work extends the tra-ditional spreadsheet to support programming abstractionssuch as encapsulation, reuse, recursive functions, higher or-der functions or polymorphisms. It defines a full spreadsheet-based language where worksheets are seen as methods and,when grouped in a workbook collection, collectively definea class.

    In [12], an extension to Excel is proposed to allow end-users to build custom functions. In this approach, the typesystem of the spreadsheet is extended so that whole matricescan be stored in a single cell. A cell that contains a matrixis displayed in a different way so that end-users clearly knowthat their content is composite. Forms/3 [7] is a prototypethat implements several extensions to spreadsheet program-ming. It allows for instance recursive computations, or ex-ception handling. Cells in Forms/3 can contain any type ofdata.

    All these approaches are very interesting; they explorehow to redefine the spreadsheet programming model in orderto bring into it the powerful abstractions found in other

  • 8/14/2019 Data Services in Your Spreadsheet

    10/10

    languages. The integration of spreadsheets in SOA wouldbe much facilitated if mainstream spreadsheet applications(and their user-base) decided to adopt some of the ideasproposed in those works. For example, if Excel actuallyhandled matrix types as cell values, we could benefit fromthis type system and propose a richer mapping of compositeexternal objects.

    Our concern in this article is almost opposite since we

    precisely try to leave the spreadsheet programming modeluntouched. SpreadATOR acts as a middleware; its formulalanguage could as-well be hidden to users who could chooseto rely solely on the visual assistant (the object explorer).SpreadATOR for instance do not provide any mechanism toactually build those object abstractions. Thanks to this, wewere able to implement our prototype as an add-in to anexisting spreadsheet application. We try to bring to end-users this small part of the benefits of programming at theconceptual level that we think does not imply any majorchange in the way they already work with spreadsheets.

    7. CONCLUSION

    In this article, we have defined the problem of spread-sheet integration with data services from the viewpoint ofspreadsheet users. We tried to answer how developers couldbenefit from the higher level and integrated view of IT re-sources offered by service-oriented architecture. More specif-ically, we discussed how to leverage, in the importation andmanipulation of data, the conceptual modeling of informa-tion as provided by data services and API such as SDO orADO.Net. We identified the shortcomings of existing solu-tions and proposed a novel approach to spreadsheet integra-tion called SpreadATOR.

    An important aspect of SpreadATOR is that it can be in-tegrated with existing spreadsheet applications such as MSExcel. It does not suppose an extension of the spreadsheet

    language and can act as a middleware. In the same time, itsinterface blends with MS Excel. This allows users who havethe need to easily introduce programmatic aspects in theirimportation, for example using a cell reference to make itparametric. An additional benefit is an improved readabil-ity of the mapping. Finally, the object explorer offers thenecessary support to avoid formula input and is very similarto the schema mapping tool already proposed in Excel.

    The superiority of specialized importation tools over ageneric approach followed by SpreadATOR is their capacityto provide very specific wizard dialogs or visual metaphorsto assist users (e.g. they can refer to dimensions when ac-cessing an OLAP server and to tables with accessing a re-lational database). However, we believe that SpreadATORis in fact compatible with these high-level features. We ar-gue that these specific assistants should output a mappingdefinition in a common formula-based importation languagesuch as the one introduced in this article. It can easily bedone since, as demonstrated, SpreadATOR is able to workwith any (.Net) API. For end-users, the benefit is a spread-sheet application over which they have a complete controlas well as the possibility to combine various importationtools in a same application. By using a common mappingdefinition, importation systems would also leverage the com-mon facilities offered by SpreadATOR such as the templatemechanism or the meta-data management and save signifi-cant development time.

    8. REFERENCES[1] Excel Services Overview. Technical report, Microsoft

    Corp., 2006.

    [2] ADO.Net Tech Preview Entity Data Model. Technicalreport, Microsoft Corp., June 2006.

    [3] Designing Reports with the Microsoft Excel Add-in forSQL Server analysis services. Microsoft Corp., 2004.

    [4] SAP NetWeaver: A Complete Platform for

    Large-Scale Business Intelligence. Technical report,Winter Corp., 2005.

    [5] G. Alonso et al. Web Services - Concepts,Architectures and Application. Springer-Verlag, 2004.

    [6] A. Blackwell and T. Green. HCI Models, Theories,and Frameworks: Toward an Interdisciplinary Science.J.M. Carroll Editor, chapter Notational systems thecognitive dimensions of notations framework. MorganKaufmann, 2003.

    [7] M. Burnett et al. Forms/3: A first-order visuallanguage to explore the boundaries of the spreadsheetparadigm. Journal of Functional Programming,11(2):155206, 2001.

    [8] M. Carey. Data delivery in a service-oriented world:the BEA AquaLogic data services platform. InSIGMOD06, pages 695705, New York, USA, 2006.

    [9] C. Clack and L. Braine. Object-oriented functionalspreadsheets. In GlaFP97, september 1997.

    [10] K. Gile. Keeping IT sane in a crazy BI world of Excel.Technical Report 36353, Forrester, 2005.

    [11] A. Halevy et al. Enterprise information integration:successes, challenges and controversies. InSIGMOD05, pages 778787, New York, USA, 2005.

    [12] S. P. Jones, A. Blackwell, and M. Burnett. Auser-centred approach to functions in excel. In ICFP03, pages 165176, New York, NY, USA, 2003.

    [13] K. Laker. Exploiting the power of oracle using

    microsoft excel. Technical report, Oracle Corp., 2004.[14] E. Lippert and E. Carter. .Net programming for office:using C# with Excel, Word, Outlook and Infopath.Addison Wesley, 2005.

    [15] B. A. Nardi and J. R. Miller. The spreadsheetinterface: A basis for end user programming. InINTERACT90, pages 977983. North-Holland, 1990.

    [16] K. W. Piersol. Object-oriented spreadsheets: theanalytic spreadsheet package. In OOPLSA86, pages385390, New York, USA, 1986. ACM Press.

    [17] F. Rice. Creating XML mappings in excel 2003.Technical report, Microsoft Corp., 2005.

    [18] C. Scaffidi, M. Shaw, and B. Myers. Estimating thenumbers of end users and end user programmers. In

    VL/HCC05, pages 207214, 2005.[19] Next-generation data programming: Service data

    objects. Technical report, IBM, BEA, 2003.

    [20] http://www.service-architecture.com.

    [21] K. Williams and B. Daniel. An introduction to servicedata objects. Java Developers Journal, October 2004.